A futuristic illustration of an explorer on a vast uncharted island, using modern tools to aid in their journey. The explorer has a high-tech ATV (representing DuckDB) to quickly traverse the terrain, a team of robotic scouts (automated profiling queries) providing detailed reports, and a holographic guide (ChatGPT) explaining findings in simple terms. The island is filled with dense forests and treacherous swamps, symbolizing complex and poorly documented data. The explorer's goal is to find hidden treasures, representing valuable insights. Incorporate data and charts symbols into the scene, making it dynamic and visually rich, without any text or words.

Why Exploratory Data Analysis (EDA) is So Hard and So Manual

Exploratory Data Analysis (EDA) is crucial for gaining a solid understanding of your data and uncovering potential insights. However, this process is typically manual and involves a number of routine functions. Despite numerous technological advancements, EDA still requires significant manual effort, technical skills, and substantial computational power. In this post, we will explore why EDA is so challenging and examine some modern tools and techniques that can make it easier.

Analogy: Exploring an Uncharted Island with Modern Technology

Imagine you’ve been tasked with exploring a vast, uncharted island. This island represents your database, and your mission is to find hidden treasures (insights) that can help answer important questions (business queries).

Starting with a Map and Limited Guidance

Your journey begins with a rough map (the business question and dataset) that shows where the island might have treasures, but it’s incomplete and lacks detailed guidance. There are many areas to explore (numerous tables), and the landmarks (documentation) are either missing or vague. This makes it difficult to decide where to start your search.

Navigating Without Context

As you step onto the island, you realize that understanding the terrain (contextual business knowledge) is essential. Without knowing the history and geography (how data is used), you might overlook significant clues or misinterpret the signs. Having an experienced guide or reference materials (query repositories and existing business logic) can help you get oriented, but they don’t provide all the answers. They might show you paths taken by previous explorers (how data has been used), but you still need to figure out much on your own.

Understanding the Terrain

Once you start exploring, you have to understand the lay of the land (the data itself). For smaller areas (small datasets), you can quickly get a sense of what’s around you by looking closely at your surroundings (eyeballing a few rows). However, for larger regions (large datasets), you need to use tools like binoculars and compasses (queries and statistical summaries) to get a broader view. This process involves a lot of trial and error—climbing trees to see the landscape (running SQL or Python queries) and digging in the dirt to find hidden artifacts (computational power and technical skills).

The Challenges of Exploration

The larger and more complex the island, the harder it is to get a quick overview. Simple reconnaissance (basic queries) might help you find some treasures on the surface, but to uncover the real gems, you need to delve deeper and navigate through dense forests and treacherous swamps (poorly documented or context-lacking data). This is a significant challenge that requires persistence, skill, and often, a bit of luck.

Leveraging Modern Tools for Efficient Exploration

In the past, to systematically scan the land, you would have needed to rent a lot of expensive equipment and hire a team to help survey it, much like using costly cloud computing resources. However, technology has evolved, making it possible to do more with less. Modern tools are now more accessible and cost-effective, similar to having advanced features available on a smartphone.

  • DuckDB for Fast Analytics: Think of DuckDB as a high-speed ATV that allows you to quickly traverse the island without getting bogged down. Unlike relying on expensive external survey teams (cloud computing), DuckDB enables you to perform fast, efficient analytics directly on your desktop. This local approach avoids the high costs and latency associated with cloud solutions, giving you immediate, powerful insights without breaking the bank.
  • Automated Profiling Queries: These act like a team of robotic scouts that systematically survey the land, automatically profiling and summarizing data to highlight key areas of interest.
  • ChatGPT for Plain English Explanations: Imagine having a holographic guide who explains complex findings in simple terms, making it easier to understand and communicate the insights you discover.

By combining these modern tools, you can navigate the uncharted island of your data more effectively, uncovering valuable treasures (insights) with greater speed and accuracy, all without the high costs previously associated with such technology.

Starting with Business Questions and Data Sets

EDA typically begins with a business question and a data set or database. Someone asks a question, and we get pointed to a database that’s supposed to have the answers. But that’s where the challenges start. Databases often have numerous tables with little to no documentation. This makes it hard to figure out where to look and what data to use. On top of that, the amount of data can be large, which only adds to the complexity.

Lack of Contextual Business Knowledge

One of the biggest hurdles is not having the contextual business knowledge about how the data is used. Without this context, it’s tough to know what you’re looking for or how to interpret the data. This is where query repositories and existing business logic come in handy. These resources can help orient you in the database by showing how data has been used in the past, what tables are involved, and what calculations or formulas have been applied. They provide a starting point, but they don’t solve all the problems.

Challenges in Understanding Data

Once you’re oriented, the next step is to understand the data itself. For small files, you might be able to eyeball a few rows to get a sense of what’s there. But with larger datasets, this isn’t practical. You have to run queries to get a feel for the data—things like averaging a number column or counting distinct values in a categorical column. These queries give you a snapshot, but they can be time-consuming and require you to write a lot of SQL or Python code.

The larger the data set, the harder it is to get a quick overview. Simple queries can help, but they only scratch the surface. Understanding the full scope of the data, especially when it’s poorly documented or lacks context, is a significant challenge.

The Manual Nature of EDA

Running Queries to Get Metadata Insights

Exploratory Data Analysis is still very much a hands-on process. To get insights, we have to run various queries to extract metadata from the data set. This includes operations like averaging numeric columns, counting distinct values in categorical columns, and summarizing data to get an initial understanding of what’s there. Each of these tasks requires writing and running multiple queries, which can be tedious and repetitive.

Why EDA is Still Manual

EDA remains a manual process for several reasons:

  1. Computational Expense: When dealing with large datasets in cloud environments like BigQuery, running numerous exploratory queries can become prohibitively expensive. Each query costs money, and the more data you process, the higher the bill.
  2. Time-Consuming: Running multiple exploratory queries can be slow, especially with big datasets. Waiting for queries to finish can take a significant amount of time, which delays the entire analysis process.
  3. Data Cleanup Issues: Real-world data is messy. You often encounter missing values, incorrect labels, and redundant columns. Cleaning and prepping the data for analysis is a complex task that requires meticulous attention to detail.
  4. Technical Skills Required: Automating parts of EDA requires advanced SQL or Python skills. Not everyone has the expertise to write efficient queries or scripts to streamline the process. This technical barrier makes EDA less accessible to those without a strong programming background.

These challenges collectively make EDA a labor-intensive task, requiring significant manual effort and technical know-how to navigate and analyze large datasets effectively.

Modern Solutions and Tools

Advancements in Technology

Recent advancements in technology have made it easier to tackle some of the challenges in EDA. Modern laptops are more powerful than ever, allowing us to store and analyze significant amounts of data locally. This means we can avoid the high costs associated with cloud environments for exploratory work and work faster without the delays caused by network latency.

Tools for Local Analysis

For local data analysis, Pandas has been a go-to tool. It allows us to manipulate and analyze data efficiently on our local machines. However, Pandas has its limitations, especially with very large datasets. This is where DuckDB comes in. DuckDB is a database management system designed for analytical queries, and it can handle large datasets efficiently right on your local machine. It combines the flexibility of SQL with the performance benefits of a local database, making it a powerful tool for EDA.

Integrating AI in EDA

AI models, like ChatGPT, are revolutionizing the way we approach EDA. These models can help to translate complex statistical insights into plain English. This is particularly helpful for those who may not have a strong background in statistics. By feeding summarized results and metadata into AI, we can quickly understand the data and identify potential insights or anomalies. AI can also assist in automating some of the more tedious aspects of EDA, such as generating initial descriptive statistics or identifying trends, allowing us to focus on deeper analysis and interpretation.

Benefits of Automation in EDA

Automating parts of the Exploratory Data Analysis process offers several significant advantages:

  • Faster Initial Analysis
    • Automates routine queries and data processing
    • Provides a broad dataset overview quickly
    • Identifies key metrics, distributions, and areas of interest faster
  • Reduced Computational Costs
    • Optimizes use of computational resources
    • Focuses on relevant data, avoiding unnecessary computations
    • Lowers expenses, especially in cloud environments with large datasets
  • Ability to Identify Underlying Trends and Insights
    • Applies consistent analysis logic across different datasets
    • Systematically detects patterns, anomalies, and correlations
    • Enhances trend identification with AI, offering plain language explanations

By leveraging automation in EDA, you can streamline the analysis process, reduce costs, and uncover deeper insights more reliably.

Practical Examples

To illustrate how automation and modern tools can streamline EDA, let’s look at a few practical examples. These examples show how to use Python, DuckDB, and AI to perform common EDA tasks more efficiently. You can adapt these examples to fit your specific needs and datasets.

Example 1: Initial Data Overview with Pandas and DuckDB

DuckDB is very straightforward to use and It’s loaded in Google Colab by default. There’s a Python API to access it and here’s a tutorial on how to use it.

import duckdb

# Define the URL of the public CSV file
csv_url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'

# Connect to DuckDB (you can use an in-memory database for temporary usage)
con = duckdb.connect(database=':memory:')

# Read the CSV file from the URL into a DuckDB table
con.execute(f"CREATE TABLE my_table AS SELECT * FROM read_csv_auto('{csv_url}')")

# Verify the data
df = con.execute("SELECT * FROM my_table").df()

# Display the data
df.head()

Example 2: Automating Metadata Extraction

A benefit of using DuckDB is its support for standard metadata queries like DESCRIBE, which allows you to comment on tables and columns. DuckDB enforces uniform data types within columns, making it easier to understand column types and run accurate descriptive queries, such as calculating the standard deviation on numeric columns. Running SQL queries in DuckDB provides a concise way to analyze your data’s structure. Additionally, the SUMMARIZE method in DuckDB offers detailed statistics on columns.

con.sql("DESCRIBE my_table")

con.sql("SUMMARIZE my_table")

Here’s an example of a query to get statistics for all numeric columns in your DuckDB database. By leveraging DuckDB, you can efficiently iterate through your data and store the results in a way that is both performant and memory-efficient.

# Define the table name
table = 'my_table'

# Fetch the table description to get column metadata
describe_query = f"DESCRIBE {table}"
columns_df = con.execute(describe_query).df()

# Filter numeric columns
numeric_columns = columns_df[columns_df['column_type'].str.contains('INTEGER|DOUBLE|FLOAT|NUMERIC')]['column_name'].tolist()

# Define the template for summary statistics query
NUMBER_COLUMN_SUMMARY_STATS_TEMPLATE = """
SELECT 
  '{column}' AS column_name,
  COUNT(*) AS total_count,
  COUNT({column}) AS non_null_count,
  1 - (COUNT({column}) / COUNT(*)) AS null_percentage,
  COUNT(DISTINCT {column}) AS unique_count,
  COUNT(DISTINCT {column}) / COUNT({column}) AS unique_percentage,
  MIN({column}) AS min,
  MAX({column}) AS max,
  AVG({column}) AS avg,
  SUM({column}) AS sum,
  STDDEV({column}) AS stddev,
  percentile_disc(0.05) WITHIN GROUP (ORDER BY {column}) AS percentile_5th,
  percentile_disc(0.25) WITHIN GROUP (ORDER BY {column}) AS percentile_25th,
  percentile_disc(0.50) WITHIN GROUP (ORDER BY {column}) AS percentile_50th,
  percentile_disc(0.75) WITHIN GROUP (ORDER BY {column}) AS percentile_75th,
  percentile_disc(0.95) WITHIN GROUP (ORDER BY {column}) AS percentile_95th
FROM {table}
"""

# Iterate through the numeric columns and generate summary statistics
summary_stats_queries = []
for column in numeric_columns:
    summary_stats_query = NUMBER_COLUMN_SUMMARY_STATS_TEMPLATE.format(column=column, table=table)
    summary_stats_queries.append(summary_stats_query)

# Combine all the summary statistics queries into one
combined_summary_stats_query = " UNION ALL ".join(summary_stats_queries)

# Execute the combined query and create a new table
summary_table_name = 'numeric_columns_summary_stats'
con.execute(f"CREATE TABLE {summary_table_name} AS {combined_summary_stats_query}")

# Verify the results
summary_df = con.execute(f"SELECT * FROM {summary_table_name}").df()
print(summary_df)

For text columns, a helpful subquery is to find the top N and bottom N values:

TOP_AND_BOTTOM_VALUES = f"""WITH sorted_values AS (
    SELECT 
      {column} as value,
      COUNT(*) AS count,
      ROW_NUMBER() OVER (ORDER BY count DESC) AS rn_desc,
      ROW_NUMBER() OVER (ORDER BY count ASC) AS rn_asc
    FROM {table}
    WHERE {column} IS NOT NULL
    GROUP BY ALL
    ORDER BY ALL
  )
  SELECT '{col}' AS column_name, value, count, rn_desc, rn_asc
  FROM sorted_values
  WHERE rn_desc <= 10 OR rn_asc <= 10
  ORDER BY rn_desc, rn_asc"""

Example 3: Using AI for Insight Generation

Now that you have a process to generate metadata for each column, you can iterate through and create prompts for ChatGPT. Converting the data into human-readable text yields the best responses. This step is particularly valuable because it transforms statistical data into narratives that business users can easily understand. You don’t need a statistics degree to comprehend your data. The output will ideally highlight the next steps for data cleanup, identify outliers, and suggest ways to use the data for further insights and analysis.

df = con.execute(f"SELECT * FROM {summary_table_name} where column_name = 'fare'").df().squeeze()
data_dict = df.to_dict()

column_summary_text = ''
for key, value in data_dict.items():
    column_summary_text += f"{key}: {value}\n"

    
print(data_text)

prompt = f"""You are an expert data analyst at a SaaS company. Your task is to understand source data and derive actionable business insights. You excel at simplifying complex technical concepts and communicating them clearly to colleagues. Using the metadata provided below, analyze the data and provide insights that could drive business decisions and strategies. Please provide an answers in paragraph form.

Metadata:
{column_summary_text}
"""

Wrapping Up: Streamlining EDA with Modern Tools and Techniques

Exploratory Data Analysis is a crucial but often challenging and manual process. The lack of contextual business knowledge, the complexity of understanding large datasets, and the technical skills required make it daunting. However, advancements in technology, such as powerful local analysis tools like Pandas and DuckDB, and the integration of AI models like ChatGPT, are transforming how we approach EDA. Automating EDA tasks can lead to faster initial analysis, reduced computational costs, and the ability to uncover deeper insights. By leveraging these modern tools and techniques, we can make EDA more efficient and effective, ultimately driving better business decisions.

Thanks for reading!


Posted

in

by

Tags: