Data gaps can occur when data is organized into time intervals but observations are missing for certain intervals. For example, let’s say you are tracking sales of snow shovels by month. Snow shovels are typically only in demand during winter months, so it is likely that there will be months with no sales at all. When charting sales data for the last 12 months, for example, it is important to include all months in order to get an accurate picture of the trend. Failing to do so and only showing months with sales can be misleading and hinder accurate interpretation of the results. In this post, we will explore a few techniques on to fill in date gaps.
Here’s an example of a SQL query that might return data with gaps:
SELECT MONTH(date_column) AS month, YEAR(date_column) AS year, SUM(sales) AS total_sales FROM sales_table GROUP BY MONTH(date_column), YEAR(date_column)
The query could return the results below. At first glance, the data looks continuous which can be misleading if your analysis stops here. For this type of data, it’s better to show the months where there are 0 sales to get a true picture.
month | year | total_sales |
1 | 2022 | 470 |
2 | 2022 | 200 |
3 | 2022 | 85 |
7 | 2022 | 50 |
9 | 2022 | 100 |
10 | 2022 | 250 |
11 | 2022 | 410 |
12 | 2022 | 550 |
Tips for Dealing with Data Gaps
- Create a date range list then left-join the original data
- Resampling data using Python
- The visualization tool does it for you
Create a date range list then left-join original data
There are many ways to perform this action but it depends on the tool you are using. Creating a range of dates can be done once and placed in a permanent table or dynamically every time you run the data. Python Pandas has a date_range
function that allows you to create a date range at your desired frequency.
#Create the sample data df = pd.DataFrame( {'dates': ['1/31/2022', '2/28/2022', '3/31/2022', '7/31/2022', '9/30/2022', '10/31/2022', '11/30/2022', '12/31/2022'], 'sales': [470,200,85,50,100,250,410,550]} df['dates'] = pd.to_datetime(df['dates']) #Create a dataset with the dates date_range = pd.date_range('1/1/2022','12/31/2022', freq='M') date_df = pd.DataFrame(date_range, columns=['dates']) #Left join the data df2 = pd.merge( date_df, df, how='left', left_on=['dates'], right_on=['dates'] ) #Fill in the gap dates with 0 df2.fillna(0, inplace=True) df2
Some databases allow you to dynamically generate a date range. Here’s how you could do it in BigQuery:
WITH date_range AS ( SELECT DATE_TRUNC(DATE('2022-02-01'), MONTH) AS start_date, DATE_TRUNC(DATE('2023-01-01'), MONTH) AS end_date ) SELECT DATE_SUB(DATE_TRUNC(month_date, MONTH), INTERVAL 1 DAY) AS month_end_date FROM date_range, UNNEST(GENERATE_DATE_ARRAY(start_date, end_date, INTERVAL 1 MONTH)) AS month_date
Resampling Data Using Python
Using Python Pandas, you can quickly and efficiently fill in the gaps by using Re Sampling. Here’s an example of a data table with date gaps:
week_df = pd.DataFrame(data=[1,2,3,4], index=pd.date_range(start='1/1/2020', end='1/31/2020', freq = 'W-MON'), columns=['number']) week_df
Now that we have some data, we can resample the data to a day frequency:
week_df.resample("D").sum()
As you can see, the resample function added the missing days from the start and end of the dataset.
The Visualization Tool Does It for You
The above methods to fix the gaps in the data are not necessary if your visualization tool will fill in the gaps. Here’s an example of how Plotly correctly displays the data.
First I’ll create a dataset:
df = pd.DataFrame( {'dates': pd.date_range('1/1/2022', '12/31/2022', freq="M"), 'sales': [470,200,85,None,None,None,50,None,100,250,410,550]}) df.dropna(inplace=True) df
Next, I’ll plot it.
import plotly.express as px fig = px.bar(df, x='dates', y="sales") fig.update_layout(template='simple_white') fig.show()
Final Thoughts
Check out more Python tricks in this Colab Notebook or in my recent Python Posts.
Thanks for reading!