Importing and Analyzing TSA Throughput data in Python

The TSA has started to publish the daily volume of passengers going through checkpoints on its website. The data set also includes the numbers from 2019 in order to measure the impact of travel as a result of COVID-19. It’s a really interesting leading indicator of a recovery in the travel sector. In this post, I’ll show you how to pull the data and analyze it in Python.

I came across this website on a webinar and found this to be a very interesting data set to use as a leading indicator for a COVID-19 recovery.

To start, let’s import the python packages we need to do the analysis.

from bs4 import BeautifulSoup
import requests
import datetime
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

In order to retrieve the data, I am using the BeautifulSoup package to read the data table on the website. The following code is designed to import the contents of the website so that we can build a dataframe using Pandas.

source = requests.get('https://www.tsa.gov/coronavirus/passenger-throughput').text
soup = BeautifulSoup(source, 'lxml')

You’ll need to have a basic understanding of HTML to understand the next step. Essentially, the data is stored in an HTML table and we can identify the rows based on the “tr” tag. We can pull every instance of the “tr” tag using the find_all method.

I then use a loop function to iterate through each row to create a series of nested lists based on the new line character

row = soup.find_all("tr")

data = []
for i in row[1:]:
    string = i.get_text()
    a = string.split("\n")
    data.append(a)

print(data)

Here is a preview of what the data now looks like:

[['5/28/2020', '321,776', '2,485,770', ''], ['5/27/2020', '261,170', '2,269,035', ''], ...

This next code snippet will create the dataframe from the list of lists

df = pd.DataFrame(data, columns=['Date','This Year','Last Year','Empty'])

Let’s take a quick look at the data types

print(df.dtypes)

Date         object
This Year    object
Last Year    object
Empty        object
dtype: object

Now that we’ve imported the data, you’ll notice that the data types are objects. We need to convert that in order to manipulate and plot the date.

For the numerical columns, we need to strip out the commas and convert it to a number. For the dates, we just need to use the datetime package to turn it into a date.

df['This Year'] = df['This Year'].replace(',','', regex=True)
df['Last Year'] = df['Last Year'].replace(',','', regex=True)
df['This Year'] = pd.to_numeric(df['This Year'])
df['Last Year'] = pd.to_numeric(df['Last Year'])
df['Date']= pd.to_datetime(df['Date'])

Now when we print, we get the following output:

Date         datetime64[ns]
This Year           float64
Last Year           float64
Empty                object
dtype: object

That looks better! Now let’s add a column to calculate the difference in passenger volume from one year ago:

df['Percent Change'] = df['This Year'] / df['Last Year'] -1

Let’s take a quick look at the data:

         Date  This Year  Last Year Empty  Percent Change
0  2020-05-28   321776.0  2485770.0             -0.870553
1  2020-05-27   261170.0  2269035.0             -0.884898
2  2020-05-26   264843.0  2453649.0             -0.892062
3  2020-05-25   340769.0  2512237.0             -0.864356
4  2020-05-24   267451.0  2070716.0             -0.870841
..        ...        ...        ...   ...             ...
85 2020-03-05  2130015.0  2402692.0             -0.113488
86 2020-03-04  1877401.0  2143619.0             -0.124191
87 2020-03-03  1736393.0  1979558.0             -0.122838
88 2020-03-02  2089641.0  2257920.0             -0.074528
89 2020-03-01  2280522.0  2301439.0             -0.009089

[90 rows x 5 columns]

Looks good! Now we are ready to plot the data. Let’s first take a look at the volume of passengers this year compared to last year.

plt.clf()

sns.set_style("whitegrid")
sns.set_context("poster", font_scale = 1, rc={"grid.linewidth": 0.6})
f, ax = plt.subplots(figsize=(20,7))
sns.lineplot(data = df, x = 'Date', y= 'This Year')
sns.lineplot(data = df, x = 'Date', y= 'Last Year')

plt.ylabel('Number of Travelers')
plt.xlabel('Date')
plt.title('Traveler Throughput')
plt.legend(['This Year','Last Year'])

#here is how to change the format of the Y axis labels to have a comma
ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()])


plt.show()
plt.savefig('last_year_and_this_year.png')

Quite the decline. What would make this more valuable is to understand the percent change year over year. It would also be interesting to know the low point and where we are today in order to get a sense of recovery.

plt.clf()

sns.set_style("whitegrid")
#sns.set_context("poster", font_scale = .5)
f, ax = plt.subplots(figsize=(20,7))
sns.lineplot(data = df, x = 'Date', y= 'Percent Change')
plt.ylabel('')
plt.xlabel('Date')
plt.title('% Change in Traveler Throughput Year over Year')
plt.legend(['% Change in Passengers'])
ax.set_yticks(sorted([ round(i,1) for i in np.arange(-1.0,0.1,.1)],reverse=True))
#ax.set_yticks([0, -0.10, -0.20, -0.30, -0.40, -0.5, -0.60, -0.70, -0.80, -0.90, -1])

ax.set_yticklabels(['0%','-10%','-20%', '-30%', '-40%','-50%','-60%','-70%', '-80%', '-90%', "-100%"])
sns.set_context("poster", font_scale = 1, rc={"grid.linewidth": 0.6})

ymin = df['Percent Change'].min()
filt = df['Percent Change'] == df['Percent Change'].min()
xmin = df.loc[filt]['Date'].min()

yrecent = df.iloc[0,4]
xrecent = df.iloc[0,0]

#show the min value
ax.annotate('Lowest Point ' + str("{:.1f}".format(ymin*100))+"%", xy=(xmin, ymin), xytext=(xmin, ymin+.2),
            arrowprops=dict(facecolor='black', shrink=0.05),
            horizontalalignment='center', verticalalignment='top'
            #arrowprops=dict(facecolor='black', shrink=0.05),
            )

#show the most recent value
ax.annotate('Yesterday ' + str("{:.1f}".format(yrecent*100))+"%", xy=(xrecent, yrecent), xytext=(xrecent, yrecent+.2),
            arrowprops=dict(facecolor='black', shrink=0.05),
            horizontalalignment='right', verticalalignment='top'
            #arrowprops=dict(facecolor='black', shrink=0.05),
            )

plt.show()
plt.savefig('change_yoy.png')

One thing that I think would be interesting, is to combine all the data into one chart. The next example will show you how to plot multiple lines with two different Y axis.

plt.clf()

x = df['Date']
y1 = df['Last Year']
y2 = df['This Year']
y3 = df['Percent Change']

sns.set_style("whitegrid")
sns.set_context("poster", font_scale = 1, rc={"grid.linewidth": 0.6})

fig, ax1 = plt.subplots(figsize=(20,7))
ax1.plot(x, y1, color = '#D3D3D3')
ax1.plot(x, y2, color = '#D3D3D3')
ax1.set_ylabel('Number of Travelers')

ax1.set_title('Travelor Throughput and % Change Year over Year')
ax1.set_yticklabels(['{:,}'.format(int(x)) for x in ax1.get_yticks().tolist()])


ax2 = ax1.twinx()
ax2.plot(x, y3)
ax2.set_yticks(sorted([ round(i,1) for i in np.arange(-1.0,0.1,.1)],reverse=True))
ax2.set_yticklabels(['0%','-10%','-20%', '-30%', '-40%','-50%','-60%','-70%', '-80%', '-90%', "-100%"])
ax2.set_ylabel('% Change Year over Year')

plt.savefig('last_year_and_this_year_and_change_yoy.png')
plt.show()

Looking at the combined graph, it doesn’t seem to add much value because the % Change line and the Number of passengers last year are overlapping. The best view is likely to have the first and second graph on top of each other. Here’s how to do that:

plt.clf()

sns.set_style("whitegrid")
sns.set_context("poster", font_scale = 1, rc={"grid.linewidth": 0.6})
f, ax = plt.subplots(figsize=(20,14))
# top chart
plt.subplot(2, 1, 1)
ax = plt.subplot(2, 1, 1)
plt.subplots_adjust(top=0.95, hspace=0.3) # adds padding in between both charts
sns.lineplot(data = df, x = 'Date', y= 'This Year')
sns.lineplot(data = df, x = 'Date', y= 'Last Year')

plt.ylabel('Number of Travelers')
plt.xlabel('Date')
plt.title('Traveler Throughput')
plt.legend(['This Year','Last Year'])
ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()])

#bottom chart
plt.subplot(2, 1, 2)
ax2 = plt.subplot(2, 1, 2)

sns.lineplot(data = df, x = 'Date', y= 'Percent Change')
plt.ylabel('')
plt.xlabel('Date')
plt.title('% Change in Traveler Throughput Year over Year')
plt.legend(['% Change in Passengers'])
ax2.set_yticks(sorted([ round(i,1) for i in np.arange(-1.0,0.1,.1)],reverse=True))


ax2.set_yticklabels(['0%','-10%','-20%', '-30%', '-40%','-50%','-60%','-70%', '-80%', '-90%', "-100%"])
sns.set_context("poster", font_scale = 1, rc={"grid.linewidth": 0.6})

ymin = df['Percent Change'].min()
filt = df['Percent Change'] == df['Percent Change'].min()
xmin = df.loc[filt]['Date'].min()

yrecent = df.iloc[0,4]
xrecent = df.iloc[0,0]

#show the min value
ax2.annotate('Lowest Point ' + str("{:.1f}".format(ymin*100))+"%", xy=(xmin, ymin), xytext=(xmin, ymin+.2),
            arrowprops=dict(facecolor='black', shrink=0.05),
            horizontalalignment='center', verticalalignment='top'
            #arrowprops=dict(facecolor='black', shrink=0.05),
            )

#show the most recent value
ax2.annotate('Yesterday ' + str("{:.1f}".format(yrecent*100))+"%", xy=(xrecent, yrecent), xytext=(xrecent, yrecent+.2),
            arrowprops=dict(facecolor='black', shrink=0.05),
            horizontalalignment='right', verticalalignment='top'
            #arrowprops=dict(facecolor='black', shrink=0.05),
            )

plt.savefig('two_charts_last_year_and_this_year_and_yoy.png')
plt.show()

Thanks for following along! If you have any suggestions about how to imporve my code, I’d love to hear it! Please post a comment below.

To get a copy of the jupyter notebook, check out my GitHub – https://github.com/prdesalvo/python


Posted

in

by