Let’s say you have a column of data that is a string list separated by commas. In this post, I’ll walk you through two ways to break up the data and count the frequency of the list items.
Method #1 – Converting Column to a Python List
A simple way to return all the values is to use a string split method and return the values as a list.
unique_countries = [val.strip() for sublist in netflix_df.country.dropna().str.split(",").tolist() for val in sublist]
To find the frequencies, we can convert the list back into a DataFrame and apply a value count.
country_summary = pd.DataFrame(unique_countries,columns=['country']).value_counts().reset_index().rename(columns={0:'count'})
Method #2 – Split the list and Append it back to the Original DataFrame
With this method, we can break up the column into multiple columns. This effectively splits up the list and allows us to isolate each item in the list. Finally, we can melt the DataFrame to bring all the items back into one DataFrame column.
countries_expanded = netflix_df['country'].str.split(',', expand=True) countries_expanded.columns = ['Country'+str(i) for i in countries_expanded.columns] countries_expanded_concat = pd.concat([netflix_df,countries_expanded], axis=1) year_country_produced_df = pd.melt(countries_expanded_concat, id_vars=['show_id','release_year'], value_vars=countries_expanded.columns, var_name='Country Number', value_name='Country Produced').dropna()
The benefit of this second method is that you can leverage other columns from the original data to group the data. Here is the link to my Kaggle Notebook where I created these examples.
Thanks for reading!