Pandas allow for almost anything as a column header and I’ll show you how to get your columns parquet and database ready.
Here is an example error you will see if trying to save a DataFrame to parquet with invalid characters for columns. Since databases and Parquet files have more restrictive naming conventions for columns than Pandas, you need to ensure that your columns contain only letters, numbers, and underscores.
Here is what you will learn in this post:
- How to keep only letters, numbers, and underscores in column names
- How to rename columns that only contain numbers
- How to deduplicate column names
Keep only letters, numbers, and underscores in column names
Consider the following DataFrame:
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randint(100,size=(5, 12)),columns=['Column with @#$@#$ characters','Question #1','Test?','Test','TEST','Hello','hello', 'test.testing', 'comma,testing',0,1,2]) df
As you can see, Pandas let me create the DataFrame with no errors. We can clean these up by leveraging a combination of string methods to remove special characters.
df.columns = (df.columns .astype(str) #convert all the elements to a string because Pandas allows you to have numbers for column names .str.replace('[.,]', '_') #Replace specific symbols with underscores .str.extractall(r'([\w\d]+)') #extract just words and numbers .unstack() #bring all the groups back into one row .apply(lambda x: '_'.join(x.dropna() #replace empty space with an underscore .astype(str)),axis=1) #convert column names to a string .str.lower() #lowercase .str.strip() #clean up white spaces if applicable ) df
Here is the resulting DataFrame:
Renaming columns that are numbers only
Numbers as column names only pose a problem if the column name is composed entirely of numbers. My trick is to dynamically convert the numbers to words using the inflect package. So for example, 1
will turn into one
. I’ll show you how to apply this using a function in the next section.
!pip install inflect import inflect p = inflect.engine()
Deduplicate column names
You can first check for duplicate columns with the following snippet:
#Check for duplicate columns df.columns.duplicated().any()
In my case, this returns True
because some columns are duplicates. I’ve created the following function that loops through the column names and appends 2, 3, etc. depending on the occurrence of the duplicate column.
I am also checking to see if the column name is a straight-up number with the function .isdecimal()
. If that returns True
, then it runs the inflect module to convert the number to a word.
def create_unique_cols(x): col_lst = [] #existing col list unique_col_lst = [] #new col list with modified col names for the duplicates for i in x.columns: col_lst.append(i) #build the list as you go through the elements if i.isdecimal(): #convert numbers to a string. In my case, I am just spelling out the number i = p.number_to_words(i) else: pass if col_lst.count(i) > 1: unique_col_lst.append(f'{i}_{col_lst.count(i)}') #add the duplicated column with a number else: unique_col_lst.append(i) #add the column as is to the new list return unique_col_lst
The Final Output
Now you can safely export to a database or parquet!
Final Thoughts
Check out more Python tricks in this Colab Notebook or in my recent Python Posts.
Thanks for reading!