How to Convert your Pandas Columns for a Structured Database like SQL

Python Pandas allows for a lot of flexibility when naming your which can cause trouble if you are trying to import the data into a structured database like SQL or BigQuery. In this post, I’ll show you how to format your Pandas columns to make them compliant with structured databases.

Clean up Steps

Here are the steps I’ll walk you through so that your column names are formated correctly before exporting:

String clean up
Dealing with integer columns
Identifying and modifying duplicate column names

Example DataFrame

Take a look at this Pandas DataFrame with columns that wouldn’t be acceptable in MySQL, for example.

df = pd.DataFrame(np.random.randint(100,size=(5, 10)),columns=['Column with @#$@#$ characters','Question #1','Test?','Test','TEST','Hello','hello',0,1,2])

df

Python
​x
 
df = pd.DataFrame(np.random.randint(100,size=(5, 10)),columns=['Column with @#$@#$ characters','Question #1','Test?','Test','TEST','Hello','hello',0,1,2])
​
df

The column names violate many naming restrictions found in structured databases because it has special characters, spaces in between words, columns that are integers, and upper-cased words.

String Clean up

Convert all the column names to a string
Extract just letters and digits (ie remove any special characters)
Join the words of the column names back together using a underscore (since the previous step seperated out each word)
Convert the column names to a string again as some of the names are still just numbers
lower case the column names

df.columns = (df.columns
              .astype(str) #convert all the elements to a string because Pandas allows you to have numbers for column names
              .str.extractall(r'([\w\d]+)') #extract just words and numbers
              .unstack() #bring all the groups back into one row
              .apply(lambda x: '_'.join(x.dropna() #replace empty space with an underscore
              .astype(str)),axis=1) #convert column names to a string
              .str.lower()) #lowercase

Python
xxxxxxxxxx
 
df.columns = (df.columns
              .astype(str) #convert all the elements to a string because Pandas allows you to have numbers for column names
              .str.extractall(r'([\w\d]+)') #extract just words and numbers
              .unstack() #bring all the groups back into one row
              .apply(lambda x: '_'.join(x.dropna() #replace empty space with an underscore
              .astype(str)),axis=1) #convert column names to a string
              .str.lower()) #lowercase

Dealing with integer columns

One way you can deal with numbers is just to convert the number to the word version. So it will convert 1 to one. You could also create a dictionary to convert them too. Whatever you prefer!

One package you can use to convert an integer to a number is called inflect.

!pip install inflect
import inflect
p = inflect.engine()

Python
xxxxxxxxxx
 
!pip install inflect
import inflect
p = inflect.engine()

I’ll be running this function in the next section.

Identifying and modifying duplicate columns

You can check for duplicate columns by running df.columns.duplicated(). If you dont’ have any duplicate columns then you can skip this section.

You’ll find yourself in one of three situations: no duplicate column names, one duplicate, or multiple duplicates. The latter is tricky and that’s the one I am going to solve for. In my example DataFrame, I have test as a column name for three columns. My goal is to append _2, _3, etc. dynamically so that I don’t have to worry about the number of duplicates. To do that, I am going to loop through the column name, append them to a new list, and then count the number of occurrences. If there are two or more occurrences, I’ll append the count as the new column name suffix.

Additionally, I will check if the column name is just a number by running .isdecimal() which is a built-in function that return True or False if the string is just comprised of numbers. If it is a number, I’ll run the function in inflect to convert the number to a spelled-out version.

def create_unique_cols(x):
  col_lst = [] #existing col list
  unique_col_lst = [] #new col list with modified col names for the duplicates

  for i in x.columns:
    col_lst.append(i) #build the list as you go through the elements

    if i.isdecimal(): #convert numbers to a string. In my case, I am just spelling out the number
      i = p.number_to_words(i)
    else:
      pass 

    if col_lst.count(i) > 1:
      unique_col_lst.append(f'{i}_{col_lst.count(i)}') #add the duplicated column with a number
    else:
      unique_col_lst.append(i) #add the column as is to the new list
  
  return unique_col_lst

Python
xxxxxxxxxx
 
def create_unique_cols(x):
  col_lst = [] #existing col list
  unique_col_lst = [] #new col list with modified col names for the duplicates
​
  for i in x.columns:
    col_lst.append(i) #build the list as you go through the elements
​
    if i.isdecimal(): #convert numbers to a string. In my case, I am just spelling out the number
      i = p.number_to_words(i)
    else:
      pass 
​
    if col_lst.count(i) > 1:
      unique_col_lst.append(f'{i}_{col_lst.count(i)}') #add the duplicated column with a number
    else:
      unique_col_lst.append(i) #add the column as is to the new list
  
  return unique_col_lst

Replace Original Column Names

In this last step, we can now take the columns and replace the ones in the original DataFrame.

df.columns = create_unique_cols(df)

Python
xxxxxxxxxx
 
df.columns = create_unique_cols(df)

Final Thoughts

Check out more Python tricks in this Colab Notebook or in my recent Python Posts.

Thanks for reading!

Posted

July 23, 2021

Python

Paul

Tags: