How to Convert your Pandas Columns for a Structured Database like SQL

Python Pandas allows for a lot of flexibility when naming your which can cause trouble if you are trying to import the data into a structured database like SQL or BigQuery. In this post, I’ll show you how to format your Pandas columns to make them compliant with structured databases.

Clean up Steps

Here are the steps I’ll walk you through so that your column names are formated correctly before exporting:

  1. String clean up
  2. Dealing with integer columns
  3. Identifying and modifying duplicate column names

Example DataFrame

Take a look at this Pandas DataFrame with columns that wouldn’t be acceptable in MySQL, for example.

df = pd.DataFrame(np.random.randint(100,size=(5, 10)),columns=['Column with @#$@#$ characters','Question #1','Test?','Test','TEST','Hello','hello',0,1,2])

df

The column names violate many naming restrictions found in structured databases because it has special characters, spaces in between words, columns that are integers, and upper-cased words.

String Clean up

  • Convert all the column names to a string
  • Extract just letters and digits (ie remove any special characters)
  • Join the words of the column names back together using a underscore (since the previous step seperated out each word)
  • Convert the column names to a string again as some of the names are still just numbers
  • lower case the column names
df.columns = (df.columns
              .astype(str) #convert all the elements to a string because Pandas allows you to have numbers for column names
              .str.extractall(r'([\w\d]+)') #extract just words and numbers
              .unstack() #bring all the groups back into one row
              .apply(lambda x: '_'.join(x.dropna() #replace empty space with an underscore
              .astype(str)),axis=1) #convert column names to a string
              .str.lower()) #lowercase

Dealing with integer columns

One way you can deal with numbers is just to convert the number to the word version. So it will convert 1 to one. You could also create a dictionary to convert them too. Whatever you prefer!

One package you can use to convert an integer to a number is called inflect.

!pip install inflect
import inflect
p = inflect.engine()

I’ll be running this function in the next section.

Identifying and modifying duplicate columns

You can check for duplicate columns by running df.columns.duplicated(). If you dont’ have any duplicate columns then you can skip this section.

You’ll find yourself in one of three situations: no duplicate column names, one duplicate, or multiple duplicates. The latter is tricky and that’s the one I am going to solve for. In my example DataFrame, I have test as a column name for three columns. My goal is to append _2, _3, etc. dynamically so that I don’t have to worry about the number of duplicates. To do that, I am going to loop through the column name, append them to a new list, and then count the number of occurrences. If there are two or more occurrences, I’ll append the count as the new column name suffix.

Additionally, I will check if the column name is just a number by running .isdecimal() which is a built-in function that return True or False if the string is just comprised of numbers. If it is a number, I’ll run the function in inflect to convert the number to a spelled-out version.

def create_unique_cols(x):
  col_lst = [] #existing col list
  unique_col_lst = [] #new col list with modified col names for the duplicates

  for i in x.columns:
    col_lst.append(i) #build the list as you go through the elements

    if i.isdecimal(): #convert numbers to a string. In my case, I am just spelling out the number
      i = p.number_to_words(i)
    else:
      pass 

    if col_lst.count(i) > 1:
      unique_col_lst.append(f'{i}_{col_lst.count(i)}') #add the duplicated column with a number
    else:
      unique_col_lst.append(i) #add the column as is to the new list
  
  return unique_col_lst

Replace Original Column Names

In this last step, we can now take the columns and replace the ones in the original DataFrame.

df.columns = create_unique_cols(df)

Final Thoughts

Check out more Python tricks in this Colab Notebook or in my recent Python Posts.

Thanks for reading!


Posted

in

by

Tags: