In this post, I’ll show you how to extract every number from a string in Python using regular expressions.
Extracting phone numbers or simply removing the symbols from numbers are the two most common reasons I am looking to strip out everything but the actual numbers from a Python Pandas string column.
Consider the following DataFrame:
import pandas as pd df = pd.DataFrame(['+1-555-555-5555','+1 (123) 456 7899', 'Here is my number 1-555-555-6666',123456789],columns=['Numbers and Text']) df
Here is how you can run to return a new column with only the numbers:
df['Numbers Only'] = df['Numbers and Text'].astype('str').str.extractall('(\d+)').unstack().fillna('').sum(axis=1).astype(int)
Breakdown
.astype('str')
First, we need to make sure that the entire column is a string in order to apply regex. This step is important if you have mixed data types in your column.
.str.extractall('(\d+)')
Using extract all, we can return every number from the string. The (\d)+
will look for any digit or any length in the string.
The result ends up being a longer DataFrame where every match becomes its own row in a multi-index DataFrame.
.unstack()
This brings the matched numbers into the same row as the original string. Each grouping will be in a separate column.
.fillna('')
Since certain strings may have more matched groups than others, there may be NaN
values in your DataFrame. This step removes them because otherwise, any operation that involves a NaN
value results in NaN
value.
.sum(axis=1)
Finally, to bring all the numbers together into one cell, we can run a sum. The axis=1
argument will sum the row values. The groupings are each stored as a string so it doesn’t add the numbers but will concatenate the numbers. The resulting column will be stored as a float data type.
.astype(int)
This last step is optional if you don’t want your number to be a float.
Final Thoughts
Check out more Python tricks in this Colab Notebook or in my recent Python Posts.
Thanks for reading!