If you have a long-running query, splitting it up into smaller queries can help with performance. With Python, we can dynamically loop through each query.
I have run into situations where my query joins two large tables and it can take forever to run. However, when I filter each table down to a single account, the query performance is greatly improved. With Python, we loop through a list of items, dynamically fill the query statement, and run them using a for loop. Let’s get started.
Step 1: Generate a List
The first step is to generate a list of id’s or whatever you are using in your SQL WHERE
statement. In my case, I am converting a Pandas column to a list.
ids = reminders_df.ids.unique().tolist()
You could also create a random list:
from numpy import random ids = [random.randint(100) for i in range(0,100)]
Step 2: Break up List into Smaller Lists
Next, we can break up the list into smaller chunks. In this example, my chunk size will be 10.
chunk_size = 10 ids_chunks = [ids[x:x+chunk_size] for x in range(0, len(ids), chunk_size)]
The final result is a list of lists. Each sublist has 10 items. The last last item in the list will likely have less than 10 items unless your original list size is a multiple of 10.
Step 3: Loop Through your Query
Now that we have our list of lists, we can perform the following:
- Dynamically add them into the
WHERE
statement of our SQL query - Run the query
- Append the results to a Pandas DataFrame
Here’s an example:
#Query starts here with brackets to insert your list items query ='''Select * from accounts where account_id in ({})''' #Function starts here that will take your list as the parameter def query_loop(lst): #create an empty DataFrame that will hold the results once the query runs combined_df = pd.DataFrame() #Here is the loop. We use first need to convert the list into a string then use format to insert the string into the query for i in lst: df = function_to_run_query(query.format(",".join(str(int(j)) for j in i)) ) #Once the Loop Finishes, we concat the results into the DataFrame combined_df = pd.concat([combined_df,df]) return combined_df
Once you do this, you can call query_loop(ids_chunks)
.
Note that I have a function called function_to_run_query()
and I don’t define that anywhere in this post. The basic idea of that function is that it takes one parameter (your query), then runs it in your database, and will return the results as a Pandas DataFrame. You can check out my post on connecting to a remote MySQL database to learn more.
Final Thoughts
Looping through a query is one way to speed up a long-running query. The other thing to consider is why your query takes so long to run. Sometimes, it can be faster to break apart the query and import the raw data from the tables into Python. Then you can leverage Pandas to merge the datasets.
Thanks for reading!