Leverage Python and the Salesforce API to efficiently manage and clean your data.
I find myself doing bulk updates a lot. The requests typically come from adding new data fields, retroactively updating fields, and cleaning Salesforce data. My approach that I found leads to the fewest number of mistakes is to do the following:
- Create a report with the Salesforce ID and columns you want to modify
- Export the data out of Salesforce and make the changes
- Re-import the data using the Salesforce ID as the lookup column
Before the API, I was creating a report, export it to a CSV, making the changes, then reuploading the data with dataloader.io. The process works great but there’s just a lot of steps and a lot of room for error. I find that I can automate most of those steps now with Python. Check out my post on Getting Started with the Salesforce API and Python.
Step 1 – Query your data
The idea here is that you’ll query your data which needs to have at least:
- Salesforce ID (so that we can perform an update request later on)
- Target Column (this the column with the data you want to modify)
The query below will return the Salesforce ID and the Industry from the Accounts object. The triple single quotes '''
allows you to have your string run over multiple lines.
# filter with a variable
query_results = sf.query('''
select
ID,
Industry
from Account
'''
)
Step 2 – Convert the Results into a Pandas Dataframe
The code saves the query results to a variable called query_results
.If you print out the query_results
, you’ll notice that it prints out as an ordered dictionary. What I like to do next, it to convert that to a dataframe and make the changes.
query_results_df = pd.DataFrame(query_results['records']).drop(columns='attributes')
Next, we can leverage Pandas to modify the target column. In the example below, I am just changing the column to a different string value.
query_results_df['Industry'] = 'Airline'
The JSON that is originally returned in the query_results
has a few extra columns that need to be excluded for the upload. I just create a new dataframe with just the target columns I need. This is typically the ID column and the column I am making updates to.
new_df = query_results_df[['Id','Industry']]
The column names need to match the Salesforce field API name. To find the API name, go to Salesforce and click on the Gear Icon > Setup. Then click on the Object Manager tab > Click the Target Object > Fields & Relationships. Then click on the target field.
For custom fields, you’ll see an API Name. For standard fields, you won’t see an API Name and will just use the Field Name. We need the column name in Pandas to match either the API name or the Field Name. Here is how to rename dataframe columns:
new_df = new_df.rename(columns={"existing_name": "new_name", "existing_name2": "new_name2"})
Next, I convert the dataframe to JSON so that we can re-upload the data:
#convert the dataframe to json
results_json = new_df.to_json(orient='records')
#convert string to file object
accounts_to_update = json.loads(results_json)
Step 3 – Upload the Data back to Salesforce
Finally, we can take the modified JSON data and re-upload it back into Salesforce. The sf.bulk.Account.update
function takes the target object and the JSON to upload.
sf.bulk.Account.update(accounts_to_update,batch_size=10000,use_serial=True)
That’s it! Now you can check Salesforce to ensure the changes went through.