Google Colab is Google’s version of a Jupyter Notebook and takes advantage of the same features you find in other Google Apps to make Python coding easy.
Code in the cloud
Your python code is automatically saved and backed in Google Drive. You no longer have to worry about backing up your code and your code changes are automatically tracked. This is so helpful when you forget to or haven’t uploaded to Github.
Easily access Google Drive files
With one line of code, you can easily access all your drive files in a colab notebook. Now you don’t have to store large files on your machine. You can simply access them directly from your Google Drive.
Google Colab also works nicely with Google Sheets and Google BigQuery. A common workflow for me is to leverage Google Colab as a simple data pipeline then store the final results in either Google Sheets or Google BigQuery depending on the project. I think it’s an easier option than exporting a CSV.
Share your work with others
Oftentimes, I find myself working with others who can benefit tremendously from Python automation but they aren’t technical and don’t have a python environment set up. With Google Colab, they don’t need to have it set up on their local machine. They just open the notebook like any other Google Doc and can run your code. This means that you can set up automation for colleagues and they can just run the code when they need to. If you need to make changes, you can do so easily by modifying the notebook. No need to send them a new python file.
For technical colleagues, they can easily collaborate with you. It’s basically the same as with other Google Docs. They can view your work and interact with the same visualizations if you were using an interactive charting library like Plotly.
You don’t have to manage a local python environment
I mentioned this before but it’s worth mentioning again. Your Python code is hosted in the cloud and you don’t have to worry about setting up a local python environment. This gives me peace of mind knowing that if my computer crashes and I have to replace it, I can easily pick up where I left of with a different computer.
Add interactive widgets
Google Colab allows you to add what they call Forms which add a GUI layer to your notebook to change inputs. Instead of changing a variable, you can have a nicer interface to change an underlying input and also put some validation around what goes into the variable. It’s essentially like creating an interactive web application.
Helpful Google Colab Commands
Since Google Colab is on Google Drive, there are some useful packages to integrate Google Apps and bring the power of Python to other applications.
Mount Google Drive
To access files from your Google Drive, you’ll have to mount it. There are two ways to mount.
Method #1 – Using a link to authenticate
In this method, you’ll run this code which will prompt you to open a link and authenticate your Google account. You have to use this method if your notebook is shared.
from google.colab import drive drive.mount('/content/drive')
Once authenticated, you’ll see your drive on the left-hand side!
Method #2 – Auto Mounting
In the Files area, you can click on the folder icon with the Drive icon to mount your drive. The benefit of this method is that it auto mounts you notebook with Drive so you only have to authenticate once. However, your notebook must remain private. The first method will require you to authenticate every time the notebook is initialized.
To access a file, you just need to find it in your directory, click on the three dots, then click on copy path.
Importing Environmental Variables and Custom Python Files
Google Colab doesn’t have a way for you save environmental variables. This is because your workspace is temporary as code is run on a virtual machine. To get around that, I find the easiest thing to do is to create a separate python file that holds the information. This way, even if you share the colab notebook, someone has to have access to your Google Drive in order to see your credentials.
sys.path.append('/content/') #Google Drive file location import db_connection #name of the python file to import
If you still want to create a traditional environmental variable, you can do so but know that the credentials are exposed in your code:
import os os.environ["password"] = "password123"
Connecting to Google Sheets
Using Python with Google Sheets is really powerful. I find this connection really powerful to automate data clean-up steps. Google Sheets can get cumbersome to work with when trying to perform a number of more complex transformations. Here are some of the benefits of working with Google Colab and Google Sheets:
- Leverage Python to make API Requests
- Easily connect to other data sources by using Python packages
- Automate data cleaning steps using Pandas
- Work with very large files. Google Sheets has a limitation of 5 million cells per sheet
- Increased processing speed with Pandas. Google Sheets can become incredibly slow when trying to run a formula on thousands of rows of data
First, you have to authenticate using this code:
import gspread from google.auth import default creds, _ = default() gc = gspread.authorize(creds)
Importing from Google Sheets
Once authenticated, you’ll need a few things before importing your data from a Google Sheet:
- The URL of the Google Sheet (the address in the address bar)
- The Sheet name
- Column names for your DataFrame
Here’s an example:
wb = gc.open_by_url('https://docs.google.com/spreadsheets...') sheet = wb.worksheet('Sheet1') rows = sheet.get_all_values() df = pd.DataFrame.from_records(rows[1:], columns = rows[0])
Writing to Google Sheets
I mostly use Google Sheets to dump my DataFrame. The bit of code I have below checks for two things:
- If the sheet name does not exist, then it creates a new sheet
- If the sheet exists, then it clears the sheet
from gspread_dataframe import set_with_dataframe sheet_name = 'Sheet1' try: sheet = wb.add_worksheet(title=sheet_name, rows="10", cols="10") except: sheet = wb.worksheet(sheet_name) sheet.clear() set_with_dataframe(sheet, df)
Uploading DataFrame to Google BigQuery
Eventually, your DataFrame will get too big for Google Sheets. Google Sheets has a limit of 5 million cells. In that case, a good next place to store your DataFrame is Google BigQuery which is Google’s cloud data warehouse. There is a free tier so you can get started without having to pay.
Here’s a sample DataFrame:
import numpy as np df_test = pd.DataFrame([[4,5,6],[4,5,6],[4,5,6]],columns=['A','B','C']) df_test
To load this data into BigQuery, there are a few things you have to do:
- Create a BigQuery account and set up billing (billing has to be set up just in case you go over the free tier limits)
- Enable the BigQuery API
- Create a Service Account and download the API key (you will need this to authenticate your account in Python)
This article walks through how to complete those steps.
For step #3, what I like to do is to save that authentication file in your Google Drive. After you have mounted Google Drive, you can then import it into your code.
#Big Query !pip install --upgrade google-cloud-bigquery !pip install --upgrade 'google-cloud-bigquery[bqstorage,pandas]' #to return results from BQ into Pandas from google.cloud import bigquery import os os.environ["GOOGLE_APPLICATION_CREDENTIALS"]='path_to_key' client = bigquery.Client()
Google Colab doesn’t allow you to create environment variables because it is a temporary session. To get around that, you can simply save those environmental variables as a separate python file and import the file as noted on line 8.
The final thing you have to do before uploading is to create the table where the data is going. The easiest way to do that is to create an empty dataset. You can do this directly in the Console (which I find pretty easy) or you can create it programmatically with some python code. Here are those instructions.
Now you are ready to upload data! There are two things you will pass into the code below:
- Your DataFrame
- Your Table id in BigQuery
from google.cloud import bigquery df = df_test #set your DataFrame here table_id = "" #set your Table id here # Construct a BigQuery client object. client = bigquery.Client() job_config = bigquery.LoadJobConfig( autodetect=True, write_disposition=bigquery.WriteDisposition.WRITE_APPEND #appends table #write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE, #overwrites table ) job = client.load_table_from_dataframe( df, table_id, job_config=job_config ) # Make an API request. job.result() # Wait for the job to complete. table = client.get_table(table_id) # Make an API request. print( "Loaded {} rows and {} columns to {}".format( table.num_rows, len(table.schema), table_id ) )
Final Thoughts
Check out more Python tricks in this Colab Notebook or in my recent Python Posts.
Thanks for reading!