Installing Spark on your local machine can be a pain. In this post, I’ll show you how to install Spark on Google Colab so that you can easily get going with PySpark.
Run this code in a Google Colab cell to get going with PySpark:
!sudo apt update !apt-get install openjdk-8-jdk-headless -qq > /dev/null !wget -q https://dlcdn.apache.org/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz !tar xf spark-3.2.0-bin-hadoop3.2.tgz !pip install -q findspark !pip install pyspark import os os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64" os.environ["SPARK_HOME"] = "/content/spark-3.2.0-bin-hadoop3.2" import findspark findspark.init() findspark.find() from pyspark.sql import DataFrame, SparkSession from typing import List import pyspark.sql.types as T import pyspark.sql.functions as F spark = SparkSession \ .builder \ .appName("Our First Spark example") \ .getOrCreate() spark
Why Use PySpark?
For me, the switch to PySpark happened when my Pandas functions were too slow and when I started running out of RAM when loading in a data file.
Is Spark Hard to Learn?
Thanks to the Spark SQL module, you can run transformations using SQL syntax instead of trying to learn the Spark syntax. In my view, this had made Spark so much more approachable since you have fewer Spark functions to learn.
Check out this Google Colab Notebook which includes the setup code above and also some starter code for analyzing data.
Final Thoughts
One alternative to Google Colab would be to use the Databricks community edition where you don’t have any installation of Spark. However, in my opinion, you can do more for free with Google Colab and have the benefit of keeping your work private.
For more Python tips and tricks, check out my recent Python Posts.
Final Thoughts
Check out more Python tricks in this Colab Notebook or in my recent Python Posts.
Thanks for reading!