Spark Dataframes and MLlib

By Lucas | August 24, 2015

NOTE: I have created an updated version of my Python Spark Dataframes tutorial that is based on Spark 2.1 uses an easier, updated Spark ML API. I would encourage readers to check that out over this older post.

A couple of months ago, I got my first experience with Apache Spark. While I am just starting to use it to implement meaningful problems, in my experience when working with a new tool or technology, just getting one’s feet wet can be crucial to getting a learning snowball rolling. Although Spark is primarily used for “big data” problems in data clusters, I have been experimenting with a very “small data” problem, a simple linear regression on California home prices. You can find the data set here. I’ve decided to put the resulting tutorial up on this blog. Although there is nothing earth shattering in this post, I think some people will find it helpful for the following reasons.

The method I used for working with the data is dataframes. Dataframes are a relatively new paradigm in Spark. They have only been available since Spark 1.3 in February 2015.
I am using the Python API. While I suspect that PySpark is going to grow rapidly in popularity, there seem to be more resources for Scala at this time.
I could find very few tutorials or even significant Q&A threads about using PySpark syntax and dataframes on Stack Overflow. That gives me cause to believe that even this simple tutorial about reading a CSV into Spark, doing some trivial data wrangling with dataframes, and performing a linear regression could be helpful to some individuals.

When I originally ran this post in early July of 2015, I had completed this code in this Jupyter notebook on a VM running Spark 1.31. Since then, I’ve installed Ubuntu on my PC, and I’ve been able to upgrade to the latest version of Spark (as of August 2015), version 1.41. I’ve made some significant changes to this post, partly due to learning a little more about Spark, and partly due to being able to take advantage of the latest features in Spark. Spark is evolving rapidly, with a major release every few months.