Spark with Python Notebook on Mac


25 April 2015

First thing first…

To use Spark we need to configure the Hadoop eco system of Yarn and HDFS. This can be done following my previous tutorial Installing Hadoop on Yosemite.

Install HomeBrew

Found here: or simply paste this inside the terminal

ruby -e "$(curl -fsSL"

To install Spark

brew install apache-spark

Will install Spark to directory /usr/local/Cellar/apache-spark/1.2.0/

Create Python HDFS directory and dataset

The directory we will be using for input and output.

hdfs dfs -mkdir /Python

Download a book for Word Count

mv 30760-0.txt book.txt
hdfs dfs -put book.txt /Python/
hdfs dfs -ls /Python/

Install Anaconda Python

We’ll install Anaconda for Python because it also contains iPython and other tools that will make working with Python easy and enjoyable.

Download and install Anaconda Python from

Or the direct link if it works is,

Running iPython notebook

In the terminal execute

IPYTHON_OPTS="notebook --pylab inline" pyspark

Which starts Python, creates the Spark Hdfs connection, and automatically opens up a new Browser Window with the Python Notebook. In the top right corner click on New Notebook.

words = sc.textFile("hdfs://localhost:9000/Python/book.txt")

words.filter(lambda w: w.startswith(" ")).take(5)

counts = words.flatMap(lambda line: line.split(" ")) \
 .map(lambda word: (word, 1)) \
 .reduceByKey(lambda a, b: a + b)



Python Notebook

You can view the Python Notebook here…

