Step 1: Install JDK

Because Apache spark depends on Java run time environment, you need to have JDK installed in your machine. Download JDK

For windows machine, you also need to set Java home after you install JDK.

  • Right-click the My Computer icon on your desktop and select Properties
  • Click the Advanced tab
  • Click the Environment Variables button
  • Under System Variables, click New
  • Enter the variable name as JAVA_HOME
  • Enter the variable value as the installation path for the Java Development Kit

Step 2: Download and test Apache Spark

  • Download Spark from the Apache Spark website or download it from here
  • Untar the file
  • Test PySpark
  • cd spark-1.4.1-bin-hadoop2.6
  • ./bin/pyspark
  • if everything goes well, you can see the PySpark interactive shell Spark

Step 3: Install Anaconda Scientific Python Distribution

Anaconda is a python distribution with lot’s of preinstalled popular packages for data science. It also installs IPython Notebook (Jupyter) automatically. Download

Choose graphical installer


Open “Launcher” after installing anaconda python. Choose to update python notebook and then Launch it. Now you should see python notebook (Jupyter) is open in your browser.


Step 4: Run Spark in Notebook

You can follow the steps in the following link to configure Notebook. Then Spark will be automatically loaded when you open a new notebook.

Or the script in notebook cell to start spark.


The first cell is to start PySpark. The second cell is a test for PySpark.

  • Contact Us