Step 1: Install JDK

Because Apache spark depends on Java run time environment, you need to have JDK installed in your machine. Download JDK

For windows machine, you also need to set Java home after you install JDK.

  • Right-click the My Computer icon on your desktop and select Properties
  • Click the Advanced tab
  • Click the Environment Variables button
  • Under System Variables, click New
  • Enter the variable name as JAVA_HOME
  • Enter the variable value as the installation path for the Java Development Kit

Step 2: Download and test Apache Spark

  • Download Spark from the Apache Spark website http://spark.apache.org/downloads.html or download it from here
  • Untar the file
  • Test PySpark
  • cd spark-1.4.1-bin-hadoop2.6
  • ./bin/pyspark
  • if everything goes well, you can see the PySpark interactive shell Spark

Step 3: Install Anaconda Scientific Python Distribution

Anaconda is a python distribution with lot’s of preinstalled popular packages for data science. It also installs IPython Notebook (Jupyter) automatically. Download

Choose graphical installer

Anaconda

Open “Launcher” after installing anaconda python. Choose to update python notebook and then Launch it. Now you should see python notebook (Jupyter) is open in your browser.

Launcher

Step 4: Run Spark in Notebook

You can follow the steps in the following link to configure Notebook. Then Spark will be automatically loaded when you open a new notebook. http://thepowerofdata.io/configuring-jupyteripython-notebook-to-work-with-pyspark-1-4-0/

Or the script in notebook cell to start spark.

pyspark

The first cell is to start PySpark. The second cell is a test for PySpark.

  • Contact Us

    emails:

    james.shanahan_at_gmail.com

    liangdai16_at_gmail.com