Restoring or Installing Spark Cluster (From Backup)

  • Install Spark on all master and worker nodes using the procedure below
  • Extract pre-installed and configured folder from personal backup to root C:\ Drive. *(do not rename to keep things working)
    • spark-2.0.0-bin-hadoop2.7 for spark binaries
    • spark-warehouse for spark working folder
    • WinUtils for Hadoop executable for spark
    • Hadoop_HDFS_HOME for HDFS storage section on disk
  • Extract additional pre-installed and configured folder from personal backup to C:\Tools
    • apache-maven-3.3.9 for apache maven
    • sbt for scala built tools
    • scala as scala language compiler
  • Create systen environment variables as mentioned so that java code jell’s well (make sure that all files are in same folder as in batch file)
	ComSpec=C:\Windows\system32\cmd.exe
	GIT_HOME=C:\Program Files\Git
	GIT_SSH=C:\PuTTY\plink.exe
	GRADLE_HOME=C:\Tools\Gradle\v3.1
	GROOVY_HOME=C:\Tools\Groovy\v2.4.7
	HADOOP_HOME=C:\WinUtils
	JAVA_HOME=C:\Program Files\Java\jdk1.8.0_101
	M2_HOME=C:\Tools\apache-maven-3.3.9
	MAVEN_HOME=C:\Tools\apache-maven-3.3.9
	MSBUILD_PATH=C:\Program Files (x86)\MSBuild\14.0\Bin
	MSDEPLOYEXE=C:\Program Files\IIS\Microsoft Web Deploy V3\msdeploy.exe
	MSDEPLOYPATH=C:\Program Files\IIS\Microsoft Web Deploy V3\
	MYSQL_HOME=C:\Tools\WAMPx64\bin\mysql\mysql5.7.14
	SBT_HOME=C:\Tools\sbt\
	SCALA_HOME=C:\Tools\scala
	SPARK_HOME=C:\spark-2.0.0-bin-hadoop2.7
	SSL_CERT_FILE=C:\Tools\Ruby\v2.3.1\lib\ruby\2.3.0\rubygems\ssl_certs\AddTrustExternalCARoot-2048.pem

	Path=C:\ProgramData\Oracle\Java\javapath;C:\Tools\Python\v3.5.2\Scripts\;C:\Tools\Python\v3.5.2\;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files (x86)\Skype\Phone\;C:\Users\sagupta\.dnx\bin;C:\Program Files\Microsoft DNX\Dnvm\;C:\Program Files\Microsoft SQL Server\130\Tools\Binn\;C:\Program Files\TortoiseGit\bin;C:\Program Files\TortoiseSVN\bin;C:\Program Files\Microsoft SQL Server\120\DTS\Binn\;C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\110\Tools\Binn\;C:\Program Files (x86)\Microsoft SQL Server\120\Tools\Binn\;C:\Program Files\Microsoft SQL Server\120\Tools\Binn\;C:\Program Files (x86)\Microsoft SQL Server\120\Tools\Binn\ManagementStudio\;C:\Program Files (x86)\Microsoft SQL Server\120\DTS\Binn\;C:\Tools\NodeJS\;C:\Program Files\Java\jdk1.8.0_101\bin;C:\Program Files\Git\bin;C:\Tools\Gradle\v3.1\bin;C:\Tools\Groovy\v2.4.7\bin;C:\Tools\apache-maven-3.3.9\bin;C:\Tools\sbt\bin;C:\Tools\scala\bin;C:\Program Files (x86)\MSBuild\14.0\Bin;C:\Tools\scala\Bin;C:\spark-2.0.0-bin-hadoop2.7\Bin;C:\WinUtils\Bin;C:\Tools\apache-maven-3.3.9\bin;C:\Tools\WAMPx64\bin\mysql\mysql5.7.14\bin;C:\Tools\MySQL Utilities 1.6\;C:\Tools\Ruby\v2.3.1\bin;C:\Users\sagupta\AppData\Roaming\npm;C:\Tools\Microsoft VS Code\bin;C:\Users\sagupta\AppData\Local\atom\bin
	PATHEXT=.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC;.PY;.PYW;.RB;.RBW;.groovy;.gy;.RB;.RBW
	SystemDrive=C:
	SystemRoot=C:\Windows
	TEMP=C:\Users\sagupta\AppData\Local\Temp
	TMP=C:\Users\sagupta\AppData\Local\Temp
	VBOX_MSI_INSTALL_PATH=C:\Program Files\Oracle\VirtualBox\
	VS140COMNTOOLS=C:\Program Files (x86)\Microsoft Visual Studio 14.0\Common7\Tools\
	windir=C:\Windows

1A. Starting Distibuted Custer (1 Master, 2 Remote Workers)

  • Start 1 no. MASTER node and 2 no. SLAVE node to create a Spark Cluster
    • You need to open 3 no. of MS-DOS command prompt in admin mode and execute commands
      • MASTER NODE = spark-class.cmd org.apache.spark.deploy.master.Master --port 7077 --webui-port 8080
      • SLAVE NODE = spark-class.cmd org.apache.spark.deploy.worker.Worker spark://{MASTERIPADDR}:7077 --cores 2 --memory 2g.
    • In previous commands
      • 7077 is port of spark cluster server where it will listent for worker nodes
      • 8080 is monitoring port spark cluster server dashboard (to be used in browser)
      • {MASTERIPADDR} to be replaced with IP address of spark cluster server
      • --cores 2 --memory 2g defines 2 cores and 2gb ram for each worker

1B. Starting Local Custer (1 Master, nCore Local Workers)

  • You need to open MS-DOS command prompt in admin mode to start spark cluster that is local[*]
  • This will create one local spark cluster master and local workers equal to availaible CPU cores
  • Use this command: spark-shell to start MASTER with WORKERS equal to number of cores locally

2. Executing Code from Scala IDE

  • Start Scala IDE, open your project from workspace (if not alreay open)
  • Run your spark scala code using "Right Click"->"RunAs Scala Application")

3. Executing Code from Spark Shell

  • You need to open MS-DOS command prompt in admin mode to start spark shell
  • Start Spark Shell using spark-shell --master spark://{MASTERIPADDR}:7077 command
  • Now your shell will connect to spark master server created in previous heading
  • Type and enter following commands on spark shell, these will execute on spark cluster
val logFile = "C:\\spark-2.0.0-bin-hadoop2.7\\README.md"
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Total lines with a: %s, Lines with b: %s".format(numAs, numBs))

Open Scala Project In Scala IDE

  • Scala project won’t open in Scala IDE (Or Eclipse with Scala IDE plugin) directly as a scala project required
  • This requires eclipse .project and .classpath files (or creation of project from scala files)
  • So first generate them using command sbt eclipse. But first install sbt and configure it to system path
  • This approach works with: Eclipse Juno, Scala IDE 4.0, and Scala Test
  • First add Scala Compilation to Eclipse. The easiest way is to download the Scala IDE bundle from the Scala IDE download page (matching version of scala).
  • It comes pre-installed with ScalaTest. Alternatively, use the Scala IDE update site or Eclipse Marketplace.
  • Generate eclipse .project and .classpath files for each spark sub project using command: sbt eclipse or sbt eclipse with-source=true
  • Eclipse plugin for sbt is required, which is availaible here
  • Import specific project like spark-core using menu File | Import | Existing Projects into Workspace using IDE. Do not select Copy projects into workspace.
  • If you want to develop on Scala 2.10 you need to configure a Scala installation for the exact Scala version that’s used to compile Spark.
  • At the time of this writing code it was Scala 2.10.4. Since Scala IDE bundles the latest versions (2.10.5 and 2.11.6) change version is requried
  • For this you need do add one in Eclipse Preferences -> Scala -> Installations by pointing to the lib/ directory of your Scala2.10.4distribution.
  • Once this is done, select all Spark projects and Right-Click, choose Scala -> Set Scala Installation and point to the 2.10.4 installation.
  • This should clear all errors about invalid cross-compiled libraries. A clean build should succeed now.
  • ScalaTest can execute unit tests by right clicking a source file and selecting Run As | Scala Test
  • If Java memory errors occur, it might be necessary to increase the settings in eclipse.ini in the Eclipse install directory. Increase the following setting as needed: --launcher.XXMaxPermSize 256M

Compiling Apache Spark itself from its Source Code

  • Download Apache Spark Source build v2.0.0 or github master along with *pom.xml file
  • Note that current master may break, or you’ll get issues related to version of scala compiler. I am currently using Spark 2.0.0 with Scala 2.8.11
  • Install Apache Maven and configure system enviornment variables M2_PATH & MAVEN_PATH
  • Spark is built using apache maven. To build Spark and its example programs, run: build/mvn -DskipTests clean package
  • You can build Spark using more than one thread by using the -T option with Maven for Parallel builds.
  • More detailed documentation is available from the project site, at Building Spark.
  • For developing Spark using an IDE, see Eclipse and IntelliJ.

Install SBT Eclipse

  • SBT Eclipse plugin lets you generate eclipse project from .scala files from MS-DOS
  • This plugin will also lets you compile scala projects from MS-DOS without eclipse project
  • This Guide explains
  • First you need to edit following files
    • Mac/Linux: Edit file $ ~/.sbt/plugins/build.sbt
    • Windows: Edit file %userprofile%\.sbt\plugins\build.sbt file on (Windows)
  • Add the following lines to above files (the empty line in between is important)
  addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "4.0.0")

  addSbtPlugin("com.github.mpeltonen" % "sbt-idea" % "1.6.0")
  • Make sure for plugins like sbteclipse-plugin its latest github release is placed while adding lines
  • Generate eclipse project is needed using command: sbt eclipse or sbt eclipse with-source=true.
  • To import a specific project, e.g. spark-core, select File | Import | Existing Projects into Workspace. Do not select Copy projects into workspace. for IntellijIDEA use sbt gen-idea. This will create .settings, .classpath, .project folders in your project