We are always in a hurry. Docker can help us run a lot of software without installing it. One need not waste any time in installation. Afterall, all we need to learn is the end result and not installations. So, why not let Docker take care for it.
What docker provides:
- A closed network subnet where the docker containers can talk to each other.
- A default gateway for the containers for outbound traffic.
- Installed images, you literally need to do nothing.
- Installed docker, docker-compose ( 1.9.0 or higher )
- Linux box. I guess this can also be done on Windows, but that area is unexplored.
- Create a dir at any location. This dir name, will be used for your project.
- $ mkdir ScalaCluster ; cd ScalaCluster
- Copy the following content to docker-compose.yml
command: start-spark master
command: start-spark worker master
- Run the following command to start the container
- $ docker-compose up -d
- Your 2 node cluster is up and running.
- You should see something line this.
$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 3f63c7601c93 singularities/spark "start-spark worke..." 10 minutes ago Up 10 minutes 6066/tcp, 7077/tcp, 8020/tcp, 8080-8081/tcp, 10020/tcp, 13562/tcp, 14000/tcp, 19888/tcp, 50070/tcp, 50470/tcp spark_worker_1 843460507f96 singularities/spark "start-spark master" 10 minutes ago Up 10 minutes 0.0.0.0:6066->6066/tcp, 7077/tcp, 0.0.0.0:7070->7070/tcp, 8020/tcp, 8081/tcp, 10020/tcp, 13562/tcp, 14000/tcp, 0.0.0.0:8080->8080/tcp, 19888/tcp, 0.0.0.0:50070->50070/tcp, 50470/tcp spark_master_1
- If you wish to expand your cluster to more number of nodes, you can also scale it out using the following command:
- $ docker-compose scale worker=2.
- You should be able to see some more docker containers as below.
$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 31488cf276ee singularities/spark "start-spark worke..." 7 seconds ago Up 5 seconds 6066/tcp, 7077/tcp, 8020/tcp, 8080-8081/tcp, 10020/tcp, 13562/tcp, 14000/tcp, 19888/tcp, 50070/tcp, 50470/tcp spark_worker_2 3f63c7601c93 singularities/spark "start-spark worke..." 10 minutes ago Up 10 minutes 6066/tcp, 7077/tcp, 8020/tcp, 8080-8081/tcp, 10020/tcp, 13562/tcp, 14000/tcp, 19888/tcp, 50070/tcp, 50470/tcp spark_worker_1 843460507f96 singularities/spark "start-spark master" 10 minutes ago Up 10 minutes 0.0.0.0:6066->6066/tcp, 7077/tcp, 0.0.0.0:7070->7070/tcp, 8020/tcp, 8081/tcp, 10020/tcp, 13562/tcp, 14000/tcp, 0.0.0.0:8080->8080/tcp, 19888/tcp, 0.0.0.0:50070->50070/tcp, 50470/tcp spark_master_1
You 3 node cluster is up and running now.
How to connect to your cluster ( Scala ):
Spark-shell is the primary way to connect to your Spark cluster. The services are up and running, but they are inside the cluster. Hence, you will have to install the Spark-client on your local and connect using the local client. Once installed, you can connect to the docker spark cluster by providing the master connection info to spark-shell.
To connect to your master, you need to figure out the I.P on which the master container is running. To check the I.P, you can use docker inspect.
$ docker inspect 843460507f96 | grep IPAddress "SecondaryIPAddresses": null, "IPAddress": "", "IPAddress": "172.18.0.2",
Use this IP address to connect using the Spark Shell. You will get a spark-shell by installing a spark client wherever you wish to. I did it by extracting the tar-ball ( Spark ver. 2.1.0 ) and simply use the spark shell to connect to it.
wget http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.7.tgz tar xvzf spark-2.1.0-bin-hadoop2.7.tgz
Go to the recently extracted bin directory in from the tarball. And connect to Spark cluster it using the master I.P obtained above
$ cd spark-2.1.0-bin-hadoop2.7/bin $ ./spark-shell --master spark://172.18.0.2:7077 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/02/28 19:57:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/02/28 19:57:07 WARN Utils: Your hostname, mean-machine resolves to a loopback address: 127.0.0.1; using 10.20.4.35 instead (on interface eth0) 17/02/28 19:57:07 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 17/02/28 19:57:28 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 17/02/28 19:57:28 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException 17/02/28 19:57:30 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException Spark context Web UI available at http://10.20.4.35:4040 Spark context available as 'sc' (master = spark://172.18.0.2:7077, app id = app-20170228142708-0000). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.1.0 /_/ Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_111) Type in expressions to have them evaluated. Type :help for more information. scala>
You are good to go ahead … I hope this helped.
Do share your experiences while using the steps.
Next steps would be using a Spark-client also inside a docker instance. But, that has some time to come. Untill, I can do ahead with externally installed Spark-client.