Run spark-1.6.0 on Yarn

Running spark-1.6.0_PHP Tutorial on Yarn

Run spark-1.6.0.pdf on Yarn

1. Agreement

This article agrees that Hadoop2.7.1 is installed in /data/hadoop/current, and Spark1.6.0 is installed in /data/hadoop/spark , where /data/hadoop/spark points to /data/hadoop/spark.

Spark’s official website is: http://spark.apache.org/ (Shark’s official website is: http://shark.cs.berkeley.edu/. Shark has become a module of Spark and no longer needs to be used separately. Install).

Run Spark in cluster mode and do not introduce client mode.

2. Install Scala

Martin Odersky of the Ecole Polytechnique Fédérale de Lausanne (EPFL) started designing Scala in 2001 based on the work of Funnel.

Scala is a multi-paradigm programming language, designed to integrate various features of pure object-oriented programming and functional programming. It runs on the Java virtual machine JVM, is compatible with existing Java programs, and can call Java class libraries. Scala includes a compiler and class libraries and is released under the BSD license.

2.1. Download

Spark is developed using Scala. Before installing Spark, install Scala in each section. Scala’s official website is: http://www.scala-lang.org/, and the download URL is: http://www.scala-lang.org/download/. This article downloads the binary installation package scala-2.11.7. tgz.

2.2. Installation

This article uses the root user (actually it can also be a non-root user, it is recommended to plan in advance) to install Scala in /data/scala, where /data/scala points to /data Soft link to /scala-2.11.7.

The installation method is very simple, upload scala-2.11.7.tgz to the /data directory, and then decompress scala-2.11.7.tgz in the /data/ directory.

Next, create a soft link: ln-s/data/scala-2.11.7/data/scala.

2.3. Set environment variables

After Scala is installed, you need to add it to the PATH environment variable. You can directly modify the /etc/profile file and add the following content:

exportSCALA_HOME=/data/scala

exportPATH=$SCALA_HOME/bin:$PATH

exportSCALA_HOME=/data/scala

exportPATH=$SCALA_HOME/bin:$PATH

3. Install Spark

Spark is installed as a non-root user. This article installs it as the hadoop user.

3.1. Download the binary installation package downloaded in this article

. This method is recommended, otherwise you will have to worry about compilation. The download URL is: http://spark.apache.org/downloads.html. This article downloads spark-1.6.0-bin-hadoop2.6.tgz, which can be run directly on YARN.

3.2. Installation

1) Upload spark-1.6.0-bin-hadoop2.6.tgz to the directory /data/hadoop

2) Unzip: tarxzfspark -1.6.0-bin-hadoop2.6.tgz

3) Create a soft link: ln-sspark-1.6.0-bin-hadoop2.6spark

in To run spark on yarn, you do not need to install spark on every machine. You can install it on only one machine. But spark can only be run on the machine where it is installed. The reason is simple: the file that calls spark is needed.

3.3. Configuration

3.3.1. Modify conf/spark-env.sh

HADOOP_CONF_DIR=/data/hadoop/current/etc/hadoop

YARN_CONF_DIR=/data/hadoop/current/etc/hadoop

You can make a copy of spark-env.sh.template, and then add the following content:

HADOOP_CONF_DIR=/data/hadoop/current/etc/hadoop

YARN_CONF_DIR=/data/hadoop/current/etc/hadoop

./bin/spark-submit--classorg.apache.spark.examples.SparkPi

--masteryarn--deploy-modecluster

--driver-memory4g

--executor-memory2g

--executor-cores1

--queuedefault

lib/spark-examples*.jar10

4. Start Spark Since it runs on Yarn, there is no process of starting Spark. Instead, when the command spark-submit is executed, Spark is scheduled to run by Yarn. 4.1. Run the built-in example

./bin/spark-submit--classorg.apache.spark.examples.SparkPi --masteryarn--deploy-modecluster --driver-memory4g --executor-memory2g --executor-cores1 --queuedefault lib/spark -examples*.jar10

运行输出：

16/02/0316:08:33INFOyarn.Client:Applicationreportforapplication_1454466109748_0007(state:RUNNING)

16/02/0316:08:34INFOyarn.Client:Applicationreportforapplication_1454466109748_0007(state:RUNNING)

16/02/0316:08:35INFOyarn.Client:Applicationreportforapplication_1454466109748_0007(state:RUNNING)

16/02/0316:08:36INFOyarn.Client:Applicationreportforapplication_1454466109748_0007(state:RUNNING)

16/02/0316:08:37INFOyarn.Client:Applicationreportforapplication_1454466109748_0007(state:RUNNING)

16/02/0316:08:38INFOyarn.Client:Applicationreportforapplication_1454466109748_0007(state:RUNNING)

16/02/0316:08:39INFOyarn.Client:Applicationreportforapplication_1454466109748_0007(state:RUNNING)

16/02/0316:08:40INFOyarn.Client:Applicationreportforapplication_1454466109748_0007(state:FINISHED)

16/02/0316:08:40INFOyarn.Client:

clienttoken:N/A

diagnostics:N/A

ApplicationMasterhost:10.225.168.251

ApplicationMasterRPCport:0

queue:default

starttime:1454486904755

finalstatus:SUCCEEDED

trackingURL:http://hadoop-168-254:8088/proxy/application_1454466109748_0007/

user:hadoop

16/02/0316:08:40INFOutil.ShutdownHookManager:Shutdownhookcalled

16/02/0316:08:40INFOutil.ShutdownHookManager:Deletingdirectory/tmp/spark-7fc8538c-8f4c-4d8d-8731-64f5c54c5eac

16/02/0316:08:33INFOyarn.Client:Applicationreportforapplication_1454466109748_0007(state:RUNNING)

16/02/0316:08:34INFOyarn.Client:Applicationreportforapplication_1454466109748_0007(state:RUNNING)

16/02/0316:08:35INFOyarn.Client:Applicationreportforapplication_1454466109748_0007(state:RUNNING)

./bin/spark-sql--masteryarn

16/02/0316:08:36INFOyarn.Client:Applicationreportforapplication_1454466109748_0007(state:RUNNING) 16/02/0316:08:37INFOyarn.Client:Applicationreportforapplication_1454466109748_0007(state:RUNNING) 16/02/0316:08:38INFOyarn.Client:Applicationreportforapplication_1454466109748_0007(state:RUNNING) 16/02/0316:08:39INFOyarn.Client:Applicationreportforapplication_1454466109748_0007(state:RUNNING) 16/02/0316:08:40INFOyarn.Client:Applicationreportforapplication_1454466109748_0007(state:FINISHED) 16/02/0316:08:40INFOyarn.Client: clienttoken:N/A diagnostics:N/A ApplicationMasterhost:10.225.168.251 ApplicationMasterRPCport:0 queue:default starttime:1454486904755 finalstatus:SUCCEEDED trackingURL:http://hadoop-168-254:8088/proxy/application_1454466109748_0007/ user:hadoop 16/02/0316:08:40INFOutil.ShutdownHookManager:Shutdownhookcalled 16/02/0316:08:40INFOutil.ShutdownHookManager:Deletingdirectory/tmp/spark-7fc8538c-8f4c-4d8d-8731-64f5c54c5eac

4.2.SparkSQLCli通过运行即可进入SparkSQLCli交互界面，但要在Yarn上以cluster运行，则需要指定参数--master值为yarn（注意不支持参数--deploy-mode的值为cluster，也就是只能以client模式运行在Yarn上）：

./bin/spark-sql--masteryarn

Why can SparkSQLCli only run in client mode? In fact, it is easy to understand. Since it is interactive and you need to see the output, the cluster mode cannot do it at this time. Because of the cluster mode, the machine on which ApplicationMaster runs is dynamically determined by Yarn.

5. Integrate with Hive

Spark integrating Hive is very simple, just the following steps:

1) Add HIVE_HOME to spark-env.sh, such as: exportHIVE_HOME =/data/hadoop/hive

2) Copy Hive’s hive-site.xml and hive-log4j.properties files to Spark’s conf directory.

After completion, execute spark-sql again to enter Spark's SQLCli, and run the command showtables to see the tables created in Hive.

Example:

./spark-sql--masteryarn--driver-class-path/data/hadoop/hive/lib/mysql-connector-java-5.1.38-bin. jar

6. Common Errors

6.1. Error 1: unknownqueue:thequeue

Run:

./bin/spark-submit--classorg. apache.spark.examples.SparkPi--masteryarn--deploy-modecluster--driver-memory4g--executor-memory2g--executor-cores1--queuethequeuelib/spark-examples*.jar10

reports the following error, Just change "--queuethequeue" to "--queuedefault".

16/02/0315:57:36INFOyarn.Client:Applicationreportforapplication_1454466109748_0004(state:FAILED)

16/02/0315:57:36INFOyarn.Client:

clienttoken:N/A

diagnostics:Applicationapplication_1454466109748_0004submittedbyuserhadooptounknownqueue:thequeue

ApplicationMasterhost:N/A

ApplicationMasterRPCport:-1

queue:thequeue

starttime:1454486255907

finalstatus:FAILED

trackingURL:http://hadoop-168-254:8088/proxy/application_1454466109748_0004/

user:hadoop

16/02/0315:57:36INFOyarn.Client:Deletingstagingdirectory.sparkStaging/application_1454466109748_0004

Exceptioninthread"main"org.apache.spark.SparkException:Applicationapplication_1454466109748_0004finishedwithfailedstatus

atorg.apache.spark.deploy.yarn.Client.run(Client.scala:1029)

atorg.apache.spark.deploy.yarn.Client$.main(Client.scala:1076)

atorg.apache.spark.deploy.yarn.Client.main(Client.scala)

atsun.reflect.NativeMethodAccessorImpl.invoke0(NativeMethod)

atsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

atsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

atjava.lang.reflect.Method.invoke(Method.java:606)

atorg.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)

atorg.apache.spark.deploy.SparkSubmit$.doRunMain(SparkSubmit.scala:181)

atorg.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)

atorg.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)

atorg.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

16/02/0315:57:36INFOutil.ShutdownHookManager:Shutdownhookcalled

16/02/0315:57:36INFOutil.ShutdownHookManager:Deletingdirectory/tmp/spark-54531ae3-4d02-41be-8b9e-92f4b0f05807

16/02/0315:57:36INFOyarn.Client:Applicationreportforapplication_1454466109748_0004(state:FAILED) 16/02/0315:57:36INFOyarn.Client: clienttoken:N /A diagnostics:Applicationapplication_1454466109748_0004submittedbyuserhadooptounknownqueue:thequeue ApplicationMasterhost:N/A ApplicationMasterRPCport:-1 queue:thequeue starttime :1454486255907 finalstatus:FAILED trackingURL:http://hadoop-168-254:8088/proxy/application_1454466109748_0004/ user:hadoop 16/02/0315:57:36INFOyarn.Client:Deletingstagingdirectory.sparkStaging/application_1454466109748_0004 Exceptioninthread"main"org.apache.spark.SparkException:Applicationapplication_1454466109748_0004finishedwithfailed status aorg.apache.spark.deploy. yarn.Client.run(Client.scala:1029) aorg.apache.spark.deploy.yarn.Client$.main(Client.scala:1076) aorg.apache.spark .deploy.yarn.Client.main(Client.scala) atsun.reflect.NativeMethodAccessorImpl.invoke0(NativeMethod) atsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) atsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) atjava.lang.reflect.Method.invoke(Method.java:606) atorg.apache .spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) aorg.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala :181) aorg.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) aorg.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala :121) aorg.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 16/02/0315:57:36INFOutil.ShutdownHookManager:Shutdownhookcalled 16/02/0315:57:36INFOutil.ShutdownHookManager:Deletingdirectory/tmp/spark-54531ae3-4d02-41be-8b9e-92f4b0f05807