博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Spark on YARN 部署案例
阅读量:6811 次
发布时间:2019-06-26

本文共 11886 字,大约阅读时间需要 39 分钟。

环境准备

1. 服务器角色分配

ip hostname role
10.8.26.197 server1 主名字节点 (NodeManager)
10.8.26.196 server2 备名字节点 (SecondaryNameNode)
10.8.26.196 server2 数据字节点 (DataNode)

2. 软件设施

  • jdk1.8.0_102

  • scala-2.11.0:

  • hadoop-2.7.0

  • spark-2.0.2-bin-hadoop2.7:对应 scala 版本不能是 scala-2.11.x

3. HOSTS 设置

在每台服务器的 "/etc/hosts" 文件,添加如下内容:

10.8.26.197   server1   10.8.26.196   server2  10.8.26.196   server2

4. SSH 免密码登录


Hadoop YARN 分布式集群配置

注:1-8 所有节点都做同样配置

1. 环境变量设置

# vim /etc/profile# hadoop env setexport HADOOP_HOME=/usr/local/hadoop-2.7.0export HADOOP_PID_DIR=/data/hadoop/pidsexport HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/nativeexport HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib/native"export HADOOP_MAPRED_HOME=$HADOOP_HOMEexport HADOOP_COMMON_HOME=$HADOOP_HOMEexport HADOOP_HDFS_HOME=$HADOOP_HOMEexport YARN_HOME=$HADOOP_HOMEexport HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoopexport HDFS_CONF_DIR=$HADOOP_HOME/etc/hadoopexport YARN_CONF_DIR=$HADOOP_HOME/etc/hadoopexport JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native# jdk env setexport JAVA_HOME=/usr/local/jdk1.8.0_102export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar# scala env setexport SCALA_HOME=/usr/local/scala-2.11.0export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$JAVA_HOME/bin:$SCALA_HOME/bin:$PATH

变量立即生效

# source /etc/profile

2. 相关路径创建

# mkdir -p /data/hadoop/{pids,storage}# mkdir -p /data/hadoop/storage/{hdfs,tmp}# mkdir -p /data/hadoop/storage/hdfs/{name,data}

3. 配置 core-site.xml

目录:$HADOOP_HOME/etc/hadoop/core-site.xml

fs.defaultFS
hdfs://server1:9000
io.file.buffer.size
131072
hadoop.tmp.dir
file:/data/hadoop/storage/tmp
hadoop.proxyuser.hadoop.hosts
*
hadoop.proxyuser.hadoop.groups
*
hadoop.native.lib
true

4. 配置 hdfs-site.xml

目录:$HADOOP_HOME/etc/hadoop/hdfs-site.xml

dfs.namenode.secondary.http-address
server2:9000
dfs.namenode.name.dir
file:/data/hadoop/storage/hdfs/name
dfs.datanode.data.dir
file:/data/hadoop/storage/hdfs/data
dfs.replication
3
dfs.webhdfs.enabled
true

5. 配置 mapred-site.xml

目录:$HADOOP_HOME/etc/hadoop/mapred-site.xml

mapreduce.framework.name
yarn
mapreduce.jobhistory.address
server1:10020
mapreduce.jobhistory.webapp.address
server1:19888

6. 配置 yarn-site.xml

目录:$HADOOP_HOME/etc/hadoop/yarn-site.xml

yarn.nodemanager.aux-services
mapreduce_shuffle
yarn.nodemanager.aux-services.mapreduce.shuffle.class
org.apache.hadoop.mapred.ShuffleHandler
yarn.resourcemanager.scheduler.address
server1:8030
yarn.resourcemanager.resource-tracker.address
server1:8031
yarn.resourcemanager.address
server1:8032
yarn.resourcemanager.admin.address
server1:8033
yarn.resourcemanager.webapp.address
server1:80

7. 配置 hadoop-env.sh、mapred-env.sh、yarn-env.sh

均在文件开头添加

目录:

  • $HADOOP_HOME/etc/hadoop/hadoop-env.sh

  • $HADOOP_HOME/etc/hadoop/mapred-env.sh

  • $HADOOP_HOME/etc/hadoop/yarn-env.sh

在以上三个文件开头添加如下内容:

export JAVA_HOME=/usr/local/jdk1.8.0_102export CLASS_PATH=$JAVA_HOME/lib:$JAVA_HOME/jre/libexport HADOOP_HOME=/usr/local/hadoop-2.7.0export HADOOP_PID_DIR=/data/hadoop/pidsexport HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/nativeexport HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib/native"export HADOOP_PREFIX=$HADOOP_HOMEexport HADOOP_MAPRED_HOME=$HADOOP_HOMEexport HADOOP_COMMON_HOME=$HADOOP_HOMEexport HADOOP_HDFS_HOME=$HADOOP_HOMEexport YARN_HOME=$HADOOP_HOMEexport HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoopexport HDFS_CONF_DIR=$HADOOP_HOME/etc/hadoopexport YARN_CONF_DIR=$HADOOP_HOME/etc/hadoopexport JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/nativeexport PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

8. 数据节点配置

# vim $HADOOP_HOME/etc/hadoop/slavesserver1server2server3

9. Hadoop 简单测试

工作目录 master $HADOOP_HOME

# cd $HADOOP_HOME

首次启动集群时,做如下操作 [主名字节点上执行]

# hdfs namenode -format# sbin/start-dfs.sh# sbin/start-yarn.sh

检查进程是否正常启动

主名字节点 - server1:

# jps11842 Jps11363 ResourceManager10981 NameNode11113 DataNode11471 NodeManager

备名字节点 - server2:

# jps7172 SecondaryNameNode7252 NodeManager7428 Jps7063 DataNode

数据节点 - server3:

# jps6523 NodeManager6699 Jps6412 DataNode

hdfs 与 mapreduce 测试

#  hdfs dfs -mkdir -p /user/root# hdfs dfs -put ~/text /user/root# hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar wordcount /user/root /user/out16/12/30 14:01:51 INFO client.RMProxy: Connecting to ResourceManager at server1/10.8.26.197:803216/12/30 14:01:55 INFO input.FileInputFormat: Total input paths to process : 116/12/30 14:01:55 INFO mapreduce.JobSubmitter: number of splits:116/12/30 14:01:56 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1483076482233_000116/12/30 14:01:58 INFO impl.YarnClientImpl: Submitted application application_1483076482233_000116/12/30 14:01:58 INFO mapreduce.Job: The url to track the job: http://server1:80/proxy/application_1483076482233_0001/16/12/30 14:01:58 INFO mapreduce.Job: Running job: job_1483076482233_000116/12/30 14:02:23 INFO mapreduce.Job: Job job_1483076482233_0001 running in uber mode : false16/12/30 14:02:24 INFO mapreduce.Job:  map 0% reduce 0%16/12/30 14:02:36 INFO mapreduce.Job:  map 100% reduce 0%16/12/30 14:02:44 INFO mapreduce.Job:  map 100% reduce 100%16/12/30 14:02:45 INFO mapreduce.Job: Job job_1483076482233_0001 completed successfully16/12/30 14:02:46 INFO mapreduce.Job: Counters: 49    File System Counters        FILE: Number of bytes read=242        FILE: Number of bytes written=230317        FILE: Number of read operations=0        FILE: Number of large read operations=0        FILE: Number of write operations=0        HDFS: Number of bytes read=493        HDFS: Number of bytes written=172        HDFS: Number of read operations=6        HDFS: Number of large read operations=0        HDFS: Number of write operations=2    Job Counters        Launched map tasks=1        Launched reduce tasks=1        Data-local map tasks=1        Total time spent by all maps in occupied slots (ms)=7899        Total time spent by all reduces in occupied slots (ms)=6754        Total time spent by all map tasks (ms)=7899        Total time spent by all reduce tasks (ms)=6754        Total vcore-seconds taken by all map tasks=7899        Total vcore-seconds taken by all reduce tasks=6754        Total megabyte-seconds taken by all map tasks=8088576        Total megabyte-seconds taken by all reduce tasks=6916096    Map-Reduce Framework        Map input records=8        Map output records=56        Map output bytes=596        Map output materialized bytes=242        Input split bytes=99        Combine input records=56        Combine output records=16        Reduce input groups=16        Reduce shuffle bytes=242        Reduce input records=16        Reduce output records=16        Spilled Records=32        Shuffled Maps =1        Failed Shuffles=0        Merged Map outputs=1        GC time elapsed (ms)=231        CPU time spent (ms)=1720        Physical memory (bytes) snapshot=293462016        Virtual memory (bytes) snapshot=4158427136        Total committed heap usage (bytes)=139976704    Shuffle Errors        BAD_ID=0        CONNECTION=0        IO_ERROR=0        WRONG_LENGTH=0        WRONG_MAP=0        WRONG_REDUCE=0    File Input Format Counters        Bytes Read=394    File Output Format Counters        Bytes Written=172

执行完成后查看输出,

# hdfs dfs -ls /user/outFound 2 items-rw-r--r--   3 root supergroup          0 2016-12-30 14:02 /user/out/_SUCCESS-rw-r--r--   3 root supergroup        172 2016-12-30 14:02 /user/out/part-r-00000

也可以通过 UI () 查看:

hadoop ui

HDFS 信息查看

# hdfs dfsadmin -report# hdfs fsck / -files -blocks

UI() :

hdfs ui

集群的后续维护

# sbin/start-all.sh# sbin/stop-all.sh

监控页面 URL

Spark 分布式集群配置

注:所有节点都做同样配置

1. Spark 相关配置

Spark 环境变量设置

# vim /etc/profile# spark env setexport SPARK_HOME=/usr/local/spark-2.0.2-bin-hadoop2.7export PATH=$SPARK_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$JAVA_HOME/bin:$SCALA_HOME/bin:$PATH
# source /etc/profile

配置 spark-env.sh

# cd $SPARK_HOME/conf# mv spark-env.sh.template spark-env.sh# vim spark-env.sh## 添加如下内容export JAVA_HOME=/usr/local/jdk1.8.0_102export SCALA_HOME=/usr/local/scala-2.11.0export HADOOP_HOME=/usr/local/hadoop-2.7.0

配置 worker 节点的主机名列表

# cd $SPARK_HOME/conf
# vim slavesserver1server2server3

其他配置

# cd $SPARK_HOME/conf# mv log4j.properties.template log4j.properties

在 Master 节点上执行

# cd $SPARK_HOME && sbin/start-all.sh

检查进程是否启动

在 master 节点上出现 "Master",在 slave 节点上出现 "Worker"

Master 节点:

[root@server1 spark-2.0.2-bin-hadoop2.7]# jps11363 ResourceManager10981 NameNode13176 Master13256 Worker11113 DataNode13435 Jps11471 NodeManager

Slave 节点:

[root@server2 conf]# jps7172 SecondaryNameNode7252 NodeManager7063 DataNode8988 Worker9133 Jps

相关测试

监控页面 URL

图片描述

切换到 "$SPARK_HOME/bin" 目录

1. 本地模式

# .bin/run-example org.apache.spark.examples.SparkPi local

2. 普通集群模式

# ./bin/run-example org.apache.spark.examples.SparkPi spark://10.8.26.197:7077# ./bin/run-example org.apache.spark.examples.SparkLR spark://10.8.26.197:7077

3. 结合 HDFS 的集群模式

工作目录 $SPARK_HOME

scala> val file=sc.textFile("hdfs://server1:9000/user/root/README.md")file: org.apache.spark.rdd.RDD[String] = hdfs://server1:9000/user/root/README.md MapPartitionsRDD[5] at textFile at 
:24scala> val count = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_+_)count: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[8] at reduceByKey at
:26scala> count.collect()res3: Array[(String, Int)] = Array((package,1), (this,1), (Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version),1), (Because,1), (Python,2), (cluster.,1), (its,1), ([run,1), (general,2), (have,1), (pre-built,1), (YARN,,1), (locally,2), (changed,1), (locally.,1), (sc.parallelize(1,1), (only,1), (Configuration,1), (This,2), (basic,1), (first,1), (learning,,1), ([Eclipse](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-Eclipse),1), (documentation,3), (graph,1), (Hive,2), (several,1), (["Specifying,1), ("yarn",1), (page](http://spark.apache.org/documentation.html),1), ([params]`.,1), ([project,2), (prefer,1), (SparkPi,2), (
,1), (engine,1), (version,1), (file,1), (documentation...scala> :quit

4. 基于 YARN 模式

# ./bin/spark-submit --class org.apache.spark.examples.SparkPi \    --master yarn \    --deploy-mode cluster \    --driver-memory 1g \    --executor-memory 1g \    --executor-cores 1 \     examples/jars/spark-examples*.jar \    10

执行结果:

$HADOOP_HOME/logs/userlogs/application_*/container*_***/stdout

监控页面

转载地址:http://ggqwl.baihongyu.com/

你可能感兴趣的文章