温馨提示×

Debian中Hadoop与Spark集成方法

小樊
53
2025-09-19 20:13:44
栏目: 智能运维

Prerequisites
Before integrating Hadoop and Spark on Debian, ensure you have:

  • Java JDK 8/11 (required by both frameworks): Install via sudo apt update && sudo apt install openjdk-11-jdk, and verify with java -version.
  • SSH Access: Set up passwordless SSH for the Hadoop/Spark user (e.g., hadoop) to enable cluster communication. Generate keys with ssh-keygen -t rsa and copy to authorized_keys.

1. Install and Configure Hadoop
Download Hadoop (e.g., 3.3.6) from the Apache website and extract it to /opt:

wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz tar -xzvf hadoop-3.3.6.tar.gz -C /opt ln -s /opt/hadoop-3.3.6 /opt/hadoop # Create a symbolic link for easy access 

Set environment variables in /etc/profile:

echo "export HADOOP_HOME=/opt/hadoop" >> /etc/profile echo "export PATH=\$PATH:\$HADOOP_HOME/bin:\$HADOOP_HOME/sbin" >> /etc/profile source /etc/profile 

Configure core Hadoop files in $HADOOP_HOME/etc/hadoop:

  • core-site.xml: Define the default file system (HDFS) and temporary directory.
    <configuration> <property><name>fs.defaultFS</name><value>hdfs://localhost:9000</value></property> <property><name>hadoop.tmp.dir</name><value>/opt/hadoop/tmp</value></property> </configuration> 
  • hdfs-site.xml: Set HDFS replication (1 for local dev) and NameNode/data directories.
    <configuration> <property><name>dfs.replication</name><value>1</value></property> <property><name>dfs.namenode.name.dir</name><value>/opt/hadoop/hdfs/namenode</value></property> <property><name>dfs.datanode.data.dir</name><value>/opt/hadoop/hdfs/datanode</value></property> </configuration> 
  • mapred-site.xml: Specify YARN as the MapReduce framework.
    <configuration> <property><name>mapreduce.framework.name</name><value>yarn</value></property> </configuration> 
  • yarn-site.xml: Enable shuffle service and set ResourceManager hostname.
    <configuration> <property><name>yarn.nodemanager.aux-services</name><value>mapreduce_shuffle</value></property> <property><name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name><value>org.apache.hadoop.mapred.ShuffleHandler</value></property> <property><name>yarn.resourcemanager.hostname</name><value>localhost</value></property> </configuration> 

Format HDFS (only once) and start services:

hdfs namenode -format start-dfs.sh # Start HDFS start-yarn.sh # Start YARN 

Verify with hdfs dfsadmin -report (check DataNodes) and yarn node -list (check NodeManagers).

2. Install and Configure Spark
Download Spark (e.g., 3.3.2) pre-built for Hadoop (e.g., spark-3.3.2-bin-hadoop3.tgz) and extract it to /opt:

wget https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz tar -xzvf spark-3.3.2-bin-hadoop3.tgz -C /opt ln -s /opt/spark-3.3.2-bin-hadoop3 /opt/spark # Symbolic link 

Set environment variables in /etc/profile:

echo "export SPARK_HOME=/opt/spark" >> /etc/profile echo "export PATH=\$PATH:\$SPARK_HOME/bin:\$SPARK_HOME/sbin" >> /etc/profile source /etc/profile 

Configure Spark to integrate with Hadoop:

  • spark-env.sh: Link to Hadoop’s config directory and add Hadoop jars to SPARK_DIST_CLASSPATH.
    echo "export HADOOP_CONF_DIR=\$HADOOP_HOME/etc/hadoop" >> \$SPARK_HOME/conf/spark-env.sh echo "export SPARK_DIST_CLASSPATH=\$(\$HADOOP_HOME/bin/hadoop classpath)" >> \$SPARK_HOME/conf/spark-env.sh 
  • spark-defaults.conf: Set default file system to HDFS and scheduler to YARN.
    spark.master yarn spark.hadoop.fs.defaultFS hdfs://localhost:9000 spark.eventLog.enabled true spark.eventLog.dir hdfs://localhost:9000/spark-logs 

Start Spark’s master and worker nodes:

start-master.sh # Start Spark Master (accessible at http://localhost:8080) start-slave.sh spark://localhost:7077 # Start Spark Worker 

3. Integrate Hadoop and Spark
The key to integration is ensuring Spark can access Hadoop’s resources (HDFS, YARN). The above configurations achieve this by:

  • Pointing Spark to Hadoop’s config files (HADOOP_CONF_DIR).
  • Adding Hadoop’s classpath to Spark (SPARK_DIST_CLASSPATH).
  • Configuring Spark to use YARN as the cluster manager (spark.master yarn) and HDFS as the default FS (spark.hadoop.fs.defaultFS).

To validate integration, run a Spark job that reads/writes data from HDFS:

# Example: Count words in an HDFS file /opt/spark/bin/run-example SparkPi 10 # Run a sample Spark job /opt/spark/bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode client \ /opt/spark/examples/jars/spark-examples_2.12-3.3.2.jar 10 

Troubleshooting Tips

  • Check Logs: If services fail to start, inspect logs in $HADOOP_HOME/logs or $SPARK_HOME/logs.
  • Version Compatibility: Use Spark versions pre-built for your Hadoop version (e.g., spark-3.3.2-bin-hadoop3 for Hadoop 3.x).
  • Firewall: Ensure ports (9000 for HDFS, 8088 for YARN, 8080 for Spark UI) are open.
  • Permissions: Verify HDFS directories (e.g., /user/hadoop) exist and have correct permissions.

0