Prerequisites
Before integrating Hadoop and Spark on Debian, ensure you have:
sudo apt update && sudo apt install openjdk-11-jdk, and verify with java -version.hadoop) to enable cluster communication. Generate keys with ssh-keygen -t rsa and copy to authorized_keys.1. Install and Configure Hadoop
Download Hadoop (e.g., 3.3.6) from the Apache website and extract it to /opt:
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz tar -xzvf hadoop-3.3.6.tar.gz -C /opt ln -s /opt/hadoop-3.3.6 /opt/hadoop # Create a symbolic link for easy access Set environment variables in /etc/profile:
echo "export HADOOP_HOME=/opt/hadoop" >> /etc/profile echo "export PATH=\$PATH:\$HADOOP_HOME/bin:\$HADOOP_HOME/sbin" >> /etc/profile source /etc/profile Configure core Hadoop files in $HADOOP_HOME/etc/hadoop:
<configuration> <property><name>fs.defaultFS</name><value>hdfs://localhost:9000</value></property> <property><name>hadoop.tmp.dir</name><value>/opt/hadoop/tmp</value></property> </configuration> <configuration> <property><name>dfs.replication</name><value>1</value></property> <property><name>dfs.namenode.name.dir</name><value>/opt/hadoop/hdfs/namenode</value></property> <property><name>dfs.datanode.data.dir</name><value>/opt/hadoop/hdfs/datanode</value></property> </configuration> <configuration> <property><name>mapreduce.framework.name</name><value>yarn</value></property> </configuration> <configuration> <property><name>yarn.nodemanager.aux-services</name><value>mapreduce_shuffle</value></property> <property><name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name><value>org.apache.hadoop.mapred.ShuffleHandler</value></property> <property><name>yarn.resourcemanager.hostname</name><value>localhost</value></property> </configuration> Format HDFS (only once) and start services:
hdfs namenode -format start-dfs.sh # Start HDFS start-yarn.sh # Start YARN Verify with hdfs dfsadmin -report (check DataNodes) and yarn node -list (check NodeManagers).
2. Install and Configure Spark
Download Spark (e.g., 3.3.2) pre-built for Hadoop (e.g., spark-3.3.2-bin-hadoop3.tgz) and extract it to /opt:
wget https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz tar -xzvf spark-3.3.2-bin-hadoop3.tgz -C /opt ln -s /opt/spark-3.3.2-bin-hadoop3 /opt/spark # Symbolic link Set environment variables in /etc/profile:
echo "export SPARK_HOME=/opt/spark" >> /etc/profile echo "export PATH=\$PATH:\$SPARK_HOME/bin:\$SPARK_HOME/sbin" >> /etc/profile source /etc/profile Configure Spark to integrate with Hadoop:
SPARK_DIST_CLASSPATH.echo "export HADOOP_CONF_DIR=\$HADOOP_HOME/etc/hadoop" >> \$SPARK_HOME/conf/spark-env.sh echo "export SPARK_DIST_CLASSPATH=\$(\$HADOOP_HOME/bin/hadoop classpath)" >> \$SPARK_HOME/conf/spark-env.sh spark.master yarn spark.hadoop.fs.defaultFS hdfs://localhost:9000 spark.eventLog.enabled true spark.eventLog.dir hdfs://localhost:9000/spark-logs Start Spark’s master and worker nodes:
start-master.sh # Start Spark Master (accessible at http://localhost:8080) start-slave.sh spark://localhost:7077 # Start Spark Worker 3. Integrate Hadoop and Spark
The key to integration is ensuring Spark can access Hadoop’s resources (HDFS, YARN). The above configurations achieve this by:
HADOOP_CONF_DIR).SPARK_DIST_CLASSPATH).spark.master yarn) and HDFS as the default FS (spark.hadoop.fs.defaultFS).To validate integration, run a Spark job that reads/writes data from HDFS:
# Example: Count words in an HDFS file /opt/spark/bin/run-example SparkPi 10 # Run a sample Spark job /opt/spark/bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode client \ /opt/spark/examples/jars/spark-examples_2.12-3.3.2.jar 10 Troubleshooting Tips
$HADOOP_HOME/logs or $SPARK_HOME/logs.spark-3.3.2-bin-hadoop3 for Hadoop 3.x)./user/hadoop) exist and have correct permissions.