100% found this document useful (1 vote)

720 views7 pages

Apache Spark Interview Questions Guide

This document provides an overview of Apache Spark and lists common interview questions asked at different levels - entry, medium, and advanced - related to Spark. Some key points covered include: - What is Apache Spark and when it should be used - Builtin libraries like SQL, Streaming, MLlib, and GraphX - Why Spark is faster than Hadoop for data processing - Programming languages supported for Spark app development - Data sources Spark can access - Machine learning use cases Spark MLlib supports The questions range from basic topics like the Spark architecture and RDDs to more advanced topics like optimization techniques and Spark's distributed processing model.

Uploaded by

Rajesh Sugumaran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

720 views7 pages

Apache Spark Interview Questions Guide

Uploaded by

Rajesh Sugumaran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Apache Spark Interview Questions

Spark has become popular among data scientists and big data enthusiasts. If you are looking for the best
collection of Apache Spark Interview Questions for your data analyst, big data or machine learning job, you
have come to the right place.

In this Spark Tutorial, we shall go through some of the frequently asked Spark Interview Questions.

Entry Level Spark Interview Questions

Medium Level Spark Interview Questions
Advanced Spark Interview Questions

Entry Level Spark Interview Questions

What is Apache Spark?

Apache Spark is an Open Source Project from the Apache Software Foundation. Apache Spark is a data
processing engine and is being used in data processing and data analytics. It has inbuilt libraries for Machine
Learning, Graph Processing, and SQL Querying. Spark is horizontally scalable and is very efficient in terms of
speed when compared to big data giant Hadoop.

When should you choose Apache Spark?

When the application needs to scale. When the application needs both batch and real-time processing of
records. When the application needs to connect to multiple databases like Apache Cassandra, Apache
Mahout, Apache HBase, SQL databases, etc. When the application should be able to query structured datasets
cumulatively present across different database platforms.

Which builtin libraries does Spark have?

Spark has four builtin libraries. They are :

SQL and DataFrames

Spark Streaming
MLlib (Machine Learning Library)
GraphX

How fast is Apache Spark when compared to Hadoop ? Give us an example.

Apache Spark is about 100 times faster than Apache Hadoop. For a typical Logistic Regression problem, if
Hadoop takes 110ms to complete, then Spark would take around 1ms.

Why is Spark faster than Hadoop?

Spark is so fast because it uses a state-of-the-art DAG scheduler, a query optimizer, and a physical execution
engine.
Which programming languages could be used for Spark Application Development?
One can use following programming languages.

Java
Scala
Python
R
SQL

On which platforms can Spark run?

When Spark is run in stand-alone cluster mode, it can run on :

Hadoop YARN
EC2
Mesos
Kubernetes

Which data sources can Spark access?

Spark can access data from hundreds of sources. Some of them are :

HDFC
Apache Cassandra
Apache HBase
Apache Hive

Can structured data be queried in Spark? If so, how?

Yes, structured data can be queried using :

SQL
DataFrame API

What is Spark MLlib?

MLlib is Spark’s scalable Machine Learning inbuilt library. The library contains many machine learning
algorithms and utilities to transform the data and extract useful information or inference from the data.

How is MLlib scalable?

Spark’s DataFrames API realizes scalability.

What kinds of machine learning use cases does MLlib solve?

MLlib contains common learning algorithms that can solve problems like :

Clustering
Classification
Regression
Recommendation
Topic Modelling
Frequent itemsets
Association rules
Sequential pattern mining

What is Spark Context?

SparkContext instance sets up internal services for a Spark Application and also establishes a connection to
a Spark execution environment. Spark Context should be created by Spark Driver Application.

What is an RDD?
RDD, short for Resilient Distributed Datasets, is a collection of elements. RDD is fault tolerant and can be
operated on in parallel. RDD provides the abstraction for distributed computing across nodes in Spark Cluster.

What are RDD transformations?

Transformations are operations which create a new RDD from an existing RDD.

Some of the RDD transformations are :

map
filter
union
intersection

What are RDD actions?

Actions are operations which consume or run operations on an RDD and produce an output value.

Some of the RDD actions are :

reduce
collect
count
countByKey

What do you mean by ‘RDD Transformations are lazy’?

Transformations on an RDD are not actually executed until an action is encountered. Hence, RDD
Transformations are called lazy.

What do you know about RDD Persistence?

Persistence means Caching. When an RDD is persisted, each node in the cluster stores the partitions (of the
RDD) in memory (RAM). When there are multiple transformations or actions on an RDD, persistence helps to
cut down the latency by the time required to load the data from file storage to memory.

What is the difference between cache() and persist() for an RDD?

cache() uses default storage level, i.e., MEMORY_ONLY.

persist() can be provided with any of the possible storage levels.

What do you mean by the default storage level: MEMORY_ONLY?

Default Storage Level – MEMORY_ONLY mean store RDD as deserialized Java objects in the JVM. If the RDD
does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they’re
needed.

What is Sliding Window Operation?

Medium Level Spark Interview Questions

How does Spark Context in Spark Application pick the value for Spark Master?
That can be done in two ways.

1. Create a new SparkConf object and set the master using its setMaster() method. This Spark Configuration object is passed
as an argument while creating the new Spark Context.
SparkConf conf = new SparkConf().setAppName("JavaKMeansExample")
.setMaster("local[2]")
.set("spark.executor.memory","3g")
.set("spark.driver.memory", "3g");

JavaSparkContext jsc = new JavaSparkContext(conf);

2. <apache-installation-directory>/conf/spark-env.sh file, located locally on the machine, contains information regarding

Spark Environment configuration. Spark Master is one the parameters that could be provided in the configuration file.

How do you configure Spark Application?

Spark Application could be configured using properties that could be set directly on a SparkConf object that is
passed during SparkContext initialization.

Following are the properties that could be configured for a Spark Application.

Spark Application Name

Number of Spark Driver Cores
Spark Driver’s Maximum Result Size
Spark Driver’s Memory
Spark Executors’ Memory
Spark Extra Listeners
Spark Local Directory
Log Spark Configuration
Spark Master
Deploy Mode of Spark Driver
Log Application Information
Spark Driver Supervise Action

Reference : Configure Spark Application

What is the use of Spark Environment Parameters? How do you configure those?
Spark Environment Parameters affect the behavior, working and memory usage of nodes in a cluster.

These parameters could be configured using the local config file spark-env.sh located at <apache-installation-
directory>/conf/spark-env.sh.

Reference: Configure Spark Ecosystem

How do you establish a connection from Apache Mesos to Apache Spark?

A connection to Mesos could be established in two ways.

Spark Driver program has to be configured to establish a connection to Mesos. Also, Spark binaries location should be
accessible to the Apache Mesos program.
The other way to connect to Mesos is that installing Spark in the same location as that of Mesos, and configure the property,
spark.mesos.executor.home to point to the location of Mesos installation.

How do you minimize data transfers between nodes in a cluster?

Minimizing shuffles can minimize the data transfers between nodes, and thus making the process fast. There
are some practices that can reduce the usage of shuffling operations. Use broadcast variables to join small and
large RDDs; and accumulators to update the variable’s values.

To run Spark Applications, should we install Spark on all the nodes of a YARN cluster?
Spark programs can be executed on top of YARN. So, there is no need to install Spark on the nodes a YARN
cluster, to run spark applications.

To run Spark Applications, should we install Spark on all the nodes of a Mesos cluster?
Spark programs can be executed on top of Mesos. So, there is no need to install Spark on the nodes of a
Mesos cluster, to run spark applications.

What is the usage of GraphX module in Spark?

GraphX is a graph processing library. It can be used to build and transform interactive graphs. Many
algorithms are available with GraphX library. PageRank is one of them.

How does Spark handle distributed processing?

Spark provides an abstraction to the distributed processing through Spark RDD API. A general user does not
need to worry about how data is processed in a distributed cluster. There are some exceptions though. When
you optimize an application for performance, you should understand about operations and actions which
require the data transfer between nodes.

Advanced Level Spark Interview Questions

Learn Apache Spark

Apache Spark Tutorial

Install Spark on Ubuntu

Install Spark on Mac OS

Scala Spark Shell - Example

Python Spark Shell - PySpark

Setup Java Project with Spark

Spark Scala Application - WordCount Example

Spark Python Application

Spark DAG & Physical Execution Plan

Setup Spark Cluster

Configure Spark Ecosystem

Configure Spark Application

Spark Cluster Managers

Spark RDD

Spark RDD - Print Contents of RDD

Spark RDD - foreach

Spark RDD - Create RDD

Spark Parallelize

Spark RDD - Read Text File to RDD

Spark RDD - Read Multiple Text Files to Single RDD

Spark RDD - Read JSON File to RDD

Spark RDD - Containing Custom Class Objects

Spark RDD - Map

Spark RDD - FlatMap

Spark RDD - Filter

Spark RDD - Distinct

Spark RDD - Reduce

Spark Dataseet

Spark - Read JSON file to Dataset

Spark - Write Dataset to JSON file

Spark - Add new Column to Dataset

Spark - Concatenate Datasets

Spark MLlib (Machine Learning Library)

Spark MLlib Tutorial

KMeans Clustering & Classification

Decision Tree Classification

Random Forest Classification

Naive Bayes Classification

Logistic Regression Classification

Topic Modelling

Spark SQL

Spark SQL Tutorial

Spark SQL - Load JSON file and execute SQL Query

Spark Others

Spark Interview Questions

Apache Spark Interview Questions Book
100% (1)
Apache Spark Interview Questions Book
15 pages
8888888888888888888
100% (1)
8888888888888888888
131 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
19 pages
Spark Interview Prep Guide
No ratings yet
Spark Interview Prep Guide
4 pages
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
No ratings yet
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
99 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Advanced Spark Training
0% (1)
Advanced Spark Training
49 pages
Spark Tuning
No ratings yet
Spark Tuning
26 pages
Spark Optimization Techniques
No ratings yet
Spark Optimization Techniques
7 pages
Spark SQL & DataFrames Guide 2.2.0
No ratings yet
Spark SQL & DataFrames Guide 2.2.0
35 pages
Spark Interview Questions: Click Here
No ratings yet
Spark Interview Questions: Click Here
35 pages
Spark Interview Prep for Telugu Speakers
100% (3)
Spark Interview Prep for Telugu Speakers
31 pages
99 Apache Spark Interview Questions For Professionals
33% (12)
99 Apache Spark Interview Questions For Professionals
11 pages
Pyspark - SQL Module
No ratings yet
Pyspark - SQL Module
132 pages
Cloudera Spark Developer Training
No ratings yet
Cloudera Spark Developer Training
491 pages
Sparksql PDF
100% (2)
Sparksql PDF
119 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Spark RDD Actions & Transformations
No ratings yet
Spark RDD Actions & Transformations
25 pages
Spark Essentials for Data Engineers
No ratings yet
Spark Essentials for Data Engineers
17 pages
Spark Application Deployment Guide
No ratings yet
Spark Application Deployment Guide
18 pages
50 PySpark Interview Questions PDF
No ratings yet
50 PySpark Interview Questions PDF
7 pages
Apache HIVE
100% (1)
Apache HIVE
105 pages
Spark With Bigdata
No ratings yet
Spark With Bigdata
94 pages
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
No ratings yet
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
19 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
How To Work With Apache Airflow
No ratings yet
How To Work With Apache Airflow
111 pages
PySpark Interview Questions Guide
100% (3)
PySpark Interview Questions Guide
126 pages
Apache Spark Quick Reference
No ratings yet
Apache Spark Quick Reference
71 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
3 pages
Pyspark PDF
100% (1)
Pyspark PDF
406 pages
Hadoop Administrator Interview Questions: Cloudera® Enterprise Version
No ratings yet
Hadoop Administrator Interview Questions: Cloudera® Enterprise Version
13 pages
Pyspark Interview Code
100% (3)
Pyspark Interview Code
197 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Spark Structured Streaming
No ratings yet
Spark Structured Streaming
655 pages
Data Engineer Interview Prep
100% (1)
Data Engineer Interview Prep
16 pages
Spark Databricks Summary
80% (5)
Spark Databricks Summary
100 pages
Spark Intreview FAQ
100% (2)
Spark Intreview FAQ
21 pages
Databricks Sparkconfig 1669383836
No ratings yet
Databricks Sparkconfig 1669383836
1 page
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
Spark Architecture Explained
100% (1)
Spark Architecture Explained
12 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
Apache Spark RDD API Examples
No ratings yet
Apache Spark RDD API Examples
38 pages
Pyspark Hands On
No ratings yet
Pyspark Hands On
189 pages
Apache Spark RDDs Guide
No ratings yet
Apache Spark RDDs Guide
186 pages
Spark Architecture for Developers
No ratings yet
Spark Architecture for Developers
7 pages
Spark Optimization PDF
100% (1)
Spark Optimization PDF
14 pages
Spark Interview QUestions
No ratings yet
Spark Interview QUestions
200 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
07 Spark Dataframes
100% (1)
07 Spark Dataframes
45 pages
Apache Spark Tutorial
100% (1)
Apache Spark Tutorial
6 pages
Spark Interview Q&A: Key Insights
No ratings yet
Spark Interview Q&A: Key Insights
10 pages
Spark Vs Hadoop Features Spark
No ratings yet
Spark Vs Hadoop Features Spark
9 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Apache Spark
No ratings yet
Apache Spark
162 pages
07 - Apache Spark - An Introduction
No ratings yet
07 - Apache Spark - An Introduction
36 pages
Apache Spark Overview & Features
No ratings yet
Apache Spark Overview & Features
65 pages
Unit 5
100% (1)
Unit 5
109 pages
Azure Data Engineer Interview Questions
No ratings yet
Azure Data Engineer Interview Questions
35 pages
Executive PG Programme in Data Science: Curriculum
No ratings yet
Executive PG Programme in Data Science: Curriculum
12 pages
Performance Tuning in Azure Databricks
100% (1)
Performance Tuning in Azure Databricks
124 pages
Integration of Python With Hadoop and Spark
No ratings yet
Integration of Python With Hadoop and Spark
10 pages
PySpark Optimization Scenarios - Wipro
No ratings yet
PySpark Optimization Scenarios - Wipro
8 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
Databricks Cloud Workshop: SF, 2015-05-20! Download Slides
100% (1)
Databricks Cloud Workshop: SF, 2015-05-20! Download Slides
168 pages
AadhyaKaul RESUME G
No ratings yet
AadhyaKaul RESUME G
1 page
Sagara Pravalika Garlapati - Data Engineer
No ratings yet
Sagara Pravalika Garlapati - Data Engineer
4 pages
Nikhil Shinde FlowCV Resume 20250207
No ratings yet
Nikhil Shinde FlowCV Resume 20250207
2 pages
Lecture 11
No ratings yet
Lecture 11
31 pages
Azure Data Factory Vs Databricks - 4 Key Differences - Hevo
No ratings yet
Azure Data Factory Vs Databricks - 4 Key Differences - Hevo
14 pages
Stream Processing Lab Manual
No ratings yet
Stream Processing Lab Manual
38 pages
Resume Dheeraj.P Data Engineer
No ratings yet
Resume Dheeraj.P Data Engineer
2 pages
Hadoop/Spark Developer Resume
No ratings yet
Hadoop/Spark Developer Resume
7 pages
ACA Exam Sample Questions
No ratings yet
ACA Exam Sample Questions
29 pages
Apache Airflow - A Python Hands-On Guide
No ratings yet
Apache Airflow - A Python Hands-On Guide
9 pages
PySpark 30 Days Practice Guide?
100% (1)
PySpark 30 Days Practice Guide?
35 pages
BIG DATA ANALYTICS: Introduction To Hadoop, Spark, and Machine-Learning 1st Edition - Ebook PDF All Chapters Instant Download
89% (9)
BIG DATA ANALYTICS: Introduction To Hadoop, Spark, and Machine-Learning 1st Edition - Ebook PDF All Chapters Instant Download
69 pages
Advanced Java Mastery in 180 Days
No ratings yet
Advanced Java Mastery in 180 Days
9 pages
Data Analyst Interview Prep Guide
No ratings yet
Data Analyst Interview Prep Guide
8 pages
Fda Unit V
No ratings yet
Fda Unit V
14 pages
DP 600 Workshop
No ratings yet
DP 600 Workshop
174 pages
Amazon EMR Serverless Architecture and Use Cases
No ratings yet
Amazon EMR Serverless Architecture and Use Cases
6 pages
Pyspark Interview: Abhinav Singh
No ratings yet
Pyspark Interview: Abhinav Singh
275 pages
AWS Project by AnwarAkhtar
No ratings yet
AWS Project by AnwarAkhtar
7 pages
Data Engineer's Career Highlights
No ratings yet
Data Engineer's Career Highlights
1 page
BDA Unit - II
No ratings yet
BDA Unit - II
66 pages