Skip to content

Commit eb85b85

Browse files
committed
Initial commit with Hue, Hive, and Hadoop implemented
0 parents commit eb85b85

File tree

15 files changed

+2517
-0
lines changed

15 files changed

+2517
-0
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
.idea

LICENSE

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
Copyright (c) 2017 - present, Georg Walther
2+
All rights reserved.
3+
4+
Redistribution and use in source and binary forms, with or without modification,
5+
are permitted provided that the following conditions are met:
6+
7+
* Redistributions of source code must retain the above copyright notice, this
8+
list of conditions and the following disclaimer.
9+
10+
* Redistributions in binary form must reproduce the above copyright notice, this
11+
list of conditions and the following disclaimer in the documentation and/or
12+
other materials provided with the distribution.
13+
14+
* Neither the name of the {organization} nor the names of its
15+
contributors may be used to endorse or promote products derived from
16+
this software without specific prior written permission.
17+
18+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
19+
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
20+
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
21+
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
22+
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
23+
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
24+
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
25+
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
26+
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
27+
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

README.md

Lines changed: 341 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,341 @@
1+
# Distributable Docker SQL on Hadoop
2+
3+
This repository expands on my earlier [Docker-Hadoop repository](https://github.com/waltherg/distributed_docker_hadoop)
4+
where I put together a basic HDFS/YARN/MapReduce system.
5+
In the present repository I explore various SQL-on-Hadoop options by spelling
6+
out the various services explicitly in the accompanying Docker Compose file.
7+
8+
* [Prerequisites](#prerequisites)
9+
* [Setup](#setup)
10+
* [Build Docker image:](#build-docker-image)
11+
* [Start cluster](#start-cluster)
12+
* [Cluster services](#services-and-their-components)
13+
* [Hadoop distributed file system](#hadoop-distributed-file-system)
14+
* [HBase](#hbase)
15+
* [Yet another resource negotiator (YARN)](#yet-another-resource-negotiator)
16+
* [MapReduce Job History Server](#mapreduce-job-history-server)
17+
* [ZooKeeper](#zookeeper)
18+
* [Spark](#spark)
19+
* [Tez](#tez)
20+
* [Hive](#hive)
21+
* [Hue](#hue)
22+
* [Impala](#impala)
23+
* [Presto](#presto)
24+
* [Drill](#drill)
25+
* [Spark SQL](#spark-sql)
26+
* [Phoenix](#phoenix)
27+
* [Moving towards production](#moving-towards-production)
28+
29+
## Prerequisites
30+
31+
Ensure you have Python Anaconda (the Python 3 flavor) installed:
32+
[https://www.anaconda.com/download](https://www.anaconda.com/download).
33+
Further ensure you have a recent version of Docker installed.
34+
The Docker version I developed this example on is:
35+
36+
$ docker --version
37+
Docker version 17.09.0-ce, build afdb6d4
38+
39+
## Setup
40+
41+
We will use Docker Compose to spin up the various Docker containers constituting
42+
our Hadoop system.
43+
To this end let us create a clean Anaconda Python virtual environment and install
44+
a current version of Docker Compose in it:
45+
46+
$ conda create --name distributable_docker_sql_on_hadoop python=3.6 --yes
47+
$ source activate distributable_docker_sql_on_hadoop
48+
$ pip install -r requirements.txt
49+
50+
Make certain `docker-compose` points to this newly installed version in the virtual
51+
environment:
52+
53+
$ which docker-compose
54+
55+
In case this does not point to the `docker-compose` binary in your virtual environment,
56+
reload the virtual environment and check again:
57+
58+
$ source deactivate
59+
$ source activate distributable_docker_sql_on_hadoop
60+
61+
### Build Docker images
62+
63+
To build all relevant Docker images locally:
64+
65+
$ source env
66+
$ ./build_images.sh
67+
68+
### Start cluster
69+
70+
To bring up the entire cluster run:
71+
72+
$ docker-compose up --force-recreate
73+
74+
In a separate terminal you can now check the state of the cluster containers through
75+
76+
$ docker ps
77+
78+
## Services and their components
79+
80+
Here we take a closer look at the services we will run and their
81+
respective components.
82+
83+
### Hadoop distributed file system (HDFS)
84+
85+
HDFS is the filesystem component of Hadoop and is optimized
86+
for storing large data sets and sequential access to these data.
87+
HDFS does not provide random read/write access to files, i.e. reading rows
88+
X through X+Y in a large CSV or editing row Z in a CSV are operations that
89+
HDFS does not provide.
90+
For HDFS we need to run two components:
91+
92+
- Namenode
93+
- The master in HDFS
94+
- Stores data block metadata and directs the datanodes
95+
- Run two of these for high availability: one active, one standby
96+
- Datanode
97+
- The worker in HDFS
98+
- Stores and retrieves data blocks
99+
- Run three data nodes in line with default block replication factor of three
100+
101+
The namenode provides a monitoring GUI at
102+
103+
[http://localhost:50070](http://localhost:50070)
104+
105+
### HBase
106+
107+
Distributed column(-family)-oriented data storage system
108+
that provides random read/write data operations on top of HDFS
109+
and auto-sharding across multiple hosts (region servers) of large tables.
110+
Info as to what regions / shards of a table are stored where is kept in
111+
the META table.
112+
113+
The HBase components are:
114+
115+
- HMaster
116+
- Coordinates the HBase cluster
117+
- Load balances data between the region servers
118+
- Handles region server failure
119+
- Region server
120+
- Store data pertaining to a given region (shard / partition of a table)
121+
- Respond to client requests directly - no need to go through HMaster for
122+
data requests
123+
- Run multiple of these (usually one region server service per physical host)
124+
to scale out data sizes that can be handled
125+
126+
#### References
127+
128+
- [HBase documentation]((https://hbase.apache.org/book.html))
129+
130+
### Yet another resource negotiator (YARN)
131+
132+
The Hadoop cluster management system which requests cluster resources
133+
for applications built on top of YARN.
134+
YARN is compute layer of a Hadoop cluster and sits on top of the
135+
cluster storage layer (HDFS or HBase).
136+
137+
The YARN components are:
138+
139+
- Resource manager
140+
- Manages the cluster resources
141+
- Run one per cluster
142+
- Node manager
143+
- Manages containers that scheduled application processes are executed in
144+
- Run one per compute node
145+
- Run multiple compute nodes to scale out computing power
146+
147+
**
148+
Where possible we want jobs that we submit to the cluster (e.g. MapReduce jobs)
149+
to run on data stored locally on the executing host.
150+
To this end we will start up the aforementioned HDFS data node service and
151+
YARN node manager concurrently on a given host and call these hosts `hadoop-node`s.
152+
**
153+
154+
The resource manager offers a management and monitoring GUI at
155+
156+
[http://localhost:8088](http://localhost:8088)
157+
158+
### MapReduce Job History Server
159+
160+
A number of the SQL-on-Hadoop approaches we will try out here translate
161+
SQL queries to MapReduce jobs executed on the data stored in the cluster.
162+
This service helps us keep track and visualize the MapReduce jobs we generated.
163+
164+
We run one job history server per cluster.
165+
166+
The job history server provides a monitoring GUI at
167+
168+
[http://localhost:19888](http://localhost:19888)
169+
170+
### ZooKeeper
171+
172+
ZooKeeper is used as a distributed coordination service between the
173+
different Hadoop services of our system.
174+
At its core, ZooKeeper provides a high-availability filesystem with ordered, atomic
175+
operations.
176+
177+
We will run ZooKeeper in replicated mode where an ensemble of ZooKeeper servers
178+
decides in a leader/follower fashion what the current consensus state is:
179+
180+
- ZooKeeper server
181+
- Run three servers so that a majority can be found in replicated mode
182+
183+
#### References
184+
185+
- [Replicated mode quickstart](
186+
https://zookeeper.apache.org/doc/r3.1.2/zookeeperStarted.html#sc_RunningReplicatedZooKeeper
187+
)
188+
189+
### Spark
190+
191+
A cluster computing framework that does not translate user applications
192+
to MapReduce operations but rather uses its own execution engine based
193+
around directed acyclic graphs (DAGs).
194+
In MapReduce all output (even intermediary results) is stored to disk
195+
whereas the Spark engine has the ability to cache (intermediate) output
196+
in memory thus avoiding the cost of reading from and writing to disk.
197+
This is great for iterative algorithms and interactive data exploration
198+
where code is executed against against the same set of data multiple times.
199+
200+
We will run Spark in cluster mode where a SparkContext object instantiated
201+
in the user application connects to a cluster manager which
202+
delegates sends the application code to one or multiple executors.
203+
As cluster manager we choose YARN over its possible alternatives since
204+
we already run it for other services of our system (alternatives are Spark's own
205+
cluster manager and Mesos).
206+
207+
The executor processes are run on worker nodes:
208+
209+
- Worker node
210+
- Node in the Spark cluster that can run application code
211+
- Run multiple of these to scale out available computing power
212+
213+
#### References
214+
215+
- [Spark cluster mode](http://spark.apache.org/docs/latest/cluster-overview.html)
216+
217+
### Tez
218+
219+
Execution engine on top of YARN that translates user requests to directed acyclic
220+
graphs (DAGs).
221+
Tez was developed as a faster execution engine for other solutions that
222+
would be otherwise executed as MapReduce jobs (Pig, Hive, etc.).
223+
224+
Tez needs to be installed on each compute node of our aforementioned YARN cluster.
225+
226+
#### References
227+
228+
- [Tez installation guide](https://tez.apache.org/install.html)
229+
230+
### Hive
231+
232+
Hive is a framework that allows you to analyze structured data files (stored locally
233+
or in HDFS) using ANSI SQL.
234+
As execution engine of Hive queries either MapReduce, Spark, or Tez may be used -
235+
where the latter two promise accelerated execution time through their DAG-based engines.
236+
Hive stores metadata on directories and files in its metastore which clients
237+
need access to in order to run Hive queries.
238+
Hive establishes data schemas on read thus allowing fast data ingestion.
239+
HDFS does not support in-place file changes hence row updates are stored in delta
240+
files and later merged into table files.
241+
Hive does not locks natively and requires ZooKeeper for these.
242+
Hive does however support indexes to improve query speeds.
243+
244+
To run Hive on our existing Hadoop cluster we need to add the following:
245+
246+
- HiveServer2
247+
- Service that allows clients to execute queries against Hive
248+
- Metastore database
249+
- A traditional RDBMS server that persists relevant metadata
250+
- Metastore server
251+
- Service that queries the metastore database for metadata on behalf of clients
252+
253+
To try out Hive you can connect to your instance of HiveServer2 through a CLI called Beeline.
254+
Once you brought up the Hadoop cluster as described above start a container that runs
255+
the Beeline CLI:
256+
257+
$ source env
258+
$ docker run -ti --network distributabledockersqlonhadoop_hadoop_net --rm ${hive_image_name}:${image_version} \
259+
bash -c 'beeline -u jdbc:hive2://hive-hiveserver2:10000 -n root'
260+
261+
To look around:
262+
263+
0: jdbc:hive2://hive-hiveserver2:10000> show tables;
264+
+-----------+
265+
| tab_name |
266+
+-----------+
267+
+-----------+
268+
No rows selected (0.176 seconds)
269+
270+
To create a table:
271+
272+
CREATE TABLE IF NOT EXISTS apache_projects (id int, name String, website String)
273+
COMMENT 'Hadoop-related Apache projects'
274+
ROW FORMAT DELIMITED
275+
FIELDS TERMINATED BY ','
276+
LINES TERMINATED BY '\n'
277+
STORED AS TEXTFILE;
278+
279+
Quit the Beeline client (and the Docker container) by pressing `Ctrl + D`.
280+
281+
The HiveServer2 instance also offers a web interface accessible at
282+
283+
[http://localhost:10002](http://localhost:10002)
284+
285+
Browse the HiveServer2 web interface to see logs of the commands and queries you executed in the
286+
Beeline client and general service logs.
287+
288+
#### References
289+
290+
- [Running Hive quickstart](https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-RunningHive)
291+
- [HiveServer2 reference](https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Overview)
292+
- [Metastore reference](https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin)
293+
294+
### Hue
295+
296+
Hue is a graphical analytics workbench on top of Hadoop.
297+
The creators of Hue maintain a [Docker image](https://hub.docker.com/r/gethue/hue/)
298+
which allows for a quick start with this platform.
299+
Here we merely update Hue's settings in line with our Hadoop cluster -
300+
see `images/hue` for details.
301+
302+
Hue runs as a web application accessible at
303+
304+
[http://localhost:8888](http://localhost:8888)
305+
306+
When first opening the web application create a user account `root` with
307+
arbitrary password.
308+
309+
Locate the Hive table you created earlier through the Beeline CLI -
310+
you should find it in the `default` Hive database.
311+
312+
### Impala
313+
314+
### Presto
315+
316+
### Drill
317+
318+
### Spark SQL
319+
320+
### Phoenix
321+
322+
## Notes
323+
324+
### Hostnames
325+
326+
Hadoop hostnames are not permitted to contain underscores `_`, therefore make certain
327+
to spell out longer hostnames with dashes `-` instead.
328+
329+
### Moving towards production
330+
331+
It should be relatively simple to scale out our test cluster to multiple physical hosts.
332+
Here is a sketch of steps that are likely to get you closer to running this on multiple hosts:
333+
334+
* Create a [Docker swarm](https://docs.docker.com/engine/swarm/)
335+
* Instead of the current Docker bridge network use an
336+
[overlay network](https://docs.docker.com/engine/userguide/networking/get-started-overlay/)
337+
* Add an [OpenVPN server container](https://github.com/kylemanna/docker-openvpn) to your
338+
Docker overlay network to grant you continued web interface access from your computer
339+
* You would likely want to use an orchestration framework such as
340+
[Ansible](https://www.ansible.com/) to tie the different steps and components
341+
of a more elaborate multi-host deployment together

0 commit comments

Comments
 (0)