|  | 
|  | 1 | +# Distributable Docker SQL on Hadoop | 
|  | 2 | + | 
|  | 3 | +This repository expands on my earlier [Docker-Hadoop repository](https://github.com/waltherg/distributed_docker_hadoop) | 
|  | 4 | +where I put together a basic HDFS/YARN/MapReduce system. | 
|  | 5 | +In the present repository I explore various SQL-on-Hadoop options by spelling | 
|  | 6 | +out the various services explicitly in the accompanying Docker Compose file. | 
|  | 7 | + | 
|  | 8 | +* [Prerequisites](#prerequisites) | 
|  | 9 | +* [Setup](#setup) | 
|  | 10 | + * [Build Docker image:](#build-docker-image) | 
|  | 11 | + * [Start cluster](#start-cluster) | 
|  | 12 | +* [Cluster services](#services-and-their-components) | 
|  | 13 | + * [Hadoop distributed file system](#hadoop-distributed-file-system) | 
|  | 14 | + * [HBase](#hbase) | 
|  | 15 | + * [Yet another resource negotiator (YARN)](#yet-another-resource-negotiator) | 
|  | 16 | + * [MapReduce Job History Server](#mapreduce-job-history-server) | 
|  | 17 | + * [ZooKeeper](#zookeeper) | 
|  | 18 | + * [Spark](#spark) | 
|  | 19 | + * [Tez](#tez) | 
|  | 20 | + * [Hive](#hive) | 
|  | 21 | + * [Hue](#hue) | 
|  | 22 | + * [Impala](#impala) | 
|  | 23 | + * [Presto](#presto) | 
|  | 24 | + * [Drill](#drill) | 
|  | 25 | + * [Spark SQL](#spark-sql) | 
|  | 26 | + * [Phoenix](#phoenix) | 
|  | 27 | +* [Moving towards production](#moving-towards-production) | 
|  | 28 | + | 
|  | 29 | +## Prerequisites | 
|  | 30 | + | 
|  | 31 | +Ensure you have Python Anaconda (the Python 3 flavor) installed: | 
|  | 32 | +[https://www.anaconda.com/download](https://www.anaconda.com/download). | 
|  | 33 | +Further ensure you have a recent version of Docker installed. | 
|  | 34 | +The Docker version I developed this example on is: | 
|  | 35 | + | 
|  | 36 | + $ docker --version | 
|  | 37 | + Docker version 17.09.0-ce, build afdb6d4 | 
|  | 38 | + | 
|  | 39 | +## Setup | 
|  | 40 | + | 
|  | 41 | +We will use Docker Compose to spin up the various Docker containers constituting | 
|  | 42 | +our Hadoop system. | 
|  | 43 | +To this end let us create a clean Anaconda Python virtual environment and install | 
|  | 44 | +a current version of Docker Compose in it: | 
|  | 45 | + | 
|  | 46 | + $ conda create --name distributable_docker_sql_on_hadoop python=3.6 --yes | 
|  | 47 | + $ source activate distributable_docker_sql_on_hadoop | 
|  | 48 | + $ pip install -r requirements.txt | 
|  | 49 | + | 
|  | 50 | +Make certain `docker-compose` points to this newly installed version in the virtual | 
|  | 51 | +environment: | 
|  | 52 | + | 
|  | 53 | + $ which docker-compose | 
|  | 54 | + | 
|  | 55 | +In case this does not point to the `docker-compose` binary in your virtual environment, | 
|  | 56 | +reload the virtual environment and check again: | 
|  | 57 | + | 
|  | 58 | + $ source deactivate | 
|  | 59 | + $ source activate distributable_docker_sql_on_hadoop | 
|  | 60 | + | 
|  | 61 | +### Build Docker images | 
|  | 62 | + | 
|  | 63 | +To build all relevant Docker images locally: | 
|  | 64 | + | 
|  | 65 | + $ source env | 
|  | 66 | + $ ./build_images.sh | 
|  | 67 | + | 
|  | 68 | +### Start cluster | 
|  | 69 | + | 
|  | 70 | +To bring up the entire cluster run: | 
|  | 71 | + | 
|  | 72 | + $ docker-compose up --force-recreate | 
|  | 73 | + | 
|  | 74 | +In a separate terminal you can now check the state of the cluster containers through | 
|  | 75 | + | 
|  | 76 | + $ docker ps | 
|  | 77 | + | 
|  | 78 | +## Services and their components | 
|  | 79 | + | 
|  | 80 | +Here we take a closer look at the services we will run and their | 
|  | 81 | +respective components. | 
|  | 82 | + | 
|  | 83 | +### Hadoop distributed file system (HDFS) | 
|  | 84 | + | 
|  | 85 | +HDFS is the filesystem component of Hadoop and is optimized | 
|  | 86 | +for storing large data sets and sequential access to these data. | 
|  | 87 | +HDFS does not provide random read/write access to files, i.e. reading rows | 
|  | 88 | +X through X+Y in a large CSV or editing row Z in a CSV are operations that | 
|  | 89 | +HDFS does not provide. | 
|  | 90 | +For HDFS we need to run two components: | 
|  | 91 | + | 
|  | 92 | +- Namenode | 
|  | 93 | + - The master in HDFS | 
|  | 94 | + - Stores data block metadata and directs the datanodes | 
|  | 95 | + - Run two of these for high availability: one active, one standby | 
|  | 96 | +- Datanode | 
|  | 97 | + - The worker in HDFS | 
|  | 98 | + - Stores and retrieves data blocks | 
|  | 99 | + - Run three data nodes in line with default block replication factor of three | 
|  | 100 | + | 
|  | 101 | +The namenode provides a monitoring GUI at | 
|  | 102 | + | 
|  | 103 | +[http://localhost:50070](http://localhost:50070) | 
|  | 104 | + | 
|  | 105 | +### HBase | 
|  | 106 | + | 
|  | 107 | +Distributed column(-family)-oriented data storage system | 
|  | 108 | +that provides random read/write data operations on top of HDFS | 
|  | 109 | +and auto-sharding across multiple hosts (region servers) of large tables. | 
|  | 110 | +Info as to what regions / shards of a table are stored where is kept in | 
|  | 111 | +the META table. | 
|  | 112 | + | 
|  | 113 | +The HBase components are: | 
|  | 114 | + | 
|  | 115 | +- HMaster | 
|  | 116 | + - Coordinates the HBase cluster | 
|  | 117 | + - Load balances data between the region servers | 
|  | 118 | + - Handles region server failure | 
|  | 119 | +- Region server | 
|  | 120 | + - Store data pertaining to a given region (shard / partition of a table) | 
|  | 121 | + - Respond to client requests directly - no need to go through HMaster for | 
|  | 122 | + data requests | 
|  | 123 | + - Run multiple of these (usually one region server service per physical host) | 
|  | 124 | + to scale out data sizes that can be handled | 
|  | 125 | + | 
|  | 126 | +#### References | 
|  | 127 | + | 
|  | 128 | +- [HBase documentation]((https://hbase.apache.org/book.html)) | 
|  | 129 | + | 
|  | 130 | +### Yet another resource negotiator (YARN) | 
|  | 131 | + | 
|  | 132 | +The Hadoop cluster management system which requests cluster resources | 
|  | 133 | +for applications built on top of YARN. | 
|  | 134 | +YARN is compute layer of a Hadoop cluster and sits on top of the | 
|  | 135 | +cluster storage layer (HDFS or HBase). | 
|  | 136 | + | 
|  | 137 | +The YARN components are: | 
|  | 138 | + | 
|  | 139 | +- Resource manager | 
|  | 140 | + - Manages the cluster resources | 
|  | 141 | + - Run one per cluster | 
|  | 142 | +- Node manager | 
|  | 143 | + - Manages containers that scheduled application processes are executed in | 
|  | 144 | + - Run one per compute node | 
|  | 145 | + - Run multiple compute nodes to scale out computing power | 
|  | 146 | + | 
|  | 147 | +** | 
|  | 148 | +Where possible we want jobs that we submit to the cluster (e.g. MapReduce jobs) | 
|  | 149 | +to run on data stored locally on the executing host. | 
|  | 150 | +To this end we will start up the aforementioned HDFS data node service and | 
|  | 151 | +YARN node manager concurrently on a given host and call these hosts `hadoop-node`s. | 
|  | 152 | +** | 
|  | 153 | + | 
|  | 154 | +The resource manager offers a management and monitoring GUI at | 
|  | 155 | + | 
|  | 156 | +[http://localhost:8088](http://localhost:8088) | 
|  | 157 | + | 
|  | 158 | +### MapReduce Job History Server | 
|  | 159 | + | 
|  | 160 | +A number of the SQL-on-Hadoop approaches we will try out here translate | 
|  | 161 | +SQL queries to MapReduce jobs executed on the data stored in the cluster. | 
|  | 162 | +This service helps us keep track and visualize the MapReduce jobs we generated. | 
|  | 163 | + | 
|  | 164 | +We run one job history server per cluster. | 
|  | 165 | + | 
|  | 166 | +The job history server provides a monitoring GUI at | 
|  | 167 | + | 
|  | 168 | +[http://localhost:19888](http://localhost:19888) | 
|  | 169 | + | 
|  | 170 | +### ZooKeeper | 
|  | 171 | + | 
|  | 172 | +ZooKeeper is used as a distributed coordination service between the | 
|  | 173 | +different Hadoop services of our system. | 
|  | 174 | +At its core, ZooKeeper provides a high-availability filesystem with ordered, atomic | 
|  | 175 | +operations. | 
|  | 176 | + | 
|  | 177 | +We will run ZooKeeper in replicated mode where an ensemble of ZooKeeper servers | 
|  | 178 | +decides in a leader/follower fashion what the current consensus state is: | 
|  | 179 | + | 
|  | 180 | +- ZooKeeper server | 
|  | 181 | + - Run three servers so that a majority can be found in replicated mode | 
|  | 182 | + | 
|  | 183 | +#### References | 
|  | 184 | + | 
|  | 185 | +- [Replicated mode quickstart]( | 
|  | 186 | + https://zookeeper.apache.org/doc/r3.1.2/zookeeperStarted.html#sc_RunningReplicatedZooKeeper | 
|  | 187 | +) | 
|  | 188 | + | 
|  | 189 | +### Spark | 
|  | 190 | + | 
|  | 191 | +A cluster computing framework that does not translate user applications | 
|  | 192 | +to MapReduce operations but rather uses its own execution engine based | 
|  | 193 | +around directed acyclic graphs (DAGs). | 
|  | 194 | +In MapReduce all output (even intermediary results) is stored to disk | 
|  | 195 | +whereas the Spark engine has the ability to cache (intermediate) output | 
|  | 196 | +in memory thus avoiding the cost of reading from and writing to disk. | 
|  | 197 | +This is great for iterative algorithms and interactive data exploration | 
|  | 198 | +where code is executed against against the same set of data multiple times. | 
|  | 199 | + | 
|  | 200 | +We will run Spark in cluster mode where a SparkContext object instantiated | 
|  | 201 | +in the user application connects to a cluster manager which | 
|  | 202 | +delegates sends the application code to one or multiple executors. | 
|  | 203 | +As cluster manager we choose YARN over its possible alternatives since | 
|  | 204 | +we already run it for other services of our system (alternatives are Spark's own | 
|  | 205 | +cluster manager and Mesos). | 
|  | 206 | + | 
|  | 207 | +The executor processes are run on worker nodes: | 
|  | 208 | + | 
|  | 209 | +- Worker node | 
|  | 210 | + - Node in the Spark cluster that can run application code | 
|  | 211 | + - Run multiple of these to scale out available computing power | 
|  | 212 | + | 
|  | 213 | +#### References | 
|  | 214 | + | 
|  | 215 | +- [Spark cluster mode](http://spark.apache.org/docs/latest/cluster-overview.html) | 
|  | 216 | + | 
|  | 217 | +### Tez | 
|  | 218 | + | 
|  | 219 | +Execution engine on top of YARN that translates user requests to directed acyclic | 
|  | 220 | +graphs (DAGs). | 
|  | 221 | +Tez was developed as a faster execution engine for other solutions that | 
|  | 222 | +would be otherwise executed as MapReduce jobs (Pig, Hive, etc.). | 
|  | 223 | + | 
|  | 224 | +Tez needs to be installed on each compute node of our aforementioned YARN cluster. | 
|  | 225 | + | 
|  | 226 | +#### References | 
|  | 227 | + | 
|  | 228 | +- [Tez installation guide](https://tez.apache.org/install.html) | 
|  | 229 | + | 
|  | 230 | +### Hive | 
|  | 231 | + | 
|  | 232 | +Hive is a framework that allows you to analyze structured data files (stored locally | 
|  | 233 | +or in HDFS) using ANSI SQL. | 
|  | 234 | +As execution engine of Hive queries either MapReduce, Spark, or Tez may be used - | 
|  | 235 | +where the latter two promise accelerated execution time through their DAG-based engines. | 
|  | 236 | +Hive stores metadata on directories and files in its metastore which clients | 
|  | 237 | +need access to in order to run Hive queries. | 
|  | 238 | +Hive establishes data schemas on read thus allowing fast data ingestion. | 
|  | 239 | +HDFS does not support in-place file changes hence row updates are stored in delta | 
|  | 240 | +files and later merged into table files. | 
|  | 241 | +Hive does not locks natively and requires ZooKeeper for these. | 
|  | 242 | +Hive does however support indexes to improve query speeds. | 
|  | 243 | + | 
|  | 244 | +To run Hive on our existing Hadoop cluster we need to add the following: | 
|  | 245 | + | 
|  | 246 | +- HiveServer2 | 
|  | 247 | + - Service that allows clients to execute queries against Hive | 
|  | 248 | +- Metastore database | 
|  | 249 | + - A traditional RDBMS server that persists relevant metadata | 
|  | 250 | +- Metastore server | 
|  | 251 | + - Service that queries the metastore database for metadata on behalf of clients | 
|  | 252 | + | 
|  | 253 | +To try out Hive you can connect to your instance of HiveServer2 through a CLI called Beeline. | 
|  | 254 | +Once you brought up the Hadoop cluster as described above start a container that runs | 
|  | 255 | +the Beeline CLI: | 
|  | 256 | + | 
|  | 257 | + $ source env | 
|  | 258 | + $ docker run -ti --network distributabledockersqlonhadoop_hadoop_net --rm ${hive_image_name}:${image_version} \ | 
|  | 259 | +  bash -c 'beeline -u jdbc:hive2://hive-hiveserver2:10000 -n root' | 
|  | 260 | + | 
|  | 261 | +To look around: | 
|  | 262 | + | 
|  | 263 | + 0: jdbc:hive2://hive-hiveserver2:10000> show tables; | 
|  | 264 | + +-----------+ | 
|  | 265 | + | tab_name | | 
|  | 266 | + +-----------+ | 
|  | 267 | + +-----------+ | 
|  | 268 | + No rows selected (0.176 seconds) | 
|  | 269 | + | 
|  | 270 | +To create a table: | 
|  | 271 | + | 
|  | 272 | + CREATE TABLE IF NOT EXISTS apache_projects (id int, name String, website String) | 
|  | 273 | + COMMENT 'Hadoop-related Apache projects' | 
|  | 274 | + ROW FORMAT DELIMITED | 
|  | 275 | + FIELDS TERMINATED BY ',' | 
|  | 276 | + LINES TERMINATED BY '\n' | 
|  | 277 | + STORED AS TEXTFILE; | 
|  | 278 | + | 
|  | 279 | +Quit the Beeline client (and the Docker container) by pressing `Ctrl + D`. | 
|  | 280 | + | 
|  | 281 | +The HiveServer2 instance also offers a web interface accessible at | 
|  | 282 | + | 
|  | 283 | +[http://localhost:10002](http://localhost:10002) | 
|  | 284 | + | 
|  | 285 | +Browse the HiveServer2 web interface to see logs of the commands and queries you executed in the | 
|  | 286 | +Beeline client and general service logs. | 
|  | 287 | + | 
|  | 288 | +#### References | 
|  | 289 | + | 
|  | 290 | +- [Running Hive quickstart](https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-RunningHive) | 
|  | 291 | +- [HiveServer2 reference](https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Overview) | 
|  | 292 | +- [Metastore reference](https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin) | 
|  | 293 | + | 
|  | 294 | +### Hue | 
|  | 295 | + | 
|  | 296 | +Hue is a graphical analytics workbench on top of Hadoop. | 
|  | 297 | +The creators of Hue maintain a [Docker image](https://hub.docker.com/r/gethue/hue/) | 
|  | 298 | +which allows for a quick start with this platform. | 
|  | 299 | +Here we merely update Hue's settings in line with our Hadoop cluster - | 
|  | 300 | +see `images/hue` for details. | 
|  | 301 | + | 
|  | 302 | +Hue runs as a web application accessible at | 
|  | 303 | + | 
|  | 304 | +[http://localhost:8888](http://localhost:8888) | 
|  | 305 | + | 
|  | 306 | +When first opening the web application create a user account `root` with | 
|  | 307 | +arbitrary password. | 
|  | 308 | + | 
|  | 309 | +Locate the Hive table you created earlier through the Beeline CLI - | 
|  | 310 | +you should find it in the `default` Hive database. | 
|  | 311 | + | 
|  | 312 | +### Impala | 
|  | 313 | + | 
|  | 314 | +### Presto | 
|  | 315 | + | 
|  | 316 | +### Drill | 
|  | 317 | + | 
|  | 318 | +### Spark SQL | 
|  | 319 | + | 
|  | 320 | +### Phoenix | 
|  | 321 | + | 
|  | 322 | +## Notes | 
|  | 323 | + | 
|  | 324 | +### Hostnames | 
|  | 325 | + | 
|  | 326 | +Hadoop hostnames are not permitted to contain underscores `_`, therefore make certain | 
|  | 327 | +to spell out longer hostnames with dashes `-` instead. | 
|  | 328 | + | 
|  | 329 | +### Moving towards production | 
|  | 330 | + | 
|  | 331 | +It should be relatively simple to scale out our test cluster to multiple physical hosts. | 
|  | 332 | +Here is a sketch of steps that are likely to get you closer to running this on multiple hosts: | 
|  | 333 | + | 
|  | 334 | +* Create a [Docker swarm](https://docs.docker.com/engine/swarm/) | 
|  | 335 | +* Instead of the current Docker bridge network use an | 
|  | 336 | + [overlay network](https://docs.docker.com/engine/userguide/networking/get-started-overlay/) | 
|  | 337 | +* Add an [OpenVPN server container](https://github.com/kylemanna/docker-openvpn) to your | 
|  | 338 | + Docker overlay network to grant you continued web interface access from your computer | 
|  | 339 | +* You would likely want to use an orchestration framework such as | 
|  | 340 | + [Ansible](https://www.ansible.com/) to tie the different steps and components | 
|  | 341 | + of a more elaborate multi-host deployment together | 
0 commit comments