Designing a Scalable Data Platform

DESIGNING A SCALABLE DATA PLATFORM USING AKKA, SPARK STREAMING AND KAFKA Alex Silva

MMM…. DATA!!!!Data Ingestion

INGESTION 110010010101000111010101111111

REAL-TIME VIDEO DATA STREAMING AND ANALYSIS

Distributed Elastic Location Agnostic Open Message Driven Self-Healing

The Reactive Manifesto Responsive Elastic Resilient Message Driven

Why Akka? Reactive Scalable Fault Tolerant Load Management Both up and out Location Transparency

Challenges with Akka Learning Curve Type Safety Debugging Dead Letters

Why Kafka? Distributed Log High Throughput Reliable Concurrency

Producers vs Consumers Producer Producer Kafka Cluster Broker 2 Topic 1 Partition 1 Broker 1 Topic 1 Partition 0 Broker 3 Topic 1 Partition 3 Producer Producer Producer

Kafka: A CP or CA System? CONSISTENCY AVAILABILITY PARTITION TOLERANCE INCONSISTENCY UNAVAILABILITY PARTITIONING INTOLERANNCE

Why Spark? Fast! Unified Platform Functional Paradigm Rich Library Set Active Community

HYDRA CORE Metrics HTTP Server Bootstrapping Utilities System Actors

HYDRA INGESTION Actor Hierarchy Supervision Kafka Gateway Message Protocol

Handler Registry Monitors registered handlers for errors/stops Brodcasts messages Handler Lifecycle

< META > { } /ingest Coordinator Registry Handlers Ingestion Flow

Ingestion Actors: Coordinator Supervises ingestion at the request level Coordinates protocol flow Reports errors and metrics

Ingestion Protocol Publish MESSAGE HANDLERS Join STOP Validate IngestValid Invalid<<Silence>>

HEY GUYS! CHECK THIS OUT! HUH?! NICE!! BRING IT!! NAH… Publish JoinJoin Ingestion Protocol: Publish Handler Registry Message handlers

Ingestion Protocol: Validation HOW DOES IT LOOK? Validate BAD! Invalid GOOD! Valid Ingestion Coordinator Message handlers

Ingestion Protocol: Invalid Message Ingestion Coordinator Error Reporter GOT A BAD ONE ReportError Ingest

foreach handler Ingestion Protocol: Ingest SHIP IT! Ingest Encode Persist

abstract class BaseMessageHandler extends Actor with ActorConfigSupport with ActorLogging with IngestionFlow with ProducerSupport with MessageHandler { ingest { case Initialize => { //nothing required by default } case Publish(request) => { log.info(s"Publish message was not handled by ${self}. Will not join.") } case Validate(request) => { sender ! Validated } case Ingest(request) => { log.warning("Ingest message was not handled by ${self}.") sender ! HandlerCompleted } case Shutdown => { //nothing required by default } case Heartbeat => { Health.get(self).getChecks } } }

Elasticity Asynchronous Share Nothing Divide and Conquer Location Transparency

akka { actor { deployment { /services-manager/handler_registry/segment_handler { router = round-robin-pool optimal-size-exploring-resizer { enabled = on action-interval = 5s downsize-after-underutilized-for = 2h } } /services-manager/kafka_producer { router = round-robin-pool resizer { lower-bound = 5 upper-bound = 50 messages-per-resize = 500 } } } } }

akka { actor { deployment { /services-manager/handler_registry/segment_handler { router = round-robin-pool optimal-size-exploring-resizer { enabled = on action-interval = 5s downsize-after-underutilized-for = 2h } } } provider = "akka.cluster.ClusterRefActorProvider" } cluster { seed-nodes = ["akka.tcp://Hydra@127.0.0.1:2552","akka.tcp://hydra@172.0.0.1:2553"] } }

SAVE THIS! SOMEBODY LOGGED IN! Events are not addressed to a specific recipient. Message vs. Events

HYDRA PERSISTENCE Kafka Producers Offset Management Message Serialization

Message Serialization Happens at the message-level Binds Messages to Producers Pluggable

trait KafkaMessage[K, P] { val timestamp = System.currentTimeMillis def key: K def payload: P def retryOnFailure: Boolean = true } case class JsonMessage(key: String, payload: JsonNode) extends KafkaMessage[String, JsonNode] object JsonMessage { val mapper = new ObjectMapper() def apply(key: String, json: String) = { val payload: JsonNode = mapper.readTree(json) new JsonMessage(key, payload) } } case class AvroMessage(val schema: SchemaHolder, key: String, json: String) extends KafkaMessage[String, GenericRecord] { def payload: GenericRecord = { val converter: JsonConverter[GenericRecord] = new JsonConverter[GenericRecord](schema.schema) converter.convert(json) } }

Why Avro? Binary Format Space Efficient Evolutionary Schemas Automatic Tables

“The ability of something to return to its original shape, after it has been pulled, stretched, pressed, or bent.” Merriam-Webster Resiliency

Resilient Protocols Message Loss Message Reordering Message Duplication

Asynchronous Communication + Eventual Consistency Resilient Protocols

ACID 2.0 Associative Commutative Idempotent Distributed

Resilient Protocols Replication Delegation Isolation Containment

override val supervisorStrategy = OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1.minute) { case _: ActorInitializationException => akka.actor.SupervisorStrategy.Stop case _: FailedToSendMessageException => Restart case _: ProducerClosedException => Restart case _: NoBrokersForPartitionException => Escalate case _: KafkaException => Escalate case _: ConnectException => Escalate case _: Exception => Escalate } val kafkaProducerSupervisor = BackoffSupervisor.props( Backoff.onFailure( kafkaProducerProps, childName = actorName[KafkaProducerActor], minBackoff = 3.seconds, maxBackoff = 30.seconds, randomFactor = 0.2 ))

class KafkaProducerActor extends Actor with LoggingAdapter with ActorConfigSupport with NotificationSupport[KafkaMessage[Any, Any]] { import KafkaProducerActor._ implicit val ec = context.dispatcher override def preRestart(cause: Throwable, message: Option[Any]) = { //send it to itself again after the exponential delays, no Ack from Kafka message match { case Some(rp: RetryingProduce) => { notifyObservers(KafkaMessageNotDelivered(rp.msg)) val nextBackOff = rp.backOff.nextBackOff val retry = RetryingProduce(rp.topic, rp.msg) retry.backOff = nextBackOff context.system.scheduler.scheduleOnce(nextBackOff.waitTime, self, retry) } case Some(produce: Produce) => { notifyObservers(KafkaMessageNotDelivered(produce.msg)) if (produce.msg.retryOnFailure) { context.system.scheduler.scheduleOnce(initialDelay, self, RetryingProduce(produce.topic, produce.msg)) } } } } }

The Error Kernel Pattern Error Kernel Per Request No Processing Delegation Ingestion Errors Time outs

HYDRA CONDUCTORS Event “Pulling” HTTP conductor

Location Transparency: Akka Remoting Peer-to-Peer Serialization Delivery Reliability Latency

Message Delivery in Hydra What does guaranteed delivery mean? At most once semantics Can be made stronger

@throws(classOf[Exception]) override def init: Future[Boolean] = Future { val useProxy = config.getBoolean(“message.proxy”,false) val ingestorPath = config.getRequiredString("ingestor.path") ingestionActor = if (useProxy) context.actorSelection(ingestorPath) else context.actorOf(ReliableIngestionProxy.props(ingestorPath)) val cHeaders = config.getOptionalList("headers") topic = config.getRequiredString("kafka.topic") headers = cHeaders match { case Some(ch) => List( ch.unwrapped.asScala.map { header => { val sh = header.toString.split(":") RawHeader(sh(0), sh(1)) } }: _* ) case None => List.empty[HttpHeader] } true }

Conductors Webhooks What’s streaming into Hydra today?

What’s streaming into Hydra in the next few days? Internal ETL Many others…

HYDRA SPARK Kafka Dispatching Spark Job Server Streaming Analytics

1 2 3 4 5 6 7 8 9 10 … 90 91 9 2 9 3 9 4 9 5 9 6 9 7 9 8 9 9 10 0 Time 321 Offsets Dispatching

1 2 3 4 5 6 7 8 9 10 … 90 91 9 2 9 3 9 4 9 5 9 6 9 7 9 8 9 9 10 0 1 2 3 Time Offsets Dispatching

ImpalaHive Dispatchers HDFS RDBMS Kafka Cassandra

Dispatching Adhoc Tables { } /dispatch

WHAT ABOUT ORDERING? Post-Dispatch Ordering Single Partition Topic

Spark Defined via hydra configuration and bootstrapped Both batch and streaming contexts Ad-hoc via job server

Spark Job Server Spark as a REST Service Spark SQL, Streaming, etc. Async and Sync Job APIs Jar Management Context Management

Jar Manager Endpoint Storage Retrieval GET /jars - lists jars and the upload timestamp POST /jars/<appName> - uploads a new jar under <appName>

Context Manager Endpoint CRUD Lifecycle Management GET /contexts - lists all current contexts POST /contexts/<name> - creates a new context DELETE /contexts/<name> - stops a context and all jobs running in it

Job Manager Endpoint Configuration Job repository Tracking Lifecycle management GET /jobs?limit=N - Lists the last N jobs POST /jobs - Starts a new job; ‘sync=true’ to wait GET /jobs/<jobId> - Gets the result or status of a job DELETE /jobs/<jobId> - Kills the job GET /jobs/<jobId>/config - Gets the job configuration

Creating Spark Jobs #Ad-hoc jobs through hydra - Run-once jobs with transient curl --data-binary @/etc/local/hydra/video-segment-fx.jar localhost:9091/jars/segment curl -d "kafka.topic=segment" 'localhost:9091/jobs?appName=segment&classPath=hydra.SegmentJob&sync=false' { "status": "STARTED", "result": { "jobId": "3156120b-f001-56cf-d22a-b40ebf0a9af1", "context": "f5ed0ec1-hydra.spark.analytics.segment.SegmentJob" } }

Persistent Context Jobs #Required for related jobs #Create a new context curl -X POST 'localhost:9091/contexts/video-032116-ctx?num-cpu-cores=10&memory-per- node=512m' OK #Verify creation curl localhost:9091/contexts ["video-032116-ctx"] #Run job using the context curl -d "kafka.topic=segment" 'localhost:9091/jobs? appName=segment&classPath=hydra.SegmentJob&sync=true&context=video-032116-ctx' { "result":{ "active-sessions":24476221 } }

Dispatcher Jobs curl -X POST localhost:9091/dispatch { "dispatch":{ "kafka.topic":"youbora", "offsets":{ "start":"2016-03-15T01:18:01", "end":"2016-03-15T03:18:01" }, "avro.schema":"http://172.16.10.220:8085/schemas/youbora", "hdfs-serializer":{ "output.path":"/tmp/hydra/youbora/${c:offsets.start}" } } } localhost

Dispatcher Jobs curl -X POST localhost:9091/dispatch { "dispatch":{ "kafka.topic":"youbora", "offsets":{ "start":"2016-03-15T01:18:01", "end":"2016-03-15T03:18:01" }, "avro.schema":"http://172.16.10.220:8085/schemas/youbora", "hdfs-serializer":{ "url":"jdbc:postgresql://localhost/test", "driver":"org.postgresql.Driver", "username":"${c:dispatchers.yb_pg_username}", "password":"${c:dispatchers.yb_pg_password}", "sql":"insert into video_raw values ('${total}','${type}','${subtype}','${user_id}','${ip}','$ {country}','${city}','${start}')" } } } localhost

Job Status #Job Results/Status curl localhost:9091/jobs/3156120b-f001-56cf-d22a-b40ebf0a9af1 { "duration":"16.264 secs", "classPath":"hydra.spark.analytics.segment.SegmentJob", "startTime":"2016-03-26T01:18:01.256Z", "context":"f5ed0ec1-hydra.spark.analytics.segment.SegmentJob", "result":{ "active-sessions":24476221 }, "status":"FINISHED", "jobId":"3156120b-f001-56cf-d22a-b40ebf0a9af1" }

Ingestion Hydra CoreIngestors HTTP Spark (Batch and Streaming) Hydra CoreDispatchers HTTP RDBMS HDFS Conductors Hydra CoreConductors HTTP Persistence :: Kafka Hydra CorePersistence HTTP AKKA Remoting 3 2 2

AWS Model vCPU Mem (GiB) SSD Storage (GB) m3.medium 1 3.75 1 x 4 m3.large 2 7.5 1 x 32 m3.xlarge 4 15 2 x 40 m3.2xlarge 8 30 2 x 80

0 500 1000 1500 2000 2500 Dec-15 Jan-16 Jan-16 Jan-16 1-Feb 3/1/16 Average Ingestions Per Second Requests

9,730 lines of Scala code Production Platform Since Jan 2016 C.I. through Jenkins and Salt Some Facts

Remote Handlers Ingestion Hydra CoreIngestors HTTP Handler <<REMOTE>> RegisterHandle Handler Registry Ingestion Protocol Create ZNode

roarking QUESTIONS? Thank You!

Location Transparency in Hydra Distributed by Default Akka Remoting Configuration-Driven Peer-to-Peer

Designing a Scalable Data Platform

More Related Content

What's hot

Similar to Designing a Scalable Data Platform

Recently uploaded

Designing a Scalable Data Platform