Designing a reactive data platform: Challenges, patterns, and anti-patterns

DESIGNING A REACTIVE DATA PLATFORM: CHALLENGES, PATTERNS AND ANTI-PATTERNS Alex Silva

Distributed Elastic Location Agnostic Open Message Driven Self-Healing

The Reactive Manifesto Responsive Elastic Resilient Message Driven

Elasticity Asynchronous Share Nothing Divide and Conquer Location Transparency

Synchronous Messaging Inherit ordering introduces implicit back pressure on the sender 3 1 2 Synchronous

4Invalid! Asynchronous 1 2 3 Asynchronous Messaging

“The ability of something to return to its original shape, after it has been pulled, stretched, pressed, or bent.” Merriam-Webster Resiliency

WHAT IF TOLD YOU IT IS COMPLEX BUT NOT THAT COMPLICATED

Software Systems are Complex Systems

“Complex systems run in degraded mode.” “Complex systems run as broken systems.” Richard Cook

Asynchronous Communication + Eventual Consistency Resilient Protocols

Failures Contained Observed Managed Reified as messages

Messages vs Events SAVE THIS! SOMEBODY LOGGED IN! FactsTopic Events Past AddressableSpecific Messages

REAL-TIME DATA INGESTION PLATFORM

Why Akka? Reactive Elastic Fault Tolerant Load Management Both up and out Location Transparency

Akka Actors Lightweight Reactive Asynchronous Resilient

Challenges with Akka Learning Curve Type Safety Debugging Dead Letters

Why Kafka? Distributed Log High Throughput Replicated Concurrency

Kafka Producer Producer Kafka Cluster Broker 2 Topic 1 Partition 1 Broker 1 Topic 1 Partition 0 Broker 3 Topic 1 Partition 3 Client Client Client

Why Spark? Fast! Unified Platform Functional Paradigm Rich Library Set Active Community

Ingestion Hydra CoreIngestors HTTP Spark (Batch and Streaming) Hydra CoreDispatchers HTTP RDBMS HDFS Conductors Hydra CoreConductors HTTP Persistence :: Kafka Hydra CorePersistence HTTP AKKA Remoting 3 2 2 Hydra Topology

GOOD PRACTICE: DECENTRALIZE THE PROCESSING OF KEY TASKS

HYDRA INGESTION MODULE Actor Hierarchy Supervision Kafka Gateway Message Protocol

< META > { } /ingest Coordinator Registry Handlers Hydra Ingestion Flow

Handler Registry Monitors registered handlers for errors/stops Broadcasts messages Handler Lifecycle

GOOD PRACTICE: DESIGN AN INCREMENTAL COMMUNICATION PROTOCOL

Hydra Ingestion Protocol Publish MESSAGE HANDLERS Join STOP Validate IngestValid Invalid<<Silence>>

HEY GUYS! CHECK THIS OUT! HUH?! NICE!! BRING IT!! NAH… Publish JoinJoin Hydra Ingestion Protocol: Publish Handler Registry Message handlers

Hydra Ingestion Protocol: Validation HOW DOES IT LOOK? Validate BAD! Invalid GOOD! Valid Ingestion Coordinator Message handlers

Hydra Ingestion Protocol: Invalid Message Ingestion Coordinator Error Reporter GOT A BAD ONE ReportError Ingest

foreach handler Hydra Ingestion Protocol: Ingest SHIP IT! Ingest Encode Persist

abstract class BaseMessageHandler extends Actor with ActorConfigSupport with ActorLogging with IngestionFlow with ProducerSupport with MessageHandler { ingest { case Initialize => { //nothing required by default } case Publish(request) => { log.info(s"Publish message was not handled by ${self}. Will not join.") } case Validate(request) => { sender ! Validated } case Ingest(request) => { log.warning("Ingest message was not handled by ${self}.") sender ! HandlerCompleted } case Shutdown => { //nothing required by default } case Heartbeat => { Health.get(self).getChecks } } }

GOOD PRACTICE: HIDE AN ELASTIC POOL OF RESOURCES BEHIND ITS OWNER

Publisher Subscriber Back pressure Less of this…

RouterPublisher Workers More of this!

akka { actor { deployment { /services-manager/handler_registry/segment_handler { router = round-robin-pool optimal-size-exploring-resizer { enabled = on action-interval = 5s downsize-after-underutilized-for = 2h } } /services-manager/kafka_producer { router = round-robin-pool resizer { lower-bound = 5 upper-bound = 50 messages-per-resize = 500 } } } } }

akka { actor { deployment { /services-manager/handler_registry/segment_handler { router = round-robin-pool optimal-size-exploring-resizer { enabled = on action-interval = 5s downsize-after-underutilized-for = 2h } } } provider = "akka.cluster.ClusterRefActorProvider" } cluster { seed-nodes = ["akka.tcp://Hydra@127.0.0.1:2552","akka.tcp://hydra@172.0.0.1:2553"] } }

GOOD PRACTICE: USE SELF-DESCRIBING MESSAGES

trait KafkaMessage[K, P] { val timestamp = System.currentTimeMillis def key: K def payload: P def retryOnFailure: Boolean = true } case class JsonMessage(key: String, payload: JsonNode) extends KafkaMessage[String, JsonNode] object JsonMessage { val mapper = new ObjectMapper() def apply(key: String, json: String) = { val payload: JsonNode = mapper.readTree(json) new JsonMessage(key, payload) } } case class AvroMessage(val schema: SchemaHolder, key: String, json: String) extends KafkaMessage[String, GenericRecord] { def payload: GenericRecord = { val converter: JsonConverter[GenericRecord] = new JsonConverter[GenericRecord](schema.schema) converter.convert(json) } }

GOOD PRACTICE: PREFER BINARY DATA FORMATS FOR COMMUNICATION

Why Avro? Binary Format Space Efficient Evolutionary Schemas Automatic Tables

GOOD PRACTICE: DELEGATE AND SUPERVISE! REPEAT!

Ingestion Actors: Coordinators Supervises ingestion at the request level Coordinates protocol flow Reports errors and metrics

Let it Crash Components where full restarts are always ok Transient failures are hard to find Simplified failure model

override val supervisorStrategy = OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1.minute) { case _: ActorInitializationException => akka.actor.SupervisorStrategy.Stop case _: FailedToSendMessageException => Restart case _: ProducerClosedException => Restart case _: NoBrokersForPartitionException => Escalate case _: KafkaException => Escalate case _: ConnectException => Escalate case _: Exception => Escalate } val kafkaProducerSupervisor = BackoffSupervisor.props( Backoff.onFailure( kafkaProducerProps, childName = actorName[KafkaProducerActor], minBackoff = 3.seconds, maxBackoff = 30.seconds, randomFactor = 0.2 ))

class KafkaProducerActor extends Actor with LoggingAdapter with ActorConfigSupport with NotificationSupport[KafkaMessage[Any, Any]] { import KafkaProducerActor._ implicit val ec = context.dispatcher override def preRestart(cause: Throwable, message: Option[Any]) = { //send it to itself again after the exponential delays, no Ack from Kafka message match { case Some(rp: RetryingProduce) => { notifyObservers(KafkaMessageNotDelivered(rp.msg)) val nextBackOff = rp.backOff.nextBackOff val retry = RetryingProduce(rp.topic, rp.msg) retry.backOff = nextBackOff context.system.scheduler.scheduleOnce(nextBackOff.waitTime, self, retry) } case Some(produce: Produce) => { notifyObservers(KafkaMessageNotDelivered(produce.msg)) if (produce.msg.retryOnFailure) { context.system.scheduler.scheduleOnce(initialDelay, self, RetryingProduce(produce.topic, produce.msg)) } } } } }

Monitoring through Death Watches

WHAT ABOUT SOME ANTI- PATTERNS?

NOT SO GOOD PRACTICE: BUILDING NANO SERVICES

NOT SO GOOD PRACTICE: TREATING LOCATION TRANSPARENCY AS A FREE-FOR-ALL

Guaranteed Delivery in Hydra What does guaranteed delivery mean? At most once semantics Can be made stronger

Akka Remoting Peer-to-Peer Serialization Delivery Reliability Latency

@throws(classOf[Exception]) override def init: Future[Boolean] = Future { val useProxy = config.getBoolean(“message.proxy”,false) val ingestorPath = config.getRequiredString("ingestor.path") ingestionActor = if (useProxy) context.actorSelection(ingestorPath) else context.actorOf(ReliableIngestionProxy.props(ingestorPath)) val cHeaders = config.getOptionalList("headers") topic = config.getRequiredString("kafka.topic") headers = cHeaders match { case Some(ch) => List( ch.unwrapped.asScala.map { header => { val sh = header.toString.split(":") RawHeader(sh(0), sh(1)) } }: _* ) case None => List.empty[HttpHeader] } true }

NOT SO GOOD PRACTICE: NOT KEEPING MESSAGE PROTOCOL BOUND TO THEIR CONTEXTS

object Messages { case object ServiceStarted case class RegisterHandler(info: ActorRef) case class RegisteredHandler(name: String, handler: ActorRef) case class RemoveHandler(path: ActorPath) case object GetHandlers case object InitiateIngestion extends HydraMessage case class RequestCompleted(s: IngestionSummary) extends HydraMessage case class IngestionSummary(name:String) case class Produce(topic: String, msg: KafkaMessage[_, _], ack: Option[ActorRef]) extends HydraMessage case object HandlerTimeout extends HydraMessage case class Validate(req: HydraRequest) extends HydraMessage case class Validated(req: HydraRequest) extends HydraMessage case class NotValid(req: HydraRequest, reason: String) extends HydraMessage case object HandlingCompleted extends HydraMessage case class Publish(request: HydraRequest) case class Ingest(request: HydraRequest) case class Join(r: HydraRequest) extends HydraMessage }

class HandlerRegistry extends Actor with LoggingAdapter with ActorConfigSupport { override def receive: Receive = { ... } override val supervisorStrategy = OneForOneStrategy() { case e: Exception => { report(e) Restart } } } object HandlerRegistry { case class RegisterHandler(info: HandlerInfo) case class RegisteredHandler(name: String, handler: ActorRef) case class RemoveHandler(path: ActorPath) case object GetHandlers }

NOT SO GOOD PRACTICE: DEVELOPING OVERLY CHATTY PROTOCOLS

Conductors Webhooks What’s streaming into Hydra today?

0 500 1000 1500 2000 2500 Dec-15 Jan-16 Jan-16 Jan-16 1-Feb 3/1/16 Average Ingestions Per Second Requests

9,730 lines of Scala code Production Platform Since Jan 2016 C.I. through Jenkins and Salt Some Facts

roarking QUESTIONS? Thank You!

Designing a reactive data platform: Challenges, patterns, and anti-patterns

More Related Content

What's hot

Viewers also liked

Similar to Designing a reactive data platform: Challenges, patterns, and anti-patterns

Recently uploaded

Designing a reactive data platform: Challenges, patterns, and anti-patterns

Editor's Notes