Selective Data Replication with Geographically Distributed Hadoop Brett Rudenstein April 16, 2015 Brussels, Belgium
Dimensions of Scalability For distributed storage systems  Ability to support ever-increasing requirements for - Space: more data - Objects: more files - Load: more clients  RAM is limiting HDFS scale  Other dimensions of scalability - Geographic Scalability: scaling across multiple data centers  Scalability as a universal measure of a distributed system 2
Geographic Scalability Scaling file system across multiple data centers 3  Running Hadoop in multiple Data Centers - Distributed across the world - As a single cluster
Main Steps Four main stages to the goal  STAGE I: The role of the Coordination Engine  STAGE II: Replicated Virtual Namespace - Active-active  STAGE III: Geographically-distributed Hadoop - File system running in multiple data centers - Disaster recovery, load balancing, self-healing, simultaneous data ingest  STAGE IV: Selective data replication - Heterogeneous storage Zones 4
The Requirements
Requirements File System with hardware components distributed over the WAN  Operated and perceived by users as a single system - Unified file system view independent of where the data is physically stored  Strict Consistency - Everybody sees the same data - Seamless file level replication  Continuous Availability - All components are Active - Disaster recovery  Geographic Scalability: Support for multiple Data Centers 6
Architecture Principles Strict consistency of metadata with fast data ingest 1. Synchronous replication of metadata between data centers - Using Coordination Engine - Provides strict consistency of the namespace 2. Asynchronous replication of data over the WAN - Data replicated in the background - Allows fast LAN-speed data creation 7
Coordination Engine For Replicating Consistent State
Coordination Engine Determines the order of operations in the system  Coordination Engine ensures the order of events submitted to the engine by multiple proposers - Anybody can Propose - Engine chooses a single Agreement every time and guarantees: • Learners observe the agreements in the same order they were chosen • An agreement triggers a corresponding application action 9
Central Coordination Simple coordination without fault tolerance  Easy to Coordinate - Single NameNode as an example of a Central Coordination Engine (No HA) - Performance and availability bottleneck - Single point of failure 10
Distributed Coordination Engine Fault-tolerant coordination using multiple acceptors  Distributed Coordination Engine operates on participating nodes - Roles: Proposer, Learner, and Acceptor - Each node can combine multiple roles  Distributed coordination - Proposing nodes submit events as proposals to a quorum of acceptors - Acceptors agree on the order of each event in the global sequence of events - Learners learn agreements in the same deterministic order 11
Consensus Algorithms Consensus is the process of agreeing on one result among a group of participants  Coordination Engine guarantees the same state of the learners at a given GSN - Each agreement is assigned a unique Global Sequence Number (GSN) - GSNs form a monotonically increasing number series – the order of agreements - Learners have the same initial state, apply the same deterministic agreements in the same deterministic order - GSN represents “logical” time in coordinated systems  PAXOS is a consensus algorithm proven to tolerate a variety of failures - Quorum-based Consensus - Deterministic State Machine - Leslie Lamport: Part-Time Parliament (1990) 12
Coordinated Replication of HCFS Namespace
Replicated Virtual Namespace Coordination Engine provides equivalence of multiple namespace replicas  Coordinated Virtual Namespace controlled by Fusion Node - Is a client that acts as a proxy to other client interactions - Reads are not coordinated - Writes (Open, Close, Append, etc…) are coordinated  The namespace events are consistent with each other - Each fusion server maintains a log of changes that would occur in the namespace - Any Fusion Node can initiate an update, which is propagated to all other Fusion Nodes  Coordination Engine establishes the global order of namespace updates - Fusion servers ensure deterministic updates in the same deterministic order to underlying file system - Systems, which start from the same state and apply the same updates, are equivalent 14
Strict Consistency Model One-Copy Equivalence as known in replicated databases  Coordination Engine sequences file open and close proposals into the global sequence of agreements - Applied to individual replicated folder namespace in the order of their Global Sequence Number  Fusion Replicated Folders have identical states when they reach the same GSN  One-copy equivalence - Folders may have different states at a given moment of “clock” time as the rate of consuming agreements may vary - Provides same state in logical time 15 15
Fusion Geographically Distributed HCFS
Scaling Hadoop Across Data Centers Continuous Availability and Disaster Recovery over the WAN  The system should appear, act, and be operated as a single cluster - Instant and automatic replication of data and metadata  Parts of the cluster on different data centers should have equal roles - Data could be ingested or accessed through any of the centers  Data creation and access should typically be at LAN speed - Running time of a job executed on one data center as if there are no other centers  Failure scenarios: the system should provide service and remain consistent - Any Fusion node can fail and still provide replication - Fusion nodes can fail simultaneously on two or more data centers and still provide replication - WAN Partitioning does not cause a data center outage - RPO is as low as possible due to continuous replication as opposed to periodic 17
Foreign File Replication File is created on the client’s data center and replicated to the other asynchronously 18  Fusion workflow 1. Client makes a request to create a file 2. Fusion coordinates File Open to other clusters involved (membership) 3. File is added to underlying storage 4. IHC server pulls data from cluster and pushed to remote clusters 5. Fusion coordinates File Close to other clusters involved (membership)
Inter Hadoop Communication Service  Uses HCFS API and communicates directly with underlying storage systems - Isilon - MAPR - HDFS - S3  NameNode and DataNode operations are unchanged 19
Multi–Data Center Installation Do I need so many replicas? 20
Features Active/Active Selective Data Replication
Selective Data Replication Three main use cases for restricting data replication  “Saudi Arabia” case – Data must never leave a specific data center - This is needed to protect data from being replicated outside of a specific geo-location, a country, or a facility, e.g., customer data from a branch in Saudi Arabia of a global bank must never leave the country due to local regulations. - Virtual namespace: only replicated metadata that has its supporting data replicated  “/tmp” case – Data created in a directory by a native client should remain native - Transient data of a job running on a DC does not need to be replicated elsewhere as it is deleted upon job completion and nobody else needs it.  “Ingest Only” case – Data directly ingested into cluster at data origin - Data replicates to all other data centers - Temporary network partitioned cluster can still ingest data 22
SDR Implementation Example / cs-2015-01.log cs-2015-02.log shv-2015-03.txtuser/ tmp/ public/ Virtually replicated namespace Selectively replicated data cs-2015-01.log dc1 dc1 dc2 dc3 shv-2015-03.txt dc1 dc2 dc2 job-2015-04.xml dc3 dc3 dc3 job-2015-04.xml dc1 dc2 dc3
Heterogeneous Storage Zones Virtual Data Centers representing different types of block storage  Storage Types: Hard Drive, SSD, RAM  Virtual data center is a zone of similarly configured Data Nodes  Example: - Z1 archival zone: DataNodes with dense hard drive storage - Z2 active data zone: DataNodes with high-performance SSDs - Z3 real-time access zone: lots of RAM and cores, short-lived hot data  SDR policy defines three directories: - /archive – replicated only on Z1 - /active-data – replicated on Z2 and Z1 - /real-time – replicated everywhere 24
Simplified WAN configurations Reduced operational complexity  Fast network protocols can keep up with demanding network replication  Hadoop clusters do not require direct communication with each other. - No n x m communication among datanodes across datacenters - Reduced firewall / socks complexities  Reduced Attack Surface
Thank You. Questions? Come visit WANdisco at Booth 11 Selective Data Replication with Geographically Distributed Hadoop Brett Rudenstein

Selective Data Replication with Geographically Distributed Hadoop

  • 1.
    Selective Data Replicationwith Geographically Distributed Hadoop Brett Rudenstein April 16, 2015 Brussels, Belgium
  • 2.
    Dimensions of Scalability Fordistributed storage systems  Ability to support ever-increasing requirements for - Space: more data - Objects: more files - Load: more clients  RAM is limiting HDFS scale  Other dimensions of scalability - Geographic Scalability: scaling across multiple data centers  Scalability as a universal measure of a distributed system 2
  • 3.
    Geographic Scalability Scaling filesystem across multiple data centers 3  Running Hadoop in multiple Data Centers - Distributed across the world - As a single cluster
  • 4.
    Main Steps Four mainstages to the goal  STAGE I: The role of the Coordination Engine  STAGE II: Replicated Virtual Namespace - Active-active  STAGE III: Geographically-distributed Hadoop - File system running in multiple data centers - Disaster recovery, load balancing, self-healing, simultaneous data ingest  STAGE IV: Selective data replication - Heterogeneous storage Zones 4
  • 5.
  • 6.
    Requirements File System withhardware components distributed over the WAN  Operated and perceived by users as a single system - Unified file system view independent of where the data is physically stored  Strict Consistency - Everybody sees the same data - Seamless file level replication  Continuous Availability - All components are Active - Disaster recovery  Geographic Scalability: Support for multiple Data Centers 6
  • 7.
    Architecture Principles Strict consistencyof metadata with fast data ingest 1. Synchronous replication of metadata between data centers - Using Coordination Engine - Provides strict consistency of the namespace 2. Asynchronous replication of data over the WAN - Data replicated in the background - Allows fast LAN-speed data creation 7
  • 8.
  • 9.
    Coordination Engine Determines theorder of operations in the system  Coordination Engine ensures the order of events submitted to the engine by multiple proposers - Anybody can Propose - Engine chooses a single Agreement every time and guarantees: • Learners observe the agreements in the same order they were chosen • An agreement triggers a corresponding application action 9
  • 10.
    Central Coordination Simple coordinationwithout fault tolerance  Easy to Coordinate - Single NameNode as an example of a Central Coordination Engine (No HA) - Performance and availability bottleneck - Single point of failure 10
  • 11.
    Distributed Coordination Engine Fault-tolerantcoordination using multiple acceptors  Distributed Coordination Engine operates on participating nodes - Roles: Proposer, Learner, and Acceptor - Each node can combine multiple roles  Distributed coordination - Proposing nodes submit events as proposals to a quorum of acceptors - Acceptors agree on the order of each event in the global sequence of events - Learners learn agreements in the same deterministic order 11
  • 12.
    Consensus Algorithms Consensus isthe process of agreeing on one result among a group of participants  Coordination Engine guarantees the same state of the learners at a given GSN - Each agreement is assigned a unique Global Sequence Number (GSN) - GSNs form a monotonically increasing number series – the order of agreements - Learners have the same initial state, apply the same deterministic agreements in the same deterministic order - GSN represents “logical” time in coordinated systems  PAXOS is a consensus algorithm proven to tolerate a variety of failures - Quorum-based Consensus - Deterministic State Machine - Leslie Lamport: Part-Time Parliament (1990) 12
  • 13.
  • 14.
    Replicated Virtual Namespace CoordinationEngine provides equivalence of multiple namespace replicas  Coordinated Virtual Namespace controlled by Fusion Node - Is a client that acts as a proxy to other client interactions - Reads are not coordinated - Writes (Open, Close, Append, etc…) are coordinated  The namespace events are consistent with each other - Each fusion server maintains a log of changes that would occur in the namespace - Any Fusion Node can initiate an update, which is propagated to all other Fusion Nodes  Coordination Engine establishes the global order of namespace updates - Fusion servers ensure deterministic updates in the same deterministic order to underlying file system - Systems, which start from the same state and apply the same updates, are equivalent 14
  • 15.
    Strict Consistency Model One-CopyEquivalence as known in replicated databases  Coordination Engine sequences file open and close proposals into the global sequence of agreements - Applied to individual replicated folder namespace in the order of their Global Sequence Number  Fusion Replicated Folders have identical states when they reach the same GSN  One-copy equivalence - Folders may have different states at a given moment of “clock” time as the rate of consuming agreements may vary - Provides same state in logical time 15 15
  • 16.
  • 17.
    Scaling Hadoop AcrossData Centers Continuous Availability and Disaster Recovery over the WAN  The system should appear, act, and be operated as a single cluster - Instant and automatic replication of data and metadata  Parts of the cluster on different data centers should have equal roles - Data could be ingested or accessed through any of the centers  Data creation and access should typically be at LAN speed - Running time of a job executed on one data center as if there are no other centers  Failure scenarios: the system should provide service and remain consistent - Any Fusion node can fail and still provide replication - Fusion nodes can fail simultaneously on two or more data centers and still provide replication - WAN Partitioning does not cause a data center outage - RPO is as low as possible due to continuous replication as opposed to periodic 17
  • 18.
    Foreign File Replication Fileis created on the client’s data center and replicated to the other asynchronously 18  Fusion workflow 1. Client makes a request to create a file 2. Fusion coordinates File Open to other clusters involved (membership) 3. File is added to underlying storage 4. IHC server pulls data from cluster and pushed to remote clusters 5. Fusion coordinates File Close to other clusters involved (membership)
  • 19.
    Inter Hadoop CommunicationService  Uses HCFS API and communicates directly with underlying storage systems - Isilon - MAPR - HDFS - S3  NameNode and DataNode operations are unchanged 19
  • 20.
    Multi–Data Center Installation DoI need so many replicas? 20
  • 21.
  • 22.
    Selective Data Replication Threemain use cases for restricting data replication  “Saudi Arabia” case – Data must never leave a specific data center - This is needed to protect data from being replicated outside of a specific geo-location, a country, or a facility, e.g., customer data from a branch in Saudi Arabia of a global bank must never leave the country due to local regulations. - Virtual namespace: only replicated metadata that has its supporting data replicated  “/tmp” case – Data created in a directory by a native client should remain native - Transient data of a job running on a DC does not need to be replicated elsewhere as it is deleted upon job completion and nobody else needs it.  “Ingest Only” case – Data directly ingested into cluster at data origin - Data replicates to all other data centers - Temporary network partitioned cluster can still ingest data 22
  • 23.
    SDR Implementation Example / cs-2015-01.log cs-2015-02.log shv-2015-03.txtuser/ tmp/ public/ Virtuallyreplicated namespace Selectively replicated data cs-2015-01.log dc1 dc1 dc2 dc3 shv-2015-03.txt dc1 dc2 dc2 job-2015-04.xml dc3 dc3 dc3 job-2015-04.xml dc1 dc2 dc3
  • 24.
    Heterogeneous Storage Zones VirtualData Centers representing different types of block storage  Storage Types: Hard Drive, SSD, RAM  Virtual data center is a zone of similarly configured Data Nodes  Example: - Z1 archival zone: DataNodes with dense hard drive storage - Z2 active data zone: DataNodes with high-performance SSDs - Z3 real-time access zone: lots of RAM and cores, short-lived hot data  SDR policy defines three directories: - /archive – replicated only on Z1 - /active-data – replicated on Z2 and Z1 - /real-time – replicated everywhere 24
  • 25.
    Simplified WAN configurations Reducedoperational complexity  Fast network protocols can keep up with demanding network replication  Hadoop clusters do not require direct communication with each other. - No n x m communication among datanodes across datacenters - Reduced firewall / socks complexities  Reduced Attack Surface
  • 26.
    Thank You. Questions? Come visitWANdisco at Booth 11 Selective Data Replication with Geographically Distributed Hadoop Brett Rudenstein

Editor's Notes

  • #3 No secret RAM is the limiting factor for NN scalability and as the result for the entire HDFS
  • #4 Achieved the goal or making a good progress
  • #13 The core of a distributed CE are consensus algorithms
  • #15 Double determinism is important for equivalent evolution of the systems
  • #18 Unlike multi-cluster architecture, where clusters run independently on each data center mirroring data between them
  • #19 DC1 DataNodes report replicas to native GeoNodes DC1 GeoNode submits Foreign Replica Report proposal FRR agreement is executed by all foreign GeoNodes: learn about foreign locations DC1 GeoNode schedules replica transfer from native to foreign DC2 DataNode DC2 DataNode reports new replica to DC2 GeoNodes DC2 GeoNode schedules replication of the new replica within DC2