http://zookeeper.apache.org/ GNUnify - 2013 1
Saurav Haloi Engineer at Symantec Work in Hadoop & Distributed System FOSS enthusiast GNUnify - 2013 2
A distributed system consists of multiple computers that communicate through a computer network and interact with each other to achieve a common goal. - Wikipedia GNUnify - 2013 3
The network is reliable. Latency is zero. Bandwidth is infinite. The network is secure. Topology doesn't change. There is one administrator. Transport cost is zero. The network is homogeneous. Reference: http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing GNUnify - 2013 4
Coordination: An act that multiple nodes must perform together. Examples: Group membership Locking Publisher/Subscriber Leader Election Synchronization Getting node coordination correct is very hard! GNUnify - 2013 5
GNUnify - 2013 6
ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers. - ZooKeeper Wiki ZooKeeper is much more than a distributed lock server! GNUnify - 2013 7
An open source, high-performance coordination service for distributed applications. Exposes common services in simple interface: naming configuration management locks & synchronization group services … developers don't have to write them from scratch Build your own on it for specific needs. GNUnify - 2013 8
Configuration Management Cluster member nodes bootstrapping configuration from a centralized source in unattended way Easier, simpler deployment/provisioning Distributed Cluster Management Node join / leave Node statuses in real time Naming service – e.g. DNS Distributed synchronization - locks, barriers, queues Leader election in a distributed system. Centralized and highly reliable (simple) data registry GNUnify - 2013 9
ZooKeeper Service is replicated over a set of machines All machines store a copy of the data (in memory) A leader is elected on service startup Clients only connect to a single ZooKeeper server & maintains a TCP connection. Client can read from any Zookeeper server, writes go through the leader & needs majority consensus. Image: https://cwiki.apache.org/confluence/display/ZOOKEEPER/ProjectDescription GNUnify - 2013 10
ZooKeeper has a hierarchal name space. Each node in the namespace is called as a ZNode. Every ZNode has data (given as byte[]) and can optionally have children. parent : "foo" |-- child1 : "bar" |-- child2 : "spam" `-- child3 : "eggs" `-- grandchild1 : "42" ZNode paths: canonical, absolute, slash-separated no relative references. names can have Unicode characters GNUnify - 2013 11
Maintain a stat structure with version numbers for data changes, ACL changes and timestamps. Version numbers increases with changes Data is read and written in its entirety Image: http://helix.incubator.apache.org/Architecture.html GNUnify - 2013 12
Persistent Nodes exists till explicitly deleted Ephemeral Nodes exists as long as the session is active can’t have children Sequence Nodes (Unique Naming) append a monotonically increasing counter to the end of path applies to both persistent & ephemeral nodes GNUnify - 2013 13
Operation Type create Write delete Write exists Read getChildren Read getData Read setData Write getACL Read setACL Write sync Read ZNodes are the main entity that a programmer access. GNUnify - 2013 14
[zk: localhost:2181(CONNECTED) 0] help [zk: localhost:2181(CONNECTED) 1] ls / ZooKeeper -server host:port cmd args [hbase, zookeeper] connect host:port get path [watch] [zk: localhost:2181(CONNECTED) 2] ls2 /zookeeper ls path [watch] [quota] set path data [version] cZxid = 0x0 rmr path ctime = Tue Jan 01 05:30:00 IST 2013 delquota [-n|-b] path mZxid = 0x0 quit mtime = Tue Jan 01 05:30:00 IST 2013 printwatches on|off pZxid = 0x0 create [-s] [-e] path data acl cversion = -1 stat path [watch] dataVersion = 0 close aclVersion = 0 ls2 path [watch] ephemeralOwner = 0x0 history dataLength = 0 listquota path numChildren = 1 setAcl path acl getAcl path [zk: localhost:2181(CONNECTED) 3] create /test-znode HelloWorld sync path Created /test-znode redo cmdno [zk: localhost:2181(CONNECTED) 4] ls / addauth scheme auth [test-znode, hbase, zookeeper] delete path [version] [zk: localhost:2181(CONNECTED) 5] get /test-znode setquota -n|-b val path HelloWorld GNUnify - 2013 15
Clients can set watches on znodes: NodeChildrenChanged NodeCreated NodeDataChanged NodeDeleted Changes to a znode trigger the watch and ZooKeeper sends the client a notification. Watches are one time triggers. Watches are always ordered. Client sees watched event before new znode data. Client should handle cases of latency between getting the event and sending a new request to get a watch. GNUnify - 2013 16
API methods are sync as well as async Sync: exists(“/test-cluster/CONFIGS", null); Async: exists("/test-cluster/CONFIGS", null, new StatCallback() { @Override public processResult(int rc, String path, Object ctx, Stat stat) { //process result when called back later } }, null); GNUnify - 2013 17
Read requests are processed locally at the ZooKeeper server to which the client is currently connected Write requests are forwarded to the leader and go through majority consensus before a response is generated. Image: http://www.slideshare.net/scottleber/apache-zookeeper GNUnify - 2013 18
Sequential Consistency: Updates are applied in order Atomicity: Updates either succeed or fail Single System Image: A client sees the same view of the service regardless of the ZK server it connects to. Reliability: Updates persists once applied, till overwritten by some clients. Timeliness: The clients’ view of the system is guaranteed to be up-to-date within a certain time bound. (Eventual Consistency) GNUnify - 2013 19
Each Client Host i, i:=1 .. N Cluster 1. Watch on /members 2. Create /members/host-${i} as ephemeral nodes /members 3. Node Join/Leave generates alert 4. Keep updating /members/host-${i} host-1 periodically for node status changes (load, memory, CPU etc.) host-2 host-N GNUnify - 2013 20
1. A znode, say “/svc/election-path" 2. All participants of the election process create an ephemeral-sequential node on the same election path. 3. The node with the smallest sequence number is the leader. 4. Each “follower” node listens to the node with the next lower seq. number 5. Upon leader removal go to election-path and find a new leader, or become the leader if it has the lowest sequence number. 6. Upon session expiration check the election state and go to election if needed Image: http://techblog.outbrain.com/2011/07/leader-election-with-zookeeper/ GNUnify - 2013 21
Assuming there are N clients trying to ZK acquire a lock Clients creates an |---Cluster ephemeral, sequential znode under +---config the path /Cluster/_locknode_ +---memberships Clients requests a list of children for the lock znode (i.e. _locknode_) +---_locknode_ The client with the least ID according to +---host1-3278451 natural ordering will hold the lock. +---host2-3278452 Other clients sets watches on the znode with id immediately preceding +---host3-3278453 its own id +--- … Periodically checks for the lock in case of notification. ---hostN-3278XXX The client wishing to release a lock deletes the node, which triggering the next client in line to acquire the lock. GNUnify - 2013 22
ZooKeeper ships client libraries in: Java C Perl Python Community contributed client bindings available for Scala, C#, Node.js, Ruby, Erlang, Go, Haskell https://cwiki.apache.org/ZOOKEEPER/zkclientbindings.html GNUnify - 2013 23
Watches are one time triggers Continuous watching on znodes requires reset of watches after every events / triggers Too many watches on a single znode creates the “herd effect” - causing bursts of traffic and limiting scalability If a znode changes multiple times between getting the event and setting the watch again, carefully handle it! Keep session time-outs long enough to handle long garbage-collection pauses in applications. Set Java max heap size correctly to avoid swapping. Dedicated disk for ZooKeeper transaction log GNUnify - 2013 24
Companies: Projects in FOSS: • Yahoo! • Apache Map/Reduce (Yarn) • Zynga • Apache HBase • Rackspace • Apache Solr • LinkedIn • Neo4j • Netflix • Katta • and many more… • and many more… Reference: https://cwiki.apache.org/confluence/display/ZOOKEEPER/PoweredBy GNUnify - 2013 25
Used within Twitter for service discovery How? Services register themselves in ZooKeeper Clients query the production cluster for service “A” in data center “XYZ” An up-to-date host list for each service is maintained Whenever new capacity is added the client will automatically be aware Also, enables load balancing across all servers. Reference: http://engineering.twitter.com/ GNUnify - 2013 26
The Chubby lock service for loosely-coupled distributed systems Google Research (7th USENIX Symposium on Operating Systems Design and Implementation (OSDI), {USENIX} (2006) ) ZooKeeper: Wait-free coordination for Internet-scale systems Yahoo Research (USENIX Annual Technology Conference 2010) Apache ZooKeeper Home: http://zookeeper.apache.org/ Presentations: http://www.slideshare.net/mumrah/introduction-to-zookeeper-trihug- may-22-2012 http://www.slideshare.net/scottleber/apache-zookeeper https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeperPre sentations GNUnify - 2013 27
The Google File System The Hadoop Distributed File System MapReduce: Simplified Data Processing on Large Clusters Bigtable: A Distributed Storage System for Structured Data PNUTS: Yahoo!’s Hosted Data Serving Platform Dynamo: Amazon's Highly Available Key-value Store Spanner: Google's Globally Distributed Database Centrifuge: Integrated Lease Management and Partitioning Cloud Services (Microsoft) ZAB: A simple totally ordered broadcast protocol (Yahoo!) Paxos Made Simple by Leslie Lamport. Eventually Consistent by Werner Vogel (CTO, Amazon) http://www.highscalability.com/ GNUnify - 2013 28
Questions? GNUnify - 2013 29
Thank You! Saurav Haloi saurav.haloi@yahoo.com Twitter: sauravhaloi GNUnify - 2013 30

Introduction to Apache ZooKeeper

  • 1.
  • 2.
    Saurav Haloi Engineer at Symantec Work in Hadoop & Distributed System FOSS enthusiast GNUnify - 2013 2
  • 3.
    A distributed systemconsists of multiple computers that communicate through a computer network and interact with each other to achieve a common goal. - Wikipedia GNUnify - 2013 3
  • 4.
    The network isreliable. Latency is zero. Bandwidth is infinite. The network is secure. Topology doesn't change. There is one administrator. Transport cost is zero. The network is homogeneous. Reference: http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing GNUnify - 2013 4
  • 5.
    Coordination: An actthat multiple nodes must perform together. Examples: Group membership Locking Publisher/Subscriber Leader Election Synchronization Getting node coordination correct is very hard! GNUnify - 2013 5
  • 6.
  • 7.
    ZooKeeper allows distributedprocesses to coordinate with each other through a shared hierarchical name space of data registers. - ZooKeeper Wiki ZooKeeper is much more than a distributed lock server! GNUnify - 2013 7
  • 8.
    An open source,high-performance coordination service for distributed applications. Exposes common services in simple interface: naming configuration management locks & synchronization group services … developers don't have to write them from scratch Build your own on it for specific needs. GNUnify - 2013 8
  • 9.
    Configuration Management Cluster member nodes bootstrapping configuration from a centralized source in unattended way Easier, simpler deployment/provisioning Distributed Cluster Management Node join / leave Node statuses in real time Naming service – e.g. DNS Distributed synchronization - locks, barriers, queues Leader election in a distributed system. Centralized and highly reliable (simple) data registry GNUnify - 2013 9
  • 10.
    ZooKeeper Service isreplicated over a set of machines All machines store a copy of the data (in memory) A leader is elected on service startup Clients only connect to a single ZooKeeper server & maintains a TCP connection. Client can read from any Zookeeper server, writes go through the leader & needs majority consensus. Image: https://cwiki.apache.org/confluence/display/ZOOKEEPER/ProjectDescription GNUnify - 2013 10
  • 11.
    ZooKeeper has ahierarchal name space. Each node in the namespace is called as a ZNode. Every ZNode has data (given as byte[]) and can optionally have children. parent : "foo" |-- child1 : "bar" |-- child2 : "spam" `-- child3 : "eggs" `-- grandchild1 : "42" ZNode paths: canonical, absolute, slash-separated no relative references. names can have Unicode characters GNUnify - 2013 11
  • 12.
    Maintain a statstructure with version numbers for data changes, ACL changes and timestamps. Version numbers increases with changes Data is read and written in its entirety Image: http://helix.incubator.apache.org/Architecture.html GNUnify - 2013 12
  • 13.
    Persistent Nodes exists till explicitly deleted Ephemeral Nodes exists as long as the session is active can’t have children Sequence Nodes (Unique Naming) append a monotonically increasing counter to the end of path applies to both persistent & ephemeral nodes GNUnify - 2013 13
  • 14.
    Operation Type create Write delete Write exists Read getChildren Read getData Read setData Write getACL Read setACL Write sync Read ZNodes are the main entity that a programmer access. GNUnify - 2013 14
  • 15.
    [zk: localhost:2181(CONNECTED) 0]help [zk: localhost:2181(CONNECTED) 1] ls / ZooKeeper -server host:port cmd args [hbase, zookeeper] connect host:port get path [watch] [zk: localhost:2181(CONNECTED) 2] ls2 /zookeeper ls path [watch] [quota] set path data [version] cZxid = 0x0 rmr path ctime = Tue Jan 01 05:30:00 IST 2013 delquota [-n|-b] path mZxid = 0x0 quit mtime = Tue Jan 01 05:30:00 IST 2013 printwatches on|off pZxid = 0x0 create [-s] [-e] path data acl cversion = -1 stat path [watch] dataVersion = 0 close aclVersion = 0 ls2 path [watch] ephemeralOwner = 0x0 history dataLength = 0 listquota path numChildren = 1 setAcl path acl getAcl path [zk: localhost:2181(CONNECTED) 3] create /test-znode HelloWorld sync path Created /test-znode redo cmdno [zk: localhost:2181(CONNECTED) 4] ls / addauth scheme auth [test-znode, hbase, zookeeper] delete path [version] [zk: localhost:2181(CONNECTED) 5] get /test-znode setquota -n|-b val path HelloWorld GNUnify - 2013 15
  • 16.
    Clients can setwatches on znodes: NodeChildrenChanged NodeCreated NodeDataChanged NodeDeleted Changes to a znode trigger the watch and ZooKeeper sends the client a notification. Watches are one time triggers. Watches are always ordered. Client sees watched event before new znode data. Client should handle cases of latency between getting the event and sending a new request to get a watch. GNUnify - 2013 16
  • 17.
    API methods aresync as well as async Sync: exists(“/test-cluster/CONFIGS", null); Async: exists("/test-cluster/CONFIGS", null, new StatCallback() { @Override public processResult(int rc, String path, Object ctx, Stat stat) { //process result when called back later } }, null); GNUnify - 2013 17
  • 18.
    Read requests areprocessed locally at the ZooKeeper server to which the client is currently connected Write requests are forwarded to the leader and go through majority consensus before a response is generated. Image: http://www.slideshare.net/scottleber/apache-zookeeper GNUnify - 2013 18
  • 19.
    Sequential Consistency: Updatesare applied in order Atomicity: Updates either succeed or fail Single System Image: A client sees the same view of the service regardless of the ZK server it connects to. Reliability: Updates persists once applied, till overwritten by some clients. Timeliness: The clients’ view of the system is guaranteed to be up-to-date within a certain time bound. (Eventual Consistency) GNUnify - 2013 19
  • 20.
    Each Client Hosti, i:=1 .. N Cluster 1. Watch on /members 2. Create /members/host-${i} as ephemeral nodes /members 3. Node Join/Leave generates alert 4. Keep updating /members/host-${i} host-1 periodically for node status changes (load, memory, CPU etc.) host-2 host-N GNUnify - 2013 20
  • 21.
    1. A znode,say “/svc/election-path" 2. All participants of the election process create an ephemeral-sequential node on the same election path. 3. The node with the smallest sequence number is the leader. 4. Each “follower” node listens to the node with the next lower seq. number 5. Upon leader removal go to election-path and find a new leader, or become the leader if it has the lowest sequence number. 6. Upon session expiration check the election state and go to election if needed Image: http://techblog.outbrain.com/2011/07/leader-election-with-zookeeper/ GNUnify - 2013 21
  • 22.
    Assuming there areN clients trying to ZK acquire a lock Clients creates an |---Cluster ephemeral, sequential znode under +---config the path /Cluster/_locknode_ +---memberships Clients requests a list of children for the lock znode (i.e. _locknode_) +---_locknode_ The client with the least ID according to +---host1-3278451 natural ordering will hold the lock. +---host2-3278452 Other clients sets watches on the znode with id immediately preceding +---host3-3278453 its own id +--- … Periodically checks for the lock in case of notification. ---hostN-3278XXX The client wishing to release a lock deletes the node, which triggering the next client in line to acquire the lock. GNUnify - 2013 22
  • 23.
    ZooKeeper ships clientlibraries in: Java C Perl Python Community contributed client bindings available for Scala, C#, Node.js, Ruby, Erlang, Go, Haskell https://cwiki.apache.org/ZOOKEEPER/zkclientbindings.html GNUnify - 2013 23
  • 24.
    Watches are onetime triggers Continuous watching on znodes requires reset of watches after every events / triggers Too many watches on a single znode creates the “herd effect” - causing bursts of traffic and limiting scalability If a znode changes multiple times between getting the event and setting the watch again, carefully handle it! Keep session time-outs long enough to handle long garbage-collection pauses in applications. Set Java max heap size correctly to avoid swapping. Dedicated disk for ZooKeeper transaction log GNUnify - 2013 24
  • 25.
    Companies: Projects in FOSS: • Yahoo! • Apache Map/Reduce (Yarn) • Zynga • Apache HBase • Rackspace • Apache Solr • LinkedIn • Neo4j • Netflix • Katta • and many more… • and many more… Reference: https://cwiki.apache.org/confluence/display/ZOOKEEPER/PoweredBy GNUnify - 2013 25
  • 26.
    Used within Twitterfor service discovery How? Services register themselves in ZooKeeper Clients query the production cluster for service “A” in data center “XYZ” An up-to-date host list for each service is maintained Whenever new capacity is added the client will automatically be aware Also, enables load balancing across all servers. Reference: http://engineering.twitter.com/ GNUnify - 2013 26
  • 27.
    The Chubby lockservice for loosely-coupled distributed systems Google Research (7th USENIX Symposium on Operating Systems Design and Implementation (OSDI), {USENIX} (2006) ) ZooKeeper: Wait-free coordination for Internet-scale systems Yahoo Research (USENIX Annual Technology Conference 2010) Apache ZooKeeper Home: http://zookeeper.apache.org/ Presentations: http://www.slideshare.net/mumrah/introduction-to-zookeeper-trihug- may-22-2012 http://www.slideshare.net/scottleber/apache-zookeeper https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeperPre sentations GNUnify - 2013 27
  • 28.
    The Google FileSystem The Hadoop Distributed File System MapReduce: Simplified Data Processing on Large Clusters Bigtable: A Distributed Storage System for Structured Data PNUTS: Yahoo!’s Hosted Data Serving Platform Dynamo: Amazon's Highly Available Key-value Store Spanner: Google's Globally Distributed Database Centrifuge: Integrated Lease Management and Partitioning Cloud Services (Microsoft) ZAB: A simple totally ordered broadcast protocol (Yahoo!) Paxos Made Simple by Leslie Lamport. Eventually Consistent by Werner Vogel (CTO, Amazon) http://www.highscalability.com/ GNUnify - 2013 28
  • 29.
  • 30.
    Thank You! Saurav Haloi saurav.haloi@yahoo.com Twitter: sauravhaloi GNUnify - 2013 30