The picture can't be displayed. Technical tips for secure Apache Hadoop cluster Akira Ajisaka, Kei Kori Yahoo Japan Corporation Big Data
Akira Ajisaka (@ajis_ka) • Software Engineer in Hadoop team @ Yahoo! JAPAN – Upgraded HDFS to 3.3.0 and enabled RBF – R&D for more secure Hadoop cluster than just enabling Kerberos auth • Apache Hadoop committer/PMC – ~800 commits in various components in 6 years – Handled and announced several CVEs – Manages build and QA environment
Kei KORI (@2k0ri) • Data Platform Engineer in Hadoop team @ Yahoo! JAPAN – Built upgrading to and continuous delivery for HDFS 3.3.0 – Research of operation for more secure Hadoop cluster • Kubernetes admin for Hadoop client environment – Migrates users from VM/BM to cloud native way – Integrates ML/DL workloads with Hadoop ecosystem
Session Overview 4
Session Overview Prerequisites: • Hadoop is not secure by default • Kerberos authentication is required This talk is to introduce further details in practice: • Wire encryption in Hadoop ecosystem • HDFS transparent data encryption at rest • Other considerations
Wire encryption in Hadoop ecosystem 6
Background For making Hadoop ecosystem more secure than perimeter security • Not only authenticate but encrypt communications • Protection and mitigation from internal threats like packet sniffing • Part of security compliance like NIST SP800-171
Overview: wire encryption types between components • HTTP encryption – HDFS, YARN, MapReduce, KMS, HttpFS, Spark, Hive, Oozie, Livy • RPC encryption – HDFS, YARN, MapReduce, KMS, Spark, Hive, Oozie, ZooKeeper • Block data transfer encryption – HDFS • Shuffle encryption – MapReduce, Spark, Tez
HTTP encryption for Hadoop • dfs.http.policy: HTTPS_ONLY in hdfs-site, yarn.http.policy: HTTPS_ONLY in yarn-site, mapreduce.jobhistory.http.policy: HTTPS_ONLY in mapred-site etc. – Enable TLS on WebUI/REST API endpoints – HTTP_AND_HTTPS while rolling update endpoints • yarn.timeline-service.webapp.https.address in yarn-site, mapreduce.jobhistory.webapp.https.address in mapred-site – Set History/Timeline Server endpoints with HTTPS • Storing certs and passphrases using Hadoop Credential Provider into hadoop.security.credential.provider.path – Separates permissions from configs – Prevents exposure outside of hadoop.security.sensitive-config-keys filtering
RPC encryption for Hadoop • hadoop.rpc.protection: privacy in core-site – Encrypts RPC incl. Kerberos authentication on SASL layer – Propagates to hadoop.security.saslproperties.resolver.class, dfs.data.transfer.saslproperties.resolver.class and dfs.data.transfer.protection • hadoop.rpc.protection: privacy,authentication while rolling update whole Hadoop servers/clients – Accepts falling back to non-encrypted RPC
Block data transfer encryption for Hadoop • dfs.encrypt.data.transfer: true, dfs.encrypt.data.transfer.cipher.suites: AES/CTR/NoPadding in hdfs-site – Only encrypts payload between HDFS client and DataNodes • Rolling update is not supported within configs – Needs managing list of encrypted nodes or extend/implement own dfs.trustedchannel.resolver.class – Trusted nodes by dfs.trustedchannel.resolver.class are forced to transfer without encryption regardless of its encryption status
Encryption for Spark In spark-defaults: • HTTP encryption – spark.ssl.sparkHistory.enabled true • Switches protocol on 1 port, does not support HTTP_AND_HTTPS – spark.yarn.historyServer.address https://... • RPC encryption – spark.authenticate: true • Also in yarn-site – spark.authenticate.enableSaslEncryption true – spark.network.sasl.serverAlwaysEncrypt true • After all Spark components recognized enableSaslEncryption • Shuffle encryption – spark.network.crypto.enabled true – spark.io.encryption.enabled true • Encrypts spilled caches and RDDs on local disks
Encryption for Hive • hive.server2.thrift.sasl.qop: auth-conf in hive-site – Encrypts JDBC between client and HiveServer2 binary mode – And Thrift between clients and Hive Metastore • hive.server2.use.SSL: true in hive-site – Only for HS2 http mode – HS2 binary mode cannot enable both TLS and SASL • Encryption for JDBC between HS2/Hive Metastore and remote RDBMS • Shuffle encryption – Tez: tez.runtime.shuffle.ssl.enable: true, tez.runtime.shuffle.keep-alive.enabled: true in tez-site – MapReduce: mapreduce.ssl.enabled: true, mapreduce.shuffle.ssl.enabled: true in mapred-site – Requires server certs for all NodeManagers
Challenges in HTTP encryption: for Application Master / Spark Driver • Server certs for ApplicationMaster / SparkDriver need to be readable by the user who submitted it – ApplicationMaster and SparkDriver run as the user – WebApplicationProxy between ResourceManager and ApplicationMaster relies on this encryption • Applications support TLS and can bundle certs since – Spark 3.0.0: SPARK-24621 – MapReduce 3.3.0: MAPREDUCE-4669 – Tez: not supported yet
Encryption for ZooKeeper server • Authenticate with SASL, encrypt with TLS – ZooKeeper doen not respect SASL QOP • Requires ZooKeeper 3.5.6 or above for servers/quorums – serverCnxnFactory=org.apache.zookeeper.server.Nett yServerCnxnFactory – sslQuorum=true – ssl.clientAuth=NONE – ssl.quorum.clientAuth=NONE • Needs ZOOKEEPER-4276 to follow Upgrading existing non-TLS cluster with no downtime – Makes ZK can serve only with secureClientPort
Encryption for ZooKeeper client • Also Requires ZooKeeper 3.5.6 or above for clients -Dzookeeper.client.secure=true -Dzookeeper.clientCnxnSocket= org.apache.zookeeper.ClientCnxnSocketNetty in client JVM args – HADOOP_OPTS environment variable – mapreduce.admin.map.child.java.opts, mapreduce.admin.reduce.child.java.opts in mapred-site for Oozie Coordinator MapReduce jobs • Needs to replace and update ZooKeeper jars in all components which communicate with ZooKeeper – ZKFC, ResourceManager, Hive clients incl. HS2, Oozie and Livy – Apache Curator also be updated to 4.2.0, Netty from 4.0 to 4.1
Enforcing Kerberos AuthN/Z for ZooKeeper • Requires ZooKeeper 3.6.0 or above for servers – 3.6.0+: zookeeper.sessionRequireClientSASLAuth=true – 3.7.0+: enforce.auth.enabled=true enforce.auth.schemes=sasl • Oozie Hive action will not work with forcing ZK SASL – when acquiring the lock for Hive Metastore – Has no mechanisms to delegate authentication or impersonation for ZooKeeper – Using HiveServer2 / Oozie Hive2 action solve it
HDFS transparent data encryption (TDE) at rest 18
Background HDFS blocks are written to local filesystem of the DataNodes • the data is not encrypted by default • encryption is required in several use cases Encryption can be done at several layers: • Application: most secure, but hardest to do • Database: most databases have this, but may incur performance penalties • Filesystem: high performance, transparent, but may not be flexible • Disk: only really protects against physical theft HDFS TDE fits between database and filesystem level
Overview: encryption/decryption is transparent to the clients
KeyProvider: Where KEK is saved Implementations of KeyProvider API • Hadoop KMS: JavaKeyStoreProvider – JCEKS files in Hadoop compatible filesystems (localFS, HDFS, cloud storage) – Not recommended • Apache Ranger KMS: RangerKeyStoreProvider – RDBMS – master key can be stored in Luna HSM (optional) – HSM is required in some use cases • PCI-DSS, FIPS 140-2
Extending KeyProvider API is not difficult • Mandatory methods for HDFS TDE – getKeyVersion, getCurrentKey, getMetadata • Optional methods (nice to have for operation) – getKeys, getKeysMetadata, getKeyVersions, createKey, deleteKey, rollNewVersion – If not implemented, you need to create/delete/list/roll keys in some way • Use cases: – LinkedIn integrated with its own key management service, LiKMS https://engineering.linkedin.com/blog/2021/the-exabyte-club-- linkedin-s-journey-of-scaling-the-hadoop-distr – Yahoo! JAPAN also integrated with our own credential store by only ~500 LOC (including test code)
KeyProvider is actually stable, can be used safely • KeyProvider is @Public and @Unstable – @Unstable in Hadoop means "incompatible changes are allowed at any time" • Actually, the API is very stable – No incompatible changes – Ranger uses it since 2015: RANGER-247 • Provided a patch to mark it stable – HADOOP-17544
Hadoop KMS: Where KEK is cached and performs authorization • KMS interacts with HDFS clients, NameNodes, and KeyProvider • KMS have its own ACLs separated from HDFS ACLs – An attacker cannot decrypt data even if HDFS ACLs are compromised – If 'usera' reads/writes data in the encryption zone with 'keya', the configuration in kms-acls.xml will be: – The configuration is hot-reloaded • For HA and scalability, multiple KMS instances are supported <property> <name>key.acl.keya.DECRYPT_EEK</name> <value>usera</value> </property>
How to deploy multiple KMS instances Two Approaches: 1. Behind a load-balancer or VIP 2. Using LoadBalancingKMSClientProvider – Implicitly used when multiple URIs are specified in hadoop.security.key.provider.path If you have a LB or VIP, use it • No configuration change to scale-out/decommission • LB saves clients' retry cost – LoadBalancingKMSClientProvider first try to connect to a KMS, if fails, then connect to another KMS
How to configure multiple KMS instances • Delegation Token must be synchronized – Use ZKDelegationTokenSecretManager – Documented an example configuration: HADOOP-17794 • hadoop.security.token.service.use_ip – If true (default), fails to validate SSL certificates in multi- homed environment – Documented: HADOOP-12665
Tuning Hadoop KMS • Documented and discussed in HADOOP-15743 – Reduce SSL session cache size and TTL – Tuning https idle timeout – Increase max file descriptors – etc. • This tuning is effective in HttpFS as well – Both KMS/HttpFS use Jetty via HttpServer2
Recap: HDFS TDE • Careful configuration required – How to save KEK – Running multiple KMS instances – KMS Tuning – Where to create encryption zones – ACLs (including key ACLs and impersonation) • They are not straightforward despite the long time since the feature was developed
Other considerations 29
Updating SSL certificates • Hadoop >= 3.3.1 allows updating SSL certificates without downtime: HADOOP-16524 – Use hot-reload feature in Jetty – Except DataNode since DN don't rely on Jetty • Useful especially for NameNode because it takes > 30 minutes to restart in large cluster
Other considerations • It is important to be ready to upgrade at any time – Sometimes CVEs have been published and the vendors warn users to upgrade • Security requirements may increase later, so be prepared for that early • Operational considerations are also necessary – Not only the cluster configuration but also the operations will be change
Conclusion & Future work We introduced many technical tips for secure Hadoop cluster • However, they might change in the future • Need to catch up with the OSS community Future work • How to enable SSL/TLS in ApplicationMaster & Spark Driver Web UIs • Impersonation does not work correctly in KMSClientProvider: HDFS-13697
THANK YOU QUESTIONS? @aajisaka @2k0ri

Technical tips for secure Apache Hadoop cluster #ApacheConAsia #ApacheCon

  • 1.
    The picture can'tbe displayed. Technical tips for secure Apache Hadoop cluster Akira Ajisaka, Kei Kori Yahoo Japan Corporation Big Data
  • 2.
    Akira Ajisaka (@ajis_ka) •Software Engineer in Hadoop team @ Yahoo! JAPAN – Upgraded HDFS to 3.3.0 and enabled RBF – R&D for more secure Hadoop cluster than just enabling Kerberos auth • Apache Hadoop committer/PMC – ~800 commits in various components in 6 years – Handled and announced several CVEs – Manages build and QA environment
  • 3.
    Kei KORI (@2k0ri) •Data Platform Engineer in Hadoop team @ Yahoo! JAPAN – Built upgrading to and continuous delivery for HDFS 3.3.0 – Research of operation for more secure Hadoop cluster • Kubernetes admin for Hadoop client environment – Migrates users from VM/BM to cloud native way – Integrates ML/DL workloads with Hadoop ecosystem
  • 4.
  • 5.
    Session Overview Prerequisites: • Hadoopis not secure by default • Kerberos authentication is required This talk is to introduce further details in practice: • Wire encryption in Hadoop ecosystem • HDFS transparent data encryption at rest • Other considerations
  • 6.
  • 7.
    Background For making Hadoopecosystem more secure than perimeter security • Not only authenticate but encrypt communications • Protection and mitigation from internal threats like packet sniffing • Part of security compliance like NIST SP800-171
  • 8.
    Overview: wire encryptiontypes between components • HTTP encryption – HDFS, YARN, MapReduce, KMS, HttpFS, Spark, Hive, Oozie, Livy • RPC encryption – HDFS, YARN, MapReduce, KMS, Spark, Hive, Oozie, ZooKeeper • Block data transfer encryption – HDFS • Shuffle encryption – MapReduce, Spark, Tez
  • 9.
    HTTP encryption forHadoop • dfs.http.policy: HTTPS_ONLY in hdfs-site, yarn.http.policy: HTTPS_ONLY in yarn-site, mapreduce.jobhistory.http.policy: HTTPS_ONLY in mapred-site etc. – Enable TLS on WebUI/REST API endpoints – HTTP_AND_HTTPS while rolling update endpoints • yarn.timeline-service.webapp.https.address in yarn-site, mapreduce.jobhistory.webapp.https.address in mapred-site – Set History/Timeline Server endpoints with HTTPS • Storing certs and passphrases using Hadoop Credential Provider into hadoop.security.credential.provider.path – Separates permissions from configs – Prevents exposure outside of hadoop.security.sensitive-config-keys filtering
  • 10.
    RPC encryption forHadoop • hadoop.rpc.protection: privacy in core-site – Encrypts RPC incl. Kerberos authentication on SASL layer – Propagates to hadoop.security.saslproperties.resolver.class, dfs.data.transfer.saslproperties.resolver.class and dfs.data.transfer.protection • hadoop.rpc.protection: privacy,authentication while rolling update whole Hadoop servers/clients – Accepts falling back to non-encrypted RPC
  • 11.
    Block data transferencryption for Hadoop • dfs.encrypt.data.transfer: true, dfs.encrypt.data.transfer.cipher.suites: AES/CTR/NoPadding in hdfs-site – Only encrypts payload between HDFS client and DataNodes • Rolling update is not supported within configs – Needs managing list of encrypted nodes or extend/implement own dfs.trustedchannel.resolver.class – Trusted nodes by dfs.trustedchannel.resolver.class are forced to transfer without encryption regardless of its encryption status
  • 12.
    Encryption for Spark Inspark-defaults: • HTTP encryption – spark.ssl.sparkHistory.enabled true • Switches protocol on 1 port, does not support HTTP_AND_HTTPS – spark.yarn.historyServer.address https://... • RPC encryption – spark.authenticate: true • Also in yarn-site – spark.authenticate.enableSaslEncryption true – spark.network.sasl.serverAlwaysEncrypt true • After all Spark components recognized enableSaslEncryption • Shuffle encryption – spark.network.crypto.enabled true – spark.io.encryption.enabled true • Encrypts spilled caches and RDDs on local disks
  • 13.
    Encryption for Hive •hive.server2.thrift.sasl.qop: auth-conf in hive-site – Encrypts JDBC between client and HiveServer2 binary mode – And Thrift between clients and Hive Metastore • hive.server2.use.SSL: true in hive-site – Only for HS2 http mode – HS2 binary mode cannot enable both TLS and SASL • Encryption for JDBC between HS2/Hive Metastore and remote RDBMS • Shuffle encryption – Tez: tez.runtime.shuffle.ssl.enable: true, tez.runtime.shuffle.keep-alive.enabled: true in tez-site – MapReduce: mapreduce.ssl.enabled: true, mapreduce.shuffle.ssl.enabled: true in mapred-site – Requires server certs for all NodeManagers
  • 14.
    Challenges in HTTPencryption: for Application Master / Spark Driver • Server certs for ApplicationMaster / SparkDriver need to be readable by the user who submitted it – ApplicationMaster and SparkDriver run as the user – WebApplicationProxy between ResourceManager and ApplicationMaster relies on this encryption • Applications support TLS and can bundle certs since – Spark 3.0.0: SPARK-24621 – MapReduce 3.3.0: MAPREDUCE-4669 – Tez: not supported yet
  • 15.
    Encryption for ZooKeeperserver • Authenticate with SASL, encrypt with TLS – ZooKeeper doen not respect SASL QOP • Requires ZooKeeper 3.5.6 or above for servers/quorums – serverCnxnFactory=org.apache.zookeeper.server.Nett yServerCnxnFactory – sslQuorum=true – ssl.clientAuth=NONE – ssl.quorum.clientAuth=NONE • Needs ZOOKEEPER-4276 to follow Upgrading existing non-TLS cluster with no downtime – Makes ZK can serve only with secureClientPort
  • 16.
    Encryption for ZooKeeperclient • Also Requires ZooKeeper 3.5.6 or above for clients -Dzookeeper.client.secure=true -Dzookeeper.clientCnxnSocket= org.apache.zookeeper.ClientCnxnSocketNetty in client JVM args – HADOOP_OPTS environment variable – mapreduce.admin.map.child.java.opts, mapreduce.admin.reduce.child.java.opts in mapred-site for Oozie Coordinator MapReduce jobs • Needs to replace and update ZooKeeper jars in all components which communicate with ZooKeeper – ZKFC, ResourceManager, Hive clients incl. HS2, Oozie and Livy – Apache Curator also be updated to 4.2.0, Netty from 4.0 to 4.1
  • 17.
    Enforcing Kerberos AuthN/Zfor ZooKeeper • Requires ZooKeeper 3.6.0 or above for servers – 3.6.0+: zookeeper.sessionRequireClientSASLAuth=true – 3.7.0+: enforce.auth.enabled=true enforce.auth.schemes=sasl • Oozie Hive action will not work with forcing ZK SASL – when acquiring the lock for Hive Metastore – Has no mechanisms to delegate authentication or impersonation for ZooKeeper – Using HiveServer2 / Oozie Hive2 action solve it
  • 18.
  • 19.
    Background HDFS blocks arewritten to local filesystem of the DataNodes • the data is not encrypted by default • encryption is required in several use cases Encryption can be done at several layers: • Application: most secure, but hardest to do • Database: most databases have this, but may incur performance penalties • Filesystem: high performance, transparent, but may not be flexible • Disk: only really protects against physical theft HDFS TDE fits between database and filesystem level
  • 20.
  • 21.
    KeyProvider: Where KEKis saved Implementations of KeyProvider API • Hadoop KMS: JavaKeyStoreProvider – JCEKS files in Hadoop compatible filesystems (localFS, HDFS, cloud storage) – Not recommended • Apache Ranger KMS: RangerKeyStoreProvider – RDBMS – master key can be stored in Luna HSM (optional) – HSM is required in some use cases • PCI-DSS, FIPS 140-2
  • 22.
    Extending KeyProvider APIis not difficult • Mandatory methods for HDFS TDE – getKeyVersion, getCurrentKey, getMetadata • Optional methods (nice to have for operation) – getKeys, getKeysMetadata, getKeyVersions, createKey, deleteKey, rollNewVersion – If not implemented, you need to create/delete/list/roll keys in some way • Use cases: – LinkedIn integrated with its own key management service, LiKMS https://engineering.linkedin.com/blog/2021/the-exabyte-club-- linkedin-s-journey-of-scaling-the-hadoop-distr – Yahoo! JAPAN also integrated with our own credential store by only ~500 LOC (including test code)
  • 23.
    KeyProvider is actuallystable, can be used safely • KeyProvider is @Public and @Unstable – @Unstable in Hadoop means "incompatible changes are allowed at any time" • Actually, the API is very stable – No incompatible changes – Ranger uses it since 2015: RANGER-247 • Provided a patch to mark it stable – HADOOP-17544
  • 24.
    Hadoop KMS: WhereKEK is cached and performs authorization • KMS interacts with HDFS clients, NameNodes, and KeyProvider • KMS have its own ACLs separated from HDFS ACLs – An attacker cannot decrypt data even if HDFS ACLs are compromised – If 'usera' reads/writes data in the encryption zone with 'keya', the configuration in kms-acls.xml will be: – The configuration is hot-reloaded • For HA and scalability, multiple KMS instances are supported <property> <name>key.acl.keya.DECRYPT_EEK</name> <value>usera</value> </property>
  • 25.
    How to deploymultiple KMS instances Two Approaches: 1. Behind a load-balancer or VIP 2. Using LoadBalancingKMSClientProvider – Implicitly used when multiple URIs are specified in hadoop.security.key.provider.path If you have a LB or VIP, use it • No configuration change to scale-out/decommission • LB saves clients' retry cost – LoadBalancingKMSClientProvider first try to connect to a KMS, if fails, then connect to another KMS
  • 26.
    How to configuremultiple KMS instances • Delegation Token must be synchronized – Use ZKDelegationTokenSecretManager – Documented an example configuration: HADOOP-17794 • hadoop.security.token.service.use_ip – If true (default), fails to validate SSL certificates in multi- homed environment – Documented: HADOOP-12665
  • 27.
    Tuning Hadoop KMS •Documented and discussed in HADOOP-15743 – Reduce SSL session cache size and TTL – Tuning https idle timeout – Increase max file descriptors – etc. • This tuning is effective in HttpFS as well – Both KMS/HttpFS use Jetty via HttpServer2
  • 28.
    Recap: HDFS TDE •Careful configuration required – How to save KEK – Running multiple KMS instances – KMS Tuning – Where to create encryption zones – ACLs (including key ACLs and impersonation) • They are not straightforward despite the long time since the feature was developed
  • 29.
  • 30.
    Updating SSL certificates •Hadoop >= 3.3.1 allows updating SSL certificates without downtime: HADOOP-16524 – Use hot-reload feature in Jetty – Except DataNode since DN don't rely on Jetty • Useful especially for NameNode because it takes > 30 minutes to restart in large cluster
  • 31.
    Other considerations • Itis important to be ready to upgrade at any time – Sometimes CVEs have been published and the vendors warn users to upgrade • Security requirements may increase later, so be prepared for that early • Operational considerations are also necessary – Not only the cluster configuration but also the operations will be change
  • 32.
    Conclusion & Futurework We introduced many technical tips for secure Hadoop cluster • However, they might change in the future • Need to catch up with the OSS community Future work • How to enable SSL/TLS in ApplicationMaster & Spark Driver Web UIs • Impersonation does not work correctly in KMSClientProvider: HDFS-13697
  • 33.