温馨提示×

温馨提示×

您好,登录后才能下订单哦!

密码登录×
登录注册×
其他方式登录
点击 登录注册 即表示同意《亿速云用户服务条款》

Hadoop运维记录系列(二十七)

发布时间:2020-06-21 17:50:19 来源:网络 阅读:1711 作者:Slaytanic 栏目:大数据

记录一个调试 pyspark2sql 访问 HDFS 透明加密的问题。

访问源码如下,使用 pyspark2.1.3,基于 CDH 5.14.0 hive 1.1.0 + parquet,其中select的部分会访问 hdfs 加密区域。

from pyspark.sql import SQLContext from pyspark.sql import HiveContext, Row from pyspark.sql.types import * import pandas as pd import pyspark.sql.functions as F   trial_pps_order = spark.read.parquet('/tmp/exia/trial_pps_select') pps_order = spark.read.parquet('/tmp/exia/orders_pps_wc_member') member_info = spark.read.parquet('/tmp/exia/member_info')     # newHiveContext=HiveContext(sc)   query_T="""     select  * from crm.masterdata_hummingbird_product_mst_banner_v1  where brand_name = 'pampers'   """ product_mst=spark.sql(query_T)   product_mst.show()

在 zeppelin里运行后返回报错如下

Traceback (most recent call last):   File "/tmp/zeppelin_pyspark-7483288776781667654.py", line 367, in <module>     raise Exception(traceback.format_exc()) Exception: Traceback (most recent call last):   File "/tmp/zeppelin_pyspark-7483288776781667654.py", line 360, in <module>     exec(code, _zcUserQueryNameSpace)   File "<stdin>", line 14, in <module>   File "/usr/lib/spark-2.1.3-bin-hadoop2.6/python/pyspark/sql/dataframe.py", line 318, in show     print(self._jdf.showString(n, 20))   File "/usr/lib/spark-2.1.3-bin-hadoop2.6/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__     answer, self.gateway_client, self.target_id, self.name)   File "/usr/lib/spark-2.1.3-bin-hadoop2.6/python/pyspark/sql/utils.py", line 63, in deco     return f(*a, **kw)   File "/usr/lib/spark-2.1.3-bin-hadoop2.6/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value     format(target_id, ".", name), value) Py4JJavaError: An error occurred while calling o76.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 6, pg-dmp-slave28.hadoop, executor 1): java.io.IOException: No KeyProvider is configured, cannot access an encrypted file	at org.apache.hadoop.hdfs.DFSClient.decryptEncryptedDataEncryptionKey(DFSClient.java:1338)	at org.apache.hadoop.hdfs.DFSClient.createWrappedInputStream(DFSClient.java:1414)	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304)	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:298)	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:298)	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)	at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:109)	at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)	at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:257)	at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:256)	at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)	at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102)	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)	at org.apache.spark.scheduler.Task.run(Task.scala:100)	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)	at java.lang.Thread.run(Thread.java:748) Driver stacktrace:	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1455)	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1443)	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1442)	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)	at scala.Option.foreach(Option.scala:257)	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1670)	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1625)	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1614)	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1928)	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1941)	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1954)	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:333)	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)	at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2390)	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)	at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2792)	at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2389)	at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2396)	at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2132)	at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2131)	at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2822)	at org.apache.spark.sql.Dataset.head(Dataset.scala:2131)	at org.apache.spark.sql.Dataset.take(Dataset.scala:2346)	at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)	at java.lang.reflect.Method.invoke(Method.java:498)	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)	at py4j.Gateway.invoke(Gateway.java:282)	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)	at py4j.commands.CallCommand.execute(CallCommand.java:79)	at py4j.GatewayConnection.run(GatewayConnection.java:238)	at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: No KeyProvider is configured, cannot access an encrypted file	at org.apache.hadoop.hdfs.DFSClient.decryptEncryptedDataEncryptionKey(DFSClient.java:1338)	at org.apache.hadoop.hdfs.DFSClient.createWrappedInputStream(DFSClient.java:1414)	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304)	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:298)	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:298)	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)	at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:109)	at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)	at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:257)	at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:256)	at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)	at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102)	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)	at org.apache.spark.scheduler.Task.run(Task.scala:100)	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)	... 1 more


主要来说,日志里提示是没有提供访问加密区域的key,无法访问被加密的数据。

出现这个报错的主要原因是spark会优先使用其自身conf文件夹下的hive-site.xml配置项来访问hiveserver2服务,但是这个hive-site.xml文件里面没有配置访问加密区域所需要的配置。加上就OK了。

  <property>     <name>hadoop.security.key.provider.path</name>     <value>kms://http@dmp-master2.hadoop:16000/kms</value>   </property>   <property>     <name>dfs.encrypt.data.transfer.algorithm</name>     <value>3des</value>   </property>   <property>     <name>dfs.encrypt.data.transfer.cipher.suites</name>     <value>AES/CTR/NoPadding</value>   </property>   <property>     <name>dfs.encrypt.data.transfer.cipher.key.bitlength</name>     <value>256</value>   </property>   <property>     <name>dfs.encryption.key.provider.uri</name>     <value>kms://http@dmp-master2.hadoop:16000/kms</value>   </property>


向AI问一下细节

免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。

AI