Usage

This section goes over how to manipulate an encrypted DataFrame in either client or insecure mode.

Saving a DataFrame

Save the encrypted DataFrame to local disk. The encrypted data can then be uploaded to cloud storage of your choice for easy access.

Scala:

dfEncrypted.write.format("edu.berkeley.cs.rise.opaque.EncryptedSource").save("dfEncrypted") // The file dfEncrypted/part-00000 now contains encrypted data 

Python:

df_encrypted.write.format("edu.berkeley.cs.rise.opaque.EncryptedSource").save("df_encrypted") 

Using the DataFrame interface

  1. Users can load the previously persisted encrypted DataFrame.

    Scala:

    import org.apache.spark.sql.types._ val dfEncrypted = (spark.read.format("edu.berkeley.cs.rise.opaque.EncryptedSource") .schema(StructType(Seq(StructField("word", StringType), StructField("count", IntegerType)))) .load("dfEncrypted")) 

    Python:

    df_encrypted = spark.read.format("edu.berkeley.cs.rise.opaque.EncryptedSource").load("df_encrypted") 
  2. Given an encrypted DataFrame, construct a new query. Users can use explain to see the generated query plan.

    Scala:

    val result = dfEncrypted.filter($"count" > lit(3)) result.explain(true) // [...] // == Optimized Logical Plan == // EncryptedFilter (count#6 > 3) // +- EncryptedLocalRelation [word#5, count#6] // [...] 

    Python:

    result = df_encrypted.filter(df_encrypted["count"] > 3) result.explain(True) 

Using the SQL interface

  1. Users can also load the previously persisted encrypted DataFrame using the SQL interface.

    spark.sql(s"""  |CREATE TEMPORARY VIEW dfEncrypted  |USING edu.berkeley.cs.rise.opaque.EncryptedSource  |OPTIONS (  | path "dfEncrypted"  |)""".stripMargin) 
  2. The SQL API can be used to run the same query on the loaded data.

    val result = spark.sql(s"""  |SELECT * FROM dfEncrypted  |WHERE count > 3""".stripMargin) result.show