Posted on Mar 29, 2023

Streamline your data analytics with AWS Athena queries and Terraform

#aws #todayilearned #tutorial #terraform

Part I of this article will detail how to streamline your data analytics with AWS Athena using Terraform Infrastructure as Code.

Requirements

AWS account
Terraform

Walkthrough

Configure AWS credentials with aws configure.

At the root of our project, create a file named provider.tf which we will specify the aws provider. Also we are going to read dynamically our columns and declare a variable for each column.

#provider.tf terraform { required_providers { aws = { source = "hashicorp/aws" version = "~> 4.59.0" } } }

Next, let's define an S3 bucket in a file named s3.tf which we will use as a source of data.

#s3.tf resource "aws_s3_bucket" "log_bucket" { bucket = "my-tf-log-bucket-source-athena" }

Now we will define a file named athena.tf, in this file we are going to create an athena workgroup,db, table, and our query.

For our aws_glue_catalog_table we are expecting the table to serealize a json object.

Note: Is important to note that our json files should have no spaces, otherwise it won't parse well other alternative could be the use of LazySimpleSerDe which could convert from byte stream

#athena.tf resource "aws_glue_catalog_database" "myservice_athena_db" { name = "myservice" } resource "aws_glue_catalog_table" "athena_table" { name = "mytable" database_name = aws_glue_catalog_database.myservice_athena_db.name description = "Table containing the results stored in S3 as source" table_type = "EXTERNAL_TABLE" storage_descriptor { location = "s3://${aws_s3_bucket.log_bucket.bucket}/mydata" input_format = "org.apache.hadoop.mapred.TextInputFormat" output_format = "org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat" ser_de_info { name = "s3-stream" serialization_library = "org.openx.data.jsonserde.JsonSerDe" parameters = { "ignore.malformed.json" = "TRUE" "dots.in.keys" = "FALSE" "case.insensitive" = "TRUE" "mapping" = "TRUE" } } dynamic "columns" { for_each = var.columns iterator = column content { name = column.value.name type = column.value.type } } } } resource "aws_athena_workgroup" "test" { name = "example123123s" configuration { enforce_workgroup_configuration = true publish_cloudwatch_metrics_enabled = true result_configuration { output_location = "s3://${aws_s3_bucket.log_bucket.bucket}/output/" } } } resource "aws_athena_named_query" "athena_query" { name = "test_query" workgroup = aws_athena_workgroup.test.id database = aws_glue_catalog_database.myservice_athena_db.name query = "SELECT json_extract_scalar(person, '$.name') as name, product as source FROM myservice.mytable;" }

Our json example should look like this (but without spaces)

{ "product": "Live JSON generator", "version": 3.1, "releaseDate": "2014-06-25T00:00:00.000Z", "demo": true, "person": { "id": 12345, "name": "John Doe", "phones": { "home": "800-123-4567", "mobile": "877-123-1234" }, "email": [ "jd@example.com", "jd@example.org" ], "dateOfBirth": "1980-01-02T00:00:00.000Z", "registered": true, "emergencyContacts": [ { "name": "Jane Doe", "phone": "888-555-1212", "relationship": "spouse" }, { "name": "Justin Doe", "phone": "877-123-1212", "relationship": "parent" } ] } }

In our variables.tf we will dynamically call all the columns

#variables.tf variable "columns" { type = list(object({ name = string, type = string, })) default = [] description = "The columns in the table, where the key is the name of the column and the value the type" }

#terraform.tfvars columns = [ { name = "product" type = "string" }, { name = "version" type = "int" }, { name = "releaseDate" type = "string" }, { name = "demo" type = "boolean" } , { name = "person" type = "string" } ]

After finishing our files let's use terraform to create our resources.

terraform init

terraform plan

terraform apply

Update our data.json file through s3 commands or directly in the interface

aws s3 cp data.json s3://<<your_bucket_name>>/mydata/

Now in Athena service in our AWS account we can see our db, table and saved query, running our query should return a value like this.

And save the results in our bucket.

DEV Community

Streamline your data analytics with AWS Athena queries and Terraform

Walkthrough

Top comments (0)