Part I of this article will detail how to streamline your data analytics with AWS Athena using Terraform Infrastructure as Code.
Requirements
- AWS account
- Terraform
Walkthrough
Configure AWS credentials with aws configure.
At the root of our project, create a file named provider.tf which we will specify the aws provider. Also we are going to read dynamically our columns and declare a variable for each column.
#provider.tf terraform { required_providers { aws = { source = "hashicorp/aws" version = "~> 4.59.0" } } } Next, let's define an S3 bucket in a file named s3.tf which we will use as a source of data.
#s3.tf resource "aws_s3_bucket" "log_bucket" { bucket = "my-tf-log-bucket-source-athena" } Now we will define a file named athena.tf, in this file we are going to create an athena workgroup,db, table, and our query.
For our aws_glue_catalog_table we are expecting the table to serealize a json object.
Note: Is important to note that our json files should have no spaces, otherwise it won't parse well other alternative could be the use of LazySimpleSerDe which could convert from byte stream
#athena.tf resource "aws_glue_catalog_database" "myservice_athena_db" { name = "myservice" } resource "aws_glue_catalog_table" "athena_table" { name = "mytable" database_name = aws_glue_catalog_database.myservice_athena_db.name description = "Table containing the results stored in S3 as source" table_type = "EXTERNAL_TABLE" storage_descriptor { location = "s3://${aws_s3_bucket.log_bucket.bucket}/mydata" input_format = "org.apache.hadoop.mapred.TextInputFormat" output_format = "org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat" ser_de_info { name = "s3-stream" serialization_library = "org.openx.data.jsonserde.JsonSerDe" parameters = { "ignore.malformed.json" = "TRUE" "dots.in.keys" = "FALSE" "case.insensitive" = "TRUE" "mapping" = "TRUE" } } dynamic "columns" { for_each = var.columns iterator = column content { name = column.value.name type = column.value.type } } } } resource "aws_athena_workgroup" "test" { name = "example123123s" configuration { enforce_workgroup_configuration = true publish_cloudwatch_metrics_enabled = true result_configuration { output_location = "s3://${aws_s3_bucket.log_bucket.bucket}/output/" } } } resource "aws_athena_named_query" "athena_query" { name = "test_query" workgroup = aws_athena_workgroup.test.id database = aws_glue_catalog_database.myservice_athena_db.name query = "SELECT json_extract_scalar(person, '$.name') as name, product as source FROM myservice.mytable;" } Our json example should look like this (but without spaces)
{ "product": "Live JSON generator", "version": 3.1, "releaseDate": "2014-06-25T00:00:00.000Z", "demo": true, "person": { "id": 12345, "name": "John Doe", "phones": { "home": "800-123-4567", "mobile": "877-123-1234" }, "email": [ "jd@example.com", "jd@example.org" ], "dateOfBirth": "1980-01-02T00:00:00.000Z", "registered": true, "emergencyContacts": [ { "name": "Jane Doe", "phone": "888-555-1212", "relationship": "spouse" }, { "name": "Justin Doe", "phone": "877-123-1212", "relationship": "parent" } ] } } In our variables.tf we will dynamically call all the columns
#variables.tf variable "columns" { type = list(object({ name = string, type = string, })) default = [] description = "The columns in the table, where the key is the name of the column and the value the type" } #terraform.tfvars columns = [ { name = "product" type = "string" }, { name = "version" type = "int" }, { name = "releaseDate" type = "string" }, { name = "demo" type = "boolean" } , { name = "person" type = "string" } ] After finishing our files let's use terraform to create our resources.
terraform init terraform plan terraform apply ![]()
Update our data.json file through s3 commands or directly in the interface
aws s3 cp data.json s3://<<your_bucket_name>>/mydata/ Now in Athena service in our AWS account we can see our db, table and saved query, running our query should return a value like this.



Top comments (0)