Using AWS, Terraform, and Ansible for DreamPort Projects - the Splunk Cluster How we used (and are still using) tools such as AWS, Terraform, and Ansible to automate everything about a Splunk cluster.
Intro The Who, the What, the Why, and the How Hands on Keys – Live Demo Summary, Questions, Extra Deep Dives On the Agenda Today...
Prerequisites – Terms and Tools • Basic understanding of AWS and cloud computing platforms • Aware of configuration management/orchestration tools such as Terraform and Ansible • Aware of the concepts of Docker • Need to have a basic understanding of Splunk and a Splunk cluster • PLEASE ASK QUESTIONS.
The Who – Me, MISI, and DreamPort • Bill Cawthra - Cloud Infrastructure Architect • I play with little fluffy clouds all day (AWS, Google Cloud, Azure) • MISI/DreamPort - Support and help develop various cyber security projects through collaboration with .gov, private industry, community, and .edu • DreamPort projects – over 20 projects/AWS environments, usually 30-90 days long (some are notably longer) • https://misi.tech/#about • https://dreamport.tech/about-us.php
The What and the Why - The Splunk Evaluation • We wanted to build a Splunk cluster to analyze it's machine learning capabilities. • The data set was 9 TB of Zeek data • 20 users accessing this data at a time (so fairly light on the frontend) • But very intense work done on the backend (indexers) • Big beefy i3.8xlarge instances… Use the instance-store for fast IO (but ephemeral! Therefore we used Splunk SmartStore) • With the help of many people at Splunk (Bryan Pluta, Tyler Muth, Matt Toth, and others), we came up with a design to fit these requirements • We are going to use AWS, Terraform, and Ansible as our tools of choice
The How - AWS • Amazon Web Services; provides an on-demand computing platform • "Elastic" resources • Allows us to rapidly scale out and scale down • Very easy to manage many disparate projects • Best datacenter money can buy
The How - Terraform • Our infrastructure configuration tool of choice • This "frames the house"; creating the AWS resources (VPC, security groups, instances, IAM policies, IAM roles, S3 buckets, etc) • Enforces configuration from the very start (no GUI. No artisinally crafted architecture)
The How - Ansible - Drywall, Paint, and Fixtures • Our automation and configuration management tool of choice • Handles configuration of systems • Handles automation tasks (upgrade and reboot of systems… and ingest orchestration!) • Does everything after the "house is framed"
The How - Docker • Easy binary management (example: to upgrade, just docker pull splunk:<VERSION>) • The splunk-docker project makes it very easy to assign roles, access variables
The How - Infrastructure Diagram
Before We Go Live • I will be covering things at a high level • I will be skipping many things • Ask questions if you want to see XYZ • Look at the code on your own too! • It’s tricky to balance being concise in a talk and detail of the code • Need to avoid turning this into a code review session… • If something looks confusing or wrong, I probably made a mistake.
Before We Go Live - Resources • https://github.com/TheDreamPort/splunk-infrastructure (santiized version of this project) • Also great references: • https://splunk.github.io/splunk-ansible/ - Splunk Ansible reference • https://splunk.github.io/docker-splunk/ - Splunk Docker
TO THE TERMINAL AND BROWSER
Conclusion • We automate automate automate • Which means, we configure/deploy everything programmatically • Ingest is automated • Makes it so easy to redo • Break up the automation into logical pieces • It is not fun having a single mega-script
Extra Notes - Splunk Ingest • Ingest the 9TB of data in batches (basically did it a month at a time) and wait for completion • Limited disk space on the ingesters • Minimize impact of mistakes • Had to be very specific on what was ingested; did not want to duplicate data • Ingest process would attempt to detect if a file had been ingested • Had to verify data was properly ingested (document count of files vs document count in Splunk)
Extra Notes - Monitoring and Logging • Delicious dashboards using Grafana • Graphs the Prometheus metric data • Can graph Loki events too (logs)
Questions? Comments?

Using AWS, Terraform, and Ansible to Automate Splunk at Scale

  • 1.
    Using AWS, Terraform,and Ansible for DreamPort Projects - the Splunk Cluster How we used (and are still using) tools such as AWS, Terraform, and Ansible to automate everything about a Splunk cluster.
  • 2.
    Intro The Who, theWhat, the Why, and the How Hands on Keys – Live Demo Summary, Questions, Extra Deep Dives On the Agenda Today...
  • 3.
    Prerequisites – Termsand Tools • Basic understanding of AWS and cloud computing platforms • Aware of configuration management/orchestration tools such as Terraform and Ansible • Aware of the concepts of Docker • Need to have a basic understanding of Splunk and a Splunk cluster • PLEASE ASK QUESTIONS.
  • 4.
    The Who –Me, MISI, and DreamPort • Bill Cawthra - Cloud Infrastructure Architect • I play with little fluffy clouds all day (AWS, Google Cloud, Azure) • MISI/DreamPort - Support and help develop various cyber security projects through collaboration with .gov, private industry, community, and .edu • DreamPort projects – over 20 projects/AWS environments, usually 30-90 days long (some are notably longer) • https://misi.tech/#about • https://dreamport.tech/about-us.php
  • 5.
    The What andthe Why - The Splunk Evaluation • We wanted to build a Splunk cluster to analyze it's machine learning capabilities. • The data set was 9 TB of Zeek data • 20 users accessing this data at a time (so fairly light on the frontend) • But very intense work done on the backend (indexers) • Big beefy i3.8xlarge instances… Use the instance-store for fast IO (but ephemeral! Therefore we used Splunk SmartStore) • With the help of many people at Splunk (Bryan Pluta, Tyler Muth, Matt Toth, and others), we came up with a design to fit these requirements • We are going to use AWS, Terraform, and Ansible as our tools of choice
  • 6.
    The How -AWS • Amazon Web Services; provides an on-demand computing platform • "Elastic" resources • Allows us to rapidly scale out and scale down • Very easy to manage many disparate projects • Best datacenter money can buy
  • 7.
    The How -Terraform • Our infrastructure configuration tool of choice • This "frames the house"; creating the AWS resources (VPC, security groups, instances, IAM policies, IAM roles, S3 buckets, etc) • Enforces configuration from the very start (no GUI. No artisinally crafted architecture)
  • 8.
    The How -Ansible - Drywall, Paint, and Fixtures • Our automation and configuration management tool of choice • Handles configuration of systems • Handles automation tasks (upgrade and reboot of systems… and ingest orchestration!) • Does everything after the "house is framed"
  • 9.
    The How -Docker • Easy binary management (example: to upgrade, just docker pull splunk:<VERSION>) • The splunk-docker project makes it very easy to assign roles, access variables
  • 10.
    The How -Infrastructure Diagram
  • 11.
    Before We GoLive • I will be covering things at a high level • I will be skipping many things • Ask questions if you want to see XYZ • Look at the code on your own too! • It’s tricky to balance being concise in a talk and detail of the code • Need to avoid turning this into a code review session… • If something looks confusing or wrong, I probably made a mistake.
  • 12.
    Before We GoLive - Resources • https://github.com/TheDreamPort/splunk-infrastructure (santiized version of this project) • Also great references: • https://splunk.github.io/splunk-ansible/ - Splunk Ansible reference • https://splunk.github.io/docker-splunk/ - Splunk Docker
  • 13.
    TO THE TERMINALAND BROWSER
  • 14.
    Conclusion • We automateautomate automate • Which means, we configure/deploy everything programmatically • Ingest is automated • Makes it so easy to redo • Break up the automation into logical pieces • It is not fun having a single mega-script
  • 15.
    Extra Notes -Splunk Ingest • Ingest the 9TB of data in batches (basically did it a month at a time) and wait for completion • Limited disk space on the ingesters • Minimize impact of mistakes • Had to be very specific on what was ingested; did not want to duplicate data • Ingest process would attempt to detect if a file had been ingested • Had to verify data was properly ingested (document count of files vs document count in Splunk)
  • 16.
    Extra Notes -Monitoring and Logging • Delicious dashboards using Grafana • Graphs the Prometheus metric data • Can graph Loki events too (logs)
  • 17.

Editor's Notes

  • #11 Splunk search-head (1) c5d.12xlarge (48 vCPU 96GB) Splunk indexer (9) i3.8xlarge (32 vCPU 244 GB each) 7600 GB of instance storage Splunk universal-forwarders (4) i3.2xlarge (8 vCPU 61 GB each) 1900 GB of instance storage Splunk master-node (1) i3.large (2 vCPU 15 GB) Splunk monitor (1) i3.large (2 vCPU 15 GB)
  • #12 If you want to follow along or poke around the code and find the flaws, go here.
  • #13 If you want to follow along or poke around the code and find the flaws, go here.