MONITORING AWS INFRASTRUCTURE How DevOps reduced monitoring costs while improving functionality by switching to InfluxDB and Grafana June 7th, 2022 Josh Gitlin – Director of DevOps
PINNACLE 21 OVERVIEW ‣ Flagship software is Pinnacle 21 Enterprise (P21E) ‣ Validates clinical trial data against the CDISC standards ‣ SDTM, SEND, ADAM, etc ‣ Life sciences customers can be confident that data plays by the rules ‣ Helps ensure submissions are free of errors ‣ Rules based engine with a web-based UI ‣ Same tool used by the FDA and Japan's PMDA to review the quality of submissions. ‣ Clean data pipeline from sponsors to health authorities ‣ Streamlines approvals to bring life-saving medicines to patients faster ‣ Incorporated in 2011 as a privately-owned company ‣ Acquired by Certara, Inc. (Nasdaq: CERT) in 2021
DIRECTOR OF DEVOPS JOSH GITLIN ‣ Principal DevOps Engineer, Pinnacle 21 ‣ Senior Systems Development Engineer, amazon.com Website Hosting Operations ‣ CTO, Digital Fruition and sitepalette.com
THE NEED FOR A SOLUTION Background, Goals and Objectives of InfluxDB Migration Project
THE NEED FOR A SOLUTION ‣ Datadog was expensive ‣ Over $65,000 annually ‣ Priced per server ‣ Friction with the user base ‣ Low adoption among the Engineering team ‣ DevOps found it difficult to create graphs ‣ Lacking critical features ‣ Inability to label the Y axis ‣ Limited Metric Math ‣ No units ‣ Limited visualization options ‣ Not suited to our use case
REPLACEMENT CONSIDERATIONS ‣ Existing solution was extremely easy to implement ‣ Automation would be needed for a replacement ‣ Needed to capture metrics and logs ‣ Data would need to be secured and protected from tampering ‣ Existing solution had 10s granularity ‣ APM Metrics were a goal ‣ Active / Synthetic HTTP Monitoring was mandatory
THE DECISION PROCESS ‣ InfluxDB could be significantly cheaper ‣ Both self-hosted and managed hosting available ‣ Pay for what we use ‣ Telegraf plugins provided more than existing agent ‣ Chef automation solved ease of installation ‣ InfluxDB not well suited to logs ‣ Grafana’s log capabilities were underpowered ‣ Went with a hosted ELK stack ‣ Active HTTP Monitoring would be built in-house
TECHNICAL IMPLEMENTATION Deploying Telegraf with Chef and building HTTP Monitoring
ARCHITECTURE DIAGRAM Visualizing InfluxDB Data Flows
TO THE CLOUD! ‣ Started with an InfluxDB cloud account ‣ Extremely easy to set up and start prototyping ‣ Telegraf for Data Collection ‣ Install on a development server ‣ Monitor ALL THE THINGS! ‣ Review Data Usage Dashboard, Fine-tune intervals ‣ Select the metrics we care about
INFLUXDB USAGE DASHBOARD Vital resource for pay-as-you-go cloud accounts https://www.influxdata.com/influxdb-templates/influxdb-cloud-usage-dashboard/
SCALE OUT PROTOTYPE ‣ Monitoring Cookbook ‣ Included from Role-based cookbooks ‣ Policyfile based workflow ‣ Created a telegraf recipe ‣ Telegraf package loaded into Artifactory ‣ Leverage the /etc/telegraf/telegraf.d/ directory ‣ Node attribute for each thing to be monitored ‣ Allows customized configuration of each telegraf input ‣ Main cookbook writes InfluxDB output, aggregator plugins, etc
TELEGRAF CONFIG AS CHEF TEMPLATES Enumerate default plugins Render template for each plugin
BASE MONITORING SET ‣ diskio ‣ ethtool ‣ Interrupts ‣ net ‣ telegraf_internals ‣ systemd_units ‣ nstat ‣ influxdb_listener
SELECTING AN INFLUXDB SERVER Output uses tagpass and tagexclude Inputs have custom tags for destination
ADD KPI METRICS ‣ Create a Telegraf Log Parser for nginx logs ‣ Telegraf tail plugin ‣ In our case the logs were already JSON ‣ Monitor Application KPIs from log files ‣ Use Chef to configure Aspera ‣ Telegraf to ingest Aspera logs using a tail plugin and grok pattern ‣ Procstat plugin to monitor process health ‣ Exec plugin to call monitoring scripts ‣ Scripts hit application API endpoints ‣ Use Ruby InfluxDB gem, write out line protocol ‣ Improved visibility by the Support team ‣ Reduced MTTR on support tickets
EXPAND INTO APM METRICS ‣ P21E is a Java Application ‣ inspectIT Ocelot ‣ Added as a Java Agent (library) via CLI switch ‣ Engineering integrated code to publish APM metrics ‣ Agent collects JVM metrics ‣ Publishes to telegraf listening on local socket as InfluxDB Line Protocol ‣ Telegraf tags with system-wide tags via Chef, publishes to InfluxDB ‣ Dashboard helps Engineering optimize the software ‣ Finds high JVM heap usage ‣ Team can correlate parts of the application with customer usage ‣ Results in possible hours saved from the longest validations ‣ Days worth of savings on things like COVID-19 vaccines
APM DASHBOARD Application Usage Metrics JVM Metrics
ACTIVE HTTP MONITORING
MONITORING SAAS WITH INFLUXDB ‣ Requirements ‣ Wanted HTTP status code, timing information, regex search ‣ Telegraf could have done this, but it needed an EC2 instance ‣ CTO suggested developing an AWS lambda ‣ Small Node.JS application ‣ Able to communicate with Cinc Server and fetch list of customer sites ‣ Asynchronous design ideal for making many HTTP requests ‣ Leveraged @influxdata/influxdb-client to publish to InfluxDB ‣ CloudWatch events to execute every minute from multiple regions ‣ Grafana dashboard and alerts ‣ Alert on non-200 response, slow response time ‣ Dashboard allows us to demonstrate that we’re meeting SLAs
HTTP MONITORING DASHBOARD Maps and Tables… …Heatmaps and Graphs!
TIPS AND TRICKS ‣ Evaluate your needs ‣ Utilize customized telegraf Intervals ‣ Send data to multiple InfluxDB destinations ‣ Watch your usage! ‣ Use the Usage Dashboard ‣ Add slowly ‣ Don’t use a “file” input when you want a tail input! ‣ Add the StatusPage integration to Slack/Teams/RSS
CONCLUSIONS
NET RESULTS ‣ Saved the business over $40,000 annually ‣ Better control over spend ‣ Improved Developer / Operations Experience ‣ Greater engagement with dashboards ‣ Engineering team is optimizing the software more ‣ Customer Success is better able to troubleshoot issues ‣ Reduced MTTR of issues and better optimized software ‣ P21 customers get better, faster software ‣ Clinical data is processed more efficiently because platform is optimized ‣ Ultimate goal: Patients get treatments faster and more efficnetly
WHAT’S NEXT ‣ Release HTTP Monitoring as Open Source ‣ Expand InfluxDB / Telegraf usage within Certara ‣ Leverage Telegraf for Windows ‣ APM for all the products ‣ Possibly integrate into products ‣ Increase Flux usage ‣ Incredibly powerful ‣ We’re barely scratching the surface
QUESTION S?
THANK YOU ;)
KEEP IN TOUCH! JOSH GITLIN

Improving Clinical Data Accuracy: How to Streamline a Data Pipeline Using Node.js, AWS and InfluxDB

  • 1.
    MONITORING AWS INFRASTRUCTURE How DevOpsreduced monitoring costs while improving functionality by switching to InfluxDB and Grafana June 7th, 2022 Josh Gitlin – Director of DevOps
  • 2.
    PINNACLE 21 OVERVIEW ‣Flagship software is Pinnacle 21 Enterprise (P21E) ‣ Validates clinical trial data against the CDISC standards ‣ SDTM, SEND, ADAM, etc ‣ Life sciences customers can be confident that data plays by the rules ‣ Helps ensure submissions are free of errors ‣ Rules based engine with a web-based UI ‣ Same tool used by the FDA and Japan's PMDA to review the quality of submissions. ‣ Clean data pipeline from sponsors to health authorities ‣ Streamlines approvals to bring life-saving medicines to patients faster ‣ Incorporated in 2011 as a privately-owned company ‣ Acquired by Certara, Inc. (Nasdaq: CERT) in 2021
  • 3.
    DIRECTOR OF DEVOPS JOSHGITLIN ‣ Principal DevOps Engineer, Pinnacle 21 ‣ Senior Systems Development Engineer, amazon.com Website Hosting Operations ‣ CTO, Digital Fruition and sitepalette.com
  • 4.
    THE NEED FORA SOLUTION Background, Goals and Objectives of InfluxDB Migration Project
  • 5.
    THE NEED FORA SOLUTION ‣ Datadog was expensive ‣ Over $65,000 annually ‣ Priced per server ‣ Friction with the user base ‣ Low adoption among the Engineering team ‣ DevOps found it difficult to create graphs ‣ Lacking critical features ‣ Inability to label the Y axis ‣ Limited Metric Math ‣ No units ‣ Limited visualization options ‣ Not suited to our use case
  • 6.
    REPLACEMENT CONSIDERATIONS ‣ Existingsolution was extremely easy to implement ‣ Automation would be needed for a replacement ‣ Needed to capture metrics and logs ‣ Data would need to be secured and protected from tampering ‣ Existing solution had 10s granularity ‣ APM Metrics were a goal ‣ Active / Synthetic HTTP Monitoring was mandatory
  • 7.
    THE DECISION PROCESS ‣InfluxDB could be significantly cheaper ‣ Both self-hosted and managed hosting available ‣ Pay for what we use ‣ Telegraf plugins provided more than existing agent ‣ Chef automation solved ease of installation ‣ InfluxDB not well suited to logs ‣ Grafana’s log capabilities were underpowered ‣ Went with a hosted ELK stack ‣ Active HTTP Monitoring would be built in-house
  • 8.
    TECHNICAL IMPLEMENTATION Deploying Telegrafwith Chef and building HTTP Monitoring
  • 9.
  • 10.
    TO THE CLOUD! ‣Started with an InfluxDB cloud account ‣ Extremely easy to set up and start prototyping ‣ Telegraf for Data Collection ‣ Install on a development server ‣ Monitor ALL THE THINGS! ‣ Review Data Usage Dashboard, Fine-tune intervals ‣ Select the metrics we care about
  • 11.
    INFLUXDB USAGE DASHBOARD Vitalresource for pay-as-you-go cloud accounts https://www.influxdata.com/influxdb-templates/influxdb-cloud-usage-dashboard/
  • 12.
    SCALE OUT PROTOTYPE ‣Monitoring Cookbook ‣ Included from Role-based cookbooks ‣ Policyfile based workflow ‣ Created a telegraf recipe ‣ Telegraf package loaded into Artifactory ‣ Leverage the /etc/telegraf/telegraf.d/ directory ‣ Node attribute for each thing to be monitored ‣ Allows customized configuration of each telegraf input ‣ Main cookbook writes InfluxDB output, aggregator plugins, etc
  • 13.
    TELEGRAF CONFIG ASCHEF TEMPLATES Enumerate default plugins Render template for each plugin
  • 14.
    BASE MONITORING SET ‣diskio ‣ ethtool ‣ Interrupts ‣ net ‣ telegraf_internals ‣ systemd_units ‣ nstat ‣ influxdb_listener
  • 15.
    SELECTING AN INFLUXDBSERVER Output uses tagpass and tagexclude Inputs have custom tags for destination
  • 16.
    ADD KPI METRICS ‣Create a Telegraf Log Parser for nginx logs ‣ Telegraf tail plugin ‣ In our case the logs were already JSON ‣ Monitor Application KPIs from log files ‣ Use Chef to configure Aspera ‣ Telegraf to ingest Aspera logs using a tail plugin and grok pattern ‣ Procstat plugin to monitor process health ‣ Exec plugin to call monitoring scripts ‣ Scripts hit application API endpoints ‣ Use Ruby InfluxDB gem, write out line protocol ‣ Improved visibility by the Support team ‣ Reduced MTTR on support tickets
  • 17.
    EXPAND INTO APMMETRICS ‣ P21E is a Java Application ‣ inspectIT Ocelot ‣ Added as a Java Agent (library) via CLI switch ‣ Engineering integrated code to publish APM metrics ‣ Agent collects JVM metrics ‣ Publishes to telegraf listening on local socket as InfluxDB Line Protocol ‣ Telegraf tags with system-wide tags via Chef, publishes to InfluxDB ‣ Dashboard helps Engineering optimize the software ‣ Finds high JVM heap usage ‣ Team can correlate parts of the application with customer usage ‣ Results in possible hours saved from the longest validations ‣ Days worth of savings on things like COVID-19 vaccines
  • 18.
  • 19.
  • 20.
    MONITORING SAAS WITH INFLUXDB ‣Requirements ‣ Wanted HTTP status code, timing information, regex search ‣ Telegraf could have done this, but it needed an EC2 instance ‣ CTO suggested developing an AWS lambda ‣ Small Node.JS application ‣ Able to communicate with Cinc Server and fetch list of customer sites ‣ Asynchronous design ideal for making many HTTP requests ‣ Leveraged @influxdata/influxdb-client to publish to InfluxDB ‣ CloudWatch events to execute every minute from multiple regions ‣ Grafana dashboard and alerts ‣ Alert on non-200 response, slow response time ‣ Dashboard allows us to demonstrate that we’re meeting SLAs
  • 21.
    HTTP MONITORING DASHBOARD Mapsand Tables… …Heatmaps and Graphs!
  • 22.
    TIPS AND TRICKS ‣Evaluate your needs ‣ Utilize customized telegraf Intervals ‣ Send data to multiple InfluxDB destinations ‣ Watch your usage! ‣ Use the Usage Dashboard ‣ Add slowly ‣ Don’t use a “file” input when you want a tail input! ‣ Add the StatusPage integration to Slack/Teams/RSS
  • 23.
  • 24.
    NET RESULTS ‣ Savedthe business over $40,000 annually ‣ Better control over spend ‣ Improved Developer / Operations Experience ‣ Greater engagement with dashboards ‣ Engineering team is optimizing the software more ‣ Customer Success is better able to troubleshoot issues ‣ Reduced MTTR of issues and better optimized software ‣ P21 customers get better, faster software ‣ Clinical data is processed more efficiently because platform is optimized ‣ Ultimate goal: Patients get treatments faster and more efficnetly
  • 25.
    WHAT’S NEXT ‣ ReleaseHTTP Monitoring as Open Source ‣ Expand InfluxDB / Telegraf usage within Certara ‣ Leverage Telegraf for Windows ‣ APM for all the products ‣ Possibly integrate into products ‣ Increase Flux usage ‣ Incredibly powerful ‣ We’re barely scratching the surface
  • 26.
  • 27.
  • 28.

Editor's Notes

  • #3 First a brief overview of Pinnacle 21 Software company specializing in life sciences solutions Flagship software is Pinnacle 21 Enterprise Used by major life sciences and pharmaceutical companies to validate data against data standards CDISC: Clinical Data Interchange Standards Consortium  Various standards which define how data for a clinical trial of a treatment should be organized and submitted to an agency P21 is a web SaaS rules based engine to check against those rules What does that mean: spell check for your data
  • #4 A little bit about me Joined Pinnacle 21 at start of 2020 as Principal Prior experience was not life science Love for monitoring from working at scale and running my own infrastruture Moved to director post acquisition
  • #6 Pinnacle 21 was using Datadog Expensive Limited functionality Low Adoption I was dissatisfied compared to Grafana
  • #7 Datadog had it’s benefits Very easy to implement Log ingestion Offsite storage, protected from alteration; required by auditors High granularity HTTP monitoring
  • #8 I knew I wanted Grafana I had worked with InfluxDB + Grafana before InfluxDB cloud, or self hosted, would be much cheaper Telegraf is great Chef can solve the ease of installation requirement Needed a log solution HTTP Monitoring
  • #13 Once I had a PoC I decided to create an MVP using automation Add to existing monitoring cookbook