Improving Clinical Data Accuracy: How to Streamline a Data Pipeline Using Node.js, AWS and InfluxDB

MONITORING AWS INFRASTRUCTURE How DevOps reduced monitoring costs while improving functionality by switching to InfluxDB and Grafana June 7th, 2022 Josh Gitlin – Director of DevOps

PINNACLE 21 OVERVIEW ‣ Flagship software is Pinnacle 21 Enterprise (P21E) ‣ Validates clinical trial data against the CDISC standards ‣ SDTM, SEND, ADAM, etc ‣ Life sciences customers can be confident that data plays by the rules ‣ Helps ensure submissions are free of errors ‣ Rules based engine with a web-based UI ‣ Same tool used by the FDA and Japan's PMDA to review the quality of submissions. ‣ Clean data pipeline from sponsors to health authorities ‣ Streamlines approvals to bring life-saving medicines to patients faster ‣ Incorporated in 2011 as a privately-owned company ‣ Acquired by Certara, Inc. (Nasdaq: CERT) in 2021

DIRECTOR OF DEVOPS JOSH GITLIN ‣ Principal DevOps Engineer, Pinnacle 21 ‣ Senior Systems Development Engineer, amazon.com Website Hosting Operations ‣ CTO, Digital Fruition and sitepalette.com

THE NEED FOR A SOLUTION Background, Goals and Objectives of InfluxDB Migration Project

THE NEED FOR A SOLUTION ‣ Datadog was expensive ‣ Over $65,000 annually ‣ Priced per server ‣ Friction with the user base ‣ Low adoption among the Engineering team ‣ DevOps found it difficult to create graphs ‣ Lacking critical features ‣ Inability to label the Y axis ‣ Limited Metric Math ‣ No units ‣ Limited visualization options ‣ Not suited to our use case

REPLACEMENT CONSIDERATIONS ‣ Existing solution was extremely easy to implement ‣ Automation would be needed for a replacement ‣ Needed to capture metrics and logs ‣ Data would need to be secured and protected from tampering ‣ Existing solution had 10s granularity ‣ APM Metrics were a goal ‣ Active / Synthetic HTTP Monitoring was mandatory

THE DECISION PROCESS ‣ InfluxDB could be significantly cheaper ‣ Both self-hosted and managed hosting available ‣ Pay for what we use ‣ Telegraf plugins provided more than existing agent ‣ Chef automation solved ease of installation ‣ InfluxDB not well suited to logs ‣ Grafana’s log capabilities were underpowered ‣ Went with a hosted ELK stack ‣ Active HTTP Monitoring would be built in-house

TECHNICAL IMPLEMENTATION Deploying Telegraf with Chef and building HTTP Monitoring

ARCHITECTURE DIAGRAM Visualizing InfluxDB Data Flows

TO THE CLOUD! ‣ Started with an InfluxDB cloud account ‣ Extremely easy to set up and start prototyping ‣ Telegraf for Data Collection ‣ Install on a development server ‣ Monitor ALL THE THINGS! ‣ Review Data Usage Dashboard, Fine-tune intervals ‣ Select the metrics we care about

INFLUXDB USAGE DASHBOARD Vital resource for pay-as-you-go cloud accounts https://www.influxdata.com/influxdb-templates/influxdb-cloud-usage-dashboard/

SCALE OUT PROTOTYPE ‣ Monitoring Cookbook ‣ Included from Role-based cookbooks ‣ Policyfile based workflow ‣ Created a telegraf recipe ‣ Telegraf package loaded into Artifactory ‣ Leverage the /etc/telegraf/telegraf.d/ directory ‣ Node attribute for each thing to be monitored ‣ Allows customized configuration of each telegraf input ‣ Main cookbook writes InfluxDB output, aggregator plugins, etc

TELEGRAF CONFIG AS CHEF TEMPLATES Enumerate default plugins Render template for each plugin

BASE MONITORING SET ‣ diskio ‣ ethtool ‣ Interrupts ‣ net ‣ telegraf_internals ‣ systemd_units ‣ nstat ‣ influxdb_listener

SELECTING AN INFLUXDB SERVER Output uses tagpass and tagexclude Inputs have custom tags for destination

ADD KPI METRICS ‣ Create a Telegraf Log Parser for nginx logs ‣ Telegraf tail plugin ‣ In our case the logs were already JSON ‣ Monitor Application KPIs from log files ‣ Use Chef to configure Aspera ‣ Telegraf to ingest Aspera logs using a tail plugin and grok pattern ‣ Procstat plugin to monitor process health ‣ Exec plugin to call monitoring scripts ‣ Scripts hit application API endpoints ‣ Use Ruby InfluxDB gem, write out line protocol ‣ Improved visibility by the Support team ‣ Reduced MTTR on support tickets

EXPAND INTO APM METRICS ‣ P21E is a Java Application ‣ inspectIT Ocelot ‣ Added as a Java Agent (library) via CLI switch ‣ Engineering integrated code to publish APM metrics ‣ Agent collects JVM metrics ‣ Publishes to telegraf listening on local socket as InfluxDB Line Protocol ‣ Telegraf tags with system-wide tags via Chef, publishes to InfluxDB ‣ Dashboard helps Engineering optimize the software ‣ Finds high JVM heap usage ‣ Team can correlate parts of the application with customer usage ‣ Results in possible hours saved from the longest validations ‣ Days worth of savings on things like COVID-19 vaccines

APM DASHBOARD Application Usage Metrics JVM Metrics

MONITORING SAAS WITH INFLUXDB ‣ Requirements ‣ Wanted HTTP status code, timing information, regex search ‣ Telegraf could have done this, but it needed an EC2 instance ‣ CTO suggested developing an AWS lambda ‣ Small Node.JS application ‣ Able to communicate with Cinc Server and fetch list of customer sites ‣ Asynchronous design ideal for making many HTTP requests ‣ Leveraged @influxdata/influxdb-client to publish to InfluxDB ‣ CloudWatch events to execute every minute from multiple regions ‣ Grafana dashboard and alerts ‣ Alert on non-200 response, slow response time ‣ Dashboard allows us to demonstrate that we’re meeting SLAs

HTTP MONITORING DASHBOARD Maps and Tables… …Heatmaps and Graphs!

TIPS AND TRICKS ‣ Evaluate your needs ‣ Utilize customized telegraf Intervals ‣ Send data to multiple InfluxDB destinations ‣ Watch your usage! ‣ Use the Usage Dashboard ‣ Add slowly ‣ Don’t use a “file” input when you want a tail input! ‣ Add the StatusPage integration to Slack/Teams/RSS

NET RESULTS ‣ Saved the business over $40,000 annually ‣ Better control over spend ‣ Improved Developer / Operations Experience ‣ Greater engagement with dashboards ‣ Engineering team is optimizing the software more ‣ Customer Success is better able to troubleshoot issues ‣ Reduced MTTR of issues and better optimized software ‣ P21 customers get better, faster software ‣ Clinical data is processed more efficiently because platform is optimized ‣ Ultimate goal: Patients get treatments faster and more efficnetly

WHAT’S NEXT ‣ Release HTTP Monitoring as Open Source ‣ Expand InfluxDB / Telegraf usage within Certara ‣ Leverage Telegraf for Windows ‣ APM for all the products ‣ Possibly integrate into products ‣ Increase Flux usage ‣ Incredibly powerful ‣ We’re barely scratching the surface

Improving Clinical Data Accuracy: How to Streamline a Data Pipeline Using Node.js, AWS and InfluxDB

More Related Content

What's hot

Similar to Improving Clinical Data Accuracy: How to Streamline a Data Pipeline Using Node.js, AWS and InfluxDB

More from InfluxData

Recently uploaded

Improving Clinical Data Accuracy: How to Streamline a Data Pipeline Using Node.js, AWS and InfluxDB

Editor's Notes