Posted on Dec 2, 2022

Display CockroachDB metrics in Splunk Dashboards

CockroachDB ships with a very convenient built-in web monitoring interface, the DB Console. In the DB Console you can visualize all important health and ops metrics by using the pre-configured dashboards. Currently, the DB Console sports 12 dashboards, covering anything from Hardware and Storage metrics to SQL and Distribution. For many customers, this is a great monitoring solution.

Larger enterprises however usually have a separate team that is responsible to monitor pretty much every component of an application, including databases, so they have a centralized solution from which they can more holistically assess the health of an application. For these cases, the same metrics that power the CockroachDB Console dashboards can be forwarded to the enterprise monitoring solution.

Recently, I worked on such an integration with Splunk. The Splunk dashboard files that emulate the DB Console are now available in our repo for everyone's benefit.

In this blog, I demonstrate how I used the OpenTelemetry Collector to send the metrics from CockroachDB to a Splunk instance in a way that can be done on one's laptop using docker.

Setup

The architecture is simple: CockroachDB --> OTEL Collector --> Splunk.
CockroachDB generates detailed time series metrics for each node in the cluster. The collector will pull these metrics from each node endpoint and push them to the Splunk HEC endpoint.
Once in Splunk, it's just a matter to make sense of the metrics by building the right charts and group them into dashboards.

CockroachDB Cluster

As customary, we use a Load Balancer to interact with the CockroachDB cluster.
Create the haproxy.cfg file and save it on the current directory.

# file: haproxy.cfg global maxconn 4096 defaults mode tcp timeout connect 10s timeout client 10m timeout server 10m option clitcpka listen psql bind :26257 mode tcp balance roundrobin option httpchk GET /health?ready=1 server cockroach1 cockroach1:26257 check port 8080 server cockroach2 cockroach2:26257 check port 8080 server cockroach3 cockroach3:26257 check port 8080 server cockroach4 cockroach4:26257 check port 8080 listen http bind :8080 mode tcp balance roundrobin option httpchk GET /health?ready=1 server cockroach1 cockroach1:8080 check port 8080 server cockroach2 cockroach2:8080 check port 8080 server cockroach3 cockroach3:8080 check port 8080 server cockroach4 cockroach4:8080 check port 8080

Create the docker network and containers

# create the network bridge docker network create --driver=bridge --subnet=172.28.0.0/16 --ip-range=172.28.0.0/24 --gateway=172.28.0.1 demo-net # CockroachDB cluster docker run -d --name=cockroach1 --hostname=cockroach1 --net demo-net cockroachdb/cockroach:latest start --insecure --join=cockroach1,cockroach2,cockroach3 docker run -d --name=cockroach2 --hostname=cockroach2 --net demo-net cockroachdb/cockroach:latest start --insecure --join=cockroach1,cockroach2,cockroach3 docker run -d --name=cockroach3 --hostname=cockroach3 --net demo-net cockroachdb/cockroach:latest start --insecure --join=cockroach1,cockroach2,cockroach3 docker run -d --name=cockroach4 --hostname=cockroach4 --net demo-net cockroachdb/cockroach:latest start --insecure --join=cockroach1,cockroach2,cockroach3 # initialize the cluster docker exec -it cockroach1 ./cockroach init --insecure # HAProxy load balancer docker run -d --name haproxy --net demo-net -p 26257:26257 -p 8080:8080 -v `pwd`/haproxy.cfg:/etc/haproxy.cfg:ro haproxy:latest -f /etc/haproxy.cfg

At this point you should be able to open the CockroachDB Console at http://localhost:8080.

Start a workload against the cluster to generate some metrics

# init docker run --rm --name workload --net demo-net cockroachdb/cockroach:latest workload init tpcc 'postgres://root@haproxy:26257?sslmode=disable' --warehouses 10 # run the workload - you might want to use a separate terminal docker run --rm --name workload --net demo-net cockroachdb/cockroach:latest workload run tpcc 'postgres://root@haproxy:26257?sslmode=disable' --warehouses 10 --tolerate-errors

With 4 nodes, you can simulate a node failure (just stop the container) and view the range activity (replication, lease-transfers, etc).
You can optionally setup CDC to a Kafka container, configure Row Level TTL, add more nodes, etc.

Splunk

Start a Splunk container.

docker run -d --name splunk --net demo-net -p 8088:8088 -p 8000:8000 -e "SPLUNK_START_ARGS=--accept-license" -e "SPLUNK_PASSWORD=cockroach" splunk/splunk

Create a data input and token for HEC.
In Splunk, click Settings > Data Inputs.
Under Local Inputs, click HTTP Event Collector:
1. Click Global Settings.
2. For All Tokens, click Enabled if this button is not already selected.
3. Uncheck the SSL checkbox.
4. Click Save.
Configure an HEC token for sending data by clicking New Token.
On the Select Source page, for Name, enter a token name, for example "Metrics token".
Leave the other options blank or unselected.
Click Next.
On the Input Settings page, for Source type, click New.
In Source Type, set value to "otel".
For Source Type Category, select Metrics.
Next to Default Index, click Create a new index. In the New Index dialog box:
1. Set Index Name to "metrics_idx".
2. For Index Data Type, click Metrics.
3. Click Save.
Select the newly created "metrics_idx".
Click Review, and then click Submit.
Copy the Token Value that is displayed. This HEC token is required for sending data.

OpenTelemetry Collector

There are 2 collector types: the core and the contrib. I have used the contrib as it features the splunk_hec exporter.

The collectors come already pre-compiled and are available for download in the releases repo. Docker containers are also available.

Create file config.yaml and save it in the current directory.
Ensure to replace the Splunk Token with the one you created in the previous step.

# file: config.yaml --- receivers: prometheus: config: scrape_configs: - job_name: 'cockroachdb' metrics_path: '/_status/vars' scrape_interval: 10s scheme: 'http' tls_config: insecure_skip_verify: true static_configs: - targets: - cockroach1:8080 - cockroach2:8080 - cockroach3:8080 - cockroach4:8080 labels: cluster_id: 'cockroachdb' exporters: splunk_hec: source: otel sourcetype: otel index: metrics_idx max_connections: 20 disable_compression: false timeout: 10s tls: insecure_skip_verify: true token: TOKEN endpoint: "http://splunk:8088/services/collector" service: pipelines: metrics: receivers: - prometheus exporters: - splunk_hec

Start the Collector

docker run -d --name otel --net demo-net -v `pwd`/config.yaml:/etc/config.yaml ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib --config=/etc/config.yaml

CockroachDB metrics should be now pulled by the Prometheus Receiver and forwarded to Splunk via the Splunk HEC Exporter.

Demo

After few minutes, you can do a quick test and run the below queries in Splunk to make sure data is received correctly.
In Splunk, click on Apps, then Search & Reporting.
Enter below commands:

# check what metrics we're receiving | mcatalog values(metric_name) WHERE index="metrics_idx" # preview the data in its raw format | mpreview index="metrics_idx" # execute the query to show the SQL Statements | mstats rate_sum(sql_select_count) as select, rate_sum(sql_insert_count) as insert, rate_sum(sql_update_count) as update, rate_sum(sql_delete_count) as delete where index="metrics_idx" span=10s

If Splunk shows data, the pipeline is working correctly and you can load the dashboards.

Click on Dashboards, then on Create New Dashboard.
In the pop-up window:
1. In Dashboard Title, set "CockroachDB Overview".
2. Select Classic Dashboard.
3. Click Create.
The Dashboard is now in edit mode. Click on Source.
Replace the current content with the XML in the overview.xml file in the repo.
Click Save.
Repeat for every Dashboard file in the repo directory.

Here are a few screenshots:

CockroachDB Hardware

CockroachDB SQL

Of course, you can create your very own dashboards, too:

The same Collector can be used for integration with many other sources and targets. It also features processors, so you can pre-process the data (say, filtering) before pushing it to your target solution.

DEV Community