A production-ready, high-performance metrics collector service written in Go that collects system and application metrics and ships them to remote endpoints with enterprise-grade security.
🚀 Features: System metrics (CPU, Memory, Disk, Network) • GPU monitoring (NVIDIA) • Application endpoint scraping • TLS/mTLS support • Prometheus & HTTP JSON shipping • Docker & Kubernetes ready
- Features
- Quick Start
- Architecture
- Installation
- Configuration
- Usage
- Shipper Types
- TLS Configuration
- Collected Metrics
- Security Considerations
- Deployment
- Performance Tuning
- Development
- Troubleshooting
- FAQ
- Contributing
- License
Get metricsd up and running in 5 minutes:
# Clone and build git clone https://github.com/0x524A/metricsd.git cd metricsd go build -o bin/metricsd cmd/metricsd/main.go # Create configuration cp config.example.json config.json # Edit config.json to set your endpoint # For example, change endpoint to your Prometheus or metrics collector URL # Run the service ./bin/metricsd -config config.json # Check health curl http://localhost:8080/healthWith TLS:
# Generate self-signed certificates (for testing) mkdir -p certs && cd certs openssl req -x509 -newkey rsa:4096 -keyout client.key -out client.crt -days 365 -nodes \ -subj "/CN=metricsd-client" cd .. # Update config.json to enable TLS # Set shipper.tls.enabled to true # Set certificate paths in shipper.tls section # Run with TLS ./bin/metricsd -config config.jsonWith Docker:
docker build -t metricsd:latest . docker run -d -p 8080:8080 -v $(pwd)/config.json:/etc/metricsd/config.json:ro metricsd:latest-
Comprehensive Metrics Collection
- CPU usage (per-core and total utilization)
- Memory usage (RAM and swap statistics)
- Disk I/O and usage statistics
- Network I/O statistics
- GPU metrics via NVIDIA NVML (optional)
- Custom application endpoint scraping
-
Application Metrics Collection
- HTTP endpoint scraping for application metrics
- Support for multiple application endpoints
- JSON-based metrics format
- Configurable timeout and retry logic
-
Flexible Shipping Options
- Prometheus Remote Write protocol with Snappy compression
- HTTP JSON POST
- Advanced TLS/SSL support for secure transmission
- Configurable request timeouts
-
Enterprise-Grade Security
- Full TLS 1.2/1.3 support with custom configuration
- Client certificate authentication (mTLS)
- Custom CA certificate support
- Configurable cipher suites
- SNI (Server Name Indication) support
- TLS version pinning (min/max)
- Session ticket management
- Optional certificate verification bypass for testing
-
Configurable & Extensible
- JSON configuration with environment variable overrides
- Adjustable collection intervals
- Enable/disable specific metric collectors
- Health endpoint for monitoring
- Flexible shipper interface for custom backends
-
Production-Ready
- Structured logging with zerolog
- Graceful shutdown with cleanup
- Error handling and resilience
- SOLID design principles
- Resource cleanup and leak prevention
The service follows SOLID principles with a clean architecture:
metrics-collector/ ├── cmd/ │ └── metrics-collector/ # Application entry point │ └── main.go ├── internal/ │ ├── collector/ # Metric collectors (System, GPU, HTTP) │ │ ├── collector.go # Collector interface and registry │ │ ├── system.go # OS metrics collector │ │ ├── gpu.go # GPU metrics collector │ │ └── http.go # HTTP endpoint scraper │ ├── config/ # Configuration management │ │ └── config.go │ ├── shipper/ # Metrics shipping │ │ ├── shipper.go # Shipper interface │ │ ├── prometheus.go # Prometheus remote write │ │ └── http_json.go # HTTP JSON shipper │ ├── orchestrator/ # Collection orchestration │ │ └── orchestrator.go │ └── server/ # HTTP server for health checks │ └── server.go ├── config.example.json # Example configuration ├── go.mod ├── go.sum └── README.md - Go 1.24 or later
- NVIDIA drivers and CUDA (optional, for GPU metrics)
# Clone the repository git clone https://github.com/jainri3/metrics-collector.git cd metrics-collector # Download dependencies go mod download # Build the binary go build -o bin/metrics-collector cmd/metrics-collector/main.goCreate a config.json file based on the example:
cp config.example.json config.json{ "server": { "host": "0.0.0.0", "port": 8080 }, "collector": { "interval_seconds": 60, "enable_cpu": true, "enable_memory": true, "enable_disk": true, "enable_network": true, "enable_gpu": false }, "shipper": { "type": "http_json", "endpoint": "https://collector.example.com:9090/api/v1/metrics", "timeout": 30000000000, "tls": { "enabled": true, "cert_file": "/path/to/client-cert.pem", "key_file": "/path/to/client-key.pem", "ca_file": "/path/to/ca.pem", "insecure_skip_verify": false, "server_name": "collector.example.com", "min_version": "TLS1.2", "max_version": "TLS1.3", "cipher_suites": [ "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256", "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384" ], "session_tickets": true } }, "endpoints": [ { "name": "app1", "url": "http://localhost:3000/metrics" } ] }| Field | Description | Default |
|---|---|---|
server.host | HTTP server bind address | 0.0.0.0 |
server.port | HTTP server port | 8080 |
collector.interval_seconds | Collection interval in seconds | 60 |
collector.enable_cpu | Enable CPU metrics collection | true |
collector.enable_memory | Enable memory metrics collection | true |
collector.enable_disk | Enable disk metrics collection | true |
collector.enable_network | Enable network metrics collection | true |
collector.enable_gpu | Enable GPU metrics collection (requires NVIDIA GPU) | false |
shipper.type | Shipper type: prometheus_remote_write or http_json | - |
shipper.endpoint | Remote endpoint URL | - |
shipper.timeout | Request timeout in nanoseconds | 30000000000 (30s) |
shipper.tls.enabled | Enable TLS/SSL | false |
shipper.tls.cert_file | Path to client certificate file (PEM) | - |
shipper.tls.key_file | Path to client private key file (PEM) | - |
shipper.tls.ca_file | Path to CA certificate file for server verification | - |
shipper.tls.insecure_skip_verify | Skip server certificate verification (not recommended) | false |
shipper.tls.server_name | Server name for SNI (overrides hostname from endpoint) | - |
shipper.tls.min_version | Minimum TLS version: TLS1.0, TLS1.1, TLS1.2, TLS1.3 | TLS1.2 |
shipper.tls.max_version | Maximum TLS version: TLS1.0, TLS1.1, TLS1.2, TLS1.3 | TLS1.3 |
shipper.tls.cipher_suites | Array of allowed cipher suites (see Cipher Suites section) | System defaults |
shipper.tls.session_tickets | Enable TLS session ticket resumption | true |
endpoints | Array of application HTTP endpoints to scrape | [] |
You can override configuration values using environment variables:
| Environment Variable | Description | Example |
|---|---|---|
MC_SERVER_HOST | Server bind address | 0.0.0.0 |
MC_SERVER_PORT | Server port number | 8080 |
MC_COLLECTOR_INTERVAL | Collection interval in seconds | 60 |
MC_SHIPPER_TYPE | Shipper type | prometheus_remote_write |
MC_SHIPPER_ENDPOINT | Shipper endpoint URL | https://metrics.example.com/write |
MC_TLS_ENABLED | Enable TLS | true |
MC_TLS_CERT_FILE | Client certificate file path | /etc/metricsd/certs/client.crt |
MC_TLS_KEY_FILE | Client private key file path | /etc/metricsd/certs/client.key |
MC_TLS_CA_FILE | CA certificate file path | /etc/metricsd/certs/ca.crt |
MC_TLS_SERVER_NAME | SNI server name | collector.example.com |
MC_TLS_MIN_VERSION | Minimum TLS version | TLS1.2 |
MC_TLS_INSECURE_SKIP_VERIFY | Skip certificate verification | false |
# Run with default config.json ./bin/metrics-collector # Run with custom config file ./bin/metrics-collector -config /path/to/config.json # Set log level ./bin/metrics-collector -log-level debugdebug- Detailed debugging informationinfo- General informational messages (default)warn- Warning messageserror- Error messages only
The service exposes a health endpoint:
curl http://localhost:8080/healthResponse:
{ "status": "healthy", "timestamp": "2025-11-05T12:34:56Z", "uptime": "1h23m45s" }Ships metrics using the Prometheus remote write protocol with Snappy compression.
{ "shipper": { "type": "prometheus_remote_write", "endpoint": "http://prometheus:9090/api/v1/write" } }Ships metrics as JSON via HTTP POST.
{ "shipper": { "type": "http_json", "endpoint": "http://collector:8080/api/v1/metrics" } }Payload format:
{ "timestamp": 1699185296, "metrics": [ { "name": "system_cpu_usage_percent", "value": 45.2, "type": "gauge", "labels": { "core": "0" } } ] }The service supports advanced TLS configuration for secure communication with remote endpoints. This includes mutual TLS (mTLS), custom cipher suites, and version pinning.
For simple TLS with server certificate verification:
{ "shipper": { "type": "prometheus_remote_write", "endpoint": "https://metrics.example.com/api/v1/write", "tls": { "enabled": true, "ca_file": "/etc/metricsd/certs/ca.pem" } } }For client certificate authentication:
{ "shipper": { "type": "http_json", "endpoint": "https://secure-collector.example.com/metrics", "tls": { "enabled": true, "cert_file": "/etc/metricsd/certs/client.crt", "key_file": "/etc/metricsd/certs/client.key", "ca_file": "/etc/metricsd/certs/ca.crt", "server_name": "secure-collector.example.com" } } }Full control over TLS parameters:
{ "shipper": { "tls": { "enabled": true, "cert_file": "/etc/metricsd/certs/client.crt", "key_file": "/etc/metricsd/certs/client.key", "ca_file": "/etc/metricsd/certs/ca.crt", "server_name": "metrics.internal.example.com", "min_version": "TLS1.2", "max_version": "TLS1.3", "cipher_suites": [ "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384", "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384", "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256", "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256" ], "session_tickets": true, "insecure_skip_verify": false } } }| Option | Description | Values |
|---|---|---|
enabled | Enable/disable TLS | true, false |
cert_file | Client certificate for mTLS | Path to PEM file |
key_file | Client private key for mTLS | Path to PEM file |
ca_file | CA certificate for server verification | Path to PEM file |
server_name | SNI hostname override | Domain name |
min_version | Minimum TLS version | TLS1.0, TLS1.1, TLS1.2, TLS1.3 |
max_version | Maximum TLS version | TLS1.0, TLS1.1, TLS1.2, TLS1.3 |
cipher_suites | Allowed cipher suites | Array of suite names |
session_tickets | Enable session resumption | true, false |
insecure_skip_verify | Skip certificate verification | true, false (not recommended for production) |
TLS 1.3 Cipher Suites:
TLS_AES_128_GCM_SHA256TLS_AES_256_GCM_SHA384TLS_CHACHA20_POLY1305_SHA256
TLS 1.2 Cipher Suites (Recommended):
TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
Additional TLS 1.2 Cipher Suites:
TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHATLS_ECDHE_RSA_WITH_AES_128_CBC_SHATLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHATLS_ECDHE_RSA_WITH_AES_256_CBC_SHATLS_RSA_WITH_AES_128_GCM_SHA256TLS_RSA_WITH_AES_256_GCM_SHA384TLS_RSA_WITH_AES_128_CBC_SHA256TLS_RSA_WITH_AES_128_CBC_SHATLS_RSA_WITH_AES_256_CBC_SHA
Note: If cipher suites are not specified, Go's default secure cipher suite list will be used. TLS 1.3 cipher suites cannot be configured in Go and use the protocol's default settings.
- Use TLS 1.2 or higher - Set
min_versiontoTLS1.2minimum - Enable mTLS - Use client certificates for mutual authentication
- Verify certificates - Keep
insecure_skip_verifyasfalsein production - Use strong cipher suites - Prefer ECDHE and AEAD ciphers
- Configure SNI - Set
server_namewhen using name-based virtual hosting - Rotate certificates - Implement a certificate rotation strategy
- Secure key storage - Protect private keys with appropriate file permissions
Generate self-signed CA:
openssl req -x509 -new -nodes -keyout ca.key -sha256 -days 1825 -out ca.crt \ -subj "/C=US/ST=State/L=City/O=Organization/CN=CA"Generate client certificate:
# Generate private key openssl genrsa -out client.key 2048 # Generate certificate signing request openssl req -new -key client.key -out client.csr \ -subj "/C=US/ST=State/L=City/O=Organization/CN=metricsd-client" # Sign with CA openssl x509 -req -in client.csr -CA ca.crt -CAkey ca.key \ -CAcreateserial -out client.crt -days 825 -sha256Set secure file permissions:
chmod 600 /etc/metricsd/certs/*.key chmod 644 /etc/metricsd/certs/*.crt chown metricsd:metricsd /etc/metricsd/certs/*Certificate verification failed:
- Ensure CA certificate includes the full chain
- Verify
server_namematches the certificate CN or SAN - Check certificate expiration dates
Handshake failure:
- Verify cipher suites are compatible with server
- Check TLS version compatibility (min/max versions)
- Ensure client certificate is valid and trusted by server
Enable debug logging:
./bin/metricsd -log-level debugCPU:
system_cpu_usage_percent- Per-core CPU usagesystem_cpu_usage_total_percent- Overall CPU usagesystem_cpu_count- Number of CPU cores
Memory:
system_memory_total_bytes- Total memorysystem_memory_used_bytes- Used memorysystem_memory_available_bytes- Available memorysystem_memory_usage_percent- Memory usage percentagesystem_swap_total_bytes- Total swap spacesystem_swap_used_bytes- Used swap spacesystem_swap_usage_percent- Swap usage percentage
Disk:
system_disk_total_bytes- Total disk spacesystem_disk_used_bytes- Used disk spacesystem_disk_free_bytes- Free disk spacesystem_disk_usage_percent- Disk usage percentagesystem_disk_read_bytes_total- Total bytes readsystem_disk_write_bytes_total- Total bytes writtensystem_disk_read_count_total- Total read operationssystem_disk_write_count_total- Total write operations
Network:
system_network_bytes_sent_total- Total bytes sentsystem_network_bytes_recv_total- Total bytes receivedsystem_network_packets_sent_total- Total packets sentsystem_network_packets_recv_total- Total packets receivedsystem_network_errors_in_total- Total input errorssystem_network_errors_out_total- Total output errorssystem_network_drop_in_total- Total input dropssystem_network_drop_out_total- Total output drops
GPU (NVIDIA):
system_gpu_count- Number of GPUssystem_gpu_utilization_percent- GPU utilizationsystem_gpu_memory_utilization_percent- GPU memory utilizationsystem_gpu_memory_total_bytes- Total GPU memorysystem_gpu_memory_used_bytes- Used GPU memorysystem_gpu_memory_free_bytes- Free GPU memorysystem_gpu_temperature_celsius- GPU temperaturesystem_gpu_power_usage_milliwatts- GPU power usagesystem_gpu_fan_speed_percent- Fan speedsystem_gpu_clock_sm_mhz- SM clock speedsystem_gpu_clock_memory_mhz- Memory clock speed
Application metrics are prefixed with app_ and include the endpoint name as a label.
Protect sensitive configuration and certificate files:
# Configuration file chmod 600 /opt/metricsd/config.json chown metricsd:metricsd /opt/metricsd/config.json # Certificate directory chmod 700 /etc/metricsd/certs chown -R metricsd:metricsd /etc/metricsd/certs # Private keys chmod 600 /etc/metricsd/certs/*.key # Certificates chmod 644 /etc/metricsd/certs/*.crtAlways run the service as a dedicated non-privileged user:
# Create dedicated user sudo useradd -r -s /bin/false -d /opt/metricsd metricsd # Set ownership sudo chown -R metricsd:metricsd /opt/metricsd- Use TLS for all remote communications
- Enable mTLS when possible for mutual authentication
- Restrict network access using firewalls
- Use internal/private networks when available
- Regularly update certificates before expiration
- Store sensitive values in environment variables
- Use secrets management tools (HashiCorp Vault, AWS Secrets Manager, etc.)
- Rotate credentials regularly
- Audit configuration changes
- Enable detailed logging for security monitoring
Create /etc/systemd/system/metricsd.service:
[Unit] Description=Metrics Collector Service (metricsd) Documentation=https://github.com/0x524A/metricsd After=network-online.target Wants=network-online.target [Service] Type=simple User=metricsd Group=metricsd WorkingDirectory=/opt/metricsd ExecStart=/opt/metricsd/bin/metricsd -config /opt/metricsd/config.json -log-level info ExecReload=/bin/kill -HUP $MAINPID Restart=on-failure RestartSec=10 KillMode=process TimeoutStopSec=30 # Security hardening NoNewPrivileges=true PrivateTmp=true ProtectSystem=strict ProtectHome=true ReadWritePaths=/opt/metricsd ProtectKernelTunables=true ProtectKernelModules=true ProtectControlGroups=true # Resource limits LimitNOFILE=65536 LimitNPROC=512 [Install] WantedBy=multi-user.targetInstall and enable:
# Copy binary and config sudo mkdir -p /opt/metricsd/{bin,certs} sudo cp bin/metricsd /opt/metricsd/bin/ sudo cp config.json /opt/metricsd/ # Create user sudo useradd -r -s /bin/false -d /opt/metricsd metricsd # Set permissions sudo chown -R metricsd:metricsd /opt/metricsd sudo chmod 600 /opt/metricsd/config.json sudo chmod 755 /opt/metricsd/bin/metricsd # Install and start service sudo cp metricsd.service /etc/systemd/system/ sudo systemctl daemon-reload sudo systemctl enable metricsd sudo systemctl start metricsd # Check status sudo systemctl status metricsd sudo journalctl -u metricsd -fPrerequisites:
- Docker installed (version 20.10+ recommended)
- Docker Compose (optional, for easier deployment)
- At least 500MB free disk space for the image
Step 1: Create the Dockerfile
Create a file named Dockerfile in the project root:
FROM golang:1.24-bookworm AS builder # Install build dependencies RUN apt-get update && apt-get install -y --no-install-recommends \ git \ make \ && rm -rf /var/lib/apt/lists/* WORKDIR /app COPY go.mod go.sum ./ RUN go mod download COPY . . # Build with all features including GPU support (NVML) RUN go build -ldflags '-w -s' -o metricsd cmd/metricsd/main.go FROM debian:bookworm-slim # Install runtime dependencies RUN apt-get update && apt-get install -y --no-install-recommends \ ca-certificates \ tzdata \ wget \ && rm -rf /var/lib/apt/lists/* # Create non-root user RUN groupadd -g 1000 metricsd && \ useradd -r -u 1000 -g metricsd -s /bin/false metricsd # Create directories RUN mkdir -p /etc/metricsd/certs /var/lib/metricsd RUN chown -R metricsd:metricsd /etc/metricsd /var/lib/metricsd WORKDIR /home/metricsd # Copy binary COPY --from=builder /app/metricsd /usr/local/bin/metricsd RUN chmod +x /usr/local/bin/metricsd # Switch to non-root user USER metricsd # Health check HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \ CMD wget --no-verbose --tries=1 --spider http://localhost:8080/health || exit 1 EXPOSE 8080 ENTRYPOINT ["/usr/local/bin/metricsd"] CMD ["-config", "/etc/metricsd/config.json"]Step 2: Build the Image
# Basic build docker build -t metricsd:latest . # Build with custom tag docker build -t metricsd:v1.0.0 . # Build with specific platform (for cross-platform) docker build --platform linux/amd64 -t metricsd:latest . # Build with build arguments (if needed) docker build --build-arg GO_VERSION=1.21 -t metricsd:latest . # Build with no cache (clean build) docker build --no-cache -t metricsd:latest . # Build and show build progress docker build --progress=plain -t metricsd:latest .Step 3: Verify the Build
# List the image docker images | grep metricsd # Check image size (should be around 20-30MB) docker images metricsd:latest --format "{{.Size}}" # Inspect the image docker inspect metricsd:latest # Test run (quick check) docker run --rm metricsd:latest -helpStep 4: Tag for Registry (Optional)
# Tag for Docker Hub docker tag metricsd:latest 0x524A/metricsd:latest docker tag metricsd:latest 0x524A/metricsd:v1.0.0 # Tag for private registry docker tag metricsd:latest registry.example.com/metricsd:latest # Push to registry docker push 0x524A/metricsd:latestOptimizing the Build
Create a .dockerignore file to exclude unnecessary files:
# .dockerignore .git .gitignore .github README.md LICENSE *.md .vscode .idea bin/ *.log *.tmp .env .DS_Store Makefile docker-compose.yml Build Troubleshooting
Common build issues:
# Issue: "cannot find package" # Solution: Ensure go.mod and go.sum are present go mod tidy docker build -t metricsd:latest . # Issue: "no space left on device" # Solution: Clean up Docker docker system prune -a --volumes # Issue: Build is slow # Solution: Use BuildKit (faster builds) DOCKER_BUILDKIT=1 docker build -t metricsd:latest . # Issue: Platform mismatch (M1 Mac, ARM) # Solution: Build for specific platform docker build --platform linux/amd64 -t metricsd:latest . # Issue: Can't connect to Docker daemon # Solution: Start Docker or check permissions sudo systemctl start docker # Linux sudo usermod -aG docker $USER # Add user to docker groupdocker-compose.yml (for container metrics):
version: '3.8' services: metricsd: build: . image: metricsd:latest container_name: metricsd restart: unless-stopped ports: - "8080:8080" volumes: - ./config.json:/etc/metricsd/config.json:ro - ./certs:/etc/metricsd/certs:ro environment: - MC_LOG_LEVEL=info - MC_SHIPPER_ENDPOINT=https://prometheus:9090/api/v1/write - MC_TLS_ENABLED=true - MC_TLS_CERT_FILE=/etc/metricsd/certs/client.crt - MC_TLS_KEY_FILE=/etc/metricsd/certs/client.key - MC_TLS_CA_FILE=/etc/metricsd/certs/ca.crt networks: - metrics healthcheck: test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:8080/health"] interval: 30s timeout: 5s retries: 3 start_period: 10s networks: metrics: driver: bridgedocker-compose.yml (for HOST metrics - recommended for production):
version: '3.8' services: metricsd: build: . image: metricsd:latest container_name: metricsd restart: unless-stopped # Use host network to access host metrics network_mode: host # Use host PID namespace to see host processes pid: host volumes: # Mount host filesystems for accurate host metrics - /:/rootfs:ro - /proc:/host/proc:ro - /sys:/host/sys:ro - /var/run/docker.sock:/var/run/docker.sock:ro - ./config.json:/etc/metricsd/config.json:ro - ./certs:/etc/metricsd/certs:ro environment: # Tell gopsutil to use host filesystems - HOST_PROC=/host/proc - HOST_SYS=/host/sys - HOST_ROOT=/rootfs - MC_LOG_LEVEL=info - MC_SHIPPER_ENDPOINT=https://prometheus:9090/api/v1/write - MC_TLS_ENABLED=true - MC_TLS_CERT_FILE=/etc/metricsd/certs/client.crt - MC_TLS_KEY_FILE=/etc/metricsd/certs/client.key - MC_TLS_CA_FILE=/etc/metricsd/certs/ca.crt healthcheck: test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:8080/health"] interval: 30s timeout: 5s retries: 3 start_period: 10s # Privileged mode may be needed for full system access # privileged: true # Or use specific capabilities cap_add: - SYS_PTRACE - SYS_ADMINPrerequisites:
- Built Docker image (see steps above)
config.jsonfile prepared- TLS certificates (optional, if using TLS)
Option 1: Quick Start (Container Metrics)
# Prepare configuration cp config.example.json config.json # Edit config.json with your settings # Run container docker run -d \ --name metricsd \ -p 8080:8080 \ -v $(pwd)/config.json:/etc/metricsd/config.json:ro \ -e MC_LOG_LEVEL=info \ metricsd:latest # Check if it's running docker ps | grep metricsd # View logs docker logs -f metricsd # Check health curl http://localhost:8080/healthOption 2: With TLS (Secure)
# Ensure you have certificates ls -la certs/ # Should have: client.crt, client.key, ca.crt # Run with TLS docker run -d \ --name metricsd \ -p 8080:8080 \ -v $(pwd)/config.json:/etc/metricsd/config.json:ro \ -v $(pwd)/certs:/etc/metricsd/certs:ro \ -e MC_LOG_LEVEL=info \ -e MC_TLS_ENABLED=true \ -e MC_TLS_CERT_FILE=/etc/metricsd/certs/client.crt \ -e MC_TLS_KEY_FILE=/etc/metricsd/certs/client.key \ -e MC_TLS_CA_FILE=/etc/metricsd/certs/ca.crt \ metricsd:latestOption 3: Host Metrics Collection (Recommended for Production)
This mounts host filesystems to collect actual host metrics instead of container metrics:
docker run -d \ --name metricsd-host \ --pid=host \ --network=host \ --restart=unless-stopped \ -v /:/rootfs:ro \ -v /proc:/host/proc:ro \ -v /sys:/host/sys:ro \ -v /var/run/docker.sock:/var/run/docker.sock:ro \ -v $(pwd)/config.json:/etc/metricsd/config.json:ro \ -v $(pwd)/certs:/etc/metricsd/certs:ro \ -e HOST_PROC=/host/proc \ -e HOST_SYS=/host/sys \ -e HOST_ROOT=/rootfs \ -e MC_LOG_LEVEL=info \ metricsd:latestOption 4: Using Docker Compose (Easiest)
# Build and start docker-compose up -d # View logs docker-compose logs -f metricsd # Stop docker-compose down # Rebuild and restart docker-compose up -d --build # View service status docker-compose psContainer Management:
# Stop container docker stop metricsd # Start container docker start metricsd # Restart container docker restart metricsd # Remove container docker rm -f metricsd # View logs (last 100 lines) docker logs --tail 100 metricsd # Follow logs in real-time docker logs -f metricsd # Check container health status docker inspect --format='{{.State.Health.Status}}' metricsd # Execute command in container docker exec -it metricsd sh # View container resource usage docker stats metricsd # Export container logs to file docker logs metricsd > metricsd.log 2>&1Note: The Deployment below collects pod/container metrics. To collect node/host metrics in Kubernetes, use a DaemonSet instead. See the "Collecting Host Metrics from Docker Container" section for a DaemonSet example.
deployment.yaml (for pod metrics):
apiVersion: v1 kind: Namespace metadata: name: monitoring --- apiVersion: v1 kind: ConfigMap metadata: name: metricsd-config namespace: monitoring data: config.json: | { "server": { "host": "0.0.0.0", "port": 8080 }, "collector": { "interval_seconds": 60, "enable_cpu": true, "enable_memory": true, "enable_disk": true, "enable_network": true, "enable_gpu": false }, "shipper": { "type": "prometheus_remote_write", "endpoint": "https://prometheus.monitoring.svc.cluster.local:9090/api/v1/write", "timeout": 30000000000, "tls": { "enabled": true, "cert_file": "/etc/metricsd/certs/tls.crt", "key_file": "/etc/metricsd/certs/tls.key", "ca_file": "/etc/metricsd/certs/ca.crt", "server_name": "prometheus.monitoring.svc.cluster.local", "min_version": "TLS1.2" } }, "endpoints": [] } --- apiVersion: apps/v1 kind: Deployment metadata: name: metricsd namespace: monitoring labels: app: metricsd spec: replicas: 1 selector: matchLabels: app: metricsd template: metadata: labels: app: metricsd spec: serviceAccountName: metricsd securityContext: runAsNonRoot: true runAsUser: 1000 fsGroup: 1000 containers: - name: metricsd image: metricsd:latest imagePullPolicy: IfNotPresent args: - "-config" - "/etc/metricsd/config.json" - "-log-level" - "info" ports: - name: http containerPort: 8080 protocol: TCP livenessProbe: httpGet: path: /health port: http initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3 readinessProbe: httpGet: path: /health port: http initialDelaySeconds: 10 periodSeconds: 5 timeoutSeconds: 3 failureThreshold: 3 resources: requests: cpu: 100m memory: 128Mi limits: cpu: 500m memory: 512Mi volumeMounts: - name: config mountPath: /etc/metricsd readOnly: true - name: certs mountPath: /etc/metricsd/certs readOnly: true securityContext: allowPrivilegeEscalation: false readOnlyRootFilesystem: true capabilities: drop: - ALL volumes: - name: config configMap: name: metricsd-config - name: certs secret: secretName: metricsd-tls --- apiVersion: v1 kind: Service metadata: name: metricsd namespace: monitoring labels: app: metricsd spec: type: ClusterIP ports: - port: 8080 targetPort: http protocol: TCP name: http selector: app: metricsd --- apiVersion: v1 kind: ServiceAccount metadata: name: metricsd namespace: monitoringCreate TLS secret:
kubectl create secret generic metricsd-tls \ --from-file=tls.crt=certs/client.crt \ --from-file=tls.key=certs/client.key \ --from-file=ca.crt=certs/ca.crt \ -n monitoringDeploy:
kubectl apply -f deployment.yaml kubectl get pods -n monitoring kubectl logs -f -n monitoring deployment/metricsdBy default, a containerized application collects metrics from inside the container (container CPU, container memory, etc.). To collect metrics from the host system instead, you need to mount host filesystems into the container.
- Container metrics: Shows resource usage of the container itself (limited by cgroups)
- Host metrics: Shows actual host machine CPU, memory, disk, and network usage
- Use case: Monitoring the physical/virtual machine where Docker is running
Mount these host paths into your container:
| Host Path | Container Mount | Purpose |
|---|---|---|
/proc | /host/proc:ro | Process information, CPU stats |
/sys | /host/sys:ro | System information, block devices |
/ | /rootfs:ro | Root filesystem for disk metrics |
/var/run/docker.sock | /var/run/docker.sock:ro | Docker socket (optional) |
Set these environment variables to tell the gopsutil library to use host paths:
HOST_PROC=/host/proc HOST_SYS=/host/sys HOST_ROOT=/rootfsdocker run -d \ --name metricsd-host-metrics \ --pid=host \ --network=host \ --restart=unless-stopped \ -v /:/rootfs:ro \ -v /proc:/host/proc:ro \ -v /sys:/host/sys:ro \ -v /var/run/docker.sock:/var/run/docker.sock:ro \ -v $(pwd)/config.json:/etc/metricsd/config.json:ro \ -e HOST_PROC=/host/proc \ -e HOST_SYS=/host/sys \ -e HOST_ROOT=/rootfs \ -e MC_LOG_LEVEL=info \ metricsd:latestversion: '3.8' services: metricsd-host: image: metricsd:latest container_name: metricsd-host-metrics restart: unless-stopped network_mode: host # Access host network interfaces pid: host # Access host processes volumes: - /:/rootfs:ro - /proc:/host/proc:ro - /sys:/host/sys:ro - /var/run/docker.sock:/var/run/docker.sock:ro - ./config.json:/etc/metricsd/config.json:ro - ./certs:/etc/metricsd/certs:ro environment: - HOST_PROC=/host/proc - HOST_SYS=/host/sys - HOST_ROOT=/rootfs cap_add: - SYS_PTRACE # For process monitoringWhen collecting host metrics:
- ✅ Use read-only mounts (
:ro) for host filesystems - ✅ Minimize capabilities - only add what's needed (SYS_PTRACE, SYS_ADMIN)
⚠️ Avoidprivileged: trueunless absolutely necessary- ✅ Run as non-root user when possible
- ✅ Review mounted paths - only mount what you need
For Kubernetes, use a DaemonSet to run one pod per node:
apiVersion: apps/v1 kind: DaemonSet metadata: name: metricsd-host namespace: monitoring spec: selector: matchLabels: app: metricsd-host template: metadata: labels: app: metricsd-host spec: hostNetwork: true hostPID: true containers: - name: metricsd image: metricsd:latest env: - name: HOST_PROC value: /host/proc - name: HOST_SYS value: /host/sys - name: HOST_ROOT value: /rootfs - name: NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName volumeMounts: - name: proc mountPath: /host/proc readOnly: true - name: sys mountPath: /host/sys readOnly: true - name: root mountPath: /rootfs readOnly: true - name: config mountPath: /etc/metricsd - name: certs mountPath: /etc/metricsd/certs securityContext: capabilities: add: - SYS_PTRACE volumes: - name: proc hostPath: path: /proc - name: sys hostPath: path: /sys - name: root hostPath: path: / - name: config configMap: name: metricsd-config - name: certs secret: secretName: metricsd-tlsCheck the logs to ensure host metrics are being collected:
# Check logs docker logs metricsd-host-metrics # You should see metrics for ALL host CPUs, not just container limits # Example: If host has 16 cores, you should see metrics for all 16 # Test with debug logging docker run --rm -it \ --pid=host \ -v /proc:/host/proc:ro \ -v /sys:/host/sys:ro \ -v $(pwd)/config.json:/etc/metricsd/config.json:ro \ -e HOST_PROC=/host/proc \ -e HOST_SYS=/host/sys \ metricsd:latest -config /etc/metricsd/config.json -log-level debugAdjust based on your needs:
- High-frequency monitoring: 10-30 seconds
- Standard monitoring: 60 seconds (recommended)
- Low-frequency monitoring: 300+ seconds
- Enable session tickets - Reduces TLS handshake overhead
- Use TLS 1.3 - Faster handshake and better performance
- Connection pooling - Automatically handled by the HTTP client
- Keep-alive - Connections are reused between shipments
Typical resource usage:
- CPU: 50-200m (minimal overhead)
- Memory: 50-150 MB RSS
- Network: Depends on metric volume and shipping frequency
Optimize with:
{ "collector": { "interval_seconds": 60, "enable_cpu": true, "enable_memory": true, "enable_disk": false, "enable_network": false, "enable_gpu": false } }The service exposes its own health endpoint:
- Monitor HTTP response time at
/health - Check logs for shipping errors
- Monitor system resource usage
- Set up alerts for service failures
# Clone repository git clone https://github.com/0x524A/metricsd.git cd metricsd # Install dependencies go mod download # Build make build # Run with development config ./bin/metricsd -config config.json -log-level debugmetricsd/ ├── cmd/ │ └── metricsd/ # Main application entry point │ └── main.go ├── internal/ # Internal packages │ ├── collector/ # Metric collectors │ │ ├── collector.go # Collector interface & registry │ │ ├── system.go # System metrics (CPU, memory, disk, network) │ │ ├── gpu.go # GPU metrics (NVIDIA NVML) │ │ └── http.go # HTTP endpoint scraper │ ├── config/ # Configuration management │ │ └── config.go # Config structs & validation │ ├── shipper/ # Metric shipping backends │ │ ├── shipper.go # Shipper interface │ │ ├── prometheus.go # Prometheus remote write protocol │ │ └── http_json.go # HTTP JSON POST │ ├── orchestrator/ # Collection & shipping coordination │ │ └── orchestrator.go │ └── server/ # HTTP server (health checks) │ └── server.go ├── bin/ # Compiled binaries ├── config.json # Runtime configuration ├── config.example.json # Example configuration ├── Makefile # Build automation ├── go.mod # Go module definition └── README.md # This file # Run all tests go test ./... # Run with coverage go test -cover ./... # Generate coverage report go test -coverprofile=coverage.out ./... go tool cover -html=coverage.out # Run specific package tests go test ./internal/collector/... # Run with verbose output go test -v ./... # Run benchmarks go test -bench=. ./...# Build for current platform go build -o bin/metricsd cmd/metricsd/main.go # Build with optimizations go build -ldflags="-s -w" -o bin/metricsd cmd/metricsd/main.go # Build for multiple platforms GOOS=linux GOARCH=amd64 go build -o bin/metricsd-linux-amd64 cmd/metricsd/main.go GOOS=darwin GOARCH=amd64 go build -o bin/metricsd-darwin-amd64 cmd/metricsd/main.go GOOS=windows GOARCH=amd64 go build -o bin/metricsd-windows-amd64.exe cmd/metricsd/main.go # Using Makefile (if available) make build make test make cleanFollow standard Go conventions:
- Use
gofmtfor formatting - Use
golintfor linting - Use
go vetfor static analysis
# Format code gofmt -w . # Run linter golangci-lint run # Static analysis go vet ./...- Create a new collector in
internal/collector/:
package collector type MyCollector struct { // fields } func NewMyCollector() *MyCollector { return &MyCollector{} } func (c *MyCollector) Collect(ctx context.Context) ([]Metric, error) { // Implementation return metrics, nil } func (c *MyCollector) Name() string { return "my_collector" }- Register in
cmd/metricsd/main.go:
myCollector := collector.NewMyCollector() registry.Register(myCollector)- Create a new shipper in
internal/shipper/:
package shipper type MyShipper struct { endpoint string client *http.Client } func NewMyShipper(endpoint string, tlsConfig *tls.Config) (*MyShipper, error) { // Implementation return &MyShipper{...}, nil } func (s *MyShipper) Ship(ctx context.Context, metrics []collector.Metric) error { // Implementation return nil } func (s *MyShipper) Close() error { // Cleanup return nil }-
Add shipper type to config validation in
internal/config/config.go -
Add initialization in
cmd/metricsd/main.go
The project adheres to SOLID principles:
-
Single Responsibility Principle (SRP)
- Each collector focuses on one metric source
- Each shipper handles one protocol
- Orchestrator only coordinates collection and shipping
-
Open/Closed Principle (OCP)
- New collectors can be added without modifying existing code
- New shippers can be plugged in via the interface
- Configuration is extensible
-
Liskov Substitution Principle (LSP)
- All collectors implement the
Collectorinterface - All shippers implement the
Shipperinterface - Components are interchangeable
- All collectors implement the
-
Interface Segregation Principle (ISP)
- Small, focused interfaces (
Collector,Shipper) - Clients depend only on methods they use
- No fat interfaces
- Small, focused interfaces (
-
Dependency Inversion Principle (DIP)
- High-level modules depend on abstractions (interfaces)
- Concrete implementations are injected
- Loose coupling throughout the codebase
Service won't start:
# Check logs sudo journalctl -u metricsd -n 50 # Verify configuration ./bin/metricsd -config config.json # Should show validation errors # Check file permissions ls -la /opt/metricsd/config.json ls -la /etc/metricsd/certs/TLS handshake errors:
# Test TLS connection openssl s_client -connect metrics.example.com:443 \ -cert /etc/metricsd/certs/client.crt \ -key /etc/metricsd/certs/client.key \ -CAfile /etc/metricsd/certs/ca.crt # Verify certificate openssl x509 -in /etc/metricsd/certs/client.crt -text -noout # Check certificate expiration openssl x509 -in /etc/metricsd/certs/client.crt -checkend 0Metrics not shipping:
- Check network connectivity to endpoint
- Verify TLS configuration
- Check endpoint authentication requirements
- Review logs for error messages
- Test endpoint manually with curl
High memory usage:
- Reduce collection frequency
- Disable unused collectors
- Check for memory leaks in logs
- Monitor with pprof if needed
Permission denied errors:
# Fix ownership sudo chown -R metricsd:metricsd /opt/metricsd sudo chown -R metricsd:metricsd /etc/metricsd # Fix permissions sudo chmod 600 /opt/metricsd/config.json sudo chmod 600 /etc/metricsd/certs/*.key sudo chmod 644 /etc/metricsd/certs/*.crtQ: Can I use metricsd without TLS? A: Yes, set shipper.tls.enabled to false. However, TLS is strongly recommended for production.
Q: Does metricsd support custom metrics? A: Yes, add application endpoints to the endpoints array in the configuration. The HTTP collector will scrape them.
Q: How do I rotate TLS certificates? A: Update the certificate files, then restart the service. Consider implementing a certificate rotation process with minimal downtime.
Q: Can I ship to multiple endpoints? A: Currently, one shipper endpoint is supported per instance. Run multiple instances for multiple destinations.
Q: What's the performance impact? A: Minimal. Typical CPU usage is <1% and memory usage is around 50-150MB depending on enabled collectors.
Q: How do I monitor metricsd itself? A: Use the /health endpoint and monitor the service logs. You can also use process monitoring tools.
Q: Does it work on Windows? A: Yes, but some system metrics may have limited support. GPU metrics require NVIDIA drivers.
Q: Can I use this with Grafana? A: Yes, ship metrics to Prometheus (using remote write) and configure Grafana to query Prometheus.
Q: How do I debug TLS issues? A: Enable debug logging with -log-level debug and review the detailed TLS handshake logs.
Q: Is IPv6 supported? A: Yes, both IPv4 and IPv6 are supported for all network operations.
Q: How do I collect host metrics when running in Docker? A: Mount the host's /proc, /sys, and / into the container and set environment variables. See the "Collecting Host Metrics from Docker Container" section for complete instructions.
Q: Why are my CPU/memory metrics showing container limits instead of host resources? A: Without host filesystem mounts, the container only sees its own cgroup limits. Mount host paths and set HOST_PROC=/host/proc and HOST_SYS=/host/sys to collect host metrics.
- Add support for multiple shipper endpoints
- Implement metric aggregation and buffering
- Add support for metric filtering and transformation
- Implement retry logic with exponential backoff
- Add support for custom labels on system metrics
- Implement metric caching for offline scenarios
- Add Datadog, InfluxDB, and other shipper backends
- Add web UI for configuration and monitoring
- Implement metric sampling for high-volume scenarios
- Add support for Windows-specific metrics
- Implement health check with detailed status information
MIT License - see LICENSE file for details
Contributions are welcome! Here's how you can help:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Add tests for new functionality
- Ensure tests pass (
go test ./...) - Format your code (
gofmt -w .) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow Go best practices and idioms
- Maintain SOLID design principles
- Add tests for new functionality
- Update documentation as needed
- Keep commits atomic and well-described
- Ensure backward compatibility when possible
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: This README and inline code comments
When reporting bugs, please include:
- metricsd version
- Operating system and version
- Go version
- Configuration file (sanitized)
- Relevant log output
- Steps to reproduce
Feature requests are welcome! Please:
- Check existing issues first
- Provide detailed use case
- Explain expected behavior
- Consider contributing the feature
Built with:
- zerolog - Fast structured logging
- gopsutil - System metrics collection
- prometheus/client_golang - Prometheus integration
- NVML - GPU metrics
- Your Name - Initial work
See also the list of contributors who participated in this project.
Made with ❤️ by the metricsd team