以下是在Debian上配置Kafka监控与告警的核心步骤,基于主流工具链(kafka_exporter+Prometheus+Grafana):
# 安装Docker(用于部署kafka_exporter) sudo apt update && sudo apt install -y docker.io sudo systemctl start docker && sudo systemctl enable docker # 安装Prometheus(监控数据采集) wget https://github.com/prometheus/prometheus/releases/download/v2.44.0/prometheus-2.44.0.linux-amd64.tar.gz tar -zxvf prometheus-*.tar.gz cd prometheus-* && ./prometheus --config.file=prometheus.yml & # 安装Grafana(可视化展示) sudo apt install -y grafana sudo systemctl start grafana-server && sudo systemctl enable grafana-server # 拉取镜像并创建docker-compose配置 docker pull bitnami/kafka-exporter:latest cat <<EOF > docker-compose.yml version: '3.1' services: kafka-exporter: image: bitnami/kafka-exporter:latest command: "--kafka.server=<KAFKA_BROKER_IP>:9092 --kafka.version=3.5.2" ports: - "9310:9308" EOF # 启动服务 docker-compose up -d <KAFKA_BROKER_IP>替换为实际Broker地址,若有多个Broker需逐一列出。编辑Prometheus配置文件prometheus.yml:
scrape_configs: - job_name: 'kafka-exporter' metrics_path: '/metrics' scrape_interval: 15s static_configs: - targets: ['localhost:9310'] # 若有多个实例需添加对应IP:端口 在prometheus.yml中添加规则文件路径:
rule_files: - "alert-rules.yml" 创建alert-rules.yml文件,包含以下示例规则:
groups: - name: kafka_alerts rules: # Broker异常告警 - alert: KafkaBrokerDown expr: up{job="kafka-exporter"} == 0 for: 2m labels: severity: critical annotations: summary: "Kafka Broker异常" description: "Broker {{ $labels.instance }} 已下线超过2分钟" # 消息积压告警 - alert: KafkaMessageBacklog expr: sum(kafka_consumergroup_lag_sum) by (group, topic) > 5000 for: 5m labels: severity: warning annotations: summary: "消息积压告警" description: "Topic {{ $labels.topic }} 的消费组 {{ $labels.group }} 积压超过5000条" kafka_disk_usage_percentage监控磁盘使用率。http://localhost:9090)。kafka-server-start.sh,添加JMX参数:export JMX_PORT=9999 export KAFKA_JMX_OPTS="-Dcom.sun.management.jmxremote.port=9999 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false" jconsole或Prometheus JMX Exporter采集更详细的JVM指标。alertmanager,通过Webhook协议对接钉钉、企业微信等通知渠道。http://localhost:9090),查询kafka_*相关指标,确认数据采集正常。remote_write对接远程存储)。参考来源: