CentOS环境下GitLab监控报警配置指南
在CentOS系统上配置GitLab监控报警前,需确保系统已安装GitLab(版本≥13.0,推荐最新稳定版),并具备root或sudo权限。同时,建议关闭SELinux(setenforce 0
)或配置SELinux策略允许监控工具访问GitLab端口(默认80/443、9090等)。
修改GitLab主配置文件/etc/gitlab/gitlab.rb
,启用内置指标服务:
gitlab_rails['gitlab_metrics_enabled'] = true gitlab_runner['metrics_enabled'] = true global['monitoring_enabled'] = true
保存后执行sudo gitlab-ctl reconfigure
使配置生效,GitLab会自动启动指标服务(默认端口9090)。
/etc/prometheus/prometheus.yml
,添加GitLab指标抓取任务:scrape_configs: - job_name: 'gitlab' static_configs: - targets: ['localhost:9090'] # GitLab服务器地址
sudo systemctl start prometheus && sudo systemctl enable prometheus
。通过http://<服务器IP>:9090
访问Prometheus Web界面,验证是否能获取GitLab指标(如gitlab_rails_database_queries_seconds
)。sudo yum install grafana -y && sudo systemctl start grafana-server
。admin
/admin
),进入“Configuration→Data Sources”,添加Prometheus数据源(URL填写http://localhost:9090
),测试连接成功后保存。4379
,来自Grafana社区),或自定义添加监控指标(如CPU使用率node_cpu_seconds_total{mode!="idle"}
、内存使用率node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes
、GitLab作业成功率gitlab_runner_jobs_success_rate
)。创建/etc/prometheus/alert.yml
文件,定义常见告警规则(如高CPU、高内存、作业失败):
groups: - name: gitlab_alerts rules: - alert: GitLabHighCPU expr: (sum by(instance) (irate(node_cpu_seconds_total{mode!="idle"}[5m]))) > 0.8 for: 5m labels: severity: critical annotations: summary: "GitLab服务器CPU使用率过高({{ $value }}%)" description: "服务器{{ $labels.instance }}的CPU使用率已持续5分钟超过80%,请检查进程负载。" - alert: GitLabHighMemory expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.2 for: 10m labels: severity: warning annotations: summary: "GitLab服务器内存不足(可用率{{ $value | humanizePercentage }})" description: "服务器{{ $labels.instance }}的内存可用率已持续10分钟低于20%,可能导致服务卡顿。" - alert: GitLabJobFailureRate expr: rate(gitlab_runner_jobs_failed_total[1h]) / rate(gitlab_runner_jobs_total[1h]) > 0.1 for: 1h labels: severity: error annotations: summary: "GitLab作业失败率过高({{ $value | humanizePercentage }})" description: "过去1小时内GitLab作业失败率超过10%,请检查Runner配置及项目流水线。"
编辑/etc/prometheus/prometheus.yml
,加载告警规则:
rule_files: - "/etc/prometheus/alert.yml" alerting: alertmanagers: - static_configs: - targets: ['localhost:9093'] # Alertmanager地址
重启Prometheus使规则生效。
安装Alertmanager:sudo yum install alertmanager -y && sudo systemctl start alertmanager
。编辑/etc/alertmanager/alertmanager.yml
,配置邮件通知(以SMTP为例):
route: receiver: 'email-notifications' group_by: ['alertname', 'severity'] group_wait: 30s group_interval: 5m repeat_interval: 1h receivers: - name: 'email-notifications' email_configs: - to: 'admin@example.com' from: 'gitlab-alert@example.com' smarthost: 'smtp.example.com:587' auth_username: 'gitlab_alert' auth_password: 'your_password' send_resolved: true # 问题恢复后发送通知
重启Alertmanager:sudo systemctl restart alertmanager
。
GitLab自带监控功能,可通过以下步骤启用:
http://<服务器IP>/admin
),点击“Monitoring→Metrics”,确认指标服务已启用。.gitlab-ci.yml
文件,添加监控任务(如安装prometheus-node-exporter
并导出指标):monitoring: stage: test script: - yum install -y prometheus-node-exporter - echo "gitlab_custom_metric{project=\"$CI_PROJECT_PATH\"} 1" > /var/lib/node_exporter/custom_metrics.prom artifacts: paths: - /var/lib/node_exporter/custom_metrics.prom expire_in: 1 week
此配置会将自定义指标暴露给Prometheus抓取。stress
工具模拟高CPU负载(stress --cpu 4 --timeout 300
),观察Prometheus是否触发GitLabHighCPU
告警,Grafana仪表盘是否显示异常,以及是否收到邮件通知。gitlab-ci.yml
中添加exit 1
),触发作业失败,验证GitLabJobFailureRate
告警是否生效。basic_auth
、Grafana的TLS
设置)。prometheus.yml
、alert.yml
、alertmanager.yml
),避免配置丢失。