DEV Community

cylon
cylon

Posted on

Envoy:离群检测 outlier detection

outlier detection

在异常检测领域中,常常需要决定新观察的点是否属于与现有观察点相同的分布(则它称为inlier),或者被认为是不同的(称为outlier)。离群是异常的数据,但是不一定是错误的数据点。

在Envoy中,离群点检测是动态确定上游集群中是否有某些主机表现不正常,然后将它们从正常的负载均衡集群中删除的过程。outlier detection可以与healthy check同时/独立启用,并构成整个上游运行状况检查解决方案的基础。

此处概念不做过多的说明,具体可以参考官方文档与自行google

监测类型

  • 连续的5xx
  • 连续的网关错误
  • 连续的本地来源错误

更多介绍参考官方文档 outlier detection

离群检测测试

说明,此处只能在单机环境测试更多还的参考与实际环境

环境准备

docker-compose 模拟后端5个节点

version: '3' services: envoy: image: envoyproxy/envoy-alpine:v1.15-latest environment: - ENVOY_UID=0 ports: - 80:80 - 443:443 - 82:9901 volumes: - ./envoy.yaml:/etc/envoy/envoy.yaml networks: envoymesh: aliases: - envoy depends_on: - webserver1 - webserver2 webserver1: image: sealloong/envoy-end:latest networks: envoymesh: aliases: - myservice - webservice expose: - 90 webserver2: image: sealloong/envoy-end:latest networks: envoymesh: aliases: - myservice - webservice expose: - 90 webserver3: image: sealloong/envoy-end:latest networks: envoymesh: aliases: - myservice - webservice expose: - 90 webserver4: image: sealloong/envoy-end:latest networks: envoymesh: aliases: - myservice - webservice expose: - 90 webserver5: image: sealloong/envoy-end:latest networks: envoymesh: aliases: - myservice - webservice expose: - 90 networks: envoymesh: {} 
Enter fullscreen mode Exit fullscreen mode

envoy 配置文件

admin: access_log_path: /dev/null address: socket_address: { address: 0.0.0.0, port_value: 9901 } static_resources: listeners: - name: listener_0 address: socket_address: { address: 0.0.0.0, port_value: 80 } filter_chains: - filters: - name: envoy_http_connection_manager typed_config: "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager stat_prefix: ingress_http codec_type: AUTO route_config: name: local_route virtual_hosts: - name: local_service domains: [ "*" ] routes: - match: { prefix: "/" } route: { cluster: local_service } http_filters: - name: envoy.filters.http.router clusters: - name: local_service connect_timeout: 0.25s type: STRICT_DNS lb_policy: ROUND_ROBIN load_assignment: cluster_name: local_service endpoints: - lb_endpoints: - endpoint: address: socket_address: { address: webservice, port_value: 90 } health_checks: timeout: 3s interval: 90s unhealthy_threshold: 5 healthy_threshold: 5 no_traffic_interval: 240s http_health_check: path: "/ping" expected_statuses: start: 200 end: 201 outlier_detection: consecutive_5xx: 2 base_ejection_time: 30s max_ejection_percent: 40 interval: 20s success_rate_minimum_hosts: 5 success_rate_request_volume: 10 
Enter fullscreen mode Exit fullscreen mode

配置说明

 outlier_detection: consecutive_5xx: 2 # 连续的5xx错误数量 base_ejection_time: 30s # 弹出主机的基准时间。实际时间等于基本时间乘以主机弹出的次数 max_ejection_percent: 40 # 可弹出主机集群的最大比例,默认值为10% ,此处为40% 即集群中5个节点的2个节点 interval: 20s # 间隔时间 success_rate_minimum_hosts: 5 # 集群中最小主机数量 success_rate_request_volume: 10 # 在一个时间间隔内中收集请求检测的最小数量 
Enter fullscreen mode Exit fullscreen mode

此处为了效果,将主动检测状态时间增加,主机弹出时间增加

路由

/502bad 模拟一个502的错误

运行结果

模拟一些5xx请求和200请求

 workers envoy_1 | [2020-09-13 06:10:01.093][1][warning][main] [source/server/server.cc:537] there is no configured limit to the number of allowed active connections. Set a limit via the runtime key overload.global_downstream_max_connections webserver2_1 | [GIN] 2020/09/13 - 06:10:08 | 200 | 63.272?s | 172.22.0.7 | GET "/" webserver5_1 | [GIN] 2020/09/13 - 06:10:10 | 200 | 46.732?s | 172.22.0.7 | GET "/" webserver1_1 | [GIN] 2020/09/13 - 06:10:11 | 200 | 45.43?s | 172.22.0.7 | GET "/" webserver3_1 | [GIN] 2020/09/13 - 06:10:13 | 502 | 43.858?s | 172.22.0.7 | GET "/502bad" webserver4_1 | [GIN] 2020/09/13 - 06:10:14 | 502 | 47.486?s | 172.22.0.7 | GET "/502bad" webserver2_1 | [GIN] 2020/09/13 - 06:10:15 | 200 | 15.691?s | 172.22.0.7 | GET "/" webserver5_1 | [GIN] 2020/09/13 - 06:10:16 | 200 | 14.719?s | 172.22.0.7 | GET "/" webserver1_1 | [GIN] 2020/09/13 - 06:10:16 | 200 | 15.758?s | 172.22.0.7 | GET "/" webserver3_1 | [GIN] 2020/09/13 - 06:10:17 | 502 | 15.697?s | 172.22.0.7 | GET "/502bad" webserver2_1 | [GIN] 2020/09/13 - 06:10:17 | 502 | 14.002?s | 172.22.0.7 | GET "/502bad" webserver5_1 | [GIN] 2020/09/13 - 06:10:17 | 502 | 14.913?s | 172.22.0.7 | GET "/502bad" webserver1_1 | [GIN] 2020/09/13 - 06:10:18 | 502 | 14.911?s | 172.22.0.7 | GET "/502bad" webserver4_1 | [GIN] 2020/09/13 - 06:10:18 | 502 | 30.429?s | 172.22.0.7 | GET "/502bad" webserver5_1 | [GIN] 2020/09/13 - 06:10:19 | 200 | 14.377?s | 172.22.0.7 | GET "/" webserver1_1 | [GIN] 2020/09/13 - 06:10:19 | 200 | 14.861?s | 172.22.0.7 | GET "/" webserver2_1 | [GIN] 2020/09/13 - 06:10:19 | 200 | 18.924?s | 172.22.0.7 | GET "/" webserver5_1 | [GIN] 2020/09/13 - 06:10:19 | 200 | 15.899?s | 172.22.0.7 | GET "/" webserver1_1 | [GIN] 2020/09/13 - 06:10:19 | 200 | 24.849?s | 172.22.0.7 | GET "/" 
Enter fullscreen mode Exit fullscreen mode

集群已弹出 20%的节点,健康检查结果为 failed_outlier_check

请求已分配到其余三台节点

30秒后,弹出主机已回复正常

再次模拟请求

30秒后,如在时间间隔内,无新增请求,节点依旧为 failed_outlier_check,有新增请求时恢复。

Top comments (0)