Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
cc1842f
Experiment with dropping metrics/labels
RobertLucian Jun 28, 2021
ee7c31a
Fix the dropping of metrics/labels for the node exporter
RobertLucian Jun 28, 2021
4ed459e
Drop unnecessary metrics/labels from kubelet
RobertLucian Jun 28, 2021
146c4e8
Drop unnecessary kube-state-metrics metrics/labels
RobertLucian Jun 28, 2021
2f26497
Drop unnecessary metrics/labels from DCGM exporter
RobertLucian Jun 28, 2021
39a639a
Remove unnecessary metrics/labels from istio
RobertLucian Jun 29, 2021
d4739a0
Change labeldrop to labelkeep
RobertLucian Jun 29, 2021
488af33
Add development docs
RobertLucian Jun 29, 2021
ccc80ce
Fixes for node-exporter & prom monitoring
RobertLucian Jun 29, 2021
b382a1f
Fixes to the nodes dashboard
RobertLucian Jun 29, 2021
6527b40
Add missing `le` label for `istio_request_duration_milliseconds_bucke…
RobertLucian Jun 30, 2021
35c5bb7
Add required label for kube-state-metrics exporter
RobertLucian Jun 30, 2021
6ba99b9
Fix batch grafana dashboard
RobertLucian Jun 30, 2021
1526346
Merge branch 'master' into fix/prometheus-oom
RobertLucian Jul 2, 2021
416bed9
Keep cortex_* metrics
RobertLucian Jul 2, 2021
f465bc1
Separate prometheus and operator workloads
RobertLucian Jul 2, 2021
4289fc6
Validate operator/prometheus node group quotas
RobertLucian Jul 2, 2021
a5be71f
Address cluster info pricing
RobertLucian Jul 2, 2021
d0b22fc
Let the node exporter run on every node
RobertLucian Jul 2, 2021
dc79cbf
Fix istio hpa
RobertLucian Jul 2, 2021
d2c26e5
Change resource requests/limits
RobertLucian Jul 2, 2021
25f5166
Have the prometheus instance type configurable
RobertLucian Jul 2, 2021
21a5d9c
Nits
RobertLucian Jul 2, 2021
65781cb
Update create.md
deliahu Jul 2, 2021
8fc36b8
Merge branch 'master' into fix/prometheus-oom
RobertLucian Jul 2, 2021
dcff049
Address some PR comments
RobertLucian Jul 2, 2021
f64d0fa
Address PR comments
RobertLucian Jul 2, 2021
5641de5
Merge branch 'master' into fix/prometheus-oom
RobertLucian Jul 3, 2021
b1fd479
Merge branch 'master' into fix/prometheus-oom
RobertLucian Jul 3, 2021
5496bc6
Merge branch 'master' into fix/prometheus-oom
RobertLucian Jul 3, 2021
File filter

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Separate prometheus and operator workloads
  • Loading branch information
RobertLucian committed Jul 2, 2021
commit f465bc111a66ee3661b3696d1c9754ff3c83c79c
26 changes: 22 additions & 4 deletions manager/generate_eks.py
Original file line number Diff line number Diff line change
Expand Up @@ -305,20 +305,38 @@ def generate_eks(
return

operator_nodegroup = default_nodegroup(cluster_config)
# TODO validate requests when clustering up
operator_settings = {
"ami": get_ami(ami_map, "t3.medium"),
"name": "cx-operator",
"instanceType": "t3.medium",
"minSize": 2,
"maxSize": 2,
"desiredCapacity": 2,
"minSize": 1,
"maxSize": 25,
"desiredCapacity": 1,
"volumeType": "gp3",
"volumeSize": 20,
"volumeIOPS": 3000,
"volumeThroughput": 125,
}
operator_nodegroup = merge_override(operator_nodegroup, operator_settings)

prometheus_nodegroup = default_nodegroup(cluster_config)
prometheus_settings = {
"ami": get_ami(ami_map, "t3.xlarge"),
"name": "cx-prometheus",
"instanceType": "t3.xlarge",
"minSize": 1,
"maxSize": 1,
"desiredCapacity": 1,
"volumeType": "gp3",
"volumeSize": 20,
"volumeIOPS": 3000,
"volumeThroughput": 125,
"labels": {"prometheus": "true"},
"taints": {"prometheus": "true:NoSchedule"},
}
prometheus_nodegroup = merge_override(prometheus_nodegroup, prometheus_settings)

worker_nodegroups = get_all_worker_nodegroups(ami_map, cluster_config)

nat_gateway = "Disable"
Expand All @@ -337,7 +355,7 @@ def generate_eks(
"tags": cluster_config["tags"],
},
"vpc": {"nat": {"gateway": nat_gateway}},
"nodeGroups": [operator_nodegroup] + worker_nodegroups,
"nodeGroups": [operator_nodegroup, prometheus_nodegroup] + worker_nodegroups,
"addons": [
{
"name": "vpc-cni",
Expand Down
6 changes: 3 additions & 3 deletions manager/manifests/cluster-autoscaler.yaml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -169,11 +169,11 @@ spec:
name: cluster-autoscaler
resources:
limits:
cpu: 100m
memory: 300Mi
cpu: 300m
memory: 1Gi
requests:
cpu: 100m
memory: 300Mi
memory: 200Mi
command:
- ./cluster-autoscaler
- --v=4
Expand Down
3 changes: 3 additions & 0 deletions manager/manifests/fluent-bit.yaml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -249,3 +249,6 @@ spec:
- key: workload
operator: Exists
effect: NoSchedule
- key: prometheus
operator: Exists
effect: NoSchedule
6 changes: 6 additions & 0 deletions manager/manifests/grafana/grafana.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,12 @@ spec:
- name: grafana-dashboard-nodes
configMap:
name: grafana-dashboard-nodes
nodeSelector:
prometheus: "true"
tolerations:
- key: prometheus
operator: Exists
effect: NoSchedule
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
Expand Down
17 changes: 10 additions & 7 deletions manager/manifests/istio.yaml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@ spec:
k8s:
resources:
requests:
cpu: 200m # default is 500m
memory: 1.75Gi # default is 2048Mi == 2Gi
cpu: 100m # default is 500m
memory: 200Mi # default is 2048Mi == 2Gi
cni:
enabled: false
ingressGateways:
Expand Down Expand Up @@ -74,7 +74,7 @@ spec:
cpu: 100m
memory: 128Mi
limits:
cpu: 2000m
cpu: 1000m
memory: 1024Mi
replicaCount: 1
hpaSpec:
Expand Down Expand Up @@ -132,20 +132,23 @@ spec:
targetPort: 15443
resources:
requests:
cpu: 200m
cpu: 300m
memory: 128Mi
limits:
cpu: 2000m
cpu: 1500m
memory: 1024Mi
replicaCount: 1
hpaSpec:
minReplicas: 1
maxReplicas: 1 # edit autoscaleEnabled in values if increasing this
maxReplicas: 100 # edit autoscaleEnabled in values if increasing this
metrics:
- type: Resource
resource:
name: cpu
targetAverageUtilization: 80
targetAverageUtilization: 90
resource:
name: mem
targetAverageUtilization: 90
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
Expand Down
4 changes: 2 additions & 2 deletions manager/manifests/operator.yaml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -58,10 +58,10 @@ spec:
imagePullPolicy: Always
resources:
requests:
cpu: 200m
cpu: 100m
memory: 128Mi
limits:
cpu: 2000m
cpu: 1500m
memory: 1024Mi
ports:
- containerPort: 8888
Expand Down
6 changes: 6 additions & 0 deletions manager/manifests/prometheus-kube-state-metrics.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -234,6 +234,12 @@ spec:
port: 8080
initialDelaySeconds: 5
timeoutSeconds: 5
nodeSelector:
prometheus: "true"
tolerations:
- key: prometheus
operator: Exists
effect: NoSchedule
---
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
Expand Down
6 changes: 6 additions & 0 deletions manager/manifests/prometheus-monitoring.yaml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,12 @@ metadata:
spec:
image: {{ config['image_prometheus'] }}
serviceAccountName: prometheus
nodeSelector:
prometheus: "true"
tolerations:
- key: prometheus
operator: Exists
effect: NoSchedule
podMonitorSelector:
matchExpressions:
- key: "monitoring.cortex.dev"
Expand Down
5 changes: 4 additions & 1 deletion manager/manifests/prometheus-node-exporter.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -153,12 +153,15 @@ spec:
hostPID: true
nodeSelector:
kubernetes.io/os: linux
prometheus: "true"
securityContext:
runAsNonRoot: true
runAsUser: 65534
serviceAccountName: node-exporter
tolerations:
- operator: Exists
- key: prometheus
operator: Exists
effect: NoSchedule
volumes:
- hostPath:
path: /sys
Expand Down
5 changes: 5 additions & 0 deletions manager/manifests/prometheus-operator.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14199,6 +14199,11 @@ spec:
allowPrivilegeEscalation: false
nodeSelector:
kubernetes.io/os: linux
prometheus: "true"
tolerations:
- key: prometheus
operator: Exists
effect: NoSchedule
securityContext:
runAsNonRoot: true
runAsUser: 65534
Expand Down
6 changes: 6 additions & 0 deletions manager/manifests/prometheus-statsd-exporter.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,12 @@ spec:
volumeMounts:
- name: statsd-mapping-config
mountPath: /etc/prometheus-statsd-exporter
nodeSelector:
prometheus: "true"
tolerations:
- key: prometheus
operator: Exists
effect: NoSchedule
volumes:
- name: statsd-mapping-config
configMap:
Expand Down
8 changes: 4 additions & 4 deletions pkg/crds/config/manager/manager.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -46,11 +46,11 @@ spec:
periodSeconds: 10
resources:
limits:
cpu: 100m
memory: 30Mi
cpu: 200m
memory: 100Mi
requests:
cpu: 100m
memory: 20Mi
cpu: 200m
memory: 80Mi
volumeMounts:
- mountPath: /mnt/cluster.yaml
name: cluster-config
Expand Down