Skip to content

Commit 1d5dddd

Browse files
authored
DOC-771 | Review and update production checklist (#784)
* review and update production checklist * review * add note on enabling RocksDB statistics and other minor changes * added more best practices for the production checklist via mamoona
1 parent 14e576c commit 1d5dddd

File tree

2 files changed

+230
-80
lines changed

2 files changed

+230
-80
lines changed

site/content/3.12/deploy/production-checklist.md

Lines changed: 115 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -10,34 +10,89 @@ have been performed on your production system before you go live.
1010

1111
## Operating System
1212

13-
- Executed the OS optimization scripts if you run ArangoDB on Linux.
13+
- Executed the operating system (OS) optimization scripts if you run ArangoDB on Linux.
1414
See [Installing ArangoDB on Linux](../operations/installation/linux/_index.md) and its sub pages
1515
[Linux Operating System Configuration](../operations/installation/linux/operating-system-configuration.md) and
1616
[Linux OS Tuning Script Examples](../operations/installation/linux/linux-os-tuning-script-examples.md) for details.
1717

18-
- OS monitoring is in place
19-
(most common metrics, e.g. disk, CPU, RAM utilization).
18+
- Ensure your OS is compatible with your ArangoDB version
19+
and keep it up to date at all times for security and stability.
20+
21+
- OS monitoring is in place with specific alerting thresholds:
22+
- **Disk usage**: Alert when reaching 60% (red line threshold).
23+
- **CPU usage**: Alert when reaching 90% (red line threshold).
24+
- **Memory usage**: Alert when reaching 85% (red line threshold).
2025

2126
- Disk space monitoring is in place. Consider setting up alerting to avoid out-of-disk situations.
2227

2328
## ArangoDB
2429

25-
- The user _root_ is not used to run any ArangoDB processes
30+
- **Use the latest versions**: Deploy the latest version series
31+
of ArangoDB to benefit from performance improvements and security fixes.
32+
33+
- **Testing environments**: Use QA environments and UAT (User Acceptance Testing)
34+
to test all changes, in particular queries, before going live with production deployments.
35+
36+
### Security
37+
38+
- Create a dedicated system user and group (e.g., "arango")
39+
to run ArangoDB processes. Never use the _root_ user to run any ArangoDB processes
2640
(if you run ArangoDB on Linux).
2741

42+
- **Access control**: Restrict access to the deployment to authorized personnel only.
43+
Implement proper authentication and authorization mechanisms.
44+
45+
- **JWT authentication**: Enable JWT authentication
46+
for production deployments. See [JWT authentication](../develop/http-api/authentication.md#jwt-user-tokens) for more details.
47+
48+
- **Encryption**: Enable [Encryption at Rest](../operations/security/encryption-at-rest.md)
49+
for sensitive data. Make sure to safely store any secret keys you create for this.
50+
51+
### Logging and Monitoring
52+
2853
- The _arangod_ (server) process and the _arangodb_ (_Starter_) process
2954
(if in use) have some form of logging enabled and logs can easily be
3055
located and inspected.
31-
32-
- *Memory considerations*
33-
- If you run multiple processes (e.g. DB-Server and Coordinator) on a single
34-
machine, adjust the [`ARANGODB_OVERRIDE_DETECTED_TOTAL_MEMORY`](../components/arangodb-server/environment-variables.md)
35-
environment variable accordingly.
36-
- For versions prior to 3.8, make sure to change the
37-
[`--query.memory-limit`](../components/arangodb-server/options.md#--querymemory-limit)
38-
query option according to the node size and workload.
39-
- Disable swap space to avoid slowdown which can result in servers being incorrectly
40-
detected as failed.
56+
57+
- **Third-party monitoring**: Configure third-party metrics monitoring tools like
58+
Grafana with Prometheus to monitor ArangoDB metrics comprehensively.
59+
60+
- **Configure metrics collection**: Enable the ArangoDB metrics API for production monitoring:
61+
- Set [`--server.export-metrics-api`](../components/arangodb-server/options.md#--serverexport-metrics-api) to `true` to enable the metrics endpoints
62+
- Enable [`--server.export-read-write-metrics`](../components/arangodb-server/options.md#--serverexport-read-write-metrics) for additional document read/write metrics
63+
- Consider enabling [`--server.export-shard-usage-metrics`](../components/arangodb-server/options.md#--serverexport-shard-usage-metrics) for detailed shard usage tracking
64+
- Configure your monitoring system (Prometheus/Grafana) to scrape the `/_admin/metrics/v2` endpoint
65+
- See [HTTP interface for server metrics](../develop/http-api/monitoring/metrics.md) for detailed information
66+
67+
- **Enable RocksDB statistics**: Consider enabling [`--rocksdb.enable-statistics`](../components/arangodb-server/options.md#--rocksdbenable-statistics) to `true` for detailed RocksDB performance metrics.
68+
69+
- Monitor the ArangoDB provided metrics with alerting based on the threshold guidelines:
70+
- Disk usage: 60% (red line)
71+
- CPU usage: 90% (red line)
72+
- Memory usage: 85% (red line)
73+
74+
### Memory
75+
76+
- For DB-Servers and Coordinators, override the
77+
[`ARANGODB_OVERRIDE_DETECTED_TOTAL_MEMORY`](../components/arangodb-server/environment-variables.md)
78+
environment variable using this rule of thumb:
79+
- Multiply available memory by 0.9 to leave headspace for OS/Kubernetes, client connections, etc.
80+
- Use 3/4 of that value for DB-Servers.
81+
- Use 1/4 of that value for Coordinators.
82+
- Agents typically don't need much memory and can use the remaining 10% headspace.
83+
84+
- Note that if ArangoDB "sees" x GB of memory in a pod,
85+
it will try to use those x GB. Memory accounting has been vastly improved in 3.12,
86+
but overshooting in certain cases may still occur.
87+
88+
- Disable swap space to avoid slowdown which can result in servers being incorrectly
89+
detected as failed.
90+
91+
- **Query memory limits**: Configure appropriate memory limits for AQL queries:
92+
- Set [`--query.max-memory-per-query`](../components/arangodb-server/options.md#--querymax-memory-per-query) to limit memory usage per individual query.
93+
- Consider setting [`--query.global-memory-limit`](../components/arangodb-server/options.md#--queryglobal-memory-limit) to limit total memory used by all concurrent queries.
94+
95+
### Service Management
4196

4297
- Ensure ArangoDB will be automatically restarted (e.g. by using a systemd service file). Typically
4398
you would use the Kubernetes operator or use systemd to launch the _Starter_.
@@ -50,36 +105,56 @@ have been performed on your production system before you go live.
50105
update-rc.d -f arangodb3 remove
51106
```
52107

53-
- If you have deployed a Cluster, the _replication factor_ and
54-
_minimal_replication_factor_ of your collections
55-
are set to a value equal or higher than 2, otherwise you run the risk of
56-
losing data in case of a node failure. See
57-
[cluster startup options](../components/arangodb-server/options.md#cluster).
58-
59-
- *Disk Performance considerations*
60-
- Verify that your **storage performance** is at least 100 IOPS for each
61-
volume in production mode. This is the bare minimum and it's recommended to
62-
provide more for performance. It is probably only a concern if you use a
63-
cloud infrastructure. Note that IOPS might be allotted based on a volume size,
64-
so make sure to check your storage provider for details. Furthermore, you should
65-
be careful with burst mode guarantees as ArangoDB requires a sustainable
66-
high IOPS rate.
67-
68-
- The considerations should be given to an IO bandwidth (especially considering
69-
RocksDB write-amplification which can easily be 10x or more).
70-
71-
- Whenever possible use **block storage**. Database data is based on append
72-
operations, so filesystem which support this should be used for best
73-
performance. We would not recommend to use NFS for performance reasons,
108+
### Cluster Configuration
109+
110+
- **Replication configuration**: For production clusters, configure collections with:
111+
- _replication factor_ of 3 for optimal data availability and fault tolerance.
112+
- _minimal_replication_factor_ of a value equal or higher than 2.
113+
- _writeConcern_ of 2.
114+
See [cluster startup options](../components/arangodb-server/options.md#cluster).
115+
116+
- **Shard limits**: Keep the total number of shards below 10,000 across your cluster
117+
to maintain optimal performance and avoid resource exhaustion.
118+
119+
### Disk Performance
120+
121+
- **Storage performance**: Verify that your storage performance is at least 100 IOPS for each
122+
volume in production mode. This is the bare minimum and it's recommended to
123+
provide more for performance. It is probably only a concern if you use a
124+
cloud infrastructure. Note that IOPS might be allotted based on a volume size,
125+
so make sure to check your storage provider for details. Furthermore, you should
126+
be careful with burst mode guarantees as ArangoDB requires a sustainable
127+
high IOPS rate.
128+
129+
- **DB-Server storage limit**: Keep individual DB-Server storage below 2TB per server to maintain optimal performance.
130+
131+
- **I/O bandwidth**: Give considerations to I/O bandwidth, especially considering
132+
RocksDB write-amplification which can easily be 10x or more.
133+
134+
- **Block storage**: Whenever possible use block storage. Database data is based on append
135+
operations, so filesystems which support this should be used for best
136+
performance. ArangoDB does not recommend using NFS for performance reasons,
74137
furthermore we experienced some issues with hard links required for
75138
Hot Backup.
76139

77-
- Verify your **Backup** and restore procedures are working.
140+
### Backup and Recovery
141+
142+
- **Test restore procedures**: Verify your backup and restore procedures are working.
143+
**TEST YOUR RESTORE PROCEDURE** regularly to ensure you can recover from failures.
144+
145+
- **Hot Backup frequency**: Take Hot Backups with a frequency that matches your
146+
RTO (Recovery Time Objective) and RPO (Recovery Point Objective) requirements.
147+
148+
- **arangodump backups**: Take backups with arangodump from time to time as an
149+
additional backup strategy alongside Hot Backups.
78150

79-
- Consider enabling [Encryption at Rest](../operations/security/encryption-at-rest.md).
80-
Make sure to safely store any secret keys you create for this.
151+
- **Secure backup storage**: Store backups in a secure, separate location from your
152+
production systems. Use encrypted storage and ensure backups are geographically
153+
distributed to protect against regional disasters. Implement proper access controls
154+
for backup storage locations.
81155

82-
- Monitor the ArangoDB provided metrics (e.g. by using Prometheus/Grafana).
156+
- **Retry mechanisms**: Implement exponential retry with jitter in your applications
157+
when connecting to ArangoDB to handle temporary network issues and failovers gracefully.
83158

84159
## Kubernetes Operator (kube-arangodb)
85160

@@ -89,4 +164,4 @@ have been performed on your production system before you go live.
89164
- The [**ReclaimPolicy**](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#reclaiming)
90165
of your persistent volumes should be set to `Retain` to prevent volumes from premature deletion.
91166

92-
- Use native networking whenever possible to reduce delays.
167+
- Use native networking whenever possible to reduce delays.

0 commit comments

Comments
 (0)