@@ -10,34 +10,89 @@ have been performed on your production system before you go live.
1010
1111## Operating System  
1212
13- -  Executed the OS  optimization scripts if you run ArangoDB on Linux.
13+ -  Executed the operating system (OS)  optimization scripts if you run ArangoDB on Linux.
1414 See [ Installing ArangoDB on Linux] ( ../operations/installation/linux/_index.md )  and its sub pages
1515 [ Linux Operating System Configuration] ( ../operations/installation/linux/operating-system-configuration.md )  and
1616 [ Linux OS Tuning Script Examples] ( ../operations/installation/linux/linux-os-tuning-script-examples.md )  for details.
1717
18- -  OS monitoring is in place
19-  (most common metrics, e.g. disk, CPU, RAM utilization).
18+ -  Ensure your OS is compatible with your ArangoDB version
19+  and keep it up to date at all times for security and stability.
20+ 
21+ -  OS monitoring is in place with specific alerting thresholds:
22+  -  ** Disk usage** : Alert when reaching 60% (red line threshold).
23+  -  ** CPU usage** : Alert when reaching 90% (red line threshold).
24+  -  ** Memory usage** : Alert when reaching 85% (red line threshold).
2025
2126-  Disk space monitoring is in place. Consider setting up alerting to avoid out-of-disk situations.
2227
2328## ArangoDB  
2429
25- -  The user _ root_  is not used to run any ArangoDB processes
30+ -  ** Use the latest versions** : Deploy the latest version series
31+  of ArangoDB to benefit from performance improvements and security fixes.
32+ 
33+ -  ** Testing environments** : Use QA environments and UAT (User Acceptance Testing)
34+  to test all changes, in particular queries, before going live with production deployments.
35+ 
36+ ### Security  
37+ 
38+ -  Create a dedicated system user and group (e.g., "arango")
39+  to run ArangoDB processes. Never use the _ root_  user to run any ArangoDB processes
2640 (if you run ArangoDB on Linux).
2741
42+ -  ** Access control** : Restrict access to the deployment to authorized personnel only.
43+  Implement proper authentication and authorization mechanisms.
44+ 
45+ -  ** JWT authentication** : Enable JWT authentication
46+  for production deployments. See [ JWT authentication] ( ../develop/http-api/authentication.md#jwt-user-tokens )  for more details.
47+ 
48+ -  ** Encryption** : Enable [ Encryption at Rest] ( ../operations/security/encryption-at-rest.md ) 
49+  for sensitive data. Make sure to safely store any secret keys you create for this.
50+ 
51+ ### Logging and Monitoring  
52+ 
2853-  The _ arangod_  (server) process and the _ arangodb_  (_ Starter_ ) process
2954 (if in use) have some form of logging enabled and logs can easily be
3055 located and inspected.
31-  
32- -  * Memory considerations* 
33-  -  If you run multiple processes (e.g. DB-Server and Coordinator) on a single
34-  machine, adjust the [ ` ARANGODB_OVERRIDE_DETECTED_TOTAL_MEMORY ` ] ( ../components/arangodb-server/environment-variables.md ) 
35-  environment variable accordingly.
36-  -  For versions prior to 3.8, make sure to change the
37-  [ ` --query.memory-limit ` ] ( ../components/arangodb-server/options.md#--querymemory-limit ) 
38-  query option according to the node size and workload.
39-  -  Disable swap space to avoid slowdown which can result in servers being incorrectly 
40-  detected as failed.
56+ 
57+ -  ** Third-party monitoring** : Configure third-party metrics monitoring tools like
58+  Grafana with Prometheus to monitor ArangoDB metrics comprehensively.
59+ 
60+ -  ** Configure metrics collection** : Enable the ArangoDB metrics API for production monitoring:
61+  -  Set [ ` --server.export-metrics-api ` ] ( ../components/arangodb-server/options.md#--serverexport-metrics-api )  to ` true `  to enable the metrics endpoints
62+  -  Enable [ ` --server.export-read-write-metrics ` ] ( ../components/arangodb-server/options.md#--serverexport-read-write-metrics )  for additional document read/write metrics
63+  -  Consider enabling [ ` --server.export-shard-usage-metrics ` ] ( ../components/arangodb-server/options.md#--serverexport-shard-usage-metrics )  for detailed shard usage tracking
64+  -  Configure your monitoring system (Prometheus/Grafana) to scrape the ` /_admin/metrics/v2 `  endpoint
65+  -  See [ HTTP interface for server metrics] ( ../develop/http-api/monitoring/metrics.md )  for detailed information
66+ 
67+ -  ** Enable RocksDB statistics** : Consider enabling [ ` --rocksdb.enable-statistics ` ] ( ../components/arangodb-server/options.md#--rocksdbenable-statistics )  to ` true `  for detailed RocksDB performance metrics.
68+ 
69+ -  Monitor the ArangoDB provided metrics with alerting based on the threshold guidelines:
70+  -  Disk usage: 60% (red line)
71+  -  CPU usage: 90% (red line)
72+  -  Memory usage: 85% (red line)
73+ 
74+ ### Memory  
75+ 
76+ -  For DB-Servers and Coordinators, override the
77+  [ ` ARANGODB_OVERRIDE_DETECTED_TOTAL_MEMORY ` ] ( ../components/arangodb-server/environment-variables.md ) 
78+  environment variable using this rule of thumb:
79+  -  Multiply available memory by 0.9 to leave headspace for OS/Kubernetes, client connections, etc.
80+  -  Use 3/4 of that value for DB-Servers.
81+  -  Use 1/4 of that value for Coordinators.
82+  -  Agents typically don't need much memory and can use the remaining 10% headspace.
83+ 
84+ -  Note that if ArangoDB "sees" x GB of memory in a pod,
85+  it will try to use those x GB. Memory accounting has been vastly improved in 3.12,
86+  but overshooting in certain cases may still occur.
87+ 
88+ -  Disable swap space to avoid slowdown which can result in servers being incorrectly 
89+  detected as failed.
90+ 
91+ -  ** Query memory limits** : Configure appropriate memory limits for AQL queries:
92+  -  Set [ ` --query.max-memory-per-query ` ] ( ../components/arangodb-server/options.md#--querymax-memory-per-query )  to limit memory usage per individual query.
93+  -  Consider setting [ ` --query.global-memory-limit ` ] ( ../components/arangodb-server/options.md#--queryglobal-memory-limit )  to limit total memory used by all concurrent queries.
94+ 
95+ ### Service Management  
4196
4297-  Ensure ArangoDB will be automatically restarted (e.g. by using a systemd service file). Typically
4398 you would use the Kubernetes operator or use systemd to launch the _ Starter_ .
@@ -50,36 +105,56 @@ have been performed on your production system before you go live.
50105 update-rc.d -f arangodb3 remove 
51106  ``` 
52107
53- -  If you have deployed a Cluster, the _ replication factor_  and 
54-  _ minimal_replication_factor_  of your collections
55-  are set to a value equal or higher than 2, otherwise you run the risk of
56-  losing data in case of a node failure. See
57-  [ cluster startup options] ( ../components/arangodb-server/options.md#cluster ) .
58- 
59- -  * Disk Performance considerations* 
60-  -  Verify that your ** storage performance**  is at least 100 IOPS for each
61-  volume in production mode. This is the bare minimum and it's recommended to
62-  provide more for performance. It is probably only a concern if you use a
63-  cloud infrastructure. Note that IOPS might be allotted based on a volume size,
64-  so make sure to check your storage provider for details. Furthermore, you should
65-  be careful with burst mode guarantees as ArangoDB requires a sustainable
66-  high IOPS rate. 
67- 
68-  -  The considerations should be given to an IO bandwidth (especially considering 
69-  RocksDB write-amplification which can easily be 10x or more).
70- 
71- -  Whenever possible use ** block storage** . Database data is based on append
72-  operations, so filesystem which support this should be used for best
73-  performance. We would not recommend to use NFS for performance reasons,
108+ ### Cluster Configuration  
109+ 
110+ -  ** Replication configuration** : For production clusters, configure collections with:
111+  -  _ replication factor_  of 3 for optimal data availability and fault tolerance.
112+  -  _ minimal_replication_factor_  of a value equal or higher than 2.
113+  -  _ writeConcern_  of 2.
114+  See [ cluster startup options] ( ../components/arangodb-server/options.md#cluster ) .
115+ 
116+ -  ** Shard limits** : Keep the total number of shards below 10,000 across your cluster
117+  to maintain optimal performance and avoid resource exhaustion.
118+ 
119+ ### Disk Performance  
120+ 
121+ -  ** Storage performance** : Verify that your storage performance is at least 100 IOPS for each
122+  volume in production mode. This is the bare minimum and it's recommended to
123+  provide more for performance. It is probably only a concern if you use a
124+  cloud infrastructure. Note that IOPS might be allotted based on a volume size,
125+  so make sure to check your storage provider for details. Furthermore, you should
126+  be careful with burst mode guarantees as ArangoDB requires a sustainable
127+  high IOPS rate.
128+ 
129+ -  ** DB-Server storage limit** : Keep individual DB-Server storage below 2TB per server to maintain optimal performance.
130+ 
131+ -  ** I/O bandwidth** : Give considerations to I/O bandwidth, especially considering 
132+  RocksDB write-amplification which can easily be 10x or more.
133+ 
134+ -  ** Block storage** : Whenever possible use block storage. Database data is based on append
135+  operations, so filesystems which support this should be used for best
136+  performance. ArangoDB does not recommend using NFS for performance reasons,
74137 furthermore we experienced some issues with hard links required for
75138 Hot Backup.
76139
77- -  Verify your ** Backup**  and restore procedures are working.
140+ ### Backup and Recovery  
141+ 
142+ -  ** Test restore procedures** : Verify your backup and restore procedures are working.
143+  ** TEST YOUR RESTORE PROCEDURE**  regularly to ensure you can recover from failures.
144+ 
145+ -  ** Hot Backup frequency** : Take Hot Backups with a frequency that matches your
146+  RTO (Recovery Time Objective) and RPO (Recovery Point Objective) requirements.
147+ 
148+ -  ** arangodump backups** : Take backups with arangodump from time to time as an
149+  additional backup strategy alongside Hot Backups.
78150
79- -  Consider enabling [ Encryption at Rest] ( ../operations/security/encryption-at-rest.md ) .
80-  Make sure to safely store any secret keys you create for this.
151+ -  ** Secure backup storage** : Store backups in a secure, separate location from your
152+  production systems. Use encrypted storage and ensure backups are geographically
153+  distributed to protect against regional disasters. Implement proper access controls
154+  for backup storage locations.
81155
82- -  Monitor the ArangoDB provided metrics (e.g. by using Prometheus/Grafana).
156+ -  ** Retry mechanisms** : Implement exponential retry with jitter in your applications
157+  when connecting to ArangoDB to handle temporary network issues and failovers gracefully.
83158
84159## Kubernetes Operator (kube-arangodb)  
85160
@@ -89,4 +164,4 @@ have been performed on your production system before you go live.
89164-  The [ ** ReclaimPolicy** ] ( https://kubernetes.io/docs/concepts/storage/persistent-volumes/#reclaiming ) 
90165 of your persistent volumes should be set to ` Retain `  to prevent volumes from premature deletion.
91166
92- -  Use native networking whenever possible to reduce delays.
167+ -  Use native networking whenever possible to reduce delays.
0 commit comments