Skip to content
This repository was archived by the owner on Oct 17, 2022. It is now read-only.

Commit 9fd0fd0

Browse files
authored
[RFC] Replicator Implementation for CouchDB 4.x (#581)
1 parent a78d0bc commit 9fd0fd0

File tree

1 file changed

+384
-0
lines changed

1 file changed

+384
-0
lines changed

rfcs/016-fdb-replicator.md

Lines changed: 384 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,384 @@
1+
---
2+
name: Formal RFC
3+
about: Submit a formal Request For Comments for consideration by the team.
4+
title: 'Replicator Implementation On FDB'
5+
labels: rfc, discussion
6+
assignees: 'vatamane@apache.org'
7+
8+
---
9+
10+
# Introduction
11+
12+
This document describes the design of the replicator application for CouchDB
13+
4.x. The replicator will rely on `couch_jobs` for centralized scheduling and
14+
monitoring of replication jobs.
15+
16+
## Abstract
17+
18+
Replication jobs can be created from documents in `_replicator` databases, or
19+
by `POST`-ing requests to the HTTP `/_replicate` endpoint. Previously, in
20+
CouchDB <= 3.x, replication jobs were mapped to individual cluster nodes and a
21+
scheduler component would run up to `max_jobs` number of jobs at a time on each
22+
node. The new design proposes using `couch_jobs`, as described in the
23+
[Background Jobs
24+
RFC](https://github.com/apache/couchdb-documentation/blob/master/rfcs/007-background-jobs.md),
25+
to have a central, FDB-based queue of replication jobs. `couch_jobs`
26+
application will manage job scheduling and coordination. The new design also
27+
proposes using heterogeneous node types as defined in the [Node Types
28+
RFC](https://github.com/apache/couchdb-documentation/blob/master/rfcs/013-node-types.md)
29+
such that replication jobs will be created only on `api_frontend` nodes and run
30+
only on `replication` nodes.
31+
32+
## Requirements Language
33+
34+
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
35+
"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
36+
interpreted as described in [RFC
37+
2119](https://www.rfc-editor.org/rfc/rfc2119.txt).
38+
39+
## Terminology
40+
41+
`_replicator` databases : A database that is either named `_replicator` or ends
42+
with the `/_replicator` suffix.
43+
44+
`transient` replications : Replication jobs created by `POST`-ing to the
45+
`/_replicate` endpoint.
46+
47+
`persistent` replications : Replication jobs defined in document in a
48+
`_replicator` database.
49+
50+
`continuous` replications : Replication jobs created with the `"continuous":
51+
true` parameter. These jobs will try to run continuously until the user removes
52+
them. They may be temporarily paused to allow other jobs to make progress.
53+
54+
`one-shot` replications : Replication jobs which are not `continuous`. If the
55+
`"continuous":true` parameter is not specified, by default, replication jobs
56+
will be `one-shot`. These jobs will try to run until they reach the end of the
57+
changes feed, then stop.
58+
59+
`api_frontend node` : Database node which has the `api_frontend` type set to
60+
`true` as described in
61+
[RFC](https://github.com/apache/couchdb-documentation/blob/master/rfcs/013-node-types.md).
62+
Replication jobs can be only be created on these nodes.
63+
64+
`replication node` : Database node which has the `replication` type set to
65+
`true` as described in
66+
[RFC](https://github.com/apache/couchdb-documentation/blob/master/rfcs/013-node-types.md).
67+
Replication jobs can only be run on these nodes.
68+
69+
`filtered` replications: Replications with a user-defined filter on the source
70+
endpoint to filter its changes feed.
71+
72+
`replication_id` : An ID defined by replication jobs, which is a hash of
73+
replication parameters that affect the result of the replication. These may
74+
include source and target endpoint URLs, as well as a filter function specified
75+
in a design document on the source endpoint.
76+
77+
`job_id` : A replication job ID derived from the database and document IDs for
78+
persistent replications, and from source, target endpoint, user name and some
79+
options for transient replications. Computing a `job_id`, unlike a
80+
`replication_id`, doesn't require making any network requests. A filtered
81+
replication with a given `job_id` during its lifetime may change its
82+
`replication_id` multiple times when filter contents changes on the source.
83+
84+
`max_jobs` : Configuration parameter which specifies up to how many replication
85+
jobs to run on each `replication` node.
86+
87+
`max_churn` : Configuration parameter which specifies a limit of how many new
88+
jobs to spawn during each rescheduling interval.
89+
90+
`min_backoff_penalty` : Configuration parameter specifying the minimum (the
91+
base) penalty applied to jobs which crash repeatedly.
92+
93+
`max_backoff_penalty` : Configuration parameter specifying the maximum penalty
94+
applied to jobs which crash repeatedly.
95+
96+
---
97+
98+
# Detailed Description
99+
100+
Replication job creation and scheduling works roughly as follows:
101+
102+
1) `Persistent` and `transient` jobs both start by creating or updating a
103+
`couch_jobs` record in a separate replication key-space on `api_frontend`
104+
nodes. Persistent jobs are driven by the `couch_epi` callback mechanism which
105+
notifies `couch_replicator` application when documents in `_replicator` DBs
106+
are updated, or when `_replicator` DBs are created and deleted. Transient jobs
107+
are created from the `_replicate` HTTP handler directly. Newly created jobs
108+
are in a `pending` state.
109+
110+
2) Each `replication` node spawns some acceptor processes which wait in
111+
`couch_jobs:accept/2` call for jobs. It will accept only jobs which are
112+
scheduled to run at a time less or equal to the current time.
113+
114+
3) After a job is accepted, its state is updated to `running`, and then, a
115+
gen_server process monitoring these replication jobs will spawn another
116+
acceptor. That happens until the `max_jobs` limit is reached.
117+
118+
4) The same monitoring gen_server will periodically check if there are any
119+
pending jobs in the queue and, if there are, spawn up to some `max_churn`
120+
number of new acceptors. These acceptors may start new jobs and, if they do,
121+
for each one of them, the oldest running job will be stopped and re-enqueued
122+
as `pending`. This in large follows the logic from the replication scheduler
123+
in CouchDB <= 3.x except that is uses `couch_jobs` as the central queuing and
124+
scheduling mechanism.
125+
126+
5) After the job is marked as `running`, it computes its `replication_id`,
127+
initializes an internal replication state record from job's data object, and
128+
starts replicating. Underneath this level the logic is identical to what's
129+
already happening in CouchDB <= 3.x and so it is not described further in this
130+
document.
131+
132+
6) As jobs run, they periodically checkpoint, and when they do that, they also
133+
recompute their `replication_id`. In the case of filtered replications the
134+
`replication_id` may change, and if so, that job is stopped and re-enqueued as
135+
`pending`. Also, during checkpointing the job's data value is updated with
136+
stats such that the job stays active and doesn't get re-enqueued by the
137+
`couch_jobs` activity monitor.
138+
139+
7) If the job crashes, it will reschedule itself in `gen_server:terminate/2`
140+
via `couch_jobs:resubmit/3` call to run again at some future time, defined
141+
roughly as `now + max(min_backoff_penalty * 2^consecutive_errors,
142+
max_backoff_penalty)`. If a job starts and successfully runs for some
143+
predefined period of time without crashing, it is considered to be `"healed"`
144+
and its `consecutive_errors` count is reset to 0.
145+
146+
8) If the node where replication job runs crashes, or the job is manually
147+
killed via `exit(Pid, kill)`, `couch_jobs` activity monitor will automatically
148+
re-enqueue the job as `pending`.
149+
150+
## Replicator Job States
151+
152+
### Description
153+
154+
The set of replication job states is defined as:
155+
156+
* `pending` : A job is marked as `pending` in these cases:
157+
- As soon as a job is created from an `api_frontend` node
158+
- When it stopped to let other replication jobs run
159+
- When a filtered replication's `replication_id` changes
160+
161+
* `running` : Set when a job is accepted by the `couch_jobs:accept/2`
162+
call. This generally means the job is actually running on a node,
163+
however, in cases when a node crashes, the job may show as
164+
`running` on that node until `couch_jobs` activity monitor
165+
re-enqueues the job, and it starts running on another node.
166+
167+
* `crashing` : The job was running, but then crashed with an intermittent
168+
error. Job's data has an error count which is incremented, and then a
169+
backoff penalty is computed and the job is rescheduled to try again at some
170+
point in the future.
171+
172+
* `completed` : One-Shot replications which have completed
173+
174+
* `failed` : This can happen when:
175+
- A replication job could not be parsed from a replication document. For
176+
example, if the user has not specified a `"source"` field.
177+
- A transient replication job crashes. Transient jobs don't get rescheduled
178+
to run again after they crash.
179+
- There already is another persistent replication job running or pending
180+
with the same `replication_id`.
181+
182+
### State Differences From CouchDB <= 3.x
183+
184+
The set of states is slightly different than the ones from before. There are
185+
now fewer states as some of them have been combined together:
186+
187+
* `initializing` was combined with `pending`
188+
189+
* `error` was combined with `crashing`
190+
191+
### Mapping Between couch_jobs States and Replication States
192+
193+
`couch_jobs` application has its own set of state definitions and they map to
194+
replicator states like so:
195+
196+
| Replicator States| `couch_jobs` States
197+
| --- | :--
198+
| pending | pending
199+
| running | running
200+
| crashing | pending
201+
| completed | finished
202+
| failed | finished
203+
204+
### State Transition Diagram
205+
206+
Jobs start in the `pending` state, after either a `_replicator` db doc
207+
update, or a POST to the `/_replicate` endpoint. Continuous jobs, will
208+
normally toggle between `pending` and `running` states. One-Shot jobs
209+
may toggle between `pending` and running a few times and then end up
210+
in `completed`.
211+
212+
```
213+
_replicator doc +-------+
214+
POST /_replicate ---->+pending|
215+
+-------+
216+
^
217+
|
218+
|
219+
v
220+
+---+---+ +--------+
221+
+---------+running+<---->|crashing|
222+
| +---+---+ +--------+
223+
| |
224+
| |
225+
v v
226+
+------+ +---------+
227+
|failed| |completed|
228+
+------+ +---------+
229+
```
230+
231+
232+
## Replication ID Collisions
233+
234+
Multiple replication jobs may specify replications which map to the same
235+
`replication_id`. To handle these collisions there is an FDB subspace `(...,
236+
LayerPrefix, ?REPLICATION_IDS, replication_id) -> job_id` to keep track of
237+
them. After the `replication_id` is computed, each replication job checks if
238+
there is already another job pending or running with the same `replication_id`.
239+
If the other job is transient, then the current job will reschedule itself as
240+
`crashing`. If the other job is persistent, the current job will fail
241+
permanently as `failed`.
242+
243+
## Replication Parameter Validation
244+
245+
`_replicator` documents in CouchDB <= 3.x were parsed and validated in a
246+
two-step process:
247+
248+
1) In a validate-doc-update (VDU) javascript function from a programmatically
249+
inserted _design document. This validation happened when the document was
250+
updated, and performed some rough checks on field names and value types. If
251+
this validation failed, the document update operation was rejected.
252+
253+
2) Inside replicator's Erlang code when it was translated to an internal
254+
record used by the replication application. This validation was more thorough
255+
but didn't have very friendly error messages. If validation failed here, the
256+
job would be marked as `failed`.
257+
258+
For CouchDB 4.x the proposal is to use only the Erlang parser. It would be
259+
called from the `before_doc_update` callback. This is a callback which runs
260+
before every document update. If validation fails there it would reject the
261+
document update operation. This should reduce code duplication and also provide
262+
better feedback to the users directly when they update the `_replicator`
263+
documents.
264+
265+
## Transient Job Behavior
266+
267+
In CouchDB <= 3.x transient replication jobs ran in memory on a particular node
268+
in the cluster. If the node where the replication job ran crashes, the job
269+
would simply disappear without a trace. It was up to the user to periodically
270+
monitor the job status and re-create the job. In the current design,
271+
`transient` jobs are persisted to FDB as `couch_jobs` records, and so would
272+
survive node restarts. Also after transient jobs complete or failed,
273+
they used to disappear immediately. This design proposes keeping them around
274+
for a configurable emount of time to allow users to retrive their status via
275+
`_scheduler/jobs/$id` API.
276+
277+
## Monitoring Endpoints
278+
279+
`_active_tasks`, `_scheduler/jobs` and `_scheduler/docs` endpoint are handled
280+
by traversing the replication job's data using a new `couch_jobs:fold_jobs/4`
281+
API function to retrieve each job's data. `_active_tasks` implementation
282+
already works that way and `_scheduler/*` endpoint will work similarly.
283+
284+
## Replication Documents Not Updated For Transient Errors
285+
286+
Configuration
287+
[option](https://docs.couchdb.org/en/latest/replication/replicator.html?highlight=update_docs#compatibility-mode)
288+
`[replicator] update_docs = false` was introduced with the scheduling
289+
replicator in a 2.x release. It controls whether to update replication
290+
documents with transient states like `triggered` and `error`. It defaulted to
291+
`false` and was mainly for compatibility with older monitoring user scripts.
292+
That behavior now becomes hard-coded such that replication documents are only
293+
updated with terminal states of `failed` and `completed`. Users should use
294+
`_scheduler/docs` API to check for completion status instead.
295+
296+
297+
# Advantages and Disadvantages
298+
299+
Advantages:
300+
301+
* Simplicity: re-using `couch_jobs` means having a lot less code to maintain
302+
in `couch_replicator`. In the draft implementation there are about 3000
303+
lines of code saved compared to the replicator application in CouchDB 3.x
304+
305+
* Simpler endpoint and monitoring implementation
306+
307+
* Fewer replication job states to keep track of
308+
309+
* Transient replications can survive node crashes and restarts
310+
311+
* Simplified and improved validation logic
312+
313+
* Using node types allows tightening firewall rules such that only
314+
`replication` nodes are the ones which may make arbitrary requests outside
315+
the cluster, and `frontend_api` nodes are the only ones that may accept
316+
incoming connections.
317+
318+
Disadvantages:
319+
320+
* Behavior changes for transient jobs
321+
322+
* Centralized job queue might mean handling some number of conflicts generated
323+
in the FDB backend when jobs are accepted. These are mitigated using the
324+
`startup_jitter` configuration parameter and a configurable number of max
325+
acceptors per node.
326+
327+
* In monitoring API responses, `running` job state might not immediately
328+
reflect the running process state on the replication node. If the node
329+
crashes, it might take up to a minute or two until the job is re-enqueued by
330+
the `couch_jobs` activity monitor.
331+
332+
# Key Changes
333+
334+
* Behavior changes for transient jobs
335+
336+
* A delay in `running` state as reflected in monitoring API responses
337+
338+
* `[replicator] update_docs = false` configuration option becomes hard-coded
339+
340+
## Applications and Modules affected
341+
342+
* couch_jobs : New APIs to fold jobs and get pending count job estimate
343+
344+
* fabric2_db : Adding EPI db create/delete callbacks
345+
346+
* couch_replicator :
347+
- Remove `couch_replicator_scheduler*` modules
348+
- Remove `couch_replicator_doc_processor_*` modules
349+
- `couch_replicator` : job creation and a general API entry-point for
350+
couch_replicator.
351+
- `couch_replicator_job` : runs each replication job
352+
- `couch_replicator_job_server` : replication job monitoring gen_server
353+
- `couch_replicator_parse` : parses replication document and HTTP
354+
`_replicate` POST bodies
355+
356+
## HTTP API additions
357+
358+
N/A
359+
360+
## HTTP API deprecations
361+
362+
N/A
363+
364+
# Security Considerations
365+
366+
Ability to confine replication jobs to run on `replication` nodes improves the
367+
security posture. It is possible to set up firewall rules which allow egress
368+
traffic sent out only from those nodes.
369+
370+
# References
371+
372+
* [Background Jobs RFC](https://github.com/apache/couchdb-documentation/blob/master/rfcs/007-background-jobs.md)
373+
374+
* [Node Types RFC](https://github.com/apache/couchdb-documentation/blob/master/rfcs/013-node-types.md)
375+
376+
* [CouchDB 3.x replicator implementation](https://github.com/apache/couchdb/blob/3.x/src/couch_replicator/README.md)
377+
378+
# Co-authors
379+
380+
* @davisp
381+
382+
# Acknowledgements
383+
384+
* @davisp

0 commit comments

Comments
 (0)