Skip to content
This repository was archived by the owner on Oct 17, 2022. It is now read-only.

Commit eaad75f

Browse files
authored
[rfc] Fair Share Replication Scheduler (#617)
Discussed on the [ML](https://lists.apache.org/thread.html/rebba9a43bfdf9696f2ce974b0fc7550a631c7b835e4c14e51cd27a87%40%3Cdev.couchdb.apache.org%3E) Based on the Fair Share Scheduler paper by [Judy Kay and Piers Lauder](https://proteusmaster.urcf.drexel.edu/urcfwiki/images/KayLauderFairShare.pdf)
1 parent 7037ec5 commit eaad75f

File tree

1 file changed

+210
-0
lines changed

1 file changed

+210
-0
lines changed

rfcs/017-fair-share-scheduling.md

Lines changed: 210 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,210 @@
1+
---
2+
name: Formal RFC
3+
about: Submit a formal Request For Comments for consideration by the team.
4+
title: 'Fair Share Job Scheduling for CouchDB 3.x Replicator'
5+
labels: rfc, discussion
6+
assignees: 'vatamane@apache.org'
7+
8+
---
9+
10+
# Introduction
11+
12+
This document describes an improvement to the CouchDB 3.x replicator to
13+
introduce fair resource sharing between replication jobs in different
14+
_replicator databases.
15+
16+
## Abstract
17+
18+
Currently CouchDB replicator 3.x schedules jobs without any regard to what
19+
database they originated from. If there are multiple `_replicator` dbs then
20+
replication jobs from dbs with most jobs will consume most of the scheduler's
21+
resources. The proposal is to implement a fair sharing scheme as described in
22+
[A Fair Share Scheduler][2] paper by Judy Kay and Piers Lauder. It would allow
23+
sharing replication scheduler resources fairly amongst `_replicator` dbs.
24+
25+
The idea was originally discussed on the [couchdb-dev][1] mailing list and the
26+
use of the Fair Share algorithm suggested by Joan Touzet.
27+
28+
## Requirements Language
29+
30+
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
31+
"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
32+
interpreted as described in [RFC
33+
2119](https://www.rfc-editor.org/rfc/rfc2119.txt).
34+
35+
## Terminology
36+
37+
`_replicator` databases : A database that is either named `_replicator` or ends
38+
with the `/_replicator` suffix.
39+
40+
`shares` : An abstract representation of entitlement to run on the replication
41+
scheduler.
42+
43+
`usage` : A measure of resource usage by jobs from a particular `_replicator`
44+
db. For the scheduling replicator this will be the total time spent running.
45+
46+
`continuous` replications : Replication jobs created with the `"continuous":
47+
true` parameter. These jobs will try to run continuously until the user removes
48+
them. They may be temporarily paused to allow other jobs to make progress.
49+
50+
`one-shot` replications : Replication jobs which are not `continuous`. If the
51+
`"continuous":true` parameter is not specified, by default, replication jobs
52+
will be `one-shot`. These jobs will try to run until they reach the end of the
53+
changes feed, then stop.
54+
55+
`job priority` : A job attribute which indicates the likelihood of the job
56+
being executed before other jobs. Following the convention in the "Fair Share"
57+
paper, jobs with a lower priority value are at the front of the pending queue,
58+
and get executed first.
59+
60+
`max_jobs` : Configuration parameter which specifies up to how many replication
61+
jobs to run on each `replication` node.
62+
63+
`max_churn` : Configuration parameter which specifies a limit of how many new
64+
jobs to spawn during each rescheduling interval.
65+
66+
---
67+
68+
# Detailed Description
69+
70+
The general idea behind the algorithm is to continuously monitor
71+
per-`_replicator` jobs statistics and update each job's priorities in
72+
proportion to the usage from all the jobs in the same `_replicator` db. To make
73+
sure all jobs eventually get a chance to run and do not starve, all the
74+
priorities are continuously boosted, such that jobs which haven't run for a
75+
while, and maybe be starved, will eventually get a chance to run.
76+
77+
The algorithm has 3 basic components that can run mostly independently from
78+
each other:
79+
80+
1) Keep track of `usage` for each `_replicator` db . In the paper this part is
81+
called "user-level scheduling". As jobs run, they send reports to this
82+
component. Those reports are accumulated for one period, then rolled up when
83+
the period ends. There is also a decay coefficient applied to account for
84+
recent historical usage (this is called `K1` in the paper). This ensures in
85+
absence of jobs running from a particular `_replicator` db, the usage would
86+
drops to 0 and the whole entry is removed from the table table altogether.
87+
88+
Every `UsageUpdateInterval` seconds (called `t1` in the paper):
89+
For each `Db`:
90+
```
91+
DecayCoeff = get_usage_decay_coefficient(0.5)
92+
AccumulatedUsage = get_accumulated_usage(Db),
93+
update_usage(Db, usage(Db) * DecayCoeff + AccumulatedUsage)
94+
reset_accumulated_usage(Db)
95+
```
96+
97+
2) Uniformly decay all process priorities. Periodically lower the priority
98+
values, and thus boost the priority, of all the pending and running jobs in the
99+
system. The paper in this step applies a per-process "nice" value, which is
100+
skipped in the initial proposal. It could be added later if needed.
101+
102+
Every `UniformPriorityBoostInterval` seconds (called `t2` in the paper):
103+
For each `Job`:
104+
```
105+
DecayCoeff = get_uniform_decay_coefficient(0.75),
106+
Job#job.priority = Job#job.priority * DecayCoeff
107+
```
108+
109+
[note]: If jobs were scheduled to run at an absolute future time (a deadline) this step could be avoided. Then, the effect of all the jobs needing to periodically move to the front of the queue would be accomplished instead by the current time (i.e. `now()`) moving head along the time-line.
110+
111+
3) Adjust running process priority in proportion to the shares used by all the
112+
jobs in the same db:
113+
114+
Every `RunningPriorityReduceInterval` seconds (called `t3` in the paper):
115+
For each `Job`:
116+
```
117+
Db = Job#job.db,
118+
SharesSq = shares(Db) * shares(Db),
119+
Job#job.priority = Job#job.priority + (usage(Db) * pending(Db)) / SharesSq
120+
```
121+
122+
### How Jobs Start and Stop
123+
124+
During each rescheduling cycle, `max_churn` running jobs from the back of the
125+
queue are stopped and `max_churn` jobs from the front of the pending queue are
126+
started. This part is not modified from the existing scheduling algorithm,
127+
except now, the jobs would be ordered by their `priority` value before being
128+
ordered by their last start time.
129+
130+
In addition, `one-shot` replication jobs would still be skipped when stopping
131+
and we'd let them run in order to maintain traditional replication semantics
132+
just like before.
133+
134+
When picking the jobs to run exclude jobs which have been exponentially backed
135+
off due to repeated errors. This part is unmodified and from the original
136+
scheduler.
137+
138+
### Configuration
139+
140+
The decay coefficients and interval times for each of the 3 parts of the algorithm would be configurable in the `[replicator]` config section.
141+
142+
Per-`_replicator` db shares would be configurable in the `[replicator.shares]` section as:
143+
144+
```
145+
[replicator.shares]
146+
$prefix/_replicator = $numshares
147+
```
148+
149+
By default each db is assigned 100 shares. Then higher number of shares should
150+
then indicated a larger proportion of scheduler resources allocated to that db.
151+
A lower number would get proportionally less shares.
152+
153+
For example:
154+
155+
```
156+
[replicator.shares]
157+
158+
; This is the default
159+
; _replicator = 100
160+
161+
high/_replicator = 200
162+
low/_replicator = 50
163+
```
164+
165+
# Advantages and Disadvantages
166+
167+
Advantages:
168+
169+
* Allow a fair share of resources between multiple `_replicator` db instances
170+
171+
* Can boost or lower the priority of some replication jobs by adjusting the
172+
shares assigned to that database instance.
173+
174+
Disadvantages:
175+
176+
* Adds more complexity to the scheduler
177+
178+
# Key Changes
179+
180+
* Modifies replication scheduler
181+
182+
## Applications and Modules affected
183+
184+
* `couch_replicator` application
185+
186+
## HTTP API additions
187+
188+
N/A
189+
190+
## HTTP API deprecations
191+
192+
N/A
193+
194+
# Security Considerations
195+
196+
None
197+
198+
# References
199+
200+
* [1]: https://lists.apache.org/thread.html/rebba9a43bfdf9696f2ce974b0fc7550a631c7b835e4c14e51cd27a87%40%3Cdev.couchdb.apache.org%3E "couchdb-dev"
201+
202+
* [2]: https://proteusmaster.urcf.drexel.edu/urcfwiki/images/KayLauderFairShare.pdf "Fair Share Scheduler"
203+
204+
# Co-authors
205+
206+
* Joan Touzet (@wohali)
207+
208+
# Acknowledgments
209+
210+
* Joan Touzet (@wohali)

0 commit comments

Comments
 (0)