Skip to content

Commit 7783b96

Browse files
committed
doc: configuration example
1 parent 3c42e50 commit 7783b96

File tree

3 files changed

+253
-14
lines changed

3 files changed

+253
-14
lines changed

README.md

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -276,6 +276,102 @@ Or you can try the live editor: https://mermaid-js.github.io/mermaid-live-editor
276276

277277
You can either copy the code into a Markdown viewer or just copy the link into your browser ([link](https://mermaid-js.github.io/mermaid-live-editor/#/edit/eyJjb2RlIjoiY2xhc3NEaWFncmFtXG5jbGFzcyBNeUZhY3Rvcnkge1xuICA8PEZhY3RvcnlbRGF0YXNldFtUZXN0T2JqZWN0XV0-PlxuICArU3BhcmtSZXBvc2l0b3J5W1Rlc3RPYmplY3RdXG59XG5cbmNsYXNzIERhdGFzZXRUZXN0T2JqZWN0IHtcbiAgPDxEYXRhc2V0W1Rlc3RPYmplY3RdPj5cbiAgPnBhcnRpdGlvbjE6IEludFxuICA-cGFydGl0aW9uMjogU3RyaW5nXG4gID5jbHVzdGVyaW5nMTogU3RyaW5nXG4gID52YWx1ZTogTG9uZ1xufVxuXG5EYXRhc2V0VGVzdE9iamVjdCA8fC4uIE15RmFjdG9yeSA6IE91dHB1dFxuY2xhc3MgQW5vdGhlckZhY3Rvcnkge1xuICA8PEZhY3RvcnlbU3RyaW5nXT4-XG4gICtEYXRhc2V0W1Rlc3RPYmplY3RdXG59XG5cbmNsYXNzIFN0cmluZ0ZpbmFsIHtcbiAgPDxTdHJpbmc-PlxuICBcbn1cblxuU3RyaW5nRmluYWwgPHwuLiBBbm90aGVyRmFjdG9yeSA6IE91dHB1dFxuY2xhc3MgU3BhcmtSZXBvc2l0b3J5VGVzdE9iamVjdEV4dGVybmFsIHtcbiAgPDxTcGFya1JlcG9zaXRvcnlbVGVzdE9iamVjdF0-PlxuICBcbn1cblxuQW5vdGhlckZhY3RvcnkgPHwtLSBEYXRhc2V0VGVzdE9iamVjdCA6IElucHV0XG5NeUZhY3RvcnkgPHwtLSBTcGFya1JlcG9zaXRvcnlUZXN0T2JqZWN0RXh0ZXJuYWwgOiBJbnB1dFxuIiwibWVybWFpZCI6eyJ0aGVtZSI6ImRlZmF1bHQifX0=)) 🍻
278278

279+
### App Configuration
280+
281+
The configuration system of SETL allows users to execute their Spark application in different execution environments, by
282+
using environment-specific configurations.
283+
284+
In `src/main/resources` directory, you should have at least two configuration files named `application.conf`
285+
and `local.conf`
286+
(take a look at this [example](https://github.com/SETL-Developers/setl-template/tree/master/src/main/resources)). These
287+
are what you need if you only want to run your application in one single environment.
288+
289+
You can also create other configurations (for example `dev.conf` and `prod.conf`), in which environment-specific
290+
parameters can be defined.
291+
292+
##### application.conf
293+
294+
This configuration file should contain universal configurations that could be used regardless the execution environment.
295+
296+
##### env.conf (e.g. local.conf, dev.conf)
297+
298+
These files should contain environment-specific parameters. By default, `local.conf` will be used.
299+
300+
##### How to use the configuration
301+
302+
Imagine the case we have two environments, a local development environment and a remote production environment. Our application
303+
needs a repository for saving and loading data. In this use case, let's prepare `application.conf`, `local.conf`, `prod.conf`
304+
and `storage.conf`
305+
306+
```hocon
307+
# application.conf
308+
setl.environment = ${app.environment}
309+
setl.config {
310+
spark.app.name = "my_application"
311+
# and other general spark configurations
312+
}
313+
```
314+
315+
```hocon
316+
# local.conf
317+
include "application.conf"
318+
319+
setl.config {
320+
spark.default.parallelism = "200"
321+
spark.sql.shuffle.partitions = "200"
322+
# and other local spark configurations
323+
}
324+
325+
app.root.dir = "/some/local/path"
326+
327+
include "storage.conf"
328+
```
329+
330+
```hocon
331+
# prod.conf
332+
setl.config {
333+
spark.default.parallelism = "1000"
334+
spark.sql.shuffle.partitions = "1000"
335+
# and other production spark configurations
336+
}
337+
338+
app.root.dir = "/some/remote/path"
339+
340+
include "storage.conf"
341+
```
342+
343+
```hocon
344+
# storage.conf
345+
myRepository {
346+
storage = "CSV"
347+
path = ${app.root.dir} // this path will depend on the execution environment
348+
inferSchema = "true"
349+
delimiter = ";"
350+
header = "true"
351+
saveMode = "Append"
352+
}
353+
```
354+
355+
To compile with local configuration, with maven, just run:
356+
```shell
357+
mvn compile
358+
```
359+
360+
To compile with production configuration, pass the jvm property `app.environment`.
361+
```shell
362+
mvn compile -Dapp.environment=prod
363+
```
364+
365+
Make sure that your resources directory has filtering enabled:
366+
```xml
367+
<resources>
368+
<resource>
369+
<directory>src/main/resources</directory>
370+
<filtering>true</filtering>
371+
</resource>
372+
</resources>
373+
```
374+
279375
## Dependencies
280376

281377
**SETL** currently supports the following data source. You won't need to provide these libraries in your project (except the JDBC driver):

docs/Quick-Start.md

Lines changed: 134 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,33 @@
11
### Basic concept
2-
With SETL, an ETL application could be represented by a `Pipeline`. A `Pipeline` contains multiple `Stages`. In each stage, we could find one or several `Factories`.
32

4-
The class `Factory[T]` is an abstraction of a data transformation that will produce an object of type `T`. It has 4 methods (*read*, *process*, *write* and *get*) that should be implemented by the developer.
3+
With SETL, an ETL application could be represented by a `Pipeline`. A `Pipeline` contains multiple `Stages`. In each
4+
stage, we could find one or several `Factories`.
55

6-
The class `SparkRepository[T]` is a data access layer abstraction. It could be used to read/write a `Dataset[T]` from/to a datastore. It should be defined in a configuration file. You can have as many SparkRepositories as you want.
6+
The class `Factory[T]` is an abstraction of a data transformation that will produce an object of type `T`. It has 4
7+
methods (*read*, *process*, *write* and *get*) that should be implemented by the developer.
78

8-
The entry point of a SETL project is the object `com.jcdecaux.setl.Setl`, which will handle the pipeline and spark repository instantiation.
9+
The class `SparkRepository[T]` is a data access layer abstraction. It could be used to read/write a `Dataset[T]` from/to
10+
a datastore. It should be defined in a configuration file. You can have as many SparkRepositories as you want.
11+
12+
The entry point of a SETL project is the object `com.jcdecaux.setl.Setl`, which will handle the pipeline and spark
13+
repository instantiation.
914

1015
### Show me some code
11-
You can find the following tutorial code in [the starter template of SETL](https://github.com/qxzzxq/setl-template). Go and clone it :)
1216

13-
Here we show a simple example of creating and saving a **Dataset[TestObject]**. The case class **TestObject** is defined as follows:
17+
You can find the following tutorial code in [the starter template of SETL](https://github.com/qxzzxq/setl-template). Go
18+
and clone it :)
19+
20+
Here we show a simple example of creating and saving a **Dataset[TestObject]**. The case class **TestObject** is defined
21+
as follows:
1422

1523
```scala
1624
case class TestObject(partition1: Int, partition2: String, clustering1: String, value: Long)
1725
```
1826

1927
#### Context initialization
20-
Suppose that we want to save our output into `src/main/resources/test_csv`. We can create a configuration file **local.conf** in `src/main/resources` with the following content that defines the target datastore to save our dataset:
28+
29+
Suppose that we want to save our output into `src/main/resources/test_csv`. We can create a configuration file **
30+
local.conf** in `src/main/resources` with the following content that defines the target datastore to save our dataset:
2131

2232
```txt
2333
testObjectRepository {
@@ -31,6 +41,7 @@ testObjectRepository {
3141
```
3242

3343
In our `App.scala` file, we build `Setl` and register this data store:
44+
3445
```scala
3546
val setl: Setl = Setl.builder()
3647
.withDefaultConfigLoader()
@@ -42,19 +53,22 @@ setl.setSparkRepository[TestObject]("testObjectRepository")
4253
```
4354

4455
#### Implementation of Factory
45-
We will create our `Dataset[TestObject]` inside a `Factory[Dataset[TestObject]]`. A `Factory[A]` will always produce an object of type `A`, and it contains 4 abstract methods that you need to implement:
56+
57+
We will create our `Dataset[TestObject]` inside a `Factory[Dataset[TestObject]]`. A `Factory[A]` will always produce an
58+
object of type `A`, and it contains 4 abstract methods that you need to implement:
59+
4660
- read
4761
- process
4862
- write
4963
- get
5064

5165
```scala
5266
class MyFactory() extends Factory[Dataset[TestObject]] with HasSparkSession {
53-
67+
5468
import spark.implicits._
55-
69+
5670
// A repository is needed for writing data. It will be delivered by the pipeline
57-
@Delivery
71+
@Delivery
5872
private[this] val repo = SparkRepository[TestObject]
5973

6074
private[this] var output = spark.emptyDataset[TestObject]
@@ -73,7 +87,7 @@ class MyFactory() extends Factory[Dataset[TestObject]] with HasSparkSession {
7387
}
7488

7589
override def write(): MyFactory.this.type = {
76-
repo.save(output) // use the repository to save the output
90+
repo.save(output) // use the repository to save the output
7791
this
7892
}
7993

@@ -83,9 +97,11 @@ class MyFactory() extends Factory[Dataset[TestObject]] with HasSparkSession {
8397
```
8498

8599
#### Define the pipeline
100+
86101
To execute the factory, we should add it into a pipeline.
87102

88-
When we call `setl.newPipeline()`, **Setl** will instantiate a new **Pipeline** and configure all the registered repositories as inputs of the pipeline. Then we can call `addStage` to add our factory into the pipeline.
103+
When we call `setl.newPipeline()`, **Setl** will instantiate a new **Pipeline** and configure all the registered
104+
repositories as inputs of the pipeline. Then we can call `addStage` to add our factory into the pipeline.
89105

90106
```scala
91107
val pipeline = setl
@@ -94,12 +110,15 @@ val pipeline = setl
94110
```
95111

96112
#### Run our pipeline
113+
97114
```scala
98115
pipeline.describe().run()
99116
```
117+
100118
The dataset will be saved into `src/main/resources/test_csv`
101119

102120
#### What's more?
121+
103122
As our `MyFactory` produces a `Dataset[TestObject]`, it can be used by other factories of the same pipeline.
104123

105124
```scala
@@ -130,12 +149,15 @@ pipeline.addStage[AnotherFactory]()
130149
```
131150

132151
### Generate pipeline diagram (with v0.4.1+)
152+
133153
You can generate a [Mermaid diagram](https://mermaid-js.github.io/mermaid/#/) by doing:
154+
134155
```scala
135156
pipeline.showDiagram()
136157
```
137158

138159
You will have some log like this:
160+
139161
```
140162
--------- MERMAID DIAGRAM ---------
141163
classDiagram
@@ -180,4 +202,102 @@ Or you can try the live editor: https://mermaid-js.github.io/mermaid-live-editor
180202
181203
```
182204

183-
You can either copy the code into a Markdown viewer or just copy the link into your browser ([link](https://mermaid-js.github.io/mermaid-live-editor/#/edit/eyJjb2RlIjoiY2xhc3NEaWFncmFtXG5jbGFzcyBNeUZhY3Rvcnkge1xuICA8PEZhY3RvcnlbRGF0YXNldFtUZXN0T2JqZWN0XV0-PlxuICArU3BhcmtSZXBvc2l0b3J5W1Rlc3RPYmplY3RdXG59XG5cbmNsYXNzIERhdGFzZXRUZXN0T2JqZWN0IHtcbiAgPDxEYXRhc2V0W1Rlc3RPYmplY3RdPj5cbiAgPnBhcnRpdGlvbjE6IEludFxuICA-cGFydGl0aW9uMjogU3RyaW5nXG4gID5jbHVzdGVyaW5nMTogU3RyaW5nXG4gID52YWx1ZTogTG9uZ1xufVxuXG5EYXRhc2V0VGVzdE9iamVjdCA8fC4uIE15RmFjdG9yeSA6IE91dHB1dFxuY2xhc3MgQW5vdGhlckZhY3Rvcnkge1xuICA8PEZhY3RvcnlbU3RyaW5nXT4-XG4gICtEYXRhc2V0W1Rlc3RPYmplY3RdXG59XG5cbmNsYXNzIFN0cmluZ0ZpbmFsIHtcbiAgPDxTdHJpbmc-PlxuICBcbn1cblxuU3RyaW5nRmluYWwgPHwuLiBBbm90aGVyRmFjdG9yeSA6IE91dHB1dFxuY2xhc3MgU3BhcmtSZXBvc2l0b3J5VGVzdE9iamVjdEV4dGVybmFsIHtcbiAgPDxTcGFya1JlcG9zaXRvcnlbVGVzdE9iamVjdF0-PlxuICBcbn1cblxuQW5vdGhlckZhY3RvcnkgPHwtLSBEYXRhc2V0VGVzdE9iamVjdCA6IElucHV0XG5NeUZhY3RvcnkgPHwtLSBTcGFya1JlcG9zaXRvcnlUZXN0T2JqZWN0RXh0ZXJuYWwgOiBJbnB1dFxuIiwibWVybWFpZCI6eyJ0aGVtZSI6ImRlZmF1bHQifX0=)) 🍻
205+
You can either copy the code into a Markdown viewer or just copy the link into your
206+
browser ([link](https://mermaid-js.github.io/mermaid-live-editor/#/edit/eyJjb2RlIjoiY2xhc3NEaWFncmFtXG5jbGFzcyBNeUZhY3Rvcnkge1xuICA8PEZhY3RvcnlbRGF0YXNldFtUZXN0T2JqZWN0XV0-PlxuICArU3BhcmtSZXBvc2l0b3J5W1Rlc3RPYmplY3RdXG59XG5cbmNsYXNzIERhdGFzZXRUZXN0T2JqZWN0IHtcbiAgPDxEYXRhc2V0W1Rlc3RPYmplY3RdPj5cbiAgPnBhcnRpdGlvbjE6IEludFxuICA-cGFydGl0aW9uMjogU3RyaW5nXG4gID5jbHVzdGVyaW5nMTogU3RyaW5nXG4gID52YWx1ZTogTG9uZ1xufVxuXG5EYXRhc2V0VGVzdE9iamVjdCA8fC4uIE15RmFjdG9yeSA6IE91dHB1dFxuY2xhc3MgQW5vdGhlckZhY3Rvcnkge1xuICA8PEZhY3RvcnlbU3RyaW5nXT4-XG4gICtEYXRhc2V0W1Rlc3RPYmplY3RdXG59XG5cbmNsYXNzIFN0cmluZ0ZpbmFsIHtcbiAgPDxTdHJpbmc-PlxuICBcbn1cblxuU3RyaW5nRmluYWwgPHwuLiBBbm90aGVyRmFjdG9yeSA6IE91dHB1dFxuY2xhc3MgU3BhcmtSZXBvc2l0b3J5VGVzdE9iamVjdEV4dGVybmFsIHtcbiAgPDxTcGFya1JlcG9zaXRvcnlbVGVzdE9iamVjdF0-PlxuICBcbn1cblxuQW5vdGhlckZhY3RvcnkgPHwtLSBEYXRhc2V0VGVzdE9iamVjdCA6IElucHV0XG5NeUZhY3RvcnkgPHwtLSBTcGFya1JlcG9zaXRvcnlUZXN0T2JqZWN0RXh0ZXJuYWwgOiBJbnB1dFxuIiwibWVybWFpZCI6eyJ0aGVtZSI6ImRlZmF1bHQifX0=))
207+
🍻
208+
209+
#### App Configuration
210+
211+
The configuration system of SETL allows users to execute their Spark application in different execution environments, by
212+
using environment-specific configurations.
213+
214+
In `src/main/resources` directory, you should have at least two configuration files named `application.conf`
215+
and `local.conf`
216+
(take a look at this [example](https://github.com/SETL-Developers/setl-template/tree/master/src/main/resources)). These
217+
are what you need if you only want to run your application in one single environment.
218+
219+
You can also create other configurations (for example `dev.conf` and `prod.conf`), in which environment-specific
220+
parameters can be defined.
221+
222+
##### application.conf
223+
224+
This configuration file should contain universal configurations that could be used regardless the execution environment.
225+
226+
##### env.conf (e.g. local.conf, dev.conf)
227+
228+
These files should contain environment-specific parameters. By default, `local.conf` will be used.
229+
230+
##### How to use the configuration
231+
232+
Imagine the case we have two environments, a local development environment and a remote production environment. Our application
233+
needs a repository for saving and loading data. In this use case, let's prepare `application.conf`, `local.conf`, `prod.conf`
234+
and `storage.conf`
235+
236+
```hocon
237+
# application.conf
238+
setl.environment = ${app.environment}
239+
setl.config {
240+
spark.app.name = "my_application"
241+
# and other general spark configurations
242+
}
243+
```
244+
245+
```hocon
246+
# local.conf
247+
include "application.conf"
248+
249+
setl.config {
250+
spark.default.parallelism = "200"
251+
spark.sql.shuffle.partitions = "200"
252+
# and other local spark configurations
253+
}
254+
255+
app.root.dir = "/some/local/path"
256+
257+
include "storage.conf"
258+
```
259+
260+
```hocon
261+
# prod.conf
262+
setl.config {
263+
spark.default.parallelism = "1000"
264+
spark.sql.shuffle.partitions = "1000"
265+
# and other production spark configurations
266+
}
267+
268+
app.root.dir = "/some/remote/path"
269+
270+
include "storage.conf"
271+
```
272+
273+
```hocon
274+
# storage.conf
275+
myRepository {
276+
storage = "CSV"
277+
path = ${app.root.dir} // this path will depend on the execution environment
278+
inferSchema = "true"
279+
delimiter = ";"
280+
header = "true"
281+
saveMode = "Append"
282+
}
283+
```
284+
285+
To compile with local configuration, with maven, just run:
286+
```shell
287+
mvn compile
288+
```
289+
290+
To compile with production configuration, pass the jvm property `app.environment`.
291+
```shell
292+
mvn compile -Dapp.environment=prod
293+
```
294+
295+
Make sure that your resources directory has filtering enabled:
296+
```xml
297+
<resources>
298+
<resource>
299+
<directory>src/main/resources</directory>
300+
<filtering>true</filtering>
301+
</resource>
302+
</resources>
303+
```

docs/utils/Compressor_Archiver.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Compressor
2+
3+
A [compressor](https://github.com/SETL-Developers/setl/blob/master/src/main/scala/com/jcdecaux/setl/storage/Compressor.scala)
4+
can:
5+
- compress a string to a byte array
6+
- decompress a byte array to a string
7+
8+
## Example:
9+
10+
```scala
11+
import com.jcdecaux.setl.storage.GZIPCompressor
12+
13+
val compressor = new GZIPCompressor()
14+
15+
val compressed = compressor.compress("data to be compressed")
16+
val data = compressor.decompress(compressed)
17+
```
18+
19+
# Archiver
20+
21+
An [Archiver](https://github.com/SETL-Developers/setl/blob/master/src/main/scala/com/jcdecaux/setl/storage/Archiver.scala) can
22+
package files and directories into a single data archive file.
23+

0 commit comments

Comments
 (0)