You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+96Lines changed: 96 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -276,6 +276,102 @@ Or you can try the live editor: https://mermaid-js.github.io/mermaid-live-editor
276
276
277
277
You can either copy the code into a Markdown viewer or just copy the link into your browser ([link](https://mermaid-js.github.io/mermaid-live-editor/#/edit/eyJjb2RlIjoiY2xhc3NEaWFncmFtXG5jbGFzcyBNeUZhY3Rvcnkge1xuICA8PEZhY3RvcnlbRGF0YXNldFtUZXN0T2JqZWN0XV0-PlxuICArU3BhcmtSZXBvc2l0b3J5W1Rlc3RPYmplY3RdXG59XG5cbmNsYXNzIERhdGFzZXRUZXN0T2JqZWN0IHtcbiAgPDxEYXRhc2V0W1Rlc3RPYmplY3RdPj5cbiAgPnBhcnRpdGlvbjE6IEludFxuICA-cGFydGl0aW9uMjogU3RyaW5nXG4gID5jbHVzdGVyaW5nMTogU3RyaW5nXG4gID52YWx1ZTogTG9uZ1xufVxuXG5EYXRhc2V0VGVzdE9iamVjdCA8fC4uIE15RmFjdG9yeSA6IE91dHB1dFxuY2xhc3MgQW5vdGhlckZhY3Rvcnkge1xuICA8PEZhY3RvcnlbU3RyaW5nXT4-XG4gICtEYXRhc2V0W1Rlc3RPYmplY3RdXG59XG5cbmNsYXNzIFN0cmluZ0ZpbmFsIHtcbiAgPDxTdHJpbmc-PlxuICBcbn1cblxuU3RyaW5nRmluYWwgPHwuLiBBbm90aGVyRmFjdG9yeSA6IE91dHB1dFxuY2xhc3MgU3BhcmtSZXBvc2l0b3J5VGVzdE9iamVjdEV4dGVybmFsIHtcbiAgPDxTcGFya1JlcG9zaXRvcnlbVGVzdE9iamVjdF0-PlxuICBcbn1cblxuQW5vdGhlckZhY3RvcnkgPHwtLSBEYXRhc2V0VGVzdE9iamVjdCA6IElucHV0XG5NeUZhY3RvcnkgPHwtLSBTcGFya1JlcG9zaXRvcnlUZXN0T2JqZWN0RXh0ZXJuYWwgOiBJbnB1dFxuIiwibWVybWFpZCI6eyJ0aGVtZSI6ImRlZmF1bHQifX0=)) 🍻
278
278
279
+
### App Configuration
280
+
281
+
The configuration system of SETL allows users to execute their Spark application in different execution environments, by
282
+
using environment-specific configurations.
283
+
284
+
In `src/main/resources` directory, you should have at least two configuration files named `application.conf`
285
+
and `local.conf`
286
+
(take a look at this [example](https://github.com/SETL-Developers/setl-template/tree/master/src/main/resources)). These
287
+
are what you need if you only want to run your application in one single environment.
288
+
289
+
You can also create other configurations (for example `dev.conf` and `prod.conf`), in which environment-specific
290
+
parameters can be defined.
291
+
292
+
##### application.conf
293
+
294
+
This configuration file should contain universal configurations that could be used regardless the execution environment.
295
+
296
+
##### env.conf (e.g. local.conf, dev.conf)
297
+
298
+
These files should contain environment-specific parameters. By default, `local.conf` will be used.
299
+
300
+
##### How to use the configuration
301
+
302
+
Imagine the case we have two environments, a local development environment and a remote production environment. Our application
303
+
needs a repository for saving and loading data. In this use case, let's prepare `application.conf`, `local.conf`, `prod.conf`
304
+
and `storage.conf`
305
+
306
+
```hocon
307
+
# application.conf
308
+
setl.environment = ${app.environment}
309
+
setl.config {
310
+
spark.app.name = "my_application"
311
+
# and other general spark configurations
312
+
}
313
+
```
314
+
315
+
```hocon
316
+
# local.conf
317
+
include "application.conf"
318
+
319
+
setl.config {
320
+
spark.default.parallelism = "200"
321
+
spark.sql.shuffle.partitions = "200"
322
+
# and other local spark configurations
323
+
}
324
+
325
+
app.root.dir = "/some/local/path"
326
+
327
+
include "storage.conf"
328
+
```
329
+
330
+
```hocon
331
+
# prod.conf
332
+
setl.config {
333
+
spark.default.parallelism = "1000"
334
+
spark.sql.shuffle.partitions = "1000"
335
+
# and other production spark configurations
336
+
}
337
+
338
+
app.root.dir = "/some/remote/path"
339
+
340
+
include "storage.conf"
341
+
```
342
+
343
+
```hocon
344
+
# storage.conf
345
+
myRepository {
346
+
storage = "CSV"
347
+
path = ${app.root.dir} // this path will depend on the execution environment
348
+
inferSchema = "true"
349
+
delimiter = ";"
350
+
header = "true"
351
+
saveMode = "Append"
352
+
}
353
+
```
354
+
355
+
To compile with local configuration, with maven, just run:
356
+
```shell
357
+
mvn compile
358
+
```
359
+
360
+
To compile with production configuration, pass the jvm property `app.environment`.
361
+
```shell
362
+
mvn compile -Dapp.environment=prod
363
+
```
364
+
365
+
Make sure that your resources directory has filtering enabled:
366
+
```xml
367
+
<resources>
368
+
<resource>
369
+
<directory>src/main/resources</directory>
370
+
<filtering>true</filtering>
371
+
</resource>
372
+
</resources>
373
+
```
374
+
279
375
## Dependencies
280
376
281
377
**SETL** currently supports the following data source. You won't need to provide these libraries in your project (except the JDBC driver):
With SETL, an ETL application could be represented by a `Pipeline`. A `Pipeline` contains multiple `Stages`. In each stage, we could find one or several `Factories`.
3
2
4
-
The class `Factory[T]` is an abstraction of a data transformation that will produce an object of type `T`. It has 4 methods (*read*, *process*, *write* and *get*) that should be implemented by the developer.
3
+
With SETL, an ETL application could be represented by a `Pipeline`. A `Pipeline` contains multiple `Stages`. In each
4
+
stage, we could find one or several `Factories`.
5
5
6
-
The class `SparkRepository[T]` is a data access layer abstraction. It could be used to read/write a `Dataset[T]` from/to a datastore. It should be defined in a configuration file. You can have as many SparkRepositories as you want.
6
+
The class `Factory[T]` is an abstraction of a data transformation that will produce an object of type `T`. It has 4
7
+
methods (*read*, *process*, *write* and *get*) that should be implemented by the developer.
7
8
8
-
The entry point of a SETL project is the object `com.jcdecaux.setl.Setl`, which will handle the pipeline and spark repository instantiation.
9
+
The class `SparkRepository[T]` is a data access layer abstraction. It could be used to read/write a `Dataset[T]` from/to
10
+
a datastore. It should be defined in a configuration file. You can have as many SparkRepositories as you want.
11
+
12
+
The entry point of a SETL project is the object `com.jcdecaux.setl.Setl`, which will handle the pipeline and spark
13
+
repository instantiation.
9
14
10
15
### Show me some code
11
-
You can find the following tutorial code in [the starter template of SETL](https://github.com/qxzzxq/setl-template). Go and clone it :)
12
16
13
-
Here we show a simple example of creating and saving a **Dataset[TestObject]**. The case class **TestObject** is defined as follows:
17
+
You can find the following tutorial code in [the starter template of SETL](https://github.com/qxzzxq/setl-template). Go
18
+
and clone it :)
19
+
20
+
Here we show a simple example of creating and saving a **Dataset[TestObject]**. The case class **TestObject** is defined
Suppose that we want to save our output into `src/main/resources/test_csv`. We can create a configuration file **local.conf** in `src/main/resources` with the following content that defines the target datastore to save our dataset:
28
+
29
+
Suppose that we want to save our output into `src/main/resources/test_csv`. We can create a configuration file **
30
+
local.conf** in `src/main/resources` with the following content that defines the target datastore to save our dataset:
21
31
22
32
```txt
23
33
testObjectRepository {
@@ -31,6 +41,7 @@ testObjectRepository {
31
41
```
32
42
33
43
In our `App.scala` file, we build `Setl` and register this data store:
We will create our `Dataset[TestObject]` inside a `Factory[Dataset[TestObject]]`. A `Factory[A]` will always produce an object of type `A`, and it contains 4 abstract methods that you need to implement:
56
+
57
+
We will create our `Dataset[TestObject]` inside a `Factory[Dataset[TestObject]]`. A `Factory[A]` will always produce an
58
+
object of type `A`, and it contains 4 abstract methods that you need to implement:
@@ -73,7 +87,7 @@ class MyFactory() extends Factory[Dataset[TestObject]] with HasSparkSession {
73
87
}
74
88
75
89
overridedefwrite():MyFactory.this.type= {
76
-
repo.save(output) // use the repository to save the output
90
+
repo.save(output) // use the repository to save the output
77
91
this
78
92
}
79
93
@@ -83,9 +97,11 @@ class MyFactory() extends Factory[Dataset[TestObject]] with HasSparkSession {
83
97
```
84
98
85
99
#### Define the pipeline
100
+
86
101
To execute the factory, we should add it into a pipeline.
87
102
88
-
When we call `setl.newPipeline()`, **Setl** will instantiate a new **Pipeline** and configure all the registered repositories as inputs of the pipeline. Then we can call `addStage` to add our factory into the pipeline.
103
+
When we call `setl.newPipeline()`, **Setl** will instantiate a new **Pipeline** and configure all the registered
104
+
repositories as inputs of the pipeline. Then we can call `addStage` to add our factory into the pipeline.
89
105
90
106
```scala
91
107
valpipeline= setl
@@ -94,12 +110,15 @@ val pipeline = setl
94
110
```
95
111
96
112
#### Run our pipeline
113
+
97
114
```scala
98
115
pipeline.describe().run()
99
116
```
117
+
100
118
The dataset will be saved into `src/main/resources/test_csv`
101
119
102
120
#### What's more?
121
+
103
122
As our `MyFactory` produces a `Dataset[TestObject]`, it can be used by other factories of the same pipeline.
You can generate a [Mermaid diagram](https://mermaid-js.github.io/mermaid/#/) by doing:
154
+
134
155
```scala
135
156
pipeline.showDiagram()
136
157
```
137
158
138
159
You will have some log like this:
160
+
139
161
```
140
162
--------- MERMAID DIAGRAM ---------
141
163
classDiagram
@@ -180,4 +202,102 @@ Or you can try the live editor: https://mermaid-js.github.io/mermaid-live-editor
180
202
181
203
```
182
204
183
-
You can either copy the code into a Markdown viewer or just copy the link into your browser ([link](https://mermaid-js.github.io/mermaid-live-editor/#/edit/eyJjb2RlIjoiY2xhc3NEaWFncmFtXG5jbGFzcyBNeUZhY3Rvcnkge1xuICA8PEZhY3RvcnlbRGF0YXNldFtUZXN0T2JqZWN0XV0-PlxuICArU3BhcmtSZXBvc2l0b3J5W1Rlc3RPYmplY3RdXG59XG5cbmNsYXNzIERhdGFzZXRUZXN0T2JqZWN0IHtcbiAgPDxEYXRhc2V0W1Rlc3RPYmplY3RdPj5cbiAgPnBhcnRpdGlvbjE6IEludFxuICA-cGFydGl0aW9uMjogU3RyaW5nXG4gID5jbHVzdGVyaW5nMTogU3RyaW5nXG4gID52YWx1ZTogTG9uZ1xufVxuXG5EYXRhc2V0VGVzdE9iamVjdCA8fC4uIE15RmFjdG9yeSA6IE91dHB1dFxuY2xhc3MgQW5vdGhlckZhY3Rvcnkge1xuICA8PEZhY3RvcnlbU3RyaW5nXT4-XG4gICtEYXRhc2V0W1Rlc3RPYmplY3RdXG59XG5cbmNsYXNzIFN0cmluZ0ZpbmFsIHtcbiAgPDxTdHJpbmc-PlxuICBcbn1cblxuU3RyaW5nRmluYWwgPHwuLiBBbm90aGVyRmFjdG9yeSA6IE91dHB1dFxuY2xhc3MgU3BhcmtSZXBvc2l0b3J5VGVzdE9iamVjdEV4dGVybmFsIHtcbiAgPDxTcGFya1JlcG9zaXRvcnlbVGVzdE9iamVjdF0-PlxuICBcbn1cblxuQW5vdGhlckZhY3RvcnkgPHwtLSBEYXRhc2V0VGVzdE9iamVjdCA6IElucHV0XG5NeUZhY3RvcnkgPHwtLSBTcGFya1JlcG9zaXRvcnlUZXN0T2JqZWN0RXh0ZXJuYWwgOiBJbnB1dFxuIiwibWVybWFpZCI6eyJ0aGVtZSI6ImRlZmF1bHQifX0=)) 🍻
205
+
You can either copy the code into a Markdown viewer or just copy the link into your
0 commit comments