@@ -17,9 +17,17 @@ The input string can be a file path or URL.
1717
1818## Read from CSV
1919
20+ Before you can read data from CSV, make sure you have the following dependency:
21+
22+ ``` kotlin
23+ implementation(" org.jetbrains.kotlinx:dataframe-csv:$dataframe_version " )
24+ ```
25+
26+ It's included by default if you have ` org.jetbrains.kotlinx:dataframe:$dataframe_version ` already.
27+
2028To read a CSV file, use the ` .readCsv() ` function.
2129
22- Since DataFrame v0.15, a new experimental CSV integration is available.
30+ Since DataFrame v0.15, this new CSV integration is available.
2331It is faster and more flexible than the old one, now being based on
2432[ Deephaven CSV] ( https://github.com/deephaven/deephaven-csv ) .
2533
@@ -43,6 +51,21 @@ import java.net.URL
4351DataFrame .readCsv(URL (" https://raw.githubusercontent.com/Kotlin/dataframe/master/data/jetbrains_repositories.csv" ))
4452```
4553
54+ Zip and GZip files are supported as well.
55+
56+ To read CSV from ` String ` :
57+
58+ ``` kotlin
59+ val csv = """
60+ A,B,C,D
61+ 12,tuv,0.12,true
62+ 41,xyz,3.6,not assigned
63+ 89,abc,7.1,false
64+ """ .trimIndent()
65+
66+ DataFrame .readCsvStr(csv)
67+ ```
68+
4669### Specify delimiter
4770
4871By default, CSV files are parsed using ` , ` as the delimiter. To specify a custom delimiter, use the ` delimiter ` argument:
@@ -60,9 +83,19 @@ val df = DataFrame.readCsv(
6083
6184<!-- -END-->
6285
86+ Aside from the delimiter, there are many other parameters to change.
87+ These include the header, the number of rows to skip, the number of rows to read, the quote character, and more.
88+ Check out the KDocs for more information.
89+
6390### Column type inference from CSV
6491
65- Column types are inferred from the CSV data. Suppose that the CSV from the previous
92+ Column types are inferred from the CSV data.
93+
94+ We rely on the fast implementation of [ Deephaven CSV] ( https://github.com/deephaven/deephaven-csv ) for inferring and
95+ parsing to (nullable) ` Int ` , ` Long ` , ` Double ` , and ` Boolean ` types.
96+ For other types we fall back to [ the parse operation] ( parse.md ) .
97+
98+ Suppose that the CSV from the previous
6699example had the following content:
67100
68101<table >
@@ -81,15 +114,15 @@ C: Double
81114D: Boolean?
82115```
83116
84- [ ` DataFrame ` ] ( DataFrame.md ) tries to parse columns as JSON, so when reading the following table with JSON object in column D:
117+ [ ` DataFrame ` ] ( DataFrame.md ) can [ parse] ( parse.md ) columns as JSON too , so when reading the following table with JSON object in column D:
85118
86119<table >
87120<tr ><th >A</th ><th >D</th ></tr >
88121<tr ><td >12</td ><td >{"B":2,"C":3}</td ></tr >
89122<tr ><td >41</td ><td >{"B":3,"C":2}</td ></tr >
90123</table >
91124
92- We get this data schema where D is [ ` ColumnGroup ` ] ( DataColumn.md#columngroup ) with 2 children columns:
125+ We get this data schema where D is [ ` ColumnGroup ` ] ( DataColumn.md#columngroup ) with two nested columns:
93126
94127``` text
95128A: Int
@@ -123,10 +156,10 @@ Sometimes columns in your CSV can be interpreted differently depending on your s
123156<tr ><td >41,111</td ></tr >
124157</table >
125158
126- Here a comma can be decimal or thousands separator, thus different values.
127- You can deal with it in two ways:
159+ Here a comma can be a decimal-, or thousands separator, and thus become different values.
160+ You can deal with it in multiple ways, for instance :
128161
129- 1 ) Provide locale as a parser option
162+ 1 ) Provide locale as parser option
130163
131164<!-- -FUN readNumbersWithSpecificLocale-->
132165
@@ -168,23 +201,26 @@ columns like this may be recognized as simple `String` values rather than actual
168201
169202You can fix this whenever you [ parse] ( parse.md ) a string-based column (e.g., using [ ` DataFrame.readCsv() ` ] ( read.md#read-from-csv ) ,
170203[ ` DataFrame.readTsv() ` ] ( read.md#read-from-csv ) , or [ ` DataColumn<String>.convertTo<>() ` ] ( convert.md ) ) by providing
171- a custom date-time pattern. There are two ways to do this:
204+ a custom date-time pattern.
205+
206+ There are two ways to do this:
172207
1732081 ) By providing the date-time pattern as raw string to the ` ParserOptions ` argument:
174209
175- <!-- -FUN readNumbersWithSpecificDateTimePattern -->
210+ <!-- -FUN readDatesWithSpecificDateTimePattern -->
176211
177212``` kotlin
178213val df = DataFrame .readCsv(
179214 file,
180215 parserOptions = ParserOptions (dateTimePattern = " dd/MMM/yy h:mm a" )
181216)
182217```
218+
183219<!-- -END-->
184220
1852212 ) By providing a ` DateTimeFormatter ` to the ` ParserOptions ` argument:
186222
187- <!-- -FUN readNumbersWithSpecificDateTimeFormatter -->
223+ <!-- -FUN readDatesWithSpecificDateTimeFormatter -->
188224
189225``` kotlin
190226val df = DataFrame .readCsv(
@@ -204,6 +240,50 @@ The result will be a dataframe with properly parsed `DateTime` columns.
204240>
205241> For more details on the parse operation, see the [ ` parse operation ` ] ( parse.md ) .
206242
243+ ### Provide a default type for all columns
244+
245+ While you can provide a ` ColType ` per column, you might not
246+ always know how many columns there are or what their names are.
247+ In such cases, you can disable type inference for all columns
248+ by providing a default type for all columns:
249+
250+ <!-- -FUN readDatesWithDefaultType-->
251+
252+ ``` kotlin
253+ val df = DataFrame .readCsv(
254+ file,
255+ colTypes = mapOf (ColType .DEFAULT to ColType .String ),
256+ )
257+ ```
258+
259+ <!-- -END-->
260+
261+ This default can be combined with specific types for other columns as well.
262+
263+ ### Unlocking Deephaven CSV features
264+
265+ For each group of functions (` readCsv ` , ` readDelim ` , ` readTsv ` , etc.)
266+ we provide one overload which has the ` adjustCsvSpecs ` parameter.
267+ This is an advanced option because it exposes the
268+ [ CsvSpecs.Builder] ( https://github.com/deephaven/deephaven-csv/blob/main/src/main/java/io/deephaven/csv/CsvSpecs.java )
269+ of the underlying Deephaven implementation.
270+ Generally, we don't recommend using this feature unless there's no other way to achieve your goal.
271+
272+ For example, to enable the (unconfigurable but) very fast [ ISO DateTime Parser of Deephaven CSV] ( https://medium.com/@deephavendatalabs/a-high-performance-csv-reader-with-type-inference-4bf2e4baf2d1 ) :
273+
274+ <!-- -FUN readDatesWithDeephavenDateTimeParser-->
275+
276+ ``` kotlin
277+ val df = DataFrame .readCsv(
278+ inputStream = file.openStream(),
279+ adjustCsvSpecs = { // it: CsvSpecs.Builder
280+ it.putParserForName(" date" , Parsers .DATETIME )
281+ },
282+ )
283+ ```
284+
285+ <!-- -END-->
286+
207287## Read from JSON
208288
209289To read a JSON file, use the ` .readJson() ` function. JSON files can be read from a file or a URL.
@@ -434,6 +514,8 @@ Before you can read data from Excel, add the following dependency:
434514implementation(" org.jetbrains.kotlinx:dataframe-excel:$dataframe_version " )
435515```
436516
517+ It's included by default if you have ` org.jetbrains.kotlinx:dataframe:$dataframe_version ` already.
518+
437519To read an Excel spreadsheet, use the ` .readExcel() ` function. Excel spreadsheets can be read from a file or a URL. Supported
438520Excel spreadsheet formats are: xls, xlsx.
439521
@@ -484,6 +566,8 @@ Before you can read data from Apache Arrow format, add the following dependency:
484566implementation(" org.jetbrains.kotlinx:dataframe-arrow:$dataframe_version " )
485567```
486568
569+ It's included by default if you have ` org.jetbrains.kotlinx:dataframe:$dataframe_version ` already.
570+
487571To read Apache Arrow formats, use the ` .readArrowFeather() ` function:
488572
489573<!-- -FUN readArrowFeather-->
0 commit comments