Operations
Data transformation pipeline usually consists of several modification operations, such as filtering, sorting, grouping, pivoting, adding/removing columns etc. The Kotlin DataFrame API is designed in functional style so that the whole processing pipeline can be represented as a single statement with a sequential chain of operations. DataFrame object is immutable and all operations return a new DataFrame instance reusing underlying data structures as much as possible.
Multiplex operations
Simple operations (such as filter or select) return new DataFrame immediately, while more complex operations return an intermediate object that is used for further configuration of the operation. Let's call such operations multiplex.
Every multiplex operation configuration consists of:
column selector that is used to select target columns for the operation
additional configuration functions
terminal function that returns modified
DataFrame
Most multiplex operations end with into or with function. The following naming convention is used:
List of DataRow operations
index(): Int— sequential row number inDataFrame, starts from 0prev(): DataRow?— previous row (nullfor the first row)next(): DataRow?— next row (nullfor the last row)diff(T) { rowExpression }: T / diffOrNull { rowExpression }: T?— difference between the results of a row expression calculated for current and previous rowsexplode(columns): DataFrame<T>— spread lists andDataFrameobjects vertically into new rowsvalues(): List<Any?>— list of all cell values from the current rowvaluesOf<T>(): List<T>— list of values of the given typecolumnsCount(): Int— number of columnscolumnNames(): List<String>— list of all column namescolumnTypes(): List<KType>— list of all column typesnamedValues(): List<NameValuePair<Any?>>— list of name-value pairs wherenameis a column name andvalueis cell valuenamedValuesOf<T>(): List<NameValuePair<T>>— list of name-value pairs where value has given typetranspose(): DataFrame<NameValuePair<*>>—DataFrameof two columns:name: Stringis column names andvalue: Any?is cell valuestransposeTo<T>(): DataFrame<NameValuePair<T>>—DataFrameof two columns:name: Stringis column names andvalue: Tis cell valuesgetRow(Int): DataRow— row fromDataFrameby row indexgetRows(Iterable<Int>): DataFrame—DataFramewith subset of rows selected by absolute row index.relative(Iterable<Int>): DataFrame—DataFramewith subset of rows selected by relative row index:relative(-1..1)will return previous, current and next row. Requested indices will be coerced to the valid range and invalid indices will be skippedgetValue<T>(columnName)— cell value of typeTby this row and givencolumnNamegetValueOrNull<T>(columnName)— cell value of typeT?by this row and givencolumnNameornullif there's no such columnget(column): T— cell value by this row and givencolumnString.invoke<T>(): T— cell value of typeTby this row and giventhiscolumn nameColumnPath.invoke<T>(): T— cell value of typeTby this row and giventhiscolumn pathColumnReference.invoke(): T— cell value of typeTby this row and giventhiscolumndf()—DataFramethat current row belongs to
List of DataRow statistics
The following statistics are available for DataRow:
rowSumrowMeanrowStd
These statistics will be applied only to values of appropriate types, and incompatible values will be ignored. For example, if a dataframe has columns of types String and Int, rowSum() will compute the sum of the Int values in the row and ignore String values.
To apply statistics only to values of a particular type use -Of versions:
rowSumOf<T>rowMeanOf<T>rowStdOf<T>rowMinOf<T>rowMaxOf<T>rowMedianOf<T>rowPercentileOf<T>
List of DataFrame operations
add — add columns
addId — add
idcolumnappend — add rows
columns/columnNames/columnTypes — get list of top-level columns, column names or column types
columnsCount — number of top-level columns
convert — change column values and/or column types
corr — pairwise correlation of columns
count — number of rows that match condition
countDistinct — number of unique rows
cumSum — cumulative sum of column values
describe — basic column statistics
distinct/distinctBy — remove duplicated rows
drop/dropLast/dropWhile/dropNulls/dropNA/dropNaNs — remove rows by condition
duplicate — duplicate rows
explode — spread lists and
DataFrameobjects vertically into new rowsfirst/firstOrNull — find first row by condition
flatten — remove column groupings recursively
forEachRow/forEachColumn — iterate over rows or columns
format — conditional formatting for cell rendering
gather — convert pairs of column names and values into new columns
getColumn/getColumnOrNull/getColumnGroup/getColumns — get one or several columns
group — group columns into
ColumnGroupgroupBy — group rows by key columns
implode — collapse column values into lists grouping by other columns
inferType — infer column type from column values
insert — insert column
joinWith — join two
DataFrameobject by an expression that evaluates joined DataRows to Booleanlast/lastOrNull — find last row by condition
map — map columns into new
DataFrameorDataColumnmerge — merge several columns into one
move — move columns or change column groupings
parse — try to convert strings into other types
pivot/pivotCounts/pivotMatches — convert values into new columns
remove — remove columns
rename — rename columns
reorder/reorderColumnsBy/reorderColumnsByName — reorder columns
replace — replace columns
reverse — reverse rows
rows/rowsReversed — get rows in direct or reversed order
rowsCount — number of rows
schema — schema of columns: names, types and hierarchy
select — select subset of columns
shuffle — reorder rows randomly
single/singleOrNull — get single row by condition
sortBy/sortByDesc/sortWith — sort rows
split — split column values into new rows/columns or inplace into lists
toList/toListOf — export
DataFrameinto a list of data classestoMap — export
DataFrameinto a map from column names to column valuesunfold - unfold objects (normal class instances) in columns according to their properties
ungroup — remove column groupings
update — update column values preserving column types
valueCounts — counts for unique values
Shortcut operations
Some operations are shortcuts for more general operations:
valueCounts is a special case of groupBy
pivotCounts, pivotMatches are special cases of pivot
You can use these shortcuts to apply the most common DataFrame transformations easier, but you can always fall back to general operations if you need more customization.