DataSink

Description

The DataSink framework is a utility framework that helps configuring and writing DataFrames.

This framework provides for reading from a given path with the specified format like avro, parquet, orc, json, csv, jdbc...

The framework is composed of two main traits:

DataSink, which is created based on a FileSourceConfiguration class and provides two main functions:

def writer(data: DataFrame): Try[DataFrameWriter[Row]]
def write(data: DataFrame): Try[DataFrame]

DataSinkConfiguration: a marker trait to define DataSink configuration classes

Usage

The framework provides the following predefined DataSink implementations:

For convenience the DataAwareSinkFactory trait and the default implementation are provided. To create a DataSink out of a given TypeSafe Config instance, one can call

DataAwareSinkFactory( someDataSinkConfigurationInstance )

Also, in order to easily extract the configuration from a given TypeSafe Config instance, the FormatAwareDataSinkConfiguration factory is provided.

FormatAwareDataSinkConfiguration( someTypesafeConfigurationInstance )

There is a convenience implicit decorator for DataFrames, available through the

import org.tupol.spark.io._
import org.tupol.spark.io.implicits._

import statements. The org.tupol.spark.io package contains the implicit factories for data sinks and the org.tupol.spark.implicits contains the actual DataFrame decorator.

This allows us to create the DataSink by calling the sink() function on a DataFrame, passing a DataSinkConfiguration configuration instance.

import org.tupol.spark.io.{pureconf, _}
import org.tupol.spark.io.implicits._

def dataFrame: DataFrame = ???
def dataSinkConfiguration: DataSinkConfiguration = ???
val dataSink: DataSink[_, _, _] = dataFrame.sink(dataSinkConfiguration)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataSink

Description

Usage

Configuration Parameters

FilesExpand file tree

data-sink.md

Latest commit

History

data-sink.md

File metadata and controls

DataSink

Description

Usage

Configuration Parameters