The DataSink framework is a utility framework that helps configuring and writing DataFrames.
This framework provides for reading from a given path with the specified format like avro, parquet, orc, json,
csv, jdbc...
The framework is composed of two main traits:
DataSink, which is created based on aFileSourceConfigurationclass and provides two main functions:def writer(data: DataFrame): Try[DataFrameWriter[Row]] def write(data: DataFrame): Try[DataFrame]
DataSinkConfiguration: a marker trait to defineDataSinkconfiguration classes
The framework provides the following predefined DataSink implementations:
- FileDataSink
- JdbcDataSink
- GenericDataSink
- FileStreamDataSink
- KafkaStreamDataSink
- GenericStreamDataSink
For convenience the DataAwareSinkFactory trait and the default implementation are provided.
To create a DataSink out of a given TypeSafe Config instance, one can call
DataAwareSinkFactory( someDataSinkConfigurationInstance )Also, in order to easily extract the configuration from a given TypeSafe Config instance,
the FormatAwareDataSinkConfiguration factory is provided.
FormatAwareDataSinkConfiguration( someTypesafeConfigurationInstance )There is a convenience implicit decorator for DataFrames, available through the
import org.tupol.spark.io._
import org.tupol.spark.io.implicits._import statements.
The org.tupol.spark.io package contains the implicit factories for data sinks and the org.tupol.spark.implicits
contains the actual DataFrame decorator.
This allows us to create the DataSink by calling the sink() function on a DataFrame,
passing a DataSinkConfiguration configuration instance.
import org.tupol.spark.io.{pureconf, _}
import org.tupol.spark.io.implicits._
def dataFrame: DataFrame = ???
def dataSinkConfiguration: DataSinkConfiguration = ???
val dataSink: DataSink[_, _, _] = dataFrame.sink(dataSinkConfiguration)