CSV Sink

A basic sink to write processed data to a single CSV file.

It's meant to be used mostly for local debugging.

How To Use

To use a CSV sink, you need to create an instance of CSVSink and pass it to the StreamingDataFrame.sink() method:

from quixstreams import Application
from quixstreams.sinks.core.csv import CSVSink

app = Application(broker_address="localhost:9092")
topic = app.topic("input-topic")

# Initialize a CSV sink with a file path 
csv_sink = CSVSink(path="file.csv")

sdf = app.dataframe(topic)
# Do some processing here ...
# Sink data to a CSV file
sdf.sink(csv_sink)

if __name__ == '__main__':
    app.run()

How It Works

CSVSink is a batching sink.
It batches processed records in memory per topic partition, and writes them to the file when a checkpoint is committed.

The output file format is the following:

key,value,timestamp,topic,partition,offset
b'afd7e8ab-4af5-4322-8417-dbfc7a0d7694',"{""number"": 0}",1722945524540,numbers-10k-keys,0,0
b'557bae7f-14b6-46c4-abc3-12f232b54c8e',"{""number"": 1}",1722945524546,numbers-10k-keys,0,1

Serialization Formats

By default, CSVSink serializes record keys by calling str() on them, and message values with json.dumps().

To use your own serializer, pass key_serializer and value_serializer to CSVSink:

import json
from quixstreams.sinks.core.csv import CSVSink

# Initialize a CSVSink with a file path 
csv_sink = CSVSink(
    path="file.csv",
    # Define custom serializers for keys and values here.
    # The callables must accept one argument for key/value, and return a string
    key_serializer=lambda key: json.dumps(key),
    value_serializer=lambda value: str(value),
)

Delivery Guarantees

The CSVSink provides at-least-once guarantees, and the resulting CSV file may contain duplicated rows of data if there were errors during processing.