Missing Data
Handling missing data is a crucial aspect of data processing, especially in streaming environments. This document outlines what is considered missing data within the StreamingDataFrame
class and how it is managed.
Types of Missing Data in Streaming
-
Missing Column: This occurs when a field is not present in the message at all. In streaming data, schemas can be dynamic, meaning that not all fields are required to be present in every message. This type of missing data is handled by the system's ability to adapt to changes in the schema over time.
-
Missing Value: This occurs when a field is present in the message, but its value is
None
. This indicates that the data for that field is missing, even though the field itself is part of the message schema.
Handling Missing Data in Aggregations
-
Rows with
None
Values: These rows are ignored during aggregation operations. This means that if a row contains aNone
value, it will not contribute to the aggregation result. This applies to the following aggregations: Count, Sum, Mean, Min, and Max. -
NaN
Values: UnlikeNone
,NaN
values are propagated to the aggregation result. This is becauseNaN
is not considered missing data in the same way asNone
. Instead, it represents a numerical value that is undefined or unrepresentable, and it is treated as such in calculations.
StreamingDataFrame.fill
Method
The fill
method in the StreamingDataFrame
class is used to fill missing column and missing value in the message with a constant value.
Example Usage
from quixstreams import Application
# Initialize the Application
app = Application(...)
sdf = app.dataframe(...)
Fill missing data for a single column with None
:
Fill missing data for multiple columns with None
:
Fill missing data with a constant value using a dictionary:
Use a combination of positional and keyword arguments: