Infilling¶
Missing data happens. Fill the gaps with precision and care.
Why use Time-Stream?¶
It is inevitable that real-world monitoring data has gaps, whether that’s from: communications outages, sensor swaps or power cuts. With Time-Stream, you can fill those missing values with a robust infilling procedure that benefits from deep knowledge of the time properties of your data.
One-liner¶
With Time-Stream you state intent, not mechanics:
tf.infill("linear", "flow", max_gap=3)
That’s it, a single line with clear intent: “I want to use the linear infill method on my flow data, but only for gaps ≤ 3 steps”.
Complex example¶
Let’s take our example 15-minute river flow data that contains a few short outages. You might want to:
Fill only gaps up to 3 consecutive steps (≤45 minutes).
Use linear interpolation to infill tiny gaps (1 step) and a more complex interpolator (e.g. PCHIP) for ≥2 step gaps.
Input:
15-minute river flow timeseries, including some missing data.
shape: (110_977, 2)
┌─────────────────────┬───────────┐
│ time ┆ flow │
│ --- ┆ --- │
│ datetime[ns] ┆ f64 │
╞═════════════════════╪═══════════╡
│ 2020-09-01 00:00:00 ┆ 92.860538 │
│ 2020-09-01 00:15:00 ┆ null │
│ 2020-09-01 00:30:00 ┆ 98.103103 │
│ 2020-09-01 00:45:00 ┆ null │
│ 2020-09-01 01:00:00 ┆ null │
│ 2020-09-01 01:15:00 ┆ null │
│ 2020-09-01 01:30:00 ┆ null │
│ 2020-09-01 01:45:00 ┆ 92.085242 │
│ … ┆ … │
│ 2023-10-31 22:15:00 ┆ 84.677897 │
│ 2023-10-31 22:30:00 ┆ 86.0179 │
│ 2023-10-31 22:45:00 ┆ 83.122459 │
│ 2023-10-31 23:00:00 ┆ 76.928613 │
│ 2023-10-31 23:15:00 ┆ 83.320365 │
│ 2023-10-31 23:30:00 ┆ null │
│ 2023-10-31 23:45:00 ┆ null │
│ 2023-11-01 00:00:00 ┆ 84.721752 │
└─────────────────────┴───────────┘
Code:
import time_stream as ts
# Wrap the DataFrame in a TimeFrame object
tf = ts.TimeFrame(df, "time", resolution="PT15M", periodicity="PT15M")
# Infill gaps
tf_infill = tf.infill(
"linear", "flow", max_gap_size=1
).infill(
"pchip", "flow", max_gap_size=3
)
Output:
<time_stream.TimeFrame> Size (estimated): 1.71 MB
Time properties:
Time column : time [2020-09-01 00:00:00, ..., 2023-11-01 00:00:00]
Type : Datetime(time_unit='ns', time_zone=None)
Resolution : PT15M
Offset : None
Alignment : PT15M
Periodicity : PT15M
Anchor : TimeAnchor.START
Columns:
flow : Float64 880.56 KB [92.86053821660515, ..., 84.72175228556347]
Key benefits¶
Conservative: You set the rules; the library enforces them.
Time aware: Honours the resolution and periodicity properties of your data.
Simple code: One call conveys the method, scope, and policy.
In more detail¶
The infill() method is the entry point for infilling your
timeseries data in Time-Stream. There are various infill methods available; from using alternative data from
another source, to delegating to well established methods from the SciPy data science library. All methods are combined with the time-integrity
of your TimeFrame.
Let’s look at the method in more detail:
- TimeFrame.infill(infill_method, column_name, observation_interval=None, max_gap_size=None, **kwargs)[source]¶
Apply an infilling method to a column in the TimeFrame to fill in missing data.
- Parameters:
infill_method (
Union[str,Type[InfillMethod],InfillMethod]) – The method to use for infillingcolumn_name (
str) – The column to infillobservation_interval (
tuple[datetime,datetime|None] |None) – Optional time interval to limit the check to.max_gap_size (
int|None) – The maximum size of consecutive null gaps that should be filled. Any gap larger than this will not be infilled and will remain as null.**kwargs – Parameters specific to the infill method.
- Return type:
Self- Returns:
A TimeFrame containing the aggregated data.
Infill methods¶
The infill_method parameter lets you choose how missing values are estimated by passing a method name as a string.
Each method has its strengths, depending on your data. The currently available methods are:
Simple infilling techniques¶
"alt_data"- infill using data from an alternative source.Either another column in your TimeFrame, or data from a different DataFrame entirely.
Polynomial interpolation¶
"linear"- straight-line interpolation between neighbouring points.Simple and neutral; best for short gaps.
"quadratic"- second-order polynomial curve.Captures gentle curvature; suitable when changes aren’t linear.
"cubic"- third-order polynomial curve.Smooth transitions; can be useful for variables with cyclical patterns.
"bspline"- B-spline interpolation (configurable order).Flexible piecewise polynomials; user decides.
Shape-preserving methods¶
"pchip"- Piecewise Cubic Hermite Interpolating Polynomial.Preserves monotonicity and avoids overshoot; can help to avoid unrealistic fluctuations between values.
"akima"- Akima spline.A smooth curve fit for data with significant local variations and potential outliers.
Note
All methods honour the maximum gap limit: they will only fill runs of missing values up to your chosen length, leaving longer gaps as NaN.
Note
For infill methods using interpolation techniques, NaN values at the very beginning and very end of a timeseries will remain NaN; there is no pre- or post- data to constrain the infilling method.
Column selection¶
The column_name parameter lets you specify which column to infill; only this column will be used by the infill
function.
Observation interval¶
The observation_interval parameter lets you specify an observation interval to restrict infilling
to a specific time window. This is useful when:
You only want to work with a subset of data (e.g. one hydrological year).
You want to fill recent gaps without touching the historical record.
You need to use different methods for different parts of your timeseries.
Example:
from datetime import datetime
tf_recent = tf.infill(
"linear",
"flow",
observation_interval=(datetime(2024, 1, 1), datetime(2024, 12, 31)),
)
This will only attempt infilling between January to Decemeber 2024; gaps outside that interval remain untouched.
Max gap size¶
Use the max_gap_size parameter to prevent over-eager interpolation. Only gaps less than this
(measured in consecutive missing steps) will be infilled.
Example:
# Fill single-step gaps only (≤ 15 minutes at 15-min resolution)
tf1 = tf.infill("linear", "flow", max_gap_size=1)
# Fill gaps up to 2 steps (≤ 30 minutes)
tf2 = tf.infill("akima", "flow", max_gap_size=2)
Note
The definition of “gap size” depends on the TimeFrame resolution.
At 15-minute resolution, max_gap_size=2 = 30 minutes; at daily resolution,
max_gap_size=2 = 2 days.
Examples¶
Alternative data infilling¶
The "alt_data" infill method allows you to fill missing values in a column using data from an alternative source.
You can specify the alternative data in two ways:
From a column within the same TimeFrame: If the alternative data is already present as a column in your current
TimeFrameobject, you can directly reference it.From a separate DataFrame: You can provide an entirely separate Polars DataFrame containing the alternative data.
In both cases, you can also apply a correction_factor to the alternative data before it’s used for infilling.
Infilling from a separate DataFrame¶
Let’s say you have a primary dataset with missing “flow” values, and a separate alt_df with “alt_data” that
can be used to infill these gaps.
Input:
shape: (110_977, 2)
┌─────────────────────┬───────────┐
│ time ┆ flow │
│ --- ┆ --- │
│ datetime[ns] ┆ f64 │
╞═════════════════════╪═══════════╡
│ 2020-09-01 00:00:00 ┆ 92.860538 │
│ 2020-09-01 00:15:00 ┆ null │
│ 2020-09-01 00:30:00 ┆ 98.103103 │
│ 2020-09-01 00:45:00 ┆ null │
│ 2020-09-01 01:00:00 ┆ null │
│ … ┆ … │
│ 2023-10-31 23:00:00 ┆ 76.928613 │
│ 2023-10-31 23:15:00 ┆ 83.320365 │
│ 2023-10-31 23:30:00 ┆ null │
│ 2023-10-31 23:45:00 ┆ null │
│ 2023-11-01 00:00:00 ┆ 84.721752 │
└─────────────────────┴───────────┘
shape: (110_977, 2)
┌─────────────────────┬────────────┐
│ time ┆ alt_flow │
│ --- ┆ --- │
│ datetime[ns] ┆ f64 │
╞═════════════════════╪════════════╡
│ 2020-09-01 00:00:00 ┆ 116.075673 │
│ 2020-09-01 00:15:00 ┆ 124.726315 │
│ 2020-09-01 00:30:00 ┆ 122.628878 │
│ 2020-09-01 00:45:00 ┆ 125.585763 │
│ 2020-09-01 01:00:00 ┆ 116.101802 │
│ … ┆ … │
│ 2023-10-31 23:00:00 ┆ 96.160767 │
│ 2023-10-31 23:15:00 ┆ 104.150457 │
│ 2023-10-31 23:30:00 ┆ 105.125655 │
│ 2023-10-31 23:45:00 ┆ 100.174939 │
│ 2023-11-01 00:00:00 ┆ 105.90219 │
└─────────────────────┴────────────┘
Code:
tf_infill = tf.infill("alt_data", "flow", alt_df=alt_df, correction_factor=0.75, alt_data_column="alt_flow")
Output:
shape: (110_977, 2)
┌─────────────────────┬───────────┐
│ time ┆ flow │
│ --- ┆ --- │
│ datetime[ns] ┆ f64 │
╞═════════════════════╪═══════════╡
│ 2020-09-01 00:00:00 ┆ 92.860538 │
│ 2020-09-01 00:15:00 ┆ 93.544737 │
│ 2020-09-01 00:30:00 ┆ 98.103103 │
│ 2020-09-01 00:45:00 ┆ 94.189322 │
│ 2020-09-01 01:00:00 ┆ 87.076351 │
│ … ┆ … │
│ 2023-10-31 23:00:00 ┆ 76.928613 │
│ 2023-10-31 23:15:00 ┆ 83.320365 │
│ 2023-10-31 23:30:00 ┆ 78.844241 │
│ 2023-10-31 23:45:00 ┆ 75.131204 │
│ 2023-11-01 00:00:00 ┆ 84.721752 │
└─────────────────────┴───────────┘
Visualisation of interpolation methods¶
A quick visualisation of the results from the different interpolation infill methods is sometimes useful. However, bear in mind that this is a very simplistic example and the correct method to use is dependent on your data. You should do your research into which is most appropriate.
shape: (16, 7)
┌─────────────────────┬──────────┬──────────┬───────────┬──────────┬──────────┬──────────┐
│ time ┆ original ┆ linear ┆ quadratic ┆ cubic ┆ pchip ┆ akima │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════════════════════╪══════════╪══════════╪═══════════╪══════════╪══════════╪══════════╡
│ 2024-01-01 00:00:00 ┆ 0.993428 ┆ 0.993428 ┆ 0.993428 ┆ 0.993428 ┆ 0.993428 ┆ 0.993428 │
│ 2024-01-02 00:00:00 ┆ 0.223471 ┆ 0.223471 ┆ 0.223471 ┆ 0.223471 ┆ 0.223471 ┆ 0.223471 │
│ 2024-01-03 00:00:00 ┆ null ┆ 1.259424 ┆ 0.546161 ┆ 0.365671 ┆ 0.889524 ┆ 1.000306 │
│ 2024-01-04 00:00:00 ┆ 2.295377 ┆ 2.295377 ┆ 2.295377 ┆ 2.295377 ┆ 2.295377 ┆ 2.295377 │
│ 2024-01-05 00:00:00 ┆ 4.54606 ┆ 4.54606 ┆ 4.54606 ┆ 4.54606 ┆ 4.54606 ┆ 4.54606 │
│ 2024-01-06 00:00:00 ┆ 1.531693 ┆ 1.531693 ┆ 1.531693 ┆ 1.531693 ┆ 1.531693 ┆ 1.531693 │
│ 2024-01-07 00:00:00 ┆ 2.031726 ┆ 2.031726 ┆ 2.031726 ┆ 2.031726 ┆ 2.031726 ┆ 2.031726 │
│ 2024-01-08 00:00:00 ┆ null ┆ 3.407293 ┆ 3.727706 ┆ 4.038041 ┆ 3.404058 ┆ 3.527325 │
│ 2024-01-09 00:00:00 ┆ null ┆ 4.782859 ┆ 5.658006 ┆ 5.671975 ┆ 5.239764 ┆ 5.265495 │
│ 2024-01-10 00:00:00 ┆ 6.158426 ┆ 6.158426 ┆ 6.158426 ┆ 6.158426 ┆ 6.158426 ┆ 6.158426 │
│ 2024-01-11 00:00:00 ┆ 5.034869 ┆ 5.034869 ┆ 5.034869 ┆ 5.034869 ┆ 5.034869 ┆ 5.034869 │
│ 2024-01-12 00:00:00 ┆ 3.061051 ┆ 3.061051 ┆ 3.061051 ┆ 3.061051 ┆ 3.061051 ┆ 3.061051 │
│ 2024-01-13 00:00:00 ┆ null ┆ 3.692068 ┆ 1.96472 ┆ 1.197094 ┆ 3.10049 ┆ 2.34322 │
│ 2024-01-14 00:00:00 ┆ null ┆ 4.323086 ┆ 2.019954 ┆ 0.375071 ┆ 3.37656 ┆ 2.68997 │
│ 2024-01-15 00:00:00 ┆ null ┆ 4.954103 ┆ 3.226754 ┆ 1.527056 ┆ 4.125893 ┆ 3.853278 │
│ 2024-01-16 00:00:00 ┆ 5.58512 ┆ 5.58512 ┆ 5.58512 ┆ 5.58512 ┆ 5.58512 ┆ 5.58512 │
└─────────────────────┴──────────┴──────────┴───────────┴──────────┴──────────┴──────────┘