Aggregation

Simple code, robust results.

Why use Time-Stream?

Aggregating time series data sounds simple - until you hit real-world edge cases, particularly when working with hydrological and environmental datasets. There are considerations such as: handling leap years, anchor points, working with offset-periods (like water-years), or tracking completeness of data going into each aggregation.

Forget the pain of writing your own custom code; with Time-Stream, you declare the aggregation period you want (daily, monthly, yearly, or any other custom period) and the library handles the rest. Simple code, robust results.

One-liner

Rolling your own aggregation functionality can get complex. With Time-Stream you express the intent directly:

tf.aggregate("P1D", "sum", "precipitation")

That’s it, a single line with clear intent: “I want a daily sum of my precipitation data.”

Complex example

UK hydrologists often use a “UK water-year” that runs from 1 October at 09:00 through to the following year. Writing code to aggregate data (whether that’s 15-minute river level, or daily flow values) into water-year totals can be painful.

Take, for example, generating a water-year “annual maximum” (AMAX) series. You might want to:

  • Generate an output time series, with the datetime values of the start of each water-year

  • Keep hold of the exact timestamp that the maximum occurred on

  • Know exactly how many data points were available in each water year

Here’s how you might have to do this in Pandas, Polars and then in Time-Stream:

Input:

15-minute river flow timeseries, including some missing data.

shape: (110_977, 2)
┌─────────────────────┬───────────┐
│ time                ┆ flow      │
│ ---                 ┆ ---       │
│ datetime[ns]        ┆ f64       │
╞═════════════════════╪═══════════╡
│ 2020-09-01 00:00:00 ┆ 35.744844 │
│ 2020-09-01 00:15:00 ┆ null      │
│ 2020-09-01 00:30:00 ┆ 79.728137 │
│ 2020-09-01 00:45:00 ┆ null      │
│ 2020-09-01 01:00:00 ┆ null      │
│ 2020-09-01 01:15:00 ┆ null      │
│ 2020-09-01 01:30:00 ┆ null      │
│ 2020-09-01 01:45:00 ┆ 17.576324 │
│ …                   ┆ …         │
│ 2023-10-31 22:15:00 ┆ 66.126464 │
│ 2023-10-31 22:30:00 ┆ 79.396944 │
│ 2023-10-31 22:45:00 ┆ 54.537908 │
│ 2023-10-31 23:00:00 ┆ -0.017411 │
│ 2023-10-31 23:15:00 ┆ 58.686884 │
│ 2023-10-31 23:30:00 ┆ null      │
│ 2023-10-31 23:45:00 ┆ null      │
│ 2023-11-01 00:00:00 ┆ 74.769198 │
└─────────────────────┴───────────┘

Code:

import pandas as pd

df = get_example_df(library="pandas")

# Shift time back by 9 hours to align water year boundaries with midnight Oct 1st
# This makes Oct 1 9am appear as Oct 1 midnight for resampling purposes
df.index = df.index - pd.Timedelta(hours=9)

# Aggregate using resample
# We can use "YS-OCT" (Year Start in October) as the resample period
resampled = df.resample("YS-OCT")

df_amax = pd.DataFrame(
    {
        "max_flow": resampled["flow"].max(),
        "count_flow": resampled["flow"].count(),
    }
)

# Get the datetime that each max value occurred on
max_dates = resampled["flow"].idxmax() + pd.Timedelta(hours=9)
df_amax["time_of_max_flow"] = max_dates

# Adjust the time column back to represent actual water year starts (Oct 1 9am)
df_amax.index = df_amax.index + pd.Timedelta(hours=9)
df_amax.index.name = "time"

# Calculate FULL water year expected counts
# Each water year should have the number of 15-min intervals (900 seconds)
#   from Oct 1 9am to next Oct 1 9am
df_amax["expected_count_time"] = (
    (
        (df_amax.index + pd.DateOffset(years=1)) - df_amax.index
    ).total_seconds() / 900
)

df_amax = df_amax.reset_index()
import polars as pl

df = get_example_df(library="polars")

# Shift time back by 9 hours to align water year boundaries with midnight Oct 1st
# This makes Oct 1 9am appear as Oct 1 midnight for resampling purposes
df = df.with_columns(pl.col("time").dt.offset_by("-9h").alias("shifted_time"))
# Assign water year based on shifted time
df = df.with_columns(
    [
        pl.when(pl.col("shifted_time").dt.month() >= 10)
        .then(pl.col("shifted_time").dt.year())
        .otherwise(pl.col("shifted_time").dt.year() - 1)
        .alias("water_year")
    ]
)

df_amax = df.group_by("water_year").agg([
    pl.col("flow").max().alias("max_flow"),
    pl.col("flow").count().alias("count_flow"),

    # Get the datetime that each max value occurred on
    pl.col("time").filter(
        pl.col("flow") == pl.col("flow").max()
    ).first().alias("time_of_max_flow"),
]).sort("water_year")

# Adjust the time column back to represent actual water year starts (Oct 1 9am)
df_amax = df_amax.with_columns(pl.datetime(pl.col("water_year"), 10, 1, 9).alias("time"))

# Calculate FULL water year expected counts
# Each water year should have the number of 15-min intervals (900 seconds)
#   from Oct 1 9am to next Oct 1 9am
df_amax = df_amax.with_columns(
    ((pl.col("time").dt.offset_by("1y") - pl.col("time")).dt.total_seconds() // 900
     ).alias("expected_count_time")
)
import time_stream as ts

df = get_example_df(library="polars")

# Wrap the DataFrame in a TimeFrame object
tf = ts.TimeFrame(df, "time", resolution="PT15M", periodicity="PT15M")

# Perform the aggregation to a water-year
tf_amax = tf.aggregate("P1Y+9MT9H", "max", "flow")


























# that's it...

Output:

shape: (5, 5)
┌─────────────────────┬─────────────────────┬────────────┬────────────┬─────────────────────┐
│ time                ┆ time_of_max_flow    ┆ max_flow   ┆ count_flow ┆ expected_count_time │
│ ---                 ┆ ---                 ┆ ---        ┆ ---        ┆ ---                 │
│ datetime[ns]        ┆ datetime[ns]        ┆ f64        ┆ i64        ┆ f64                 │
╞═════════════════════╪═════════════════════╪════════════╪════════════╪═════════════════════╡
│ 2019-10-01 09:00:00 ┆ 2020-09-22 02:15:00 ┆ 119.348992 ┆ 2759       ┆ 35136.0             │
│ 2020-10-01 09:00:00 ┆ 2021-03-04 20:00:00 ┆ 119.958825 ┆ 33354      ┆ 35040.0             │
│ 2021-10-01 09:00:00 ┆ 2022-08-17 03:30:00 ┆ 119.839728 ┆ 33285      ┆ 35040.0             │
│ 2022-10-01 09:00:00 ┆ 2023-06-14 05:00:00 ┆ 119.902685 ┆ 33271      ┆ 35040.0             │
│ 2023-10-01 09:00:00 ┆ 2023-10-10 02:45:00 ┆ 119.50087  ┆ 2789       ┆ 35136.0             │
└─────────────────────┴─────────────────────┴────────────┴────────────┴─────────────────────┘
shape: (5, 5)
┌─────────────────────┬─────────────────────┬────────────┬────────────┬─────────────────────┐
│ time                ┆ time_of_max_flow    ┆ max_flow   ┆ count_flow ┆ expected_count_time │
│ ---                 ┆ ---                 ┆ ---        ┆ ---        ┆ ---                 │
│ datetime[μs]        ┆ datetime[ns]        ┆ f64        ┆ u32        ┆ i64                 │
╞═════════════════════╪═════════════════════╪════════════╪════════════╪═════════════════════╡
│ 2019-10-01 09:00:00 ┆ 2020-09-22 02:15:00 ┆ 119.348992 ┆ 2759       ┆ 35136               │
│ 2020-10-01 09:00:00 ┆ 2021-03-04 20:00:00 ┆ 119.958825 ┆ 33354      ┆ 35040               │
│ 2021-10-01 09:00:00 ┆ 2022-08-17 03:30:00 ┆ 119.839728 ┆ 33285      ┆ 35040               │
│ 2022-10-01 09:00:00 ┆ 2023-06-14 05:00:00 ┆ 119.902685 ┆ 33271      ┆ 35040               │
│ 2023-10-01 09:00:00 ┆ 2023-10-10 02:45:00 ┆ 119.50087  ┆ 2789       ┆ 35136               │
└─────────────────────┴─────────────────────┴────────────┴────────────┴─────────────────────┘
shape: (5, 5)
┌─────────────────────┬─────────────────────┬────────────┬────────────┬─────────────────────┐
│ time                ┆ time_of_max_flow    ┆ max_flow   ┆ count_flow ┆ expected_count_time │
│ ---                 ┆ ---                 ┆ ---        ┆ ---        ┆ ---                 │
│ datetime[ns]        ┆ datetime[ns]        ┆ f64        ┆ u32        ┆ u32                 │
╞═════════════════════╪═════════════════════╪════════════╪════════════╪═════════════════════╡
│ 2019-10-01 09:00:00 ┆ 2020-09-22 02:15:00 ┆ 119.348992 ┆ 2759       ┆ 35136               │
│ 2020-10-01 09:00:00 ┆ 2021-03-04 20:00:00 ┆ 119.958825 ┆ 33354      ┆ 35040               │
│ 2021-10-01 09:00:00 ┆ 2022-08-17 03:30:00 ┆ 119.839728 ┆ 33285      ┆ 35040               │
│ 2022-10-01 09:00:00 ┆ 2023-06-14 05:00:00 ┆ 119.902685 ┆ 33271      ┆ 35040               │
│ 2023-10-01 09:00:00 ┆ 2023-10-10 02:45:00 ┆ 119.50087  ┆ 2789       ┆ 35136               │
└─────────────────────┴─────────────────────┴────────────┴────────────┴─────────────────────┘

Key benefits

  • Less boilerplate: No need to wrangle custom datetime columns or write manual offset logic.

  • Fewer mistakes: Periodicity, alignment, and anchor semantics are enforced for you.

  • Domain-ready: Express hydrological conventions directly: daily at 09:00, or water year from October.

  • Readable & reproducible: Your code is self-explanatory to collaborators and reviewers.

In more detail

The aggregate() method is the entry point for performing aggregations with timeseries data in Time-Stream. It combines Polars performance with TimeFrame semantics (resolution, periodicity, anchor).

Aggregation period

The time window you want to aggregate into. This can be specified as an ISO-8601 duration string, with optional modification to specify a custom offset to the period.

Common examples:

  • "P1D" – calendar day

  • "P1M" – calendar month

  • "P1Y" – calendar year

  • "P1Y+9MT9H"water year starting 1 Oct 09:00

  • "PT15M" – 15-minute buckets

Note

The resulting TimeFrame will have its resolution and periodicity set to this value.

Aggregation methods

Choose how values inside each window are summarised. Pass a string corresponding to one of the built-in functions:

  • "sum" - Add up all values in each period.

    Use this for quantities that accumulate over time, such as precipitation.

  • "mean" - Average of all values in each period.

    Useful for variables like temperature or concentration, where the average represents the period well.

  • "min" - Smallest value observed in the period.

    Often used to track minimum daily temperature, or low flows in rivers.

  • "max" - Largest value observed in the period.

    Common in hydrology for annual maxima (AMAX) or flood frequency analysis.

Column selection

Specify which columns to aggregate; only these will be used by the aggregation function. This can be a single column name, a list of columns, or if not provided - the method will use all columns in the timeseries.

Missing criteria

Control whether a window is considered “complete enough” to produce a value by specifying a specific missing criteria policy and associated threshold.

The policies you can specify are:

  • available: Requires at least “n” points within the aggregation window.

  • percent: Requires at least “n”% of points within the aggregation window.

  • missing: No more than “n” points can be missing within the aggregation window.

Examples

Using the 15-minute flow example data:

# Require at least 2,400 values present (25 days of 15 minute data)
tf_agg = tf.aggregate("P1M", "mean", "flow", missing_criteria=("available", 25 * 96))

# Require at least 75% of data to be present
tf_agg = tf.aggregate("P1M", "mean", "flow", missing_criteria=("percent", 75))

# Allow at most 150 missing values
tf_agg = tf.aggregate("P1M", "mean", "flow", missing_criteria=("missing", 150))

The resulting TimeFrame object will contain metadata columns that provide detail about the completeness of the aggregation windows, and whether an aggregated data point is considered valid:

  • count_<column>: The number of points found in each aggregation window

  • expected_count_<time>: The number of points expected if the aggregation window was full

  • valid_<column>: Whether the individual aggregated data points are valid or not, based on the missing criteria specified.

tf.aggregate("P1M", "mean", "flow", missing_criteria=("missing", 150))

shape: (39, 5)
┌─────────────────────┬───────────┬────────────┬─────────────────────┬────────────┐
│ time                ┆ mean_flow ┆ count_flow ┆ expected_count_time ┆ valid_flow │
│ ---                 ┆ ---       ┆ ---        ┆ ---                 ┆ ---        │
│ datetime[ns]        ┆ f64       ┆ u32        ┆ u32                 ┆ bool       │
╞═════════════════════╪═══════════╪════════════╪═════════════════════╪════════════╡
│ 2020-09-01 00:00:00 ┆ 56.327301 ┆ 2730       ┆ 2880                ┆ true       │
│ 2020-10-01 00:00:00 ┆ 55.354524 ┆ 2838       ┆ 2976                ┆ true       │
│ 2020-11-01 00:00:00 ┆ 53.558817 ┆ 2747       ┆ 2880                ┆ true       │
│ 2020-12-01 00:00:00 ┆ 55.746043 ┆ 2833       ┆ 2976                ┆ true       │
│ 2021-01-01 00:00:00 ┆ 54.708216 ┆ 2821       ┆ 2976                ┆ false      │
│ …                   ┆ …         ┆ …          ┆ …                   ┆ …          │
│ 2023-07-01 00:00:00 ┆ 55.706993 ┆ 2835       ┆ 2976                ┆ true       │
│ 2023-08-01 00:00:00 ┆ 54.436446 ┆ 2817       ┆ 2976                ┆ false      │
│ 2023-09-01 00:00:00 ┆ 53.924188 ┆ 2746       ┆ 2880                ┆ true       │
│ 2023-10-01 00:00:00 ┆ 56.500386 ┆ 2822       ┆ 2976                ┆ false      │
│ 2023-11-01 00:00:00 ┆ 74.769198 ┆ 1          ┆ 2880                ┆ false      │
└─────────────────────┴───────────┴────────────┴─────────────────────┴────────────┘

Time anchoring

Choose the time anchor of the aggregated TimeFrame, which you may want to be different than the input TimeFrame. For example, meteorological observations are often considered as end anchored - where the value is considered valid up to the given timestamp. When producing a daily mean from this data, it may make more sense for the result to use a start anchor - indicating the value is valid from the start of the day to the end of the day.

See the concepts page page for more information about time anchors.

Note

If omitted, the aggregation uses the input TimeFrame’s time anchor