Quick start¶

Create and work with timeseries data using the Time-Stream package.

import time_stream as ts

Create a TimeFrame¶

Create sample data in a Polars DataFrame:

from datetime import datetime, timedelta

import polars as pl

dates = [datetime(2023, 1, 1) + timedelta(days=i) for i in range(10)]
temperatures = [20.5, 21.0, 19.1, 26.0, 24.2, 26.6, 28.4, 30.9, 31.0, 29.1]
precipitation = [0.0, 0.0, 5.1, 10.2, 2.0, 0.2, 0.0, 3.0, 1.6, 0.0]

df = pl.DataFrame({"time": dates, "temperature": temperatures, "precipitation": precipitation})

Now wrap the Polars DataFrame in a TimeFrame, which adds specialized functionality for time series operations:

tf = ts.TimeFrame(
    df=df,
    time_name="time",  # Specify which column contains the primary datetime values
)
print(tf)

<time_stream.TimeFrame> Size (estimated): 288.00 B
Time properties:
    Time column   : time  [2023-01-01 00:00:00, ..., 2023-01-10 00:00:00]
    Type          : Datetime(time_unit='us', time_zone=None)
    Resolution    : PT0.000001S
    Offset        : None
    Alignment     : PT0.000001S
    Periodicity   : PT0.000001S
    Anchor        : TimeAnchor.START
Columns:
    temperature   : Float64  80.00 B  [20.5, ..., 29.1]
    precipitation : Float64  80.00 B  [0.0, ..., 0.0]

With Time Properties¶

The TimeFrame object can configure important properties about the time aspect of your data. More information about these properties and concepts can be found on the concepts page.

Here, we will show some basic usage of these time properties.

Resolution, Offset, Periodicity and Time Anchor¶

Without specifying resolution or periodicity, the default initialisation sets these properties to 1 microsecond, to account for any set of datetime values. The default is for no offset. The time_anchor is set to start:

print(tf.resolution)
print(tf.offset)
print(tf.periodicity)
print(tf.time_anchor)

PT0.000001S
None
PT0.000001S
TimeAnchor.START

Although the default of 1 microsecond will account for any datetime values, for more control over certain time series functionality it is important to specify the actual resolution, offset and periodicity if known. These properties can be provided as an ISO 8601 duration string, e.g. P1D (1 day) or PT15M (15 minutes).

The time_anchor property can be set to start, end, or point.

Again, more detail can be found on the concepts page about all these properties.

Resolution¶

For most cases, it is sufficient to just specify the resolution. The offset will default to “no offset”, and the periodicity will be set to the same as the resolution.

tf = ts.TimeFrame(
    df=df,
    time_name="time",
    resolution="P1D",  # Sampling interval of 1 day
)

print("resolution=", tf.resolution)
print("offset=", tf.offset)
print("periodicity=", tf.periodicity)

resolution= P1D
offset= None
periodicity= P1D

Offset¶

The next most common modification might be to specify an offset. This is where your data is measured at a point in time offset from the “natural boundary” of the resolution (more info here: concepts page). The periodicity is automatically built from the resolution + offset, to specify that we only expect 1 value within those points in time.

tf = ts.TimeFrame(
    df=df_offset,
    time_name="time",
    resolution="P1D",  # Sampling interval of 1 day
    offset="+T9H",  # Values are measured at 09:00am on each day
)

print("resolution=", tf.resolution)
print("offset=", tf.offset)
print("periodicity=", tf.periodicity)

resolution= P1D
offset= +T9H
periodicity= P1D+T9H

Periodicity¶

Finally, you may have data that is measured at a given resolution, but you only expect 1 value in a different period of time. This is when you would specify a specific periodicity. The classic hydrological example would be where you have an annual-maximum (AMAX) timeseries, where the measured data is a daily resolution, but we only expect 1 value per year.

tf = ts.TimeFrame(
    df=df_amax,
    time_name="time",
    resolution="P1D",  # Sampling interval of 1 day
    offset="+T9H",  # Values are measured at 09:00am on each day
    periodicity="P1Y+9MT9H",  # We only expect 1 value per "water-year" (1st Oct 09:00)
)

print("resolution=", tf.resolution)
print("offset=", tf.offset)
print("periodicity=", tf.periodicity)

resolution= P1D
offset= +T9H
periodicity= P1Y+9MT9H

Duplicate Detection¶

TimeFrame automatically checks for rows with duplicates in the specified time column. You have control over what the model should do when it detects rows with duplicate time values. Consider this DataFrame with duplicate time values:

shape: (10, 3)
┌─────────────────────┬─────────────┬───────────────┐
│ time                ┆ temperature ┆ precipitation │
│ ---                 ┆ ---         ┆ ---           │
│ datetime[μs]        ┆ i64         ┆ i64           │
╞═════════════════════╪═════════════╪═══════════════╡
│ 2023-01-01 00:00:00 ┆ 20          ┆ null          │
│ 2023-01-01 00:00:00 ┆ null        ┆ 0             │
│ 2023-02-01 00:00:00 ┆ 19          ┆ 5             │
│ 2023-03-01 00:00:00 ┆ 26          ┆ 10            │
│ 2023-04-01 00:00:00 ┆ 24          ┆ 2             │
│ 2023-05-01 00:00:00 ┆ 26          ┆ 0             │
│ 2023-06-01 00:00:00 ┆ 28          ┆ null          │
│ 2023-06-01 00:00:00 ┆ 30          ┆ 3             │
│ 2023-06-01 00:00:00 ┆ null        ┆ 4             │
│ 2023-07-01 00:00:00 ┆ 29          ┆ 0             │
└─────────────────────┴─────────────┴───────────────┘

The following strategies are available to use with the on_duplicate argument:

Error (Default): on_duplicate="error"

Raises an error when duplicate rows are found. This is the default behavior to ensure data integrity.

ts.TimeFrame(df, "time", on_duplicates="error")

Warning: Duplicate time values found. A TimeFrame must have unique time values. Options for dealing with duplicate rows include: ['DROP', 'KEEP_FIRST', 'KEEP_LAST', 'ERROR', 'MERGE'].

Keep First: on_duplicate="keep_first"

For a given group of rows with the same time value, keeps only the first row and discards the others.

tf = ts.TimeFrame(df, "time", on_duplicates="keep_first")

shape: (7, 3)
┌─────────────────────┬─────────────┬───────────────┐
│ time                ┆ temperature ┆ precipitation │
│ ---                 ┆ ---         ┆ ---           │
│ datetime[μs]        ┆ i64         ┆ i64           │
╞═════════════════════╪═════════════╪═══════════════╡
│ 2023-01-01 00:00:00 ┆ 20          ┆ null          │
│ 2023-02-01 00:00:00 ┆ 19          ┆ 5             │
│ 2023-03-01 00:00:00 ┆ 26          ┆ 10            │
│ 2023-04-01 00:00:00 ┆ 24          ┆ 2             │
│ 2023-05-01 00:00:00 ┆ 26          ┆ 0             │
│ 2023-06-01 00:00:00 ┆ 28          ┆ null          │
│ 2023-07-01 00:00:00 ┆ 29          ┆ 0             │
└─────────────────────┴─────────────┴───────────────┘

Keep Last: on_duplicate="keep_last"

For a given group of rows with the same time value, keeps only the last row and discards the others.

tf = ts.TimeFrame(df, "time", on_duplicates="keep_last")

shape: (7, 3)
┌─────────────────────┬─────────────┬───────────────┐
│ time                ┆ temperature ┆ precipitation │
│ ---                 ┆ ---         ┆ ---           │
│ datetime[μs]        ┆ i64         ┆ i64           │
╞═════════════════════╪═════════════╪═══════════════╡
│ 2023-01-01 00:00:00 ┆ null        ┆ 0             │
│ 2023-02-01 00:00:00 ┆ 19          ┆ 5             │
│ 2023-03-01 00:00:00 ┆ 26          ┆ 10            │
│ 2023-04-01 00:00:00 ┆ 24          ┆ 2             │
│ 2023-05-01 00:00:00 ┆ 26          ┆ 0             │
│ 2023-06-01 00:00:00 ┆ null        ┆ 4             │
│ 2023-07-01 00:00:00 ┆ 29          ┆ 0             │
└─────────────────────┴─────────────┴───────────────┘

Drop: on_duplicate="drop"

Removes all rows that have duplicate timestamps. This strategy is appropriate when you are unsure of the integrity of duplicate rows and only want unique, unambiguous data.

tf = ts.TimeFrame(df, "time", on_duplicates="drop")

shape: (5, 3)
┌─────────────────────┬─────────────┬───────────────┐
│ time                ┆ temperature ┆ precipitation │
│ ---                 ┆ ---         ┆ ---           │
│ datetime[μs]        ┆ i64         ┆ i64           │
╞═════════════════════╪═════════════╪═══════════════╡
│ 2023-02-01 00:00:00 ┆ 19          ┆ 5             │
│ 2023-03-01 00:00:00 ┆ 26          ┆ 10            │
│ 2023-04-01 00:00:00 ┆ 24          ┆ 2             │
│ 2023-05-01 00:00:00 ┆ 26          ┆ 0             │
│ 2023-07-01 00:00:00 ┆ 29          ┆ 0             │
└─────────────────────┴─────────────┴───────────────┘

Merge: on_duplicate="merge"

For a given group of rows with the same time value, performs a merge of all rows. This combines values with a top-down approach that preserves the first non-null value for each column.

tf = ts.TimeFrame(df, "time", on_duplicates="merge")

shape: (7, 3)
┌─────────────────────┬─────────────┬───────────────┐
│ time                ┆ temperature ┆ precipitation │
│ ---                 ┆ ---         ┆ ---           │
│ datetime[μs]        ┆ i64         ┆ i64           │
╞═════════════════════╪═════════════╪═══════════════╡
│ 2023-01-01 00:00:00 ┆ 20          ┆ 0             │
│ 2023-02-01 00:00:00 ┆ 19          ┆ 5             │
│ 2023-03-01 00:00:00 ┆ 26          ┆ 10            │
│ 2023-04-01 00:00:00 ┆ 24          ┆ 2             │
│ 2023-05-01 00:00:00 ┆ 26          ┆ 0             │
│ 2023-06-01 00:00:00 ┆ 28          ┆ 3             │
│ 2023-07-01 00:00:00 ┆ 29          ┆ 0             │
└─────────────────────┴─────────────┴───────────────┘

With Metadata¶

The TimeFrame object can hold metadata to describe your data. This can be metadata about the time series dataset as a whole, or about the individual columns. Keeping the metadata and the data together in one object like this can help simplify downstream processes, such as derivation functions, running infilling routines, plotting data, etc.

Dataset-level metadata can be set with the with_metadata() method:

metadata = {"location": "UKCEH Wallingford", "station_id": "ABC123"}

tf = tf.with_metadata(metadata)

Column-level metadata can be set with the with_column_metadata() method:

column_metadata = {
    "temperature": {"units": "°C", "description": "Average temperature"},
    "precipitation": {
        "units": "mm",
        "description": "Precipitation amount",
        "instrument_type": "Tipping bucket",
        # Note that metadata keys are not required to be the same for all columns
    },
}

tf = tf.with_column_metadata(column_metadata)

Metadata can be accessed via the metadata (dataset-level) and column_metadata (column-level) attributes:

print("Dataset-level metadata:")
print("")
print("All: ", tf.metadata)
print("Specific key: ", tf.metadata["location"])
print("")
print("Column-level metadata:")
print("")
print("All: ", tf.column_metadata)
print("Specific column: ", tf.column_metadata["temperature"])
print("Specific column key: ", tf.column_metadata["temperature"]["units"])

Dataset-level metadata:

All:  {'location': 'UKCEH Wallingford', 'station_id': 'ABC123'}
Specific key:  UKCEH Wallingford

Column-level metadata:

All:  {'temperature': {'units': '°C', 'description': 'Average temperature'}, 'precipitation': {'units': 'mm', 'description': 'Precipitation amount', 'instrument_type': 'Tipping bucket'}, 'time': {}}
Specific column:  {'units': '°C', 'description': 'Average temperature'}
Specific column key:  °C

Data Access and Update¶

Data Selection¶

The underlying Polars DataFrame is accessed via the df property

tf.df

You can create new TimeFrame objects as a selection, using the select() method, or via indexing syntax:

# Select multiple columns as a TimeFrame
selected_tf = tf.select(["temperature"])
# or
selected_tf = tf[["temperature"]]
print("Type: ", type(selected_tf))
print(selected_tf.df)

Type:  <class 'time_stream.base.TimeFrame'>
shape: (10, 2)
┌─────────────────────┬─────────────┐
│ time                ┆ temperature │
│ ---                 ┆ ---         │
│ datetime[μs]        ┆ f64         │
╞═════════════════════╪═════════════╡
│ 2023-01-01 00:00:00 ┆ 20.5        │
│ 2023-01-02 00:00:00 ┆ 21.0        │
│ 2023-01-03 00:00:00 ┆ 19.1        │
│ 2023-01-04 00:00:00 ┆ 26.0        │
│ 2023-01-05 00:00:00 ┆ 24.2        │
│ 2023-01-06 00:00:00 ┆ 26.6        │
│ 2023-01-07 00:00:00 ┆ 28.4        │
│ 2023-01-08 00:00:00 ┆ 30.9        │
│ 2023-01-09 00:00:00 ┆ 31.0        │
│ 2023-01-10 00:00:00 ┆ 29.1        │
└─────────────────────┴─────────────┘

Note

The primary time column is automatically maintained in any selection.

Data Update¶

If you need to make changes to the underlying Polars DataFrame, use the with_df() method. This performs some checks on the new DataFrame to check the integrity of the time data has been maintained, and returns a new TimeFrame object with the updated data.

# Update the DataFrame by adding a new column
new_df = tf.df.with_columns((pl.col("temperature") * 1.8 + 32).alias("temperature_f"))

tf = tf.with_df(new_df)

shape: (10, 4)
┌─────────────────────┬─────────────┬───────────────┬───────────────┐
│ time                ┆ temperature ┆ precipitation ┆ temperature_f │
│ ---                 ┆ ---         ┆ ---           ┆ ---           │
│ datetime[μs]        ┆ f64         ┆ f64           ┆ f64           │
╞═════════════════════╪═════════════╪═══════════════╪═══════════════╡
│ 2023-01-01 00:00:00 ┆ 20.5        ┆ 0.0           ┆ 68.9          │
│ 2023-01-02 00:00:00 ┆ 21.0        ┆ 0.0           ┆ 69.8          │
│ 2023-01-03 00:00:00 ┆ 19.1        ┆ 5.1           ┆ 66.38         │
│ 2023-01-04 00:00:00 ┆ 26.0        ┆ 10.2          ┆ 78.8          │
│ 2023-01-05 00:00:00 ┆ 24.2        ┆ 2.0           ┆ 75.56         │
│ 2023-01-06 00:00:00 ┆ 26.6        ┆ 0.2           ┆ 79.88         │
│ 2023-01-07 00:00:00 ┆ 28.4        ┆ 0.0           ┆ 83.12         │
│ 2023-01-08 00:00:00 ┆ 30.9        ┆ 3.0           ┆ 87.62         │
│ 2023-01-09 00:00:00 ┆ 31.0        ┆ 1.6           ┆ 87.8          │
│ 2023-01-10 00:00:00 ┆ 29.1        ┆ 0.0           ┆ 84.38         │
└─────────────────────┴─────────────┴───────────────┴───────────────┘