Quick start¶
Create and work with timeseries data using the Time-Stream package.
import time_stream as ts
Create a TimeFrame¶
Create sample data in a Polars DataFrame:
from datetime import datetime, timedelta
import polars as pl
dates = [datetime(2023, 1, 1) + timedelta(days=i) for i in range(10)]
temperatures = [20.5, 21.0, 19.1, 26.0, 24.2, 26.6, 28.4, 30.9, 31.0, 29.1]
precipitation = [0.0, 0.0, 5.1, 10.2, 2.0, 0.2, 0.0, 3.0, 1.6, 0.0]
df = pl.DataFrame({"time": dates, "temperature": temperatures, "precipitation": precipitation})
Now wrap the Polars DataFrame in a TimeFrame
, which adds specialized functionality
for time series operations:
tf = ts.TimeFrame(
df=df,
time_name="time", # Specify which column contains the primary datetime values
)
With Time Properties¶
The TimeFrame
object can configure important properties about the time aspect of your data.
More information about these properties and concepts can be found on the concepts page page.
Here, we will show some basic usage of these time properties.
Periodicity, Resolution and Time Anchor¶
Without specifying resolution and periodicity, the default initialisation sets these properties to 1 microsecond, to account for any set of datetime values. The time anchor property is set to start:
print(tf.resolution)
print(tf.periodicity)
print(tf.time_anchor)
PT0.000001S
PT0.000001S
TimeAnchor.START
Although the default of 1 microsecond will account for any datetime values, for more control over certain time series functionality it is important to specify the actual resolution and periodicity if known. These properties can be provided as an ISO 8601 duration string like P1D (1 day) or PT15M (15 minutes).
The time anchor property can be set to start, end, or point.
Again, more detail can be found on the concepts page page about all these properties.
tf = ts.TimeFrame(
df=df,
time_name="time",
resolution="P1D", # Each timestamp is at day precision
periodicity="P1D", # Data points are spaced 1 day apart
time_anchor="end",
)
print(tf.resolution)
print(tf.periodicity)
print(tf.time_anchor)
P1D
P1D
TimeAnchor.END
shape: (10, 3)
┌─────────────────────┬─────────────┬───────────────┐
│ time ┆ temperature ┆ precipitation │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ f64 ┆ f64 │
╞═════════════════════╪═════════════╪═══════════════╡
│ 2023-01-01 00:00:00 ┆ 20.5 ┆ 0.0 │
│ 2023-01-02 00:00:00 ┆ 21.0 ┆ 0.0 │
│ 2023-01-03 00:00:00 ┆ 19.1 ┆ 5.1 │
│ 2023-01-04 00:00:00 ┆ 26.0 ┆ 10.2 │
│ 2023-01-05 00:00:00 ┆ 24.2 ┆ 2.0 │
│ 2023-01-06 00:00:00 ┆ 26.6 ┆ 0.2 │
│ 2023-01-07 00:00:00 ┆ 28.4 ┆ 0.0 │
│ 2023-01-08 00:00:00 ┆ 30.9 ┆ 3.0 │
│ 2023-01-09 00:00:00 ┆ 31.0 ┆ 1.6 │
│ 2023-01-10 00:00:00 ┆ 29.1 ┆ 0.0 │
└─────────────────────┴─────────────┴───────────────┘
Duplicate Detection¶
TimeFrame
automatically checks for rows with duplicates in the specified time column.
You have control over what the model should do when it detects rows with duplicate time values.
Consider this DataFrame with duplicate time values:
shape: (10, 3)
┌─────────────────────┬─────────────┬───────────────┐
│ time ┆ temperature ┆ precipitation │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ i64 ┆ i64 │
╞═════════════════════╪═════════════╪═══════════════╡
│ 2023-01-01 00:00:00 ┆ 20 ┆ null │
│ 2023-01-01 00:00:00 ┆ null ┆ 0 │
│ 2023-02-01 00:00:00 ┆ 19 ┆ 5 │
│ 2023-03-01 00:00:00 ┆ 26 ┆ 10 │
│ 2023-04-01 00:00:00 ┆ 24 ┆ 2 │
│ 2023-05-01 00:00:00 ┆ 26 ┆ 0 │
│ 2023-06-01 00:00:00 ┆ 28 ┆ null │
│ 2023-06-01 00:00:00 ┆ 30 ┆ 3 │
│ 2023-06-01 00:00:00 ┆ null ┆ 4 │
│ 2023-07-01 00:00:00 ┆ 29 ┆ 0 │
└─────────────────────┴─────────────┴───────────────┘
time | temperature | precipitation |
---|---|---|
datetime[μs] | i64 | i64 |
2023-01-01 00:00:00 | 20 | null |
2023-01-01 00:00:00 | null | 0 |
2023-02-01 00:00:00 | 19 | 5 |
2023-03-01 00:00:00 | 26 | 10 |
2023-04-01 00:00:00 | 24 | 2 |
2023-05-01 00:00:00 | 26 | 0 |
2023-06-01 00:00:00 | 28 | null |
2023-06-01 00:00:00 | 30 | 3 |
2023-06-01 00:00:00 | null | 4 |
2023-07-01 00:00:00 | 29 | 0 |
The following strategies are available to use with the on_duplicate
argument:
Error (Default):
on_duplicate="error"
Raises an error when duplicate rows are found. This is the default behavior to ensure data integrity.
ts.TimeFrame(df, "time", on_duplicates="error")
Warning: Duplicate time values found. A TimeFrame must have unique time values. Options for dealing with duplicate rows include: ['DROP', 'KEEP_FIRST', 'KEEP_LAST', 'ERROR', 'MERGE'].
Keep First:
on_duplicate="keep_first"
For a given group of rows with the same time value, keeps only the first row and discards the others.
tf = ts.TimeFrame(df, "time", on_duplicates="keep_first")
shape: (7, 3)
┌─────────────────────┬─────────────┬───────────────┐
│ time ┆ temperature ┆ precipitation │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ i64 ┆ i64 │
╞═════════════════════╪═════════════╪═══════════════╡
│ 2023-01-01 00:00:00 ┆ 20 ┆ null │
│ 2023-02-01 00:00:00 ┆ 19 ┆ 5 │
│ 2023-03-01 00:00:00 ┆ 26 ┆ 10 │
│ 2023-04-01 00:00:00 ┆ 24 ┆ 2 │
│ 2023-05-01 00:00:00 ┆ 26 ┆ 0 │
│ 2023-06-01 00:00:00 ┆ 28 ┆ null │
│ 2023-07-01 00:00:00 ┆ 29 ┆ 0 │
└─────────────────────┴─────────────┴───────────────┘
Keep Last:
on_duplicate="keep_last"
For a given group of rows with the same time value, keeps only the last row and discards the others.
tf = ts.TimeFrame(df, "time", on_duplicates="keep_last")
shape: (7, 3)
┌─────────────────────┬─────────────┬───────────────┐
│ time ┆ temperature ┆ precipitation │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ i64 ┆ i64 │
╞═════════════════════╪═════════════╪═══════════════╡
│ 2023-01-01 00:00:00 ┆ null ┆ 0 │
│ 2023-02-01 00:00:00 ┆ 19 ┆ 5 │
│ 2023-03-01 00:00:00 ┆ 26 ┆ 10 │
│ 2023-04-01 00:00:00 ┆ 24 ┆ 2 │
│ 2023-05-01 00:00:00 ┆ 26 ┆ 0 │
│ 2023-06-01 00:00:00 ┆ null ┆ 4 │
│ 2023-07-01 00:00:00 ┆ 29 ┆ 0 │
└─────────────────────┴─────────────┴───────────────┘
Drop:
on_duplicate="drop"
Removes all rows that have duplicate timestamps. This strategy is appropriate when you are unsure of the integrity of duplicate rows and only want unique, unambiguous data.
tf = ts.TimeFrame(df, "time", on_duplicates="drop")
shape: (5, 3)
┌─────────────────────┬─────────────┬───────────────┐
│ time ┆ temperature ┆ precipitation │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ i64 ┆ i64 │
╞═════════════════════╪═════════════╪═══════════════╡
│ 2023-02-01 00:00:00 ┆ 19 ┆ 5 │
│ 2023-03-01 00:00:00 ┆ 26 ┆ 10 │
│ 2023-04-01 00:00:00 ┆ 24 ┆ 2 │
│ 2023-05-01 00:00:00 ┆ 26 ┆ 0 │
│ 2023-07-01 00:00:00 ┆ 29 ┆ 0 │
└─────────────────────┴─────────────┴───────────────┘
Merge:
on_duplicate="merge"
For a given group of rows with the same time value, performs a merge of all rows. This combines values with a top-down approach that preserves the first non-null value for each column.
tf = ts.TimeFrame(df, "time", on_duplicates="merge")
shape: (7, 3)
┌─────────────────────┬─────────────┬───────────────┐
│ time ┆ temperature ┆ precipitation │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ i64 ┆ i64 │
╞═════════════════════╪═════════════╪═══════════════╡
│ 2023-01-01 00:00:00 ┆ 20 ┆ 0 │
│ 2023-02-01 00:00:00 ┆ 19 ┆ 5 │
│ 2023-03-01 00:00:00 ┆ 26 ┆ 10 │
│ 2023-04-01 00:00:00 ┆ 24 ┆ 2 │
│ 2023-05-01 00:00:00 ┆ 26 ┆ 0 │
│ 2023-06-01 00:00:00 ┆ 28 ┆ 3 │
│ 2023-07-01 00:00:00 ┆ 29 ┆ 0 │
└─────────────────────┴─────────────┴───────────────┘
With Metadata¶
The TimeFrame
object can hold metadata to describe your data.
This can be metadata about the time series dataset as a whole, or about the individual columns. Keeping the metadata
and the data together in one object like this can help simplify downstream processes,
such as derivation functions, running infilling routines, plotting data, etc.
Dataset-level metadata can be set with the with_metadata()
method:
metadata = {"location": "UKCEH Wallingford", "station_id": "ABC123"}
tf = tf.with_metadata(metadata)
Column-level metadata can be set with the with_column_metadata()
method:
column_metadata = {
"temperature": {"units": "°C", "description": "Average temperature"},
"precipitation": {
"units": "mm",
"description": "Precipitation amount",
"instrument_type": "Tipping bucket",
# Note that metadata keys are not required to be the same for all columns
},
}
tf = tf.with_column_metadata(column_metadata)
Metadata can be accessed via the metadata
(dataset-level)
and column_metadata
(column-level) attributes:
print("Dataset-level metadata:")
print("")
print("All: ", tf.metadata)
print("Specific key: ", tf.metadata["location"])
print("")
print("Column-level metadata:")
print("")
print("All: ", tf.column_metadata)
print("Specific column: ", tf.column_metadata["temperature"])
print("Specific column key: ", tf.column_metadata["temperature"]["units"])
Dataset-level metadata:
All: {'location': 'UKCEH Wallingford', 'station_id': 'ABC123'}
Specific key: UKCEH Wallingford
Column-level metadata:
All: {'temperature': {'units': '°C', 'description': 'Average temperature'}, 'precipitation': {'units': 'mm', 'description': 'Precipitation amount', 'instrument_type': 'Tipping bucket'}, 'time': {}}
Specific column: {'units': '°C', 'description': 'Average temperature'}
Specific column key: °C
Data Access and Update¶
Data Selection¶
The underlying Polars DataFrame is accessed via the df
property
tf.df
You can create new TimeFrame
objects as a selection, using the
select()
method, or via indexing syntax:
# Select multiple columns as a TimeFrame
selected_tf = tf.select(["temperature"])
# or
selected_tf = tf[["temperature"]]
print("Type: ", type(selected_tf))
print(selected_tf)
Type: <class 'time_stream.base.TimeFrame'>
shape: (10, 2)
┌─────────────────────┬─────────────┐
│ time ┆ temperature │
│ --- ┆ --- │
│ datetime[μs] ┆ f64 │
╞═════════════════════╪═════════════╡
│ 2023-01-01 00:00:00 ┆ 20.5 │
│ 2023-01-02 00:00:00 ┆ 21.0 │
│ 2023-01-03 00:00:00 ┆ 19.1 │
│ 2023-01-04 00:00:00 ┆ 26.0 │
│ 2023-01-05 00:00:00 ┆ 24.2 │
│ 2023-01-06 00:00:00 ┆ 26.6 │
│ 2023-01-07 00:00:00 ┆ 28.4 │
│ 2023-01-08 00:00:00 ┆ 30.9 │
│ 2023-01-09 00:00:00 ┆ 31.0 │
│ 2023-01-10 00:00:00 ┆ 29.1 │
└─────────────────────┴─────────────┘
Note
The primary time column is automatically maintained in any selection.
Data Update¶
If you need to make changes to the underlying Polars DataFrame, use the with_df()
method.
This performs some checks on the new DataFrame to check the integrity of the time data has been maintained, and
returns a new TimeFrame
object with the updated data.
# Update the DataFrame by adding a new column
new_df = tf.df.with_columns((pl.col("temperature") * 1.8 + 32).alias("temperature_f"))
tf = tf.with_df(new_df)
shape: (10, 4)
┌─────────────────────┬─────────────┬───────────────┬───────────────┐
│ time ┆ temperature ┆ precipitation ┆ temperature_f │
│ --- ┆ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ f64 ┆ f64 ┆ f64 │
╞═════════════════════╪═════════════╪═══════════════╪═══════════════╡
│ 2023-01-01 00:00:00 ┆ 20.5 ┆ 0.0 ┆ 68.9 │
│ 2023-01-02 00:00:00 ┆ 21.0 ┆ 0.0 ┆ 69.8 │
│ 2023-01-03 00:00:00 ┆ 19.1 ┆ 5.1 ┆ 66.38 │
│ 2023-01-04 00:00:00 ┆ 26.0 ┆ 10.2 ┆ 78.8 │
│ 2023-01-05 00:00:00 ┆ 24.2 ┆ 2.0 ┆ 75.56 │
│ 2023-01-06 00:00:00 ┆ 26.6 ┆ 0.2 ┆ 79.88 │
│ 2023-01-07 00:00:00 ┆ 28.4 ┆ 0.0 ┆ 83.12 │
│ 2023-01-08 00:00:00 ┆ 30.9 ┆ 3.0 ┆ 87.62 │
│ 2023-01-09 00:00:00 ┆ 31.0 ┆ 1.6 ┆ 87.8 │
│ 2023-01-10 00:00:00 ┆ 29.1 ┆ 0.0 ┆ 84.38 │
└─────────────────────┴─────────────┴───────────────┴───────────────┘