rainfallqc.utils.data_utils

All data operations for polars including datetime and calendar functionality.

Classes and functions ordered alphabetically.

rainfallqc.utils.data_utils.back_propagate_daily_data_flags(data, flag_column, num_days)[source]

Back fill-in flags a number of days.

This will prioritise higher flag values.

Parameters:
  • data (DataFrame) – Daily data with flag_column

  • flag_column (str) – column with flags

  • num_days: – Number of days to back-propagate

  • num_days (int)

Return type:

DataFrame

Returns:

:
data

Data with flags back-propogated

rainfallqc.utils.data_utils.calculate_dry_spell_fraction(data, target_gauge_col, dry_period_days)[source]

Calculate dry spell fraction.

Parameters:
  • data (DataFrame) – Data with time column

  • target_gauge_col (str) – Column with rainfall data

  • dry_period_days (int) – Length for of a “dry_spell”

Return type:

Series

Returns:

:
rain_daily_dry_day

Data with dry spell fraction

rainfallqc.utils.data_utils.check_data_has_consistent_time_step(data)[source]

Check data has a consistent time step i.e. ‘1h’.

Parameters:

data (DataFrame) – Data with time column

Raises:

ValueError – If data has more than one time steps

Return type:

None

rainfallqc.utils.data_utils.check_data_is_monthly(data)[source]

Check data is monthly.

Parameters:

data (DataFrame) – Data with time column

Raises:

ValueError – If data has a no monthly time steps

Return type:

None

rainfallqc.utils.data_utils.check_data_is_specific_time_res(data, time_res)[source]

Check data has a hourly or daily time step.

Does not work for monthly data, please use ‘check_data_is_monthly’.

Parameters:
  • data (DataFrame) – Data with time column.

  • time_res (str | list) – Time resolutions either a single string or list of strings

Raises:

ValueError – If data is not hourly or daily.

Return type:

None

rainfallqc.utils.data_utils.check_for_negative_values(df, target_gauge_col)[source]

Check if the target column contains any negative values.

Parameters:
  • df (DataFrame) – DataFrame to check.

  • target_gauge_col (str) – Column to check for negative values.

Raises:

ValueError – If negative values are found in the target column.

Return type:

bool

rainfallqc.utils.data_utils.convert_daily_data_to_monthly(daily_data, rain_cols, perc_for_valid_month=95)[source]

Convert daily data to monthly whilst setting month to NaN if less than a given percentage of days is missing.

Parameters:
  • daily_data (DataFrame) – Daily data to convert to monthly

  • rain_cols (list) – Columns with rainfall data

  • perc_for_valid_month (int | float) – Percentage of month needed to be classed as a valid month for the monthly group by

Return type:

DataFrame

Returns:

:
monthly_data

Monthly data

rainfallqc.utils.data_utils.convert_datarray_seconds_to_days(series_seconds)[source]

Convert xarray series from seconds to days. For some reason the CDD data from ETCCDI is in seconds.

Parameters:

series_seconds (DataArray) – Data in series to convert to days.

Return type:

ndarray

Returns:

:
series_days

Data array converted to days.

rainfallqc.utils.data_utils.downsample_and_fill_columns(high_res_data, low_res_data, data_cols, fill_limit, fill_method='backward', time_col='time')[source]

Join columns from lower resolution data to higher resolution data and fill gaps.

Parameters:
  • high_res_data (DataFrame) – Higher resolution data (e.g., 15-min)

  • low_res_data (DataFrame) – Lower resolution data with columns to join (e.g., hourly)

  • data_cols (str | list[str]) – Column name(s) to join and fill. Can be: - Single column name: “rainfall” - List of columns: [“rain1”, “rain2”] - Regex pattern: “^rain.*$”

  • fill_limit (int) – Maximum number of intervals to fill

  • fill_method (str) – “forward”, “backward”, or “none”

  • time_col (str) – Name of time column (default: ‘time’)

Return type:

DataFrame

Returns:

:
high_res_data_filled

High resolution data with filled columns

rainfallqc.utils.data_utils.downsample_monthly_data(sub_monthly_data, monthly_data, data_cols, time_col='time')[source]

Join monthly data to hourly and fill only within same month.

Parameters:
  • sub_monthly_data (DataFrame) – Sub-monthly data (e.g., hourly)

  • monthly_data (DataFrame) – Monthly data with columns to join

  • data_cols (str | list[str]) – Column name(s) to join and fill. Can be: - Single column name: “rainfall” - List of columns: [“rain1”, “rain2”]

  • time_col (str) – Name of time column (default: ‘time’)

Return type:

DataFrame

Returns:

:
result

Sub-monthly data with monthly columns joined and filled within month

rainfallqc.utils.data_utils.extract_negative_values_from_data(data, cols_to_extract_from)[source]

Extract negative values from data.

Parameters:
  • data (DataFrame) – Rainfall data.

  • cols_to_extract_from (list) – Columns to extract negative values from

Return type:

DataFrame

Returns:

:
data

Data with only negative values or 0.

rainfallqc.utils.data_utils.extract_positive_values_from_data(data, cols_to_extract_from)[source]

Extract positive values from data.

Parameters:
  • data (DataFrame) – Rainfall data.

  • cols_to_extract_from (list) – Columns to extract positive values from

Return type:

DataFrame

Returns:

:
data

Data with only positive values or 0.

rainfallqc.utils.data_utils.format_timedelta_duration(td)[source]

Convert timedelta to custom strings.

Parameters:

td (timedelta) – Time delta to convert.

Return type:

str

Returns:

:
td

Human-readable timedelta string using largest unit (d, h, m, s).

rainfallqc.utils.data_utils.get_data_timestep_as_str(data)[source]

Get time step of data.

Parameters:

data (DataFrame) – Data with time column

Return type:

str

Returns:

:
time_step

Time step of data i.e. ‘1h’, ‘1d’, ‘15m’.

rainfallqc.utils.data_utils.get_data_timesteps(data)[source]

Get data timesteps. Ideally the data should have 1.

Parameters:

data (DataFrame) – Data with time column.

Return type:

Series

Returns:

:
unique_timesteps

All unique time steps in data (timedelta).

rainfallqc.utils.data_utils.get_dry_period_proportions(dry_period_days)[source]

Get dry period proportions.

Parameters:

dry_period_days (int) – Length for of a “dry_spell” (default: 15 days)

Return type:

dict

Returns:

:
fraction_dry_days

Dictionary with keys “1”, “2”, “3” with dry spell fractions

rainfallqc.utils.data_utils.get_dry_spells(data, target_gauge_col)[source]

Get dry spell column.

Parameters:
  • data (DataFrame) – Rainfall data

  • target_gauge_col (str) – Column with rainfall data

Return type:

DataFrame

Returns:

:
data_w_dry_spells

Data with is_dry binary column

rainfallqc.utils.data_utils.get_expected_days_in_month(data)[source]

Get expected number of days in a months within the data.

Parameters:

data (DataFrame) – Data with ‘year’ and ‘month’ columns

Return type:

DataFrame

Returns:

:
data:

Data with ‘expected_days_in_month” column

rainfallqc.utils.data_utils.get_normalised_diff(data, target_col, other_col, diff_col_name)[source]

Ger normalised difference between two columns in data.

Parameters:
  • data (DataFrame) – Data with columns

  • target_col (str) – Target column

  • other_col (str) – Other column.

  • diff_col_name (str) – New column name for difference column

Return type:

DataFrame

Returns:

:
data_w_norm_diff

Data with normalised diff

rainfallqc.utils.data_utils.make_month_and_year_col(data)[source]

Make year and month columns for polars dataframe.

Parameters:

data (DataFrame) – Data with time column

Return type:

DataFrame

Returns:

:
data

Data with year and month columns

rainfallqc.utils.data_utils.normalise_data(data)[source]

Normalise data to [0, 1].

Parameters:

data (Series | Expr) – Data with time column.

Return type:

Series

Returns:

:
norm_data

Normalised data.

rainfallqc.utils.data_utils.offset_data_by_time(data, target_col, offset_in_time, time_res)[source]

Shift/offset data either backwards or forwards in time.

Parameters:
  • data (DataFrame) – Data with column to offset in ‘time’

  • target_col (str) – Column of data to offset

  • offset_in_time (int) – Amount to offset data by i.e. 1 for 1 day if time_res set to ‘1d’

  • time_res (str) – Time resolution like ‘hourly’, ‘daily’, ‘1h’ or ‘1d’

Return type:

DataFrame

Returns:

:
data

Offset data by ‘offset_in_time’ amount

rainfallqc.utils.data_utils.replace_missing_vals_with_nan(data, target_gauge_col, missing_val=None)[source]

Replace no data value with numpy.nan.

Parameters:
  • data (DataFrame) – Rainfall data

  • target_gauge_col (str) – Column of rainfall

  • missing_val (int) – Missing value identifier

Return type:

DataFrame

Returns:

:
gsdr_data

GSDR data with missing values replaced

rainfallqc.utils.data_utils.resample_data_by_time_step(data, rain_cols, time_col, time_step, min_count, hour_offset)[source]

Group hourly data into daily and check for at least 24 daily time steps per day.

Parameters:
  • data (DataFrame) – Rainfall data to resample

  • rain_cols (List[str]) – List of column with rainfall data

  • time_col (str) – Name of time column

  • time_step (str) – Time step to resample into (e.g. ‘1d’ for daily, ‘1h’ for hourly, ‘15m’ for 15 minute)

  • min_count (int) – Minimum number of time steps needed per time period

  • hour_offset (int) – Time offset in hours (needed if data is not aligned to midnight)

Return type:

DataFrame

Returns:

:
resampled_data

Rainfall data grouped into a given time step

Functions

back_propagate_daily_data_flags(data, ...)

Back fill-in flags a number of days.

calculate_dry_spell_fraction(data, ...)

Calculate dry spell fraction.

check_data_has_consistent_time_step(data)

Check data has a consistent time step i.e. '1h'.

check_data_is_monthly(data)

Check data is monthly.

check_data_is_specific_time_res(data, time_res)

Check data has a hourly or daily time step.

check_for_negative_values(df, target_gauge_col)

Check if the target column contains any negative values.

convert_daily_data_to_monthly(daily_data, ...)

Convert daily data to monthly whilst setting month to NaN if less than a given percentage of days is missing.

convert_datarray_seconds_to_days(series_seconds)

Convert xarray series from seconds to days.

downsample_and_fill_columns(high_res_data, ...)

Join columns from lower resolution data to higher resolution data and fill gaps.

downsample_monthly_data(sub_monthly_data, ...)

Join monthly data to hourly and fill only within same month.

extract_negative_values_from_data(data, ...)

Extract negative values from data.

extract_positive_values_from_data(data, ...)

Extract positive values from data.

format_timedelta_duration(td)

Convert timedelta to custom strings.

get_data_timestep_as_str(data)

Get time step of data.

get_data_timesteps(data)

Get data timesteps.

get_dry_period_proportions(dry_period_days)

Get dry period proportions.

get_dry_spells(data, target_gauge_col)

Get dry spell column.

get_expected_days_in_month(data)

Get expected number of days in a months within the data.

get_normalised_diff(data, target_col, ...)

Ger normalised difference between two columns in data.

make_month_and_year_col(data)

Make year and month columns for polars dataframe.

normalise_data(data)

Normalise data to [0, 1].

offset_data_by_time(data, target_col, ...)

Shift/offset data either backwards or forwards in time.

replace_missing_vals_with_nan(data, ...[, ...])

Replace no data value with numpy.nan.

resample_data_by_time_step(data, rain_cols, ...)

Group hourly data into daily and check for at least 24 daily time steps per day.