Data Storage

We are currently storing all our timeseries data in parquet files on s3, this has it’s advantages and works really well for read only data. For the ingestion we have had to add some hacks to avoid horizontal scaling, and it’s not great having to read an entire parquet file into memory just to append a single row to the bottom of it when new data comes in (parquet files can’t be updated).

On the API we have also seen a number of issues with parquet and have had to create a custom duckdb connection pool to use parquet/duckdb in an API.

We are now wanting to start writing processed data which we already know will require many updates and it would be nice to not have to force processing to only happen sequentially to avoid overwriting data.

Because of the above we have been considering alternative data storage solutions for the timeseries product.

Solutions we investigated

image