cyto_ml.data package

Submodules

cyto_ml.data.db_config module

cyto_ml.data.flowcam module

class cyto_ml.data.flowcam.FlowCamSession(directory: str, output_directory: str, experiment_name: str)[source]

Bases: object

Bundle up all the logic of the decollage script so it can be run without passing commandline arguments

do_decollage() None[source]

Not very lovely single function that replaces the work of the script. See cyto_ml.pipeline.pipeline_decollage - has the same code in it

output_dir() None[source]
read_metadata() None[source]
cyto_ml.data.flowcam.exif_headers(lon: float, lat: float, date: str, depth: int | None = 0) dict[source]

Given lat, lon, date and option of depth, write and return a dict with EXIF standard tags as keys

cyto_ml.data.flowcam.headers_from_filename(filename: str) dict[source]

Attempt to extract lon/lat and date, option of depth, from filename Return a dict with key-value pairs for use as EXIF headers

cyto_ml.data.flowcam.lst_metadata(filename: str) DataFrame[source]

Read the csv-ish “.lst” file from the FlowCam export Return a pandas dataframe

cyto_ml.data.flowcam.parse_filename(filename: str) tuple[source]

Attempt to extract file prefix, lon, lat, date, depth, from filename

cyto_ml.data.flowcam.read_headers(filename: str) dict[source]
cyto_ml.data.flowcam.window_slice(image: ndarray, x: int, y: int, height: int, width: int) ndarray[source]
cyto_ml.data.flowcam.write_headers(filename: str, headers: dict) bool[source]

Given a dictionary of EXIF tag keys and their values, write to filename Returns True if nothing has obviously gone wrong during this process

cyto_ml.data.image module

exception cyto_ml.data.image.ImageProcessingError[source]

Bases: Exception

cyto_ml.data.image.base_normalise() Compose[source]

Baseline - don’t standardise the values, just tensorise (which automatically translates to a 0-1 range)

cyto_ml.data.image.convert_3_band(image: array) array[source]

Given a 1-band image normalised between 0 and 1, convert to 3 band https://stackoverflow.com/a/57723482 This seems very brute-force, but PIL is not converting our odd-format greyscale images from the Flow Cytometer well. Improvements appreciated

cyto_ml.data.image.load_image(path: str, normalise_func: str | None = 'base_normalise') Tensor[source]

Given an image path, return a tensor suitable to hand to a model Optional normalise_func which defaults to converting to a range between 0..1

cyto_ml.data.image.load_image_from_url(url: str, normalise_func: str | None = 'base_normalise') Tensor[source]

Given an image url, return a tensor suitable to hand to a model Optional normalise_func which defaults to converting to a range between 0..1

cyto_ml.data.image.normalise_flowlr(image: <module 'PIL.Image' from '/opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/PIL/Image.py'>) array[source]

Utility function to normalise flow cytometer images. As output from the flow cytometer, they are 16 bit greyscale, but all the values are in a low range (max value 1018 across the set)

As recommended by @Kzra, normalise all values by the maximum Both for display, and before handing to a model.

Image.point(lambda…) should do this, but the values stay integers So roundtrip this through numpy

cyto_ml.data.image.prepare_image(image: <module 'PIL.Image' from '/opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/PIL/Image.py'>, normalise_func: str | None = 'base_normalise') Tensor[source]

Take an xarray of image data and prepare it to pass through the model a) Converts the image data to a PyTorch tensor b) Accepts a single image or batch (no need for torch.stack)

cyto_ml.data.image.resize_normalise() Compose[source]

Resize to 256x256 https://github.com/ukceh-rse/ViT-LASNet/blob/36235f9b992a6c345f1010dab133549d20f181d9/test/test.py#L115

cyto_ml.data.labels module

cyto_ml.data.s3 module

Thin wrapper around the s3 object store with images and metadata

cyto_ml.data.s3.boto3_client() Session[source]
cyto_ml.data.s3.bucket_keys(bucket_name: str, prefix: str = '/', delimiter: str = '/', start_after: str = '') Generator[str, None, None][source]

Efficiently the contents of a bucket Lifted from this highly-rated SO answer: https://stackoverflow.com/a/54014862

cyto_ml.data.s3.image_index(location: str, suffix: str = '.tif') DataFrame[source]

Find records in a bucket, return a DataFrame serving as an index Filter by optional file suffix, which by default is .tif

cyto_ml.data.vectorstore module

class cyto_ml.data.vectorstore.ChromadbStore(db_name: str)[source]

Bases: VectorStore

add(url: str, embeddings: List[float]) None[source]

Add vector to Chromadb

client = <chromadb.api.client.Client object>
closest(url: str, n_results: int = 25) List[source]

Get the N closest identifiers by cosine distance

embeddings() List[List][source]
get(url: str) list[source]

Retrieve vector from Chromadb

ids() List[str][source]
class cyto_ml.data.vectorstore.PostgresStore(db_name: str)[source]

Bases: VectorStore

add(url: str, embeddings: List[float]) None[source]
closest(embeddings: list, n_results: int = 25) List[source]
embeddings() List[List][source]
get(url: str) List[float][source]
ids() List[str][source]
class cyto_ml.data.vectorstore.SQLiteVecStore(db_name: str, embedding_len: int | None = 512, check_same_thread: bool = True)[source]

Bases: VectorStore

add(url: str, embeddings: List[float], classification: str | None = '') None[source]

Add image embeddings to storage. Two tables: * one regular one which holds metadata, with embeddings as floats * one “virtual table” for indexing it by ID with encoded embeddings

classes() List[str][source]
closest(url: str, n_results: int = 25) List[source]

Find and return the N closest examples by distance Accepts an image URL, returns a list ordered by distance

embeddings() List[List][source]
get(url: str) List[float][source]
ids() List[str][source]
labelled(label: str, n_results: int = 50) List[str][source]
load_ext(db_name: str) None[source]

Load the sqlite extension into our db if needed

load_schema() None[source]

Load our db schema if needed; Default embedding length is 2048, set at init. Consider SQLAlchemy for this, or a CLI-based way of loading from a file; a list of CREATE TABLE statements feels like a kludge.

class cyto_ml.data.vectorstore.VectorStore[source]

Bases: object

abstract add(url: str, embeddings: List[float]) None[source]
abstract closest(embeddings: List) List[float][source]
abstract embeddings() List[List][source]
abstract get(url: str) List[float][source]
abstract ids() List[str][source]
cyto_ml.data.vectorstore.deserialize(packed: bytes) List[float][source]

Inverse of the serialisation method suggested above (e.g. for clustering)

cyto_ml.data.vectorstore.serialize_f32(vector: List[float]) bytes[source]

serializes a list of floats into a compact “raw bytes” format https://github.com/asg017/sqlite-vec/blob/main/examples/simple-python/demo.py

cyto_ml.data.vectorstore.vector_store(store_type: str | None = 'chromadb', db_name: str | None = 'test_collection', **kwargs) VectorStore[source]

Module contents