cyto_ml.data package¶

Submodules¶

cyto_ml.data.db_config module¶

cyto_ml.data.flowcam module¶

class cyto_ml.data.flowcam.FlowCamSession(directory: str, output_directory: str, experiment_name: str)[source]¶

Bases: object

Bundle up all the logic of the decollage script so it can be run without passing commandline arguments

do_decollage() → None[source]¶: Not very lovely single function that replaces the work of the script. See cyto_ml.pipeline.pipeline_decollage - has the same code in it

output_dir() → None[source]¶

read_metadata() → None[source]¶

cyto_ml.data.flowcam.exif_headers(lon: float, lat: float, date: str, depth: int | None = 0) → dict[source]¶: Given lat, lon, date and option of depth, write and return a dict with EXIF standard tags as keys

cyto_ml.data.flowcam.headers_from_filename(filename: str) → dict[source]¶: Attempt to extract lon/lat and date, option of depth, from filename Return a dict with key-value pairs for use as EXIF headers

cyto_ml.data.flowcam.lst_metadata(filename: str) → DataFrame[source]¶: Read the csv-ish “.lst” file from the FlowCam export Return a pandas dataframe

cyto_ml.data.flowcam.parse_filename(filename: str) → tuple[source]¶: Attempt to extract file prefix, lon, lat, date, depth, from filename

cyto_ml.data.flowcam.read_headers(filename: str) → dict[source]¶

cyto_ml.data.flowcam.window_slice(image: ndarray, x: int, y: int, height: int, width: int) → ndarray[source]¶

cyto_ml.data.flowcam.write_headers(filename: str, headers: dict) → bool[source]¶: Given a dictionary of EXIF tag keys and their values, write to filename Returns True if nothing has obviously gone wrong during this process

cyto_ml.data.image module¶

exception cyto_ml.data.image.ImageProcessingError[source]¶: Bases: Exception

cyto_ml.data.image.base_normalise() → Compose[source]¶: Baseline - don’t standardise the values, just tensorise (which automatically translates to a 0-1 range)

cyto_ml.data.image.convert_3_band(image: array) → array[source]¶: Given a 1-band image normalised between 0 and 1, convert to 3 band https://stackoverflow.com/a/57723482 This seems very brute-force, but PIL is not converting our odd-format greyscale images from the Flow Cytometer well. Improvements appreciated

cyto_ml.data.image.load_image(path: str, normalise_func: str | None = 'base_normalise') → Tensor[source]¶: Given an image path, return a tensor suitable to hand to a model Optional normalise_func which defaults to converting to a range between 0..1

cyto_ml.data.image.load_image_from_url(url: str, normalise_func: str | None = 'base_normalise') → Tensor[source]¶: Given an image url, return a tensor suitable to hand to a model Optional normalise_func which defaults to converting to a range between 0..1

cyto_ml.data.image.normalise_flowlr(image: <module 'PIL.Image' from '/opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/PIL/Image.py'>) → array[source]¶

Utility function to normalise flow cytometer images. As output from the flow cytometer, they are 16 bit greyscale, but all the values are in a low range (max value 1018 across the set)

As recommended by @Kzra, normalise all values by the maximum Both for display, and before handing to a model.

Image.point(lambda…) should do this, but the values stay integers So roundtrip this through numpy

cyto_ml.data.image.prepare_image(image: <module 'PIL.Image' from '/opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/PIL/Image.py'>, normalise_func: str | None = 'base_normalise') → Tensor[source]¶: Take an xarray of image data and prepare it to pass through the model a) Converts the image data to a PyTorch tensor b) Accepts a single image or batch (no need for torch.stack)

cyto_ml.data.image.resize_normalise() → Compose[source]¶: Resize to 256x256 https://github.com/ukceh-rse/ViT-LASNet/blob/36235f9b992a6c345f1010dab133549d20f181d9/test/test.py#L115

cyto_ml.data.labels module¶

cyto_ml.data.s3 module¶

Thin wrapper around the s3 object store with images and metadata

cyto_ml.data.s3.boto3_client() → Session[source]¶

cyto_ml.data.s3.bucket_keys(bucket_name: str, prefix: str = '/', delimiter: str = '/', start_after: str = '') → Generator[str, None, None][source]¶: Efficiently the contents of a bucket Lifted from this highly-rated SO answer: https://stackoverflow.com/a/54014862

cyto_ml.data.s3.image_index(location: str, suffix: str = '.tif') → DataFrame[source]¶: Find records in a bucket, return a DataFrame serving as an index Filter by optional file suffix, which by default is .tif

cyto_ml.data.vectorstore module¶

class cyto_ml.data.vectorstore.ChromadbStore(db_name: str)[source]¶

Bases: VectorStore

add(url: str, embeddings: List[float]) → None[source]¶: Add vector to Chromadb

client = <chromadb.api.client.Client object>¶

closest(url: str, n_results: int = 25) → List[source]¶: Get the N closest identifiers by cosine distance

embeddings() → List[List][source]¶

get(url: str) → list[source]¶: Retrieve vector from Chromadb

ids() → List[str][source]¶

class cyto_ml.data.vectorstore.PostgresStore(db_name: str)[source]¶

Bases: VectorStore

add(url: str, embeddings: List[float]) → None[source]¶

closest(embeddings: list, n_results: int = 25) → List[source]¶

embeddings() → List[List][source]¶

get(url: str) → List[float][source]¶

ids() → List[str][source]¶

class cyto_ml.data.vectorstore.SQLiteVecStore(db_name: str, embedding_len: int | None = 512, check_same_thread: bool = True)[source]¶

Bases: VectorStore

add(url: str, embeddings: List[float], classification: str | None = '') → None[source]¶: Add image embeddings to storage. Two tables: * one regular one which holds metadata, with embeddings as floats * one “virtual table” for indexing it by ID with encoded embeddings

classes() → List[str][source]¶

closest(url: str, n_results: int = 25) → List[source]¶: Find and return the N closest examples by distance Accepts an image URL, returns a list ordered by distance

embeddings() → List[List][source]¶

get(url: str) → List[float][source]¶

ids() → List[str][source]¶

labelled(label: str, n_results: int = 50) → List[str][source]¶

load_ext(db_name: str) → None[source]¶: Load the sqlite extension into our db if needed

load_schema() → None[source]¶: Load our db schema if needed; Default embedding length is 2048, set at init. Consider SQLAlchemy for this, or a CLI-based way of loading from a file; a list of CREATE TABLE statements feels like a kludge.

class cyto_ml.data.vectorstore.VectorStore[source]¶

Bases: object

abstract add(url: str, embeddings: List[float]) → None[source]¶

abstract closest(embeddings: List) → List[float][source]¶

abstract embeddings() → List[List][source]¶

abstract get(url: str) → List[float][source]¶

abstract ids() → List[str][source]¶

cyto_ml.data.vectorstore.deserialize(packed: bytes) → List[float][source]¶: Inverse of the serialisation method suggested above (e.g. for clustering)

cyto_ml.data.vectorstore.serialize_f32(vector: List[float]) → bytes[source]¶: serializes a list of floats into a compact “raw bytes” format https://github.com/asg017/sqlite-vec/blob/main/examples/simple-python/demo.py

cyto_ml.data.vectorstore.vector_store(store_type: str | None = 'chromadb', db_name: str | None = 'test_collection', **kwargs) → VectorStore[source]¶

cyto_ml.data package¶

Submodules¶

cyto_ml.data.db_config module¶

cyto_ml.data.flowcam module¶

cyto_ml.data.image module¶

cyto_ml.data.labels module¶

cyto_ml.data.s3 module¶

cyto_ml.data.vectorstore module¶

Module contents¶

plankton_ml

Navigation

Related Topics