cyto_ml.data package¶
Submodules¶
cyto_ml.data.db_config module¶
cyto_ml.data.flowcam module¶
- class cyto_ml.data.flowcam.FlowCamSession(directory: str, output_directory: str, experiment_name: str)[source]¶
Bases:
object
Bundle up all the logic of the decollage script so it can be run without passing commandline arguments
- cyto_ml.data.flowcam.exif_headers(lon: float, lat: float, date: str, depth: int | None = 0) dict [source]¶
Given lat, lon, date and option of depth, write and return a dict with EXIF standard tags as keys
- cyto_ml.data.flowcam.headers_from_filename(filename: str) dict [source]¶
Attempt to extract lon/lat and date, option of depth, from filename Return a dict with key-value pairs for use as EXIF headers
- cyto_ml.data.flowcam.lst_metadata(filename: str) DataFrame [source]¶
Read the csv-ish “.lst” file from the FlowCam export Return a pandas dataframe
- cyto_ml.data.flowcam.parse_filename(filename: str) tuple [source]¶
Attempt to extract file prefix, lon, lat, date, depth, from filename
cyto_ml.data.image module¶
- cyto_ml.data.image.base_normalise() Compose [source]¶
Baseline - don’t standardise the values, just tensorise (which automatically translates to a 0-1 range)
- cyto_ml.data.image.convert_3_band(image: array) array [source]¶
Given a 1-band image normalised between 0 and 1, convert to 3 band https://stackoverflow.com/a/57723482 This seems very brute-force, but PIL is not converting our odd-format greyscale images from the Flow Cytometer well. Improvements appreciated
- cyto_ml.data.image.load_image(path: str, normalise_func: str | None = 'base_normalise') Tensor [source]¶
Given an image path, return a tensor suitable to hand to a model Optional normalise_func which defaults to converting to a range between 0..1
- cyto_ml.data.image.load_image_from_url(url: str, normalise_func: str | None = 'base_normalise') Tensor [source]¶
Given an image url, return a tensor suitable to hand to a model Optional normalise_func which defaults to converting to a range between 0..1
- cyto_ml.data.image.normalise_flowlr(image: <module 'PIL.Image' from '/opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/PIL/Image.py'>) array [source]¶
Utility function to normalise flow cytometer images. As output from the flow cytometer, they are 16 bit greyscale, but all the values are in a low range (max value 1018 across the set)
As recommended by @Kzra, normalise all values by the maximum Both for display, and before handing to a model.
Image.point(lambda…) should do this, but the values stay integers So roundtrip this through numpy
- cyto_ml.data.image.prepare_image(image: <module 'PIL.Image' from '/opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/PIL/Image.py'>, normalise_func: str | None = 'base_normalise') Tensor [source]¶
Take an xarray of image data and prepare it to pass through the model a) Converts the image data to a PyTorch tensor b) Accepts a single image or batch (no need for torch.stack)
- cyto_ml.data.image.resize_normalise() Compose [source]¶
Resize to 256x256 https://github.com/ukceh-rse/ViT-LASNet/blob/36235f9b992a6c345f1010dab133549d20f181d9/test/test.py#L115
cyto_ml.data.labels module¶
cyto_ml.data.s3 module¶
Thin wrapper around the s3 object store with images and metadata
- cyto_ml.data.s3.bucket_keys(bucket_name: str, prefix: str = '/', delimiter: str = '/', start_after: str = '') Generator[str, None, None] [source]¶
Efficiently the contents of a bucket Lifted from this highly-rated SO answer: https://stackoverflow.com/a/54014862
cyto_ml.data.vectorstore module¶
- class cyto_ml.data.vectorstore.ChromadbStore(db_name: str)[source]¶
Bases:
VectorStore
- client = <chromadb.api.client.Client object>¶
- class cyto_ml.data.vectorstore.PostgresStore(db_name: str)[source]¶
Bases:
VectorStore
- class cyto_ml.data.vectorstore.SQLiteVecStore(db_name: str, embedding_len: int | None = 512, check_same_thread: bool = True)[source]¶
Bases:
VectorStore
- add(url: str, embeddings: List[float], classification: str | None = '') None [source]¶
Add image embeddings to storage. Two tables: * one regular one which holds metadata, with embeddings as floats * one “virtual table” for indexing it by ID with encoded embeddings
- cyto_ml.data.vectorstore.deserialize(packed: bytes) List[float] [source]¶
Inverse of the serialisation method suggested above (e.g. for clustering)
- cyto_ml.data.vectorstore.serialize_f32(vector: List[float]) bytes [source]¶
serializes a list of floats into a compact “raw bytes” format https://github.com/asg017/sqlite-vec/blob/main/examples/simple-python/demo.py
- cyto_ml.data.vectorstore.vector_store(store_type: str | None = 'chromadb', db_name: str | None = 'test_collection', **kwargs) VectorStore [source]¶