Extracting evidence-linked abundance drivers from species-account text

Extracting evidence-linked abundance drivers from species-account text#

Plant Atlas 2020 | Motivating paper | Ongoing Development

Primary Contact: Lily Gouldsbrough, lilgou@ceh.ac.uk | Notebook Top-to-Bottom Runtime < 10 Minutes

Challenge:
Species accounts contain useful ecological knowledge about habitat, management, biogeography and likely reasons for change. However, that knowledge is usually written as prose, which makes it difficult to reuse directly in structured workflows such as causal diagrams or expert-review tables.

Approach:
This notebook demonstrates a constrained workflow for using a small local instruction-tuned language model to extract candidate ecological driver relationships from species-account text. Each sentence is processed with a short context window, and the model returns a small JSON object containing a driver phrase, driver type, target, effect direction, confidence score and evidence quote. Standard Python code then validates, groups and aggregates these rows into evidence-linked candidate edges for expert review.

Important:
The output is not an automatically generated causal model. The model proposes evidence-linked rows; the analyst decides whether the evidence is useful, whether the direction is plausible and whether the edge belongs in a causal diagram.

Introduction#

Biodiversity records are rarely collected from a perfectly representative sample of places. Some sites are visited more often because they are accessible, well known, close to recorders, or more likely to contain species of interest. These patterns matter when inferring changes in species’ distributions or abundance from opportunistic records.

Causal diagrams are one way to reason about this problem. They make explicit which processes are thought to affect a species, which processes affect where records are collected, and where those processes overlap. Boyd et al. (2025) discuss this in the context of using causal diagrams and superpopulation models to correct geographic biases in biodiversity monitoring data.

In practice, much of the ecological knowledge needed to draft these diagrams is held as prose. Plant Atlas-style species accounts describe habitat, management, broad biogeography, recent trends and likely reasons for change, but those descriptions are not directly usable as candidate graph edges.

This notebook asks a narrow practical question: can a small local LLM help convert species-account prose into an evidence-linked table of candidate abundance drivers for expert review?

The included demo data are synthetic Plant Atlas-style examples. They mimic the structure of local account extracts, but they are not copied Plant Atlas prose and should not be quoted as Plant Atlas content.

Running the notebook#

Clone the accompanying notebook repository, create the conda environment, and point MODEL_PATH to a locally available Hugging Face-compatible instruction model.

conda env create -f environment.yml
conda activate plant-atlas-llm-dag

A typical local-model setup is:

export MODEL_PATH=../models/qwen2.5-3b-instruct
export LLM_LOAD_IN_4BIT=1
export LLM_4BIT_QUANT_TYPE=nf4
export LLM_4BIT_USE_DOUBLE_QUANT=1
export LLM_4BIT_COMPUTE_DTYPE=float16
export TRANSFORMERS_OFFLINE=1
export HF_HUB_OFFLINE=1

# Recommended first run.
export LLM_MAX_SENTENCES=10
export LLM_BATCH_SIZE=1
export LLM_MAX_NEW_TOKENS=128

If the smoke test works, unset LLM_MAX_SENTENCES or increase it gradually. The notebook intentionally does not download model weights during execution, so text, prompts and model outputs remain inside the local analysis environment.

General use#

The example is framed around Plant Atlas-style accounts, but the pattern is more general. It may also be useful for habitat management plans, protected-area reports, species recovery documents, invasive-species risk assessments or literature-screening outputs.

The key constraint is that the model task remains narrow and reviewable. The LLM handles sentence-level JSON extraction; ecological judgement remains with the analyst.

Why a small local LLM?#

A small local model is used because the task has been deliberately narrowed. The model sees one sentence at a time and returns a compact schema: driver phrase, driver type, target, effect direction, confidence and evidence quote.

This is also a proportionate design choice. LLM workflows are not automatically environmentally benign, because training and inference have compute, energy and hardware costs. The aim here is to use a model that is sufficient for a constrained sentence-level extraction step, rather than starting with a much larger hosted model.

Importing required libraries#

The workflow uses standard scientific Python packages plus a required local LLM route through Hugging Face transformers. If a local model is not available, the notebook fails early with a setup error rather than silently using a non-LLM substitute.

Warning and logging filters are applied so that the rendered notebook focuses on the workflow and results. Sentence-level parsing status is still written to an audit CSV for reproducibility and troubleshooting.

from __future__ import annotations

import contextlib
import datetime as dt
import hashlib
import io
import json
import logging
import math
import os
import re
import unicodedata
import warnings
import zipfile
from collections import Counter
from pathlib import Path
from typing import Any, Iterable

import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import pandas as pd

try:
    from IPython.display import display
except Exception:  # pragma: no cover
    display = print

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

plt.rcParams["figure.figsize"] = (9, 5)
plt.rcParams["axes.spines.top"] = False
plt.rcParams["axes.spines.right"] = False

_NOTEBOOK_START_TIME = dt.datetime.now()
print("Notebook started:", _NOTEBOOK_START_TIME.isoformat(timespec="seconds"))
# Keep the rendered notebook focused on the workflow rather than library deprecation or generation-configuration noise.
warnings.filterwarnings("ignore")
os.environ.setdefault("TOKENIZERS_PARALLELISM", "false")
os.environ.setdefault("TRANSFORMERS_NO_ADVISORY_WARNINGS", "1")

for logger_name in ["transformers", "transformers.generation", "bitsandbytes", "accelerate"]:
    logging.getLogger(logger_name).setLevel(logging.ERROR)

try:
    from transformers.utils import logging as transformers_logging

    transformers_logging.set_verbosity_error()
except Exception:
    pass

Notebook started: 2026-04-28T23:04:34

Configuration#

This cell collects the main run settings. For the built-in synthetic example, leave USE_DEMO_DATA = True. For a real analysis, set USE_DEMO_DATA = False and either place local CSV/ZIP files in PLANT_ATLAS_DATA_DIR or add explicit paths to INPUT_PATHS.

The default MODEL_PATH is the path used during development. On another machine, set MODEL_PATH before starting Jupyter so that it points to a local Hugging Face-compatible instruction model directory.

# ---------------------------------------------------------------------
# Input/output configuration
# ---------------------------------------------------------------------

USE_DEMO_DATA = True

DATA_DIR = Path(os.environ.get("PLANT_ATLAS_DATA_DIR", "data"))
INPUT_PATHS: list[Path] = []

OUTPUT_DIR = Path(os.environ.get("PLANT_DAG_OUTPUT_DIR", "outputs_plant_atlas_llm_to_dag"))
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# Batch controls for real datasets.
START_INDEX = 0
END_INDEX = None  # e.g. 10 for a small test run

# Text processing.
CONTEXT_WINDOW_SENTENCES = 1
MAX_SENTENCE_CHARS = 900

# Required local-LLM-assisted extraction.
#
# There is no non-LLM route for the main extraction step. The model must be
# available locally as a Hugging Face Transformers-compatible directory.
# The notebook intentionally does not download weights during execution.
#
# Default used during development:
#   ../models/qwen2.5-3b-instruct
#
# For other machines, set:
#   export MODEL_PATH=/path/to/local/instruction-model
MODEL_PATH = Path(os.environ.get("MODEL_PATH", "../models/qwen2.5-3b-instruct")).expanduser()
LLM_MODEL = str(MODEL_PATH)
LLM_LOCAL_FILES_ONLY = True

# Generation controls. Smaller values make the notebook much faster and are
# usually sufficient because the task asks for a small JSON object per sentence.
LLM_TEMPERATURE = float(os.environ.get("LLM_TEMPERATURE", "0"))
LLM_MAX_NEW_TOKENS = int(os.environ.get("LLM_MAX_NEW_TOKENS", "192"))
LLM_MAX_INPUT_TOKENS = int(os.environ.get("LLM_MAX_INPUT_TOKENS", "1536"))
LLM_BATCH_SIZE = int(os.environ.get("LLM_BATCH_SIZE", "1"))
LLM_DEVICE_MAP = os.environ.get("LLM_DEVICE_MAP", "auto")
# 4-bit loading is the default for this notebook. It is the intended route
# for the Qwen2.5-3B-Instruct local model on 6 GB VRAM hardware. Set
# LLM_LOAD_IN_4BIT=0 only if you explicitly want full/bfloat/float16 loading
# and know the model will fit comfortably on your machine.
LLM_TORCH_DTYPE = os.environ.get("LLM_TORCH_DTYPE", "auto")
LLM_LOAD_IN_4BIT = os.environ.get("LLM_LOAD_IN_4BIT", "1") != "0"
LLM_4BIT_QUANT_TYPE = os.environ.get("LLM_4BIT_QUANT_TYPE", "nf4")
LLM_4BIT_USE_DOUBLE_QUANT = os.environ.get("LLM_4BIT_USE_DOUBLE_QUANT", "1") != "0"
LLM_4BIT_COMPUTE_DTYPE = os.environ.get("LLM_4BIT_COMPUTE_DTYPE", "float16")

# Keep detailed sentence-level parsing diagnostics in CSV outputs, but do not
# print them in the rendered notebook unless explicitly requested.
SHOW_EXTRACTION_DIAGNOSTICS = os.environ.get("SHOW_EXTRACTION_DIAGNOSTICS", "0") == "1"

# For quick debugging, set for example:
#   export LLM_MAX_SENTENCES=10
# Leave unset to process every prepared sentence.
_llm_max_sentences = os.environ.get("LLM_MAX_SENTENCES")
LLM_MAX_SENTENCES = int(_llm_max_sentences) if _llm_max_sentences else None

# Semantic grouping threshold for driver phrases. Higher = stricter grouping.
PHRASE_SIMILARITY_THRESHOLD = 0.55

RUN_CONFIG = {
    "use_demo_data": USE_DEMO_DATA,
    "data_dir": str(DATA_DIR),
    "input_paths": [str(p) for p in INPUT_PATHS],
    "output_dir": str(OUTPUT_DIR),
    "context_window_sentences": CONTEXT_WINDOW_SENTENCES,
    "max_sentence_chars": MAX_SENTENCE_CHARS,
    "model_path": str(MODEL_PATH),
    "llm_local_files_only": LLM_LOCAL_FILES_ONLY,
    "llm_temperature": LLM_TEMPERATURE,
    "llm_max_new_tokens": LLM_MAX_NEW_TOKENS,
    "llm_max_input_tokens": LLM_MAX_INPUT_TOKENS,
    "llm_batch_size": LLM_BATCH_SIZE,
    "llm_device_map": LLM_DEVICE_MAP,
    "llm_torch_dtype": LLM_TORCH_DTYPE,
    "llm_load_in_4bit": LLM_LOAD_IN_4BIT,
    "llm_4bit_quant_type": LLM_4BIT_QUANT_TYPE,
    "llm_4bit_use_double_quant": LLM_4BIT_USE_DOUBLE_QUANT,
    "llm_4bit_compute_dtype": LLM_4BIT_COMPUTE_DTYPE,
    "show_extraction_diagnostics": SHOW_EXTRACTION_DIAGNOSTICS,
    "llm_max_sentences": LLM_MAX_SENTENCES,
    "phrase_similarity_threshold": PHRASE_SIMILARITY_THRESHOLD,
}

RUN_CONFIG_HASH = hashlib.sha256(
    json.dumps(RUN_CONFIG, sort_keys=True, default=str).encode("utf-8")
).hexdigest()[:12]

print("Run configuration hash:", RUN_CONFIG_HASH)

RUN_CONFIG

Run configuration hash: ead3a98f083d

{'use_demo_data': True,
 'data_dir': 'data',
 'input_paths': [],
 'output_dir': 'outputs_plant_atlas_llm_to_dag',
 'context_window_sentences': 1,
 'max_sentence_chars': 900,
 'model_path': '../models/qwen2.5-3b-instruct',
 'llm_local_files_only': True,
 'llm_temperature': 0.0,
 'llm_max_new_tokens': 192,
 'llm_max_input_tokens': 1536,
 'llm_batch_size': 1,
 'llm_device_map': 'auto',
 'llm_torch_dtype': 'auto',
 'llm_load_in_4bit': True,
 'llm_4bit_quant_type': 'nf4',
 'llm_4bit_use_double_quant': True,
 'llm_4bit_compute_dtype': 'float16',
 'show_extraction_diagnostics': False,
 'llm_max_sentences': None,
 'phrase_similarity_threshold': 0.55}

Workflow overview#

The local LLM is used only where language understanding is useful: converting short prose into a structured record. Everything around that step is explicit Python, which keeps the route from source sentence to proposed graph edge inspectable.

The broad workflow is:

workflow_steps = [
    "Plant Atlas\nspecies account text",
    "Sentence splitting\n+ local context",
    "LLM prompt\n+ JSON schema",
    "Schema and quote\nvalidation",
    "Phrase grouping\n+ edge aggregation",
    "Candidate DAGs\n+ expert review tables",
]

fig, ax = plt.subplots(figsize=(12, 2.4))
ax.axis("off")

x_positions = np.linspace(0.06, 0.94, len(workflow_steps))
for i, (x, label) in enumerate(zip(x_positions, workflow_steps)):
    ax.text(
        x, 0.5, label,
        ha="center", va="center",
        bbox={"boxstyle": "round,pad=0.45", "fc": "white", "ec": "black", "lw": 1.2},
        transform=ax.transAxes,
    )
    if i < len(workflow_steps) - 1:
        ax.annotate(
            "", xy=(x_positions[i + 1] - 0.075, 0.5), xytext=(x + 0.075, 0.5),
            arrowprops={"arrowstyle": "->", "lw": 1.2},
            xycoords=ax.transAxes, textcoords=ax.transAxes,
        )

ax.set_title("LLM-assisted text-to-DAG workflow")
plt.tight_layout()
plt.show()

../../../_images/326bee7613fcb124adc0b986d99c8bd45006f889969d46ffd13458c6bdc4c6df.png

Create or load species-account text#

The built-in rows below are synthetic. They mimic the structure of local Plant Atlas-style CSV extracts, including fields such as canonical, atlasSpeciesDescription, atlasSpeciesBiogeography and atlasSpeciesTrends. They are not copied Plant Atlas account text and should not be quoted as Plant Atlas content.

For a real analysis, set USE_DEMO_DATA = False and point PLANT_ATLAS_DATA_DIR at local CSV or ZIP files containing the account text to be analysed.

DEMO_SPECIES = [
    {
        "canonical": "Kickxia spuria",
        "vernacular": "Round-leaved Fluellen",
        "atlasSpeciesDescription": (
            "A small annual of arable fields, open disturbed ground and sunny field edges, "
            "most often on light calcareous soils. It can persist where the crop is open "
            "and where cultivation creates bare ground for germination. Dense competitive "
            "vegetation and intensive herbicide use reduce habitat suitability."
        ),
        "atlasSpeciesBiogeography": (
            "A southern-temperate archaeophyte in Britain and the Channel Islands."
        ),
        "atlasSpeciesTrends": (
            "Local populations are most likely to be maintained where arable margins are managed "
            "for scarce plants. Late-summer germination may help some plants escape early-season "
            "herbicide applications, but loss of open cultivated margins remains a pressure."
        ),
    },
    {
        "canonical": "Koenigia islandica",
        "vernacular": "Iceland-purslane",
        "atlasSpeciesDescription": (
            "A tiny annual of wet gravelly flushes, basaltic seepages and open stony ground in "
            "cool upland situations. It depends on short, open vegetation and reliable moisture. "
            "Drying of flushes, trampling that removes the fine substrate, and heavy competition "
            "from taller vegetation can reduce population performance."
        ),
        "atlasSpeciesBiogeography": (
            "A boreal-montane species with a very restricted British range."
        ),
        "atlasSpeciesTrends": (
            "Small colonies may fluctuate between years. Continued seepage, light disturbance and "
            "limited competition are likely to maintain habitat suitability, whereas warmer drier "
            "summers could increase stress on the shallow wet substrates used by the species."
        ),
    },
    {
        "canonical": "Lagurus ovatus",
        "vernacular": "Hare's-tail Grass",
        "atlasSpeciesDescription": (
            "An annual grass of sandy coastal ground, open dunes, walls, roadsides and other dry "
            "disturbed places. It is favoured by bare sand and warm open microsites, and can also "
            "appear from garden escapes. Closed dune grassland and lack of disturbance reduce the "
            "open ground needed for recruitment."
        ),
        "atlasSpeciesBiogeography": (
            "A Mediterranean-Atlantic neophyte that is naturalised in several coastal areas."
        ),
        "atlasSpeciesTrends": (
            "The species appears to be increasing in some sandy coastal habitats. Continued seed "
            "supply from gardens and disturbed urban ground may support local spread, while severe "
            "winter cold may limit establishment inland."
        ),
    },
    {
        "canonical": "Lamium confertum",
        "vernacular": "Northern Dead-nettle",
        "atlasSpeciesDescription": (
            "An annual of cultivated land, disturbed field corners and waste ground. It requires "
            "open soil for germination and is easily overlooked among other dead-nettles. Modern "
            "weed control, autumn-sown dense crops and reduced small-scale cultivation can lower "
            "the chance of plants reaching maturity."
        ),
        "atlasSpeciesBiogeography": (
            "A northern archaeophyte with scattered and often casual records."
        ),
        "atlasSpeciesTrends": (
            "Declines are consistent with the loss of lightly managed cultivated ground. Recent "
            "records may partly reflect targeted recording, so under-recording should be considered "
            "before treating absence as population loss."
        ),
    },
    {
        "canonical": "Larix decidua",
        "vernacular": "European Larch",
        "atlasSpeciesDescription": (
            "A deciduous conifer planted in forestry, shelter-belts and parkland. It regenerates "
            "from seed on disturbed mineral soil, scree and open rocky ground, especially where "
            "light reaches the ground layer. Very dry sites and waterlogged soils are less suitable "
            "for establishment."
        ),
        "atlasSpeciesBiogeography": (
            "A European mountain tree widely planted outside its native range."
        ),
        "atlasSpeciesTrends": (
            "Natural regeneration can follow soil disturbance near planted trees. Disease pressure "
            "from larch pathogens and felling of infected plantations can reduce local abundance, "
            "although continued planting creates new seed sources in some landscapes."
        ),
    },
]

TEXT_COLUMNS_CANDIDATES = [
    "atlasSpeciesDescription",
    "atlasSpeciesTrends",
    "atlasSpeciesBiogeography",
    "description",
    "trends",
    "biogeography",
    "habitat",
    "ecology",
    "management",
]

NAME_COLUMNS_CANDIDATES = [
    "canonical",
    "canonicalName",
    "scientificName",
    "taxonName",
    "name",
    "species",
]


def clean_text(value: Any) -> str:
    """Convert simple HTML-ish text to readable plain text."""
    if value is None or (isinstance(value, float) and math.isnan(value)):
        return ""
    text = str(value)
    text = re.sub(r"<[^>]+>", " ", text)
    text = re.sub(r"&nbsp;", " ", text)
    text = re.sub(r"&amp;", "&", text)
    text = re.sub(r"&#x2060;", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text


def find_first_existing(columns: Iterable[str], candidates: list[str]) -> str | None:
    lowered = {str(c).lower(): str(c) for c in columns}
    for candidate in candidates:
        if candidate.lower() in lowered:
            return lowered[candidate.lower()]
    return None


def read_csv_or_zip(path: Path) -> list[tuple[str, pd.DataFrame]]:
    """Read a CSV or every CSV inside a ZIP file."""
    path = Path(path)
    if path.suffix.lower() == ".csv":
        return [(path.name, pd.read_csv(path))]

    if path.suffix.lower() == ".zip":
        out = []
        with zipfile.ZipFile(path, "r") as zf:
            for name in sorted(zf.namelist()):
                if name.lower().endswith(".csv"):
                    with zf.open(name) as f:
                        out.append((f"{path.name}::{name}", pd.read_csv(f)))
        return out

    raise ValueError(f"Unsupported input path: {path}")


def discover_input_paths(data_dir: Path, input_paths: list[Path]) -> list[Path]:
    if input_paths:
        return [Path(p).expanduser().resolve() for p in input_paths]

    data_dir = Path(data_dir).expanduser().resolve()
    if not data_dir.exists():
        return []

    return sorted(list(data_dir.glob("*.csv")) + list(data_dir.glob("*.zip")))


def build_species_text_table() -> pd.DataFrame:
    if USE_DEMO_DATA:
        df = pd.DataFrame(DEMO_SPECIES)
        df["species_id"] = [f"demo_{i:03d}" for i in range(1, len(df) + 1)]
        df["source_file"] = "synthetic_plant_atlas_style_accounts"
        text_columns = [c for c in TEXT_COLUMNS_CANDIDATES if c in df.columns]
        df["source_text"] = df[text_columns].fillna("").agg(" ".join, axis=1).map(clean_text)
        return df[["species_id", "canonical", "source_file", "source_text"]]

    paths = discover_input_paths(DATA_DIR, INPUT_PATHS)
    if not paths:
        raise FileNotFoundError(
            "No input files found. Either set USE_DEMO_DATA=True or place CSV/ZIP files "
            "in PLANT_ATLAS_DATA_DIR."
        )

    rows = []
    for path in paths:
        for source_name, df in read_csv_or_zip(path):
            name_col = find_first_existing(df.columns, NAME_COLUMNS_CANDIDATES)
            text_cols = [c for c in TEXT_COLUMNS_CANDIDATES if c in df.columns]
            if not name_col or not text_cols:
                print(f"Skipping {source_name}: could not identify name/text columns")
                continue

            for idx, row in df.iterrows():
                canonical = clean_text(row.get(name_col, ""))
                parts = [clean_text(row.get(c, "")) for c in text_cols]
                source_text = " ".join(p for p in parts if p)
                if canonical and source_text:
                    rows.append(
                        {
                            "species_id": str(row.get("ddbid", f"{source_name}_{idx}")),
                            "canonical": canonical,
                            "source_file": source_name,
                            "source_text": source_text,
                        }
                    )

    if not rows:
        raise ValueError("No usable species account rows found in the configured input files.")

    out = pd.DataFrame(rows)
    if END_INDEX is not None:
        out = out.iloc[START_INDEX:END_INDEX]
    else:
        out = out.iloc[START_INDEX:]
    return out.reset_index(drop=True)


species_text = build_species_text_table()
species_text.to_csv(OUTPUT_DIR / "species_text_table.csv", index=False)

display(species_text)
print(f"Loaded {len(species_text)} species accounts")

	species_id	canonical	source_file	source_text
0	demo_001	Kickxia spuria	synthetic_plant_atlas_style_accounts	A small annual of arable fields, open disturbe...
1	demo_002	Koenigia islandica	synthetic_plant_atlas_style_accounts	A tiny annual of wet gravelly flushes, basalti...
2	demo_003	Lagurus ovatus	synthetic_plant_atlas_style_accounts	An annual grass of sandy coastal ground, open ...
3	demo_004	Lamium confertum	synthetic_plant_atlas_style_accounts	An annual of cultivated land, disturbed field ...
4	demo_005	Larix decidua	synthetic_plant_atlas_style_accounts	A deciduous conifer planted in forestry, shelt...

Loaded 5 species accounts

Data exploration#

Before running the LLM, check that the input text looks sensible: how many species are present, how many sentences have been created, and which broad terms appear most often. These simple checks catch common ingestion problems such as selecting the wrong text column or duplicating accounts.

def sentence_split(text: str) -> list[str]:
    text = re.sub(r"\s+", " ", str(text)).strip()
    if not text:
        return []
    # Lightweight splitter suitable for short account prose.
    parts = re.split(r"(?<=[.!?])\s+(?=\[?[A-Z0-9])", text)
    return [p.strip() for p in parts if p.strip()]


sentence_rows = []
for _, row in species_text.iterrows():
    sentences = sentence_split(row["source_text"])
    for sentence_id, sentence in enumerate(sentences):
        prev_start = max(0, sentence_id - CONTEXT_WINDOW_SENTENCES)
        next_end = min(len(sentences), sentence_id + CONTEXT_WINDOW_SENTENCES + 1)

        context_parts = []
        for j in range(prev_start, next_end):
            label = "CURRENT" if j == sentence_id else f"context_{j}"
            context_parts.append(f"[{label}] {sentences[j]}")
        context_text = " ".join(context_parts)

        if len(sentence) > MAX_SENTENCE_CHARS:
            sentence = sentence[:MAX_SENTENCE_CHARS].rsplit(" ", 1)[0] + " ..."

        sentence_rows.append(
            {
                "species_id": row["species_id"],
                "canonical": row["canonical"],
                "source_file": row["source_file"],
                "sentence_id": sentence_id,
                "sentence": sentence,
                "context_text": context_text,
            }
        )

sentences_df = pd.DataFrame(sentence_rows)
sentences_df.to_csv(OUTPUT_DIR / "all_sentences.csv", index=False)

print("Sentence rows:", len(sentences_df))
display(sentences_df.head(10))

Sentence rows: 30

	species_id	canonical	source_file	sentence_id	sentence	context_text
0	demo_001	Kickxia spuria	synthetic_plant_atlas_style_accounts	0	A small annual of arable fields, open disturbe...	[CURRENT] A small annual of arable fields, ope...
1	demo_001	Kickxia spuria	synthetic_plant_atlas_style_accounts	1	It can persist where the crop is open and wher...	[context_0] A small annual of arable fields, o...
2	demo_001	Kickxia spuria	synthetic_plant_atlas_style_accounts	2	Dense competitive vegetation and intensive her...	[context_1] It can persist where the crop is o...
3	demo_001	Kickxia spuria	synthetic_plant_atlas_style_accounts	3	Local populations are most likely to be mainta...	[context_2] Dense competitive vegetation and i...
4	demo_001	Kickxia spuria	synthetic_plant_atlas_style_accounts	4	Late-summer germination may help some plants e...	[context_3] Local populations are most likely ...
5	demo_001	Kickxia spuria	synthetic_plant_atlas_style_accounts	5	A southern-temperate archaeophyte in Britain a...	[context_4] Late-summer germination may help s...
6	demo_002	Koenigia islandica	synthetic_plant_atlas_style_accounts	0	A tiny annual of wet gravelly flushes, basalti...	[CURRENT] A tiny annual of wet gravelly flushe...
7	demo_002	Koenigia islandica	synthetic_plant_atlas_style_accounts	1	It depends on short, open vegetation and relia...	[context_0] A tiny annual of wet gravelly flus...
8	demo_002	Koenigia islandica	synthetic_plant_atlas_style_accounts	2	Drying of flushes, trampling that removes the ...	[context_1] It depends on short, open vegetati...
9	demo_002	Koenigia islandica	synthetic_plant_atlas_style_accounts	3	Small colonies may fluctuate between years.	[context_2] Drying of flushes, trampling that ...

sentence_counts = sentences_df.groupby("canonical").size().sort_values(ascending=True)

ax = sentence_counts.plot(kind="barh")
# set title based on if using dummy data or not
if USE_DEMO_DATA == True:
    ax.set_title("Sentence count by species account - all the same as using demo data")
else:
    ax.set_title("Sentence count by species account")
ax.set_xlabel("Number of sentences")
ax.set_ylabel("")
plt.tight_layout()
plt.show()

../../../_images/4ebc6711b42d2609307268058bba98fc7540b0ba2c1b634d55d0086892d324e7.png

STOPWORDS = {
    "and", "the", "with", "where", "from", "that", "this", "have", "has", "but", "are", "for",
    "can", "may", "into", "its", "their", "species", "plant", "plants", "populations", "local",
    "account", "description", "trend", "trends", "atlasspeciesdescription", "atlasspeciestrends",
}

tokens = []
for text in species_text["source_text"]:
    cleaned = re.sub(r"[^A-Za-z\- ]+", " ", text.lower())
    for token in cleaned.split():
        token = token.strip("-")
        if len(token) > 3 and token not in STOPWORDS:
            tokens.append(token)

common_terms = pd.DataFrame(Counter(tokens).most_common(20), columns=["term", "count"])
display(common_terms)

ax = common_terms.sort_values("count").plot(kind="barh", x="term", y="count", legend=False)
ax.set_title("Most frequent non-stopword terms")
ax.set_xlabel("Count")
ax.set_ylabel("")
plt.tight_layout()
plt.show()

	term	count
0	open	10
1	ground	10
2	disturbed	5
3	annual	4
4	reduce	4
5	light	3
6	germination	3
7	vegetation	3
8	some	3
9	loss	3
10	cultivated	3
11	continued	3
12	disturbance	3
13	coastal	3
14	seed	3
15	soil	3
16	planted	3
17	small	2
18	arable	2
19	field	2

../../../_images/886b0228204f3301779402645ff9f22556c547f1359a7aad5b42a8ffbaa9ed7c.png

Rule-based sanity check#

The next cell is a simple rule-based sanity check. It is not the main method and it is not used as a substitute for the LLM extraction. It provides a quick comparator: if a sentence contains an obvious phrase such as drainage or grazing, the keyword pass should usually find it.

This is useful when developing prompts because it separates two problems: whether the input text contains relevant ecological phrases at all, and whether the local model extracts them in the desired schema.

SOURCE_TYPE_KEYWORDS = {
    "climate": [
        "drought", "summer drought", "temperature", "warming", "frost", "exposure", "wind exposure", "salt spray"
    ],
    "hydrology": [
        "drainage", "water table", "water tables", "wet", "mire", "bog", "seepage", "river", "flow", "impounded",
        "hydrological"
    ],
    "soil_substrate": [
        "calcareous", "acidic", "base-rich", "peat", "gravel", "shallow soils", "metal-rich", "nutrient-poor"
    ],
    "habitat_structure": [
        "woodland", "hedgebanks", "grassland", "heath", "saltmarsh", "cliffs", "short turf", "open swards",
        "scrub encroachment", "dense secondary canopy", "rank vegetation"
    ],
    "management": [
        "grazing", "coppice", "traditional grazing", "autumn grazing", "management", "abandonment"
    ],
    "land_use": [
        "arable", "agricultural improvement", "ploughing", "reseeding", "coastal development", "roadside"
    ],
    "pollution": [
        "fertiliser", "fertiliser inputs", "nutrient enrichment", "nitrogen deposition", "turbidity", "fertilised"
    ],
    "disturbance": [
        "trampling", "peat cutting", "channel modification", "engineered", "conversion"
    ],
}

POSITIVE_CUES = [
    "favoured by", "favored by", "associated with", "stable in", "increase after", "tolerates",
    "maintain", "maintains", "persist", "persists"
]
NEGATIVE_CUES = [
    "declines", "decline", "reduced", "reduce", "losses", "threatened", "vulnerable",
    "sensitive to", "damage", "displaced", "abandonment"
]

TARGET_RULES = [
    # Prefer a direct abundance target where the sentence is about change, persistence,
    # losses, rarity, maintenance or decline. Habitat and population-performance targets
    # are kept for sentences that are genuinely about mechanism or condition rather than
    # abundance itself.
    ("abundance", [
        "abundance", "decline", "declines", "increase", "increases", "losses", "lost",
        "reduced", "maintained", "maintain", "maintains", "persist", "persists",
        "local populations", "colonies", "rare", "scarce", "common"
    ]),
    ("habitat suitability", [
        "habitat", "sward", "water table", "substrate", "canopy", "light", "turf",
        "germination", "suitability"
    ]),
    ("population performance", ["flowering", "recruitment", "seed", "cover"]),
]


def normalise_node_label(text: str) -> str:
    text = unicodedata.normalize("NFKD", str(text)).encode("ascii", "ignore").decode("ascii")
    text = text.lower()
    text = re.sub(r"[^a-z0-9\s\-]+", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text


DRIVER_LABEL_REPLACEMENTS = {
    "bare ground for germination": "bare_ground",
    "bare ground": "bare_ground",
    "open ground": "open_ground",
    "disturbed ground": "disturbed_ground",
    "open disturbed ground": "disturbed_ground",
    "light calcareous soils": "calcareous_soil",
    "calcareous soils": "calcareous_soil",
    "dense competitive vegetation": "dense_vegetation",
    "competitive vegetation": "dense_vegetation",
    "intensive herbicide use": "herbicide_use",
    "herbicide applications": "herbicide_use",
    "arable margins": "arable_margins",
    "open cultivated margins": "arable_margins",
    "loss of open cultivated margins": "margin_loss",
    "short open swards": "open_sward",
    "short turf": "short_turf",
    "autumn grazing": "autumn_grazing",
    "traditional grazing": "traditional_grazing",
    "scrub encroachment": "scrub",
    "dense secondary canopy": "dense_canopy",
    "water table": "water_table",
    "water tables": "water_table",
    "base-rich seepage": "base_rich_seepage",
    "nitrogen deposition": "nitrogen_dep",
    "nutrient enrichment": "nutrient_enrichment",
    "fertiliser inputs": "fertiliser",
    "summer drought": "summer_drought",
    "salt spray": "salt_spray",
    "coastal development": "coastal_development",
    "agricultural improvement": "agri_improvement",
}

DRIVER_LABEL_STOPWORDS = {
    "a", "an", "the", "and", "or", "of", "for", "to", "by", "with", "where", "which",
    "can", "may", "most", "often", "more", "less", "likely", "local", "plant", "plants",
    "species", "habitat", "habitats", "intensive", "dense", "open"
}


def canonicalise_driver_label(text: str, max_tokens: int = 3) -> str:
    """Create short, graph-friendly driver labels from extracted phrases.

    The evidence table keeps the original phrase in ``source_raw``. This helper
    only shortens the graph node identifier so that the plotted DAG is readable.
    """
    label = normalise_node_label(text)
    if not label:
        return ""

    for phrase, replacement in DRIVER_LABEL_REPLACEMENTS.items():
        if phrase in label:
            return replacement

    tokens = [t for t in re.split(r"[\s\-]+", label) if t and t not in DRIVER_LABEL_STOPWORDS]
    if not tokens:
        tokens = [t for t in re.split(r"[\s\-]+", label) if t]

    # Prefer the most specific end of a noun phrase, e.g. "loss of open cultivated
    # margins" -> "cultivated_margins".
    tokens = tokens[-max_tokens:]
    return "_".join(tokens)


def normalise_target_label(text: str) -> str:
    """Normalise model targets into the small target vocabulary used by the DAG."""
    target = normalise_node_label(text)
    if target in {"abundance", "habitat suitability", "population performance"}:
        return target
    if any(token in target for token in ["decline", "increase", "abundance", "population", "populations", "colonies"]):
        return "abundance"
    if any(token in target for token in ["habitat", "suitability", "condition"]):
        return "habitat suitability"
    if any(token in target for token in ["flowering", "recruitment", "performance", "cover"]):
        return "population performance"
    return "abundance"


def infer_effect(sentence: str) -> str:
    lower = sentence.lower()
    pos = any(cue in lower for cue in POSITIVE_CUES)
    neg = any(cue in lower for cue in NEGATIVE_CUES)
    if pos and not neg:
        return "positive"
    if neg and not pos:
        return "negative"
    if pos and neg:
        return "mixed"
    return "unknown"


def infer_target(sentence: str) -> str:
    lower = sentence.lower()
    for target, cues in TARGET_RULES:
        if any(cue in lower for cue in cues):
            return target
    return "abundance"


def keyword_mentions(sentence: str) -> list[tuple[str, str]]:
    lower = sentence.lower()
    matches = []
    for source_type, keywords in SOURCE_TYPE_KEYWORDS.items():
        for keyword in keywords:
            if re.search(r"\b" + re.escape(keyword.lower()) + r"\b", lower):
                matches.append((source_type, keyword))
    # Prefer longer phrases first and deduplicate.
    matches = sorted(matches, key=lambda x: len(x[1]), reverse=True)
    seen = set()
    out = []
    for source_type, keyword in matches:
        key = (source_type, keyword.lower())
        if key not in seen:
            seen.add(key)
            out.append((source_type, keyword))
    return out


def deterministic_extract_edges(sentences: pd.DataFrame) -> pd.DataFrame:
    rows = []

    for _, row in sentences.iterrows():
        sentence = row["sentence"]
        matches = keyword_mentions(sentence)
        if not matches:
            continue

        effect = infer_effect(sentence)
        target = infer_target(sentence)

        for source_type, keyword in matches:
            source_raw = keyword
            source_canonical = normalise_node_label(source_raw)

            confidence = 0.55
            if effect != "unknown":
                confidence += 0.15
            if len(keyword.split()) > 1:
                confidence += 0.05

            rows.append(
                {
                    "species_id": row["species_id"],
                    "canonical": row["canonical"],
                    "source_file": row["source_file"],
                    "sentence_id": int(row["sentence_id"]),
                    "source_raw": source_raw,
                    "source_canonical": canonicalise_driver_label(source_raw),
                    "source_type": source_type,
                    "target_canonical": target,
                    "effect": effect,
                    "confidence": round(min(confidence, 0.90), 2),
                    "evidence_quote": sentence,
                    "extraction_method": "deterministic_keyword_demo",
                }
            )

    return pd.DataFrame(rows)


baseline_edges_df = deterministic_extract_edges(sentences_df)
baseline_edges_df.to_csv(OUTPUT_DIR / "baseline_deterministic_candidate_edges.csv", index=False)

print("Baseline deterministic candidate edges:", len(baseline_edges_df))
display(baseline_edges_df.head(15))

Baseline deterministic candidate edges: 8

	species_id	canonical	source_file	sentence_id	source_raw	source_canonical	source_type	target_canonical	effect	confidence	evidence_quote	extraction_method
0	demo_001	Kickxia spuria	synthetic_plant_atlas_style_accounts	0	calcareous	calcareous	soil_substrate	habitat suitability	unknown	0.55	A small annual of arable fields, open disturbe...	deterministic_keyword_demo
1	demo_001	Kickxia spuria	synthetic_plant_atlas_style_accounts	0	arable	arable	land_use	habitat suitability	unknown	0.55	A small annual of arable fields, open disturbe...	deterministic_keyword_demo
2	demo_001	Kickxia spuria	synthetic_plant_atlas_style_accounts	3	arable	arable	land_use	abundance	positive	0.70	Local populations are most likely to be mainta...	deterministic_keyword_demo
3	demo_002	Koenigia islandica	synthetic_plant_atlas_style_accounts	0	wet	wet	hydrology	abundance	unknown	0.55	A tiny annual of wet gravelly flushes, basalti...	deterministic_keyword_demo
4	demo_002	Koenigia islandica	synthetic_plant_atlas_style_accounts	2	trampling	trampling	disturbance	habitat suitability	negative	0.70	Drying of flushes, trampling that removes the ...	deterministic_keyword_demo
5	demo_002	Koenigia islandica	synthetic_plant_atlas_style_accounts	4	seepage	seepage	hydrology	abundance	positive	0.70	Continued seepage, light disturbance and limit...	deterministic_keyword_demo
6	demo_002	Koenigia islandica	synthetic_plant_atlas_style_accounts	4	wet	wet	hydrology	abundance	positive	0.70	Continued seepage, light disturbance and limit...	deterministic_keyword_demo
7	demo_003	Lagurus ovatus	synthetic_plant_atlas_style_accounts	2	grassland	grassland	habitat_structure	population performance	negative	0.70	Closed dune grassland and lack of disturbance ...	deterministic_keyword_demo

Local LLM evidence extraction#

This is the main extraction step. Each sentence is sent to the local model with a short neighbouring context window and a constrained JSON schema. The model is asked for candidate driver relationships only when the sentence contains evidence for a factor that may affect habitat suitability, population performance or abundance.

The prompt is deliberately narrow. The model is not asked to write a causal diagram or summarise a species account. It only proposes sentence-grounded candidate rows that can be validated and reviewed later.

LLM_SCHEMA = {
    "edges": [
        {
            "source_raw": "exact phrase copied from the current sentence",
            "source_type": (
                "climate|hydrology|soil_substrate|habitat_structure|disturbance|"
                "management|land_use|biotic_interaction|disease_pathogen|pollution|"
                "topography|other"
            ),
            "target_canonical": "abundance|habitat suitability|population performance",
            "effect": "positive|negative|mixed|unknown",
            "confidence": 0.70,
            "evidence_quote": "short quote copied from the current sentence",
        }
    ]
}


def make_llm_prompt(canonical: str, sentence_id: int, sentence: str, context_text: str) -> str:
    """Create a narrow, evidence-grounded prompt for one sentence."""
    return f"""
You are helping ecologists convert Plant Atlas species-account prose into
reviewable candidate causal-graph edges.

Species: {canonical}
Sentence ID: {sentence_id}

CURRENT SENTENCE:
{sentence}

NEIGHBOURING CONTEXT FOR DISAMBIGUATION ONLY:
{context_text}

Task:
Extract candidate ecological driver relationships from the CURRENT SENTENCE.
Use only evidence stated in the CURRENT SENTENCE. Do not infer from general
knowledge. Ignore dates, pure locations, taxonomic notes, recording history and
introduction-history statements unless they are directly linked to abundance,
habitat suitability or population performance.

Return one JSON object and nothing else. No markdown. No explanation. No trailing
text. Keep source_raw labels short so that they can be read on a graph. If the current sentence contains no relevant relationship, return exactly:
{{"edges": []}}

Definitions:
- source_raw: an exact, short driver phrase copied from the CURRENT SENTENCE. Prefer compact noun phrases of one to four words, for example "grazing", "bare ground", "calcareous soils" or "herbicide use". Do not use a whole clause as the source.
- source_type: one of climate, hydrology, soil_substrate, habitat_structure,
  disturbance, management, land_use, biotic_interaction, disease_pathogen,
  pollution, topography or other.
- target_canonical: one of abundance, habitat suitability or population performance. Use abundance when the sentence says the driver maintains, increases, reduces, threatens, explains losses of, or is associated with local populations, colonies, rarity, decline, persistence or abundance. Use habitat suitability only when the sentence is mainly about habitat condition. Use population performance for flowering, recruitment, seed set or vigour.
- effect: positive means the source tends to increase or maintain the target;
  negative means it tends to reduce the target; mixed means both are stated;
  unknown means the relationship is stated but direction is unclear.
- confidence: a number from 0 to 1.
- evidence_quote: a short quote copied from the CURRENT SENTENCE.

JSON schema example:
{json.dumps(LLM_SCHEMA, indent=2)}
""".strip()


def preview_prompt(row_number: int = 0) -> None:
    """Print one prompt so reviewers can inspect exactly what the LLM sees."""
    row = sentences_df.iloc[row_number]
    print(make_llm_prompt(row["canonical"], int(row["sentence_id"]), row["sentence"], row["context_text"]))


preview_prompt(0)

You are helping ecologists convert Plant Atlas species-account prose into
reviewable candidate causal-graph edges.

Species: Kickxia spuria
Sentence ID: 0

CURRENT SENTENCE:
A small annual of arable fields, open disturbed ground and sunny field edges, most often on light calcareous soils.

NEIGHBOURING CONTEXT FOR DISAMBIGUATION ONLY:
[CURRENT] A small annual of arable fields, open disturbed ground and sunny field edges, most often on light calcareous soils. [context_1] It can persist where the crop is open and where cultivation creates bare ground for germination.

Task:
Extract candidate ecological driver relationships from the CURRENT SENTENCE.
Use only evidence stated in the CURRENT SENTENCE. Do not infer from general
knowledge. Ignore dates, pure locations, taxonomic notes, recording history and
introduction-history statements unless they are directly linked to abundance,
habitat suitability or population performance.

Return one JSON object and nothing else. No markdown. No explanation. No trailing
text. Keep source_raw labels short so that they can be read on a graph. If the current sentence contains no relevant relationship, return exactly:
{"edges": []}

Definitions:
- source_raw: an exact, short driver phrase copied from the CURRENT SENTENCE. Prefer compact noun phrases of one to four words, for example "grazing", "bare ground", "calcareous soils" or "herbicide use". Do not use a whole clause as the source.
- source_type: one of climate, hydrology, soil_substrate, habitat_structure,
  disturbance, management, land_use, biotic_interaction, disease_pathogen,
  pollution, topography or other.
- target_canonical: one of abundance, habitat suitability or population performance. Use abundance when the sentence says the driver maintains, increases, reduces, threatens, explains losses of, or is associated with local populations, colonies, rarity, decline, persistence or abundance. Use habitat suitability only when the sentence is mainly about habitat condition. Use population performance for flowering, recruitment, seed set or vigour.
- effect: positive means the source tends to increase or maintain the target;
  negative means it tends to reduce the target; mixed means both are stated;
  unknown means the relationship is stated but direction is unclear.
- confidence: a number from 0 to 1.
- evidence_quote: a short quote copied from the CURRENT SENTENCE.

JSON schema example:
{
  "edges": [
    {
      "source_raw": "exact phrase copied from the current sentence",
      "source_type": "climate|hydrology|soil_substrate|habitat_structure|disturbance|management|land_use|biotic_interaction|disease_pathogen|pollution|topography|other",
      "target_canonical": "abundance|habitat suitability|population performance",
      "effect": "positive|negative|mixed|unknown",
      "confidence": 0.7,
      "evidence_quote": "short quote copied from the current sentence"
    }
  ]
}

RAW_EDGE_COLUMNS = [
    "species_id",
    "canonical",
    "source_file",
    "sentence_id",
    "source_raw",
    "source_canonical",
    "source_type",
    "target_canonical",
    "effect",
    "confidence",
    "evidence_quote",
    "extraction_method",
    "llm_error",
]

LLM_AUDIT_COLUMNS = [
    "species_id",
    "canonical",
    "source_file",
    "sentence_id",
    "parse_status",
    "n_candidate_edges",
    "response_hash",
    "diagnostic_note",
]

AGG_EDGE_COLUMNS = [
    "species_id",
    "canonical",
    "source_canonical",
    "target_canonical",
    "effect",
    "source_types",
    "source_raw_examples",
    "source_canonical_examples",
    "semantic_cluster_ids",
    "n_evidence_sentences",
    "sentence_ids",
    "evidence_quotes",
    "mean_confidence",
    "max_confidence",
    "extraction_methods",
    "edge_origin",
    "assumption",
]

DAG_DIAGNOSTIC_COLUMNS = [
    "species_id",
    "canonical",
    "n_nodes",
    "n_edges",
    "is_dag",
    "example_cycles",
]

NODE_COLUMNS = [
    "species_id",
    "canonical",
    "node",
    "node_type",
]


def ensure_columns(df: pd.DataFrame, columns: list[str]) -> pd.DataFrame:
    """Return a copy with all expected columns present, preserving row order."""
    out = df.copy()
    for col in columns:
        if col not in out.columns:
            out[col] = np.nan
    return out[columns + [c for c in out.columns if c not in columns]]


def coerce_json_object(text: str) -> dict[str, Any]:
    """Parse a JSON object from small-model output.

    The prompt asks for JSON only. In practice, small local models can still add
    a short preamble, markdown fence or dangling comma. The parser repairs those
    simple cases, then returns a normalised object with an ``edges`` list.
    """
    cleaned = str(text).strip()
    cleaned = re.sub(r"^```(?:json)?\s*", "", cleaned)
    cleaned = re.sub(r"\s*```$", "", cleaned)

    # Prefer the whole response, but fall back to the outermost JSON object.
    candidate = cleaned
    if "{" in cleaned and "}" in cleaned:
        candidate = cleaned[cleaned.find("{"): cleaned.rfind("}") + 1]

    # A few deterministic repairs that do not invent content.
    candidate = candidate.replace("\u201c", '"').replace("\u201d", '"')
    candidate = re.sub(r",\s*([}\]])", r"\1", candidate)

    obj = json.loads(candidate)

    if not isinstance(obj, dict):
        raise ValueError("LLM response was valid JSON but not a JSON object")
    if "edges" not in obj or obj["edges"] is None:
        obj["edges"] = []
    if not isinstance(obj["edges"], list):
        raise ValueError("LLM response field 'edges' was not a list")
    return obj


def safe_coerce_json_object(text: str) -> tuple[dict[str, Any], str]:
    """Parse an LLM response without creating visible notebook errors.

    If parsing still fails after deterministic repairs, the sentence contributes
    no candidate edge. The diagnostic is written to ``llm_extraction_audit.csv``
    rather than displayed in the main rendered notebook.
    """
    try:
        return coerce_json_object(text), "parsed"
    except Exception:
        return {"edges": []}, "unparsed_empty"


_TRANSFORMERS_PIPELINE = None


def validate_local_model_path(model_path: Path) -> None:
    """Fail early unless a local Hugging Face-compatible model directory is present."""
    model_path = Path(model_path)

    if not model_path.exists():
        raise FileNotFoundError(
            f"MODEL_PATH does not exist: {model_path}. "
            "Set MODEL_PATH to a local Hugging Face-compatible instruction model directory, "
            "for example: export MODEL_PATH=../models/qwen2.5-3b-instruct"
        )

    if not model_path.is_dir():
        raise NotADirectoryError(
            f"MODEL_PATH is not a directory: {model_path}. "
            "Set MODEL_PATH to the directory containing the local model files."
        )

    present = {p.name for p in model_path.iterdir()}
    if "config.json" not in present:
        raise FileNotFoundError(
            f"MODEL_PATH is missing config.json: {model_path}. "
            "This does not look like a Hugging Face model directory."
        )

    tokenizer_files = {"tokenizer.json", "tokenizer.model", "vocab.json", "merges.txt"}
    if not (present & tokenizer_files):
        raise FileNotFoundError(
            f"MODEL_PATH is missing recognisable tokenizer files: {model_path}. "
            "Expected one of tokenizer.json, tokenizer.model, vocab.json or merges.txt."
        )

    weight_patterns = ["*.safetensors", "*.bin", "*.pt"]
    if not any(model_path.glob(pattern) for pattern in weight_patterns):
        raise FileNotFoundError(
            f"MODEL_PATH does not contain local model weights: {model_path}. "
            "Expected files such as *.safetensors or *.bin. "
            "The notebook intentionally does not download weights during execution."
        )


def get_transformers_pipeline():
    """Load a local Hugging Face Transformers text-generation pipeline once."""
    global _TRANSFORMERS_PIPELINE

    if _TRANSFORMERS_PIPELINE is not None:
        return _TRANSFORMERS_PIPELINE

    validate_local_model_path(MODEL_PATH)

    try:
        import torch
        from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
    except ImportError as exc:
        raise ImportError(
            "Install torch, transformers, accelerate and bitsandbytes for local inference."
        ) from exc

    print(f"Loading local LLM from: {MODEL_PATH}")
    print("This one-off model-load step can take a few minutes on first run.")
    print(f"4-bit quantised loading: {LLM_LOAD_IN_4BIT}")

    tokenizer = AutoTokenizer.from_pretrained(
        LLM_MODEL,
        local_files_only=True,
        trust_remote_code=False,
    )
    if tokenizer.pad_token_id is None:
        tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "left"

    model_kwargs = {
        "device_map": LLM_DEVICE_MAP,
        "local_files_only": True,
        "trust_remote_code": False,
    }

    if LLM_LOAD_IN_4BIT:
        compute_dtype = getattr(torch, LLM_4BIT_COMPUTE_DTYPE)
        model_kwargs["quantization_config"] = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type=LLM_4BIT_QUANT_TYPE,
            bnb_4bit_use_double_quant=LLM_4BIT_USE_DOUBLE_QUANT,
            bnb_4bit_compute_dtype=compute_dtype,
        )
    else:
        if LLM_TORCH_DTYPE == "auto":
            model_kwargs["torch_dtype"] = "auto"
        else:
            model_kwargs["torch_dtype"] = getattr(torch, LLM_TORCH_DTYPE)

    model = AutoModelForCausalLM.from_pretrained(LLM_MODEL, **model_kwargs)
    model.eval()

    # Keep generation settings explicit in the extraction call below. Some model
    # repositories define default max_length/min_length values that are harmless
    # but noisy in rendered notebooks when max_new_tokens is also supplied.
    model.generation_config.max_length = None
    model.generation_config.min_length = None
    model.generation_config.pad_token_id = tokenizer.pad_token_id
    model.generation_config.eos_token_id = tokenizer.eos_token_id

    _TRANSFORMERS_PIPELINE = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
    )
    return _TRANSFORMERS_PIPELINE


def make_model_input(prompt: str, tokenizer) -> str:
    """Apply the model chat template where available, otherwise use a plain prompt."""
    system_message = (
        "You extract sentence-grounded ecological evidence as strict JSON. "
        "Do not add evidence that is not stated in the supplied sentence. "
        "Return one JSON object only. If no relationship is present, return {\"edges\": []}."
    )

    if getattr(tokenizer, "chat_template", None):
        return tokenizer.apply_chat_template(
            [
                {"role": "system", "content": system_message},
                {"role": "user", "content": prompt},
            ],
            tokenize=False,
            add_generation_prompt=True,
        )
    return f"{system_message}\n\n{prompt}\n\nJSON:"


def truncate_model_input(model_input: str, tokenizer) -> str:
    """Keep prompts within a fixed token budget before generation."""
    token_ids = tokenizer.encode(model_input, add_special_tokens=False)
    if len(token_ids) <= LLM_MAX_INPUT_TOKENS:
        return model_input
    token_ids = token_ids[-LLM_MAX_INPUT_TOKENS:]
    return tokenizer.decode(token_ids, skip_special_tokens=True)


def batched(items: list[Any], batch_size: int) -> list[list[Any]]:
    """Yield fixed-size batches."""
    return [items[i: i + batch_size] for i in range(0, len(items), batch_size)]


def unpack_pipeline_output(output: Any) -> str:
    """Handle slightly different pipeline return shapes across transformers versions."""
    if isinstance(output, list) and output and isinstance(output[0], dict):
        return output[0].get("generated_text", "")
    if isinstance(output, dict):
        return output.get("generated_text", "")
    return str(output)


def call_transformers_json_batch(prompts: list[str]) -> list[tuple[dict[str, Any], str, str]]:
    """Run local Transformers generation for a batch and parse responses safely."""
    pipe = get_transformers_pipeline()
    tokenizer = pipe.tokenizer

    model_inputs = [truncate_model_input(make_model_input(prompt, tokenizer), tokenizer) for prompt in prompts]

    generation_kwargs = {
        "max_new_tokens": LLM_MAX_NEW_TOKENS,
        "do_sample": LLM_TEMPERATURE > 0,
        "return_full_text": False,
        "pad_token_id": tokenizer.pad_token_id,
        "eos_token_id": tokenizer.eos_token_id,
    }
    if LLM_TEMPERATURE > 0:
        generation_kwargs["temperature"] = LLM_TEMPERATURE

    # Some transformers/bitsandbytes versions emit repeated generation-configuration
    # messages to stderr even when settings are harmless. Suppress that noise
    # in the rendered notebook; the extraction audit records parse status.
    with warnings.catch_warnings(), contextlib.redirect_stderr(io.StringIO()):
        warnings.simplefilter("ignore")
        outputs = pipe(model_inputs, batch_size=LLM_BATCH_SIZE, **generation_kwargs)

    generated_texts = [unpack_pipeline_output(output) for output in outputs]

    parsed = []
    for generated_text in generated_texts:
        payload, parse_status = safe_coerce_json_object(generated_text)
        parsed.append((payload, parse_status, generated_text))
    return parsed


def normalise_llm_edges(payload: dict[str, Any], sentence_row: pd.Series) -> list[dict[str, Any]]:
    """Convert provider JSON into the notebook's dataframe schema."""
    out = []
    for item in payload.get("edges", []):
        if not isinstance(item, dict):
            continue

        source_raw = str(item.get("source_raw", "")).strip()
        if not source_raw:
            continue

        source_type = str(item.get("source_type", "other")).strip()
        target = normalise_target_label(str(item.get("target_canonical", "abundance")).strip())
        effect = str(item.get("effect", "unknown")).strip().lower()
        quote = str(item.get("evidence_quote", "")).strip()
        if not quote:
            quote = str(sentence_row["sentence"]).strip()

        try:
            confidence = float(item.get("confidence", 0.5))
        except (TypeError, ValueError):
            confidence = 0.5

        out.append(
            {
                "species_id": sentence_row["species_id"],
                "canonical": sentence_row["canonical"],
                "source_file": sentence_row["source_file"],
                "sentence_id": int(sentence_row["sentence_id"]),
                "source_raw": source_raw,
                "source_canonical": canonicalise_driver_label(source_raw),
                "source_type": source_type,
                "target_canonical": target,
                "effect": effect,
                "confidence": round(max(0.0, min(confidence, 1.0)), 2),
                "evidence_quote": quote,
                "extraction_method": "llm_transformers",
                "llm_error": np.nan,
            }
        )
    return out


def extract_edges_with_llm(sentences: pd.DataFrame) -> tuple[pd.DataFrame, pd.DataFrame]:
    """Run the required local LLM over sentence rows, with batching and progress reporting."""
    try:
        from tqdm.auto import tqdm
    except ImportError:
        tqdm = lambda x, **kwargs: x

    sentence_iter = sentences.copy()
    if LLM_MAX_SENTENCES is not None:
        sentence_iter = sentence_iter.head(int(LLM_MAX_SENTENCES))

    sentence_rows = [row for _, row in sentence_iter.iterrows()]
    prompts = [
        make_llm_prompt(
            row["canonical"],
            int(row["sentence_id"]),
            row["sentence"],
            row["context_text"],
        )
        for row in sentence_rows
    ]

    print(f"Prepared {len(prompts)} sentence-level LLM calls.")
    print(
        f"Batch size: {LLM_BATCH_SIZE}; max_new_tokens: {LLM_MAX_NEW_TOKENS}; max_input_tokens: {LLM_MAX_INPUT_TOKENS}")

    # Pre-flight model loading is deliberate. If MODEL_PATH or 4-bit setup is
    # wrong, fail here with a setup error rather than writing non-LLM outputs.
    get_transformers_pipeline()

    rows = []
    audit_rows = []
    for row_batch, prompt_batch in tqdm(
            zip(batched(sentence_rows, LLM_BATCH_SIZE), batched(prompts, LLM_BATCH_SIZE)),
            total=(len(prompts) + LLM_BATCH_SIZE - 1) // LLM_BATCH_SIZE,
            desc="LLM extraction batches",
    ):
        parsed_payloads = call_transformers_json_batch(prompt_batch)
        for row, (payload, parse_status, generated_text) in zip(row_batch, parsed_payloads):
            new_edges = normalise_llm_edges(payload, row)
            rows.extend(new_edges)
            audit_rows.append(
                {
                    "species_id": row["species_id"],
                    "canonical": row["canonical"],
                    "source_file": row["source_file"],
                    "sentence_id": int(row["sentence_id"]),
                    "parse_status": parse_status,
                    "n_candidate_edges": len(new_edges),
                    "response_hash": hashlib.sha256(str(generated_text).encode("utf-8")).hexdigest()[:12],
                    "diagnostic_note": "" if parse_status == "parsed" else "response did not parse as JSON after deterministic cleanup; treated as no candidate edge",
                }
            )

    edge_df = ensure_columns(pd.DataFrame(rows), RAW_EDGE_COLUMNS)
    audit_df = ensure_columns(pd.DataFrame(audit_rows), LLM_AUDIT_COLUMNS)
    return edge_df, audit_df


raw_edges_df, llm_extraction_audit_df = extract_edges_with_llm(sentences_df)
raw_edges_df.to_csv(OUTPUT_DIR / "all_raw_candidate_edges.csv", index=False)
llm_extraction_audit_df.to_csv(OUTPUT_DIR / "llm_extraction_audit.csv", index=False)

n_sentences_processed = len(llm_extraction_audit_df)
n_sentences_with_edges = int((llm_extraction_audit_df["n_candidate_edges"] > 0).sum()) if n_sentences_processed else 0

print("LLM-assisted candidate edges:", len(raw_edges_df))
print("Sentences processed:", n_sentences_processed)
print("Sentences returning at least one candidate edge:", n_sentences_with_edges)

if raw_edges_df.empty:
    print("No candidate edges were extracted for this input.")
else:
    display(raw_edges_df.head(10))

if SHOW_EXTRACTION_DIAGNOSTICS:
    display(llm_extraction_audit_df)

Prepared 30 sentence-level LLM calls.
Batch size: 1; max_new_tokens: 192; max_input_tokens: 1536
Loading local LLM from: ../models/qwen2.5-3b-instruct
This one-off model-load step can take a few minutes on first run.
4-bit quantised loading: True

LLM-assisted candidate edges: 37
Sentences processed: 30
Sentences returning at least one candidate edge: 29

	species_id	canonical	source_file	sentence_id	source_raw	source_canonical	source_type	target_canonical	effect	confidence	evidence_quote	extraction_method	llm_error
0	demo_001	Kickxia spuria	synthetic_plant_atlas_style_accounts	0	light calcareous soils	calcareous_soil	soil_substrate	abundance	positive	1.0	most often on light calcareous soils.	llm_transformers	NaN
1	demo_001	Kickxia spuria	synthetic_plant_atlas_style_accounts	1	crop	crop	land_use	habitat suitability	positive	1.0	It can persist where the crop is open and wher...	llm_transformers	NaN
2	demo_001	Kickxia spuria	synthetic_plant_atlas_style_accounts	1	cultivation	cultivation	disturbance	habitat suitability	positive	1.0	It can persist where the crop is open and wher...	llm_transformers	NaN
3	demo_001	Kickxia spuria	synthetic_plant_atlas_style_accounts	2	dense competitive vegetation	dense_vegetation	habitat_structure	habitat suitability	negative	1.0	Dense competitive vegetation and intensive her...	llm_transformers	NaN
4	demo_001	Kickxia spuria	synthetic_plant_atlas_style_accounts	2	intensive herbicide use	herbicide_use	management	habitat suitability	negative	1.0	Dense competitive vegetation and intensive her...	llm_transformers	NaN
5	demo_001	Kickxia spuria	synthetic_plant_atlas_style_accounts	3	arable margins	arable_margins	habitat structure	habitat suitability	positive	1.0	Local populations are most likely to be mainta...	llm_transformers	NaN
6	demo_001	Kickxia spuria	synthetic_plant_atlas_style_accounts	4	loss of open cultivated margins	arable_margins	disturbance	abundance	negative	1.0	but loss of open cultivated margins remains a ...	llm_transformers	NaN
7	demo_001	Kickxia spuria	synthetic_plant_atlas_style_accounts	5	southern-temperate	southern_temperate	climate	abundance	positive	1.0	A southern-temperate archaeophyte in Britain a...	llm_transformers	NaN
8	demo_002	Koenigia islandica	synthetic_plant_atlas_style_accounts	0	cool upland situations	cool_upland_situations	habitat_structure	habitat suitability	positive	1.0	in cool upland situations.	llm_transformers	NaN
9	demo_002	Koenigia islandica	synthetic_plant_atlas_style_accounts	1	short, open vegetation	short_vegetation	habitat_structure	habitat suitability	positive	1.0	It depends on short, open vegetation and relia...	llm_transformers	NaN

Inspect local LLM extraction coverage#

The first check is whether the model produced candidate edges, how many sentences failed parsing or generation, and how many candidate edges were returned per species. This is a diagnostic step, not evidence that the extracted edges are correct.

For a real analysis, inspect the error rate before moving on. A few failures may be acceptable if the review table still contains useful evidence, but a high failure rate usually means the prompt, model or token limits need attention.

if raw_edges_df.empty:
    print("No candidate edges were extracted.")
else:
    fig, ax = plt.subplots(figsize=(8, 3.5))
    species_counts = raw_edges_df["canonical"].value_counts().sort_values()
    species_counts.plot(kind="barh", ax=ax)
    ax.set_title("Candidate edges by species")
    ax.set_xlabel("Number of candidate edges")
    ax.set_ylabel("")
    plt.tight_layout()
    plt.show()

    fig, ax = plt.subplots(figsize=(8, 3.5))
    type_counts = raw_edges_df["source_type"].value_counts().sort_values()
    type_counts.plot(kind="barh", ax=ax)
    ax.set_title("Candidate edges by driver type")
    ax.set_xlabel("Number of candidate edges")
    ax.set_ylabel("")
    plt.tight_layout()
    plt.show()

    confidence_by_type = (
        raw_edges_df.groupby("source_type", dropna=False)["confidence"]
        .mean()
        .sort_values()
        .reset_index()
    )
    display(confidence_by_type)

../../../_images/7c61f07f26b6bd7dbd6d5909f41ff4e88f52ce371ba88f1013503f59768abb09.png

../../../_images/b3f28199cff13d423deebd7826afc9f4f7b5716eabc659a7d07828d17b29676e.png

	source_type	confidence
0	disease_pathogen	0.850000
1	topography	0.900000
2	disturbance	0.985714
3	climate	1.000000
4	habitat_structure	1.000000
5	hydrology	1.000000
6	land_use	1.000000
7	habitat structure	1.000000
8	management	1.000000
9	other	1.000000
10	soil_substrate	1.000000

Validate candidate edges#

The validation step checks that each proposed row has the required fields, allowed values and a quote that can be linked back to the source sentence. Rows that fail validation are retained with flags rather than hidden.

This keeps the workflow auditable: the review table shows both the proposed relationship and the reason it should be treated with caution.

VALID_TARGETS = {"abundance", "habitat suitability", "population performance"}
VALID_EFFECTS = {"positive", "negative", "mixed", "unknown"}
EXTERNAL_DRIVER_TYPES = {
    "climate",
    "hydrology",
    "soil_substrate",
    "habitat_structure",
    "disturbance",
    "management",
    "land_use",
    "biotic_interaction",
    "disease_pathogen",
    "pollution",
    "topography",
    "other",
}

NON_DRIVER_PATTERNS = [
    r"\b\d{4}\b",
    r"\bfirst recorded\b",
    r"\bintroduced\b",
    r"\bnative\b",
    r"\barchaeophyte\b",
    r"\bneophyte\b",
]


def validate_edge(row: pd.Series) -> dict[str, Any]:
    reasons = []
    warnings = []

    source = str(row.get("source_raw", "")).strip()
    quote = str(row.get("evidence_quote", "")).strip()
    target = str(row.get("target_canonical", "")).strip().lower()
    effect = str(row.get("effect", "")).strip().lower()
    source_type = str(row.get("source_type", "")).strip().lower()

    if not source:
        reasons.append("missing_source_phrase")
    if not quote:
        reasons.append("missing_evidence_quote")
    if target not in VALID_TARGETS:
        reasons.append("invalid_target")
    if effect not in VALID_EFFECTS:
        reasons.append("invalid_effect")
    if source_type not in EXTERNAL_DRIVER_TYPES:
        reasons.append("invalid_source_type")

    lower_source = source.lower()
    lower_quote = quote.lower()
    if source and lower_source not in lower_quote:
        warnings.append("source_phrase_not_exactly_in_quote")

    if any(re.search(pattern, lower_source) for pattern in NON_DRIVER_PATTERNS):
        reasons.append("likely_non_driver_source")

    accepted_as_evidence = not reasons
    accepted_for_environmental_dag = accepted_as_evidence and source_type != "other"

    return {
        "accepted_as_evidence": accepted_as_evidence,
        "accepted_for_environmental_dag": accepted_for_environmental_dag,
        "validation_reasons": "|".join(reasons),
        "validation_warnings": "|".join(warnings),
    }


if raw_edges_df.empty:
    validated_edges_df = raw_edges_df.copy()
else:
    validation = raw_edges_df.apply(validate_edge, axis=1, result_type="expand")
    validated_edges_df = pd.concat([raw_edges_df.reset_index(drop=True), validation.reset_index(drop=True)], axis=1)

validated_edges_df.to_csv(OUTPUT_DIR / "all_candidate_edges_validated.csv", index=False)

print("Validated candidate edges:", len(validated_edges_df))
if len(validated_edges_df):
    display(validated_edges_df.head(15))
    display(validated_edges_df["accepted_for_environmental_dag"].value_counts(dropna=False).rename("count"))
else:
    print("No candidate edges were produced.")

Validated candidate edges: 37

	species_id	canonical	source_file	sentence_id	source_raw	source_canonical	source_type	target_canonical	effect	confidence	evidence_quote	extraction_method	llm_error	accepted_as_evidence	accepted_for_environmental_dag	validation_reasons	validation_warnings
0	demo_001	Kickxia spuria	synthetic_plant_atlas_style_accounts	0	light calcareous soils	calcareous_soil	soil_substrate	abundance	positive	1.0	most often on light calcareous soils.	llm_transformers	NaN	True	True
1	demo_001	Kickxia spuria	synthetic_plant_atlas_style_accounts	1	crop	crop	land_use	habitat suitability	positive	1.0	It can persist where the crop is open and wher...	llm_transformers	NaN	True	True
2	demo_001	Kickxia spuria	synthetic_plant_atlas_style_accounts	1	cultivation	cultivation	disturbance	habitat suitability	positive	1.0	It can persist where the crop is open and wher...	llm_transformers	NaN	True	True
3	demo_001	Kickxia spuria	synthetic_plant_atlas_style_accounts	2	dense competitive vegetation	dense_vegetation	habitat_structure	habitat suitability	negative	1.0	Dense competitive vegetation and intensive her...	llm_transformers	NaN	True	True
4	demo_001	Kickxia spuria	synthetic_plant_atlas_style_accounts	2	intensive herbicide use	herbicide_use	management	habitat suitability	negative	1.0	Dense competitive vegetation and intensive her...	llm_transformers	NaN	True	True
5	demo_001	Kickxia spuria	synthetic_plant_atlas_style_accounts	3	arable margins	arable_margins	habitat structure	habitat suitability	positive	1.0	Local populations are most likely to be mainta...	llm_transformers	NaN	False	False	invalid_source_type
6	demo_001	Kickxia spuria	synthetic_plant_atlas_style_accounts	4	loss of open cultivated margins	arable_margins	disturbance	abundance	negative	1.0	but loss of open cultivated margins remains a ...	llm_transformers	NaN	True	True
7	demo_001	Kickxia spuria	synthetic_plant_atlas_style_accounts	5	southern-temperate	southern_temperate	climate	abundance	positive	1.0	A southern-temperate archaeophyte in Britain a...	llm_transformers	NaN	True	True
8	demo_002	Koenigia islandica	synthetic_plant_atlas_style_accounts	0	cool upland situations	cool_upland_situations	habitat_structure	habitat suitability	positive	1.0	in cool upland situations.	llm_transformers	NaN	True	True
9	demo_002	Koenigia islandica	synthetic_plant_atlas_style_accounts	1	short, open vegetation	short_vegetation	habitat_structure	habitat suitability	positive	1.0	It depends on short, open vegetation and relia...	llm_transformers	NaN	True	True
10	demo_002	Koenigia islandica	synthetic_plant_atlas_style_accounts	1	reliable moisture	reliable_moisture	hydrology	habitat suitability	positive	1.0	It depends on short, open vegetation and relia...	llm_transformers	NaN	True	True
11	demo_002	Koenigia islandica	synthetic_plant_atlas_style_accounts	3	flourishing conditions	flourishing_conditions	topography	habitat suitability	positive	0.9	Small colonies may fluctuate between years.	llm_transformers	NaN	True	True		source_phrase_not_exactly_in_quote
12	demo_002	Koenigia islandica	synthetic_plant_atlas_style_accounts	4	warmer drier summers	warmer_drier_summers	climate	habitat suitability	negative	1.0	warmer drier summers could increase stress on ...	llm_transformers	NaN	True	True
13	demo_002	Koenigia islandica	synthetic_plant_atlas_style_accounts	5	warmer drier summers	warmer_drier_summers	climate	habitat suitability	negative	1.0	warmer drier summers could increase stress on ...	llm_transformers	NaN	True	True
14	demo_003	Lagurus ovatus	synthetic_plant_atlas_style_accounts	0	bare sand	bare_sand	habitat_structure	habitat suitability	positive	1.0	It is favored by bare sand and warm open micro...	llm_transformers	NaN	True	True

accepted_for_environmental_dag
True     33
False     4
Name: count, dtype: int64

Group similar driver phrases#

The LLM extracts phrases as they appear in the sentence, so the same ecological idea may appear in several wordings. For example, fertiliser inputs, high fertiliser inputs and fertilised grassland should often be reviewed together. This section uses a lightweight token-overlap approach to create provisional phrase clusters.

For a larger analysis, this step could be replaced by embedding-based clustering or an expert-curated vocabulary. A transparent string-based method is used here because it is easy to inspect.

def phrase_tokens(text: str) -> set[str]:
    text = normalise_node_label(text)
    return {t for t in text.split() if len(t) > 2}


def jaccard(a: set[str], b: set[str]) -> float:
    if not a or not b:
        return 0.0
    return len(a & b) / len(a | b)


def connected_components_from_pairs(labels: list[str], threshold: float) -> dict[str, int]:
    graph = nx.Graph()
    graph.add_nodes_from(labels)
    token_map = {label: phrase_tokens(label) for label in labels}

    for i, a in enumerate(labels):
        for b in labels[i + 1:]:
            if jaccard(token_map[a], token_map[b]) >= threshold:
                graph.add_edge(a, b)

    component_id = {}
    for idx, component in enumerate(nx.connected_components(graph), start=1):
        for label in component:
            component_id[label] = idx
    return component_id


dag_edges = validated_edges_df[validated_edges_df.get("accepted_for_environmental_dag", False).astype(bool)].copy()

if dag_edges.empty:
    clustered_edges_df = dag_edges.copy()
    cluster_table = pd.DataFrame(columns=["source_type", "semantic_cluster_id", "semantic_cluster_label", "members"])
else:
    clustered_parts = []
    cluster_rows = []

    for source_type, group in dag_edges.groupby("source_type"):
        labels = sorted(group["source_canonical"].dropna().unique())
        component_ids = connected_components_from_pairs(labels, PHRASE_SIMILARITY_THRESHOLD)

        label_to_cluster = {}
        for cluster_id in sorted(set(component_ids.values())):
            members = sorted([label for label, cid in component_ids.items() if cid == cluster_id])
            # Use the shortest member as a readable provisional label.
            cluster_label = sorted(members, key=lambda x: (len(x), x))[0]
            semantic_cluster_id = f"{source_type}_{cluster_id:03d}"
            for member in members:
                label_to_cluster[member] = (semantic_cluster_id, cluster_label)
            cluster_rows.append(
                {
                    "source_type": source_type,
                    "semantic_cluster_id": semantic_cluster_id,
                    "semantic_cluster_label": cluster_label,
                    "members": "; ".join(members),
                }
            )

        tmp = group.copy()
        tmp["semantic_cluster_id"] = tmp["source_canonical"].map(lambda x: label_to_cluster[x][0])
        tmp["semantic_cluster_label"] = tmp["source_canonical"].map(lambda x: label_to_cluster[x][1])
        clustered_parts.append(tmp)

    clustered_edges_df = pd.concat(clustered_parts, ignore_index=True)
    cluster_table = pd.DataFrame(cluster_rows).sort_values(["source_type", "semantic_cluster_id"]).reset_index(
        drop=True)

clustered_edges_df.to_csv(OUTPUT_DIR / "all_candidate_edges_clustered.csv", index=False)
cluster_table.to_csv(OUTPUT_DIR / "driver_semantic_clusters.csv", index=False)

print("Semantic clusters:", len(cluster_table))
display(cluster_table.head(20))

Semantic clusters: 30

	source_type	semantic_cluster_id	semantic_cluster_label	members
0	climate	climate_001	severe_winter_cold	severe_winter_cold
1	climate	climate_002	southern_temperate	southern_temperate
2	climate	climate_003	warmer_drier_summers	warmer_drier_summers
3	disease_pathogen	disease_pathogen_001	larch_pathogens	larch_pathogens
4	disturbance	disturbance_001	arable_margins	arable_margins
5	disturbance	disturbance_002	cultivation	cultivation
6	disturbance	disturbance_003	disturbed_mineral_soil	disturbed_mineral_soil
7	disturbance	disturbance_004	seed_supply	seed_supply
8	disturbance	disturbance_005	severe_winter_cold	severe_winter_cold
9	disturbance	disturbance_006	soil_disturbance	soil_disturbance
10	habitat_structure	habitat_structure_001	bare_sand	bare_sand
11	habitat_structure	habitat_structure_002	closed_dune_grassland	closed_dune_grassland
12	habitat_structure	habitat_structure_003	cool_upland_situations	cool_upland_situations
13	habitat_structure	habitat_structure_004	dense_vegetation	dense_vegetation
14	habitat_structure	habitat_structure_005	managed_cultivated_ground	managed_cultivated_ground
15	habitat_structure	habitat_structure_006	short_vegetation	short_vegetation
16	habitat_structure	habitat_structure_007	soil	soil
17	habitat_structure	habitat_structure_008	warm_microsites	warm_microsites
18	hydrology	hydrology_001	reliable_moisture	reliable_moisture
19	hydrology	hydrology_002	sites_waterlogged_soils	sites_waterlogged_soils

Visualise extracted evidence#

The plots below summarise the extracted candidate rows before they are aggregated into graph edges. They are intended as quick checks: which driver types dominate, which effect directions appear, and whether confidence scores look plausible.

These plots do not validate the ecology. They help identify issues worth checking in the review table.

if clustered_edges_df.empty:
    print("No clustered edges to visualise.")
else:
    source_counts = clustered_edges_df["source_type"].value_counts().sort_values()

    ax = source_counts.plot(kind="barh")
    ax.set_title("Candidate DAG edges by broad driver type")
    ax.set_xlabel("Number of candidate edges")
    ax.set_ylabel("")
    plt.tight_layout()
    plt.show()

../../../_images/90deaba9dea5615e8e0235b6deac9a8346b9ced8004c37649092c16441fb1d4c.png

if clustered_edges_df.empty:
    print("No clustered edges to visualise.")
else:
    species_driver_counts = (
        clustered_edges_df.groupby(["canonical", "source_type"])
        .size()
        .unstack(fill_value=0)
        .sort_index()
    )

    ax = species_driver_counts.plot(kind="bar", stacked=True)
    ax.set_title("Driver-type evidence by species")
    ax.set_xlabel("")
    ax.set_ylabel("Candidate edge count")
    plt.xticks(rotation=45, ha="right")
    ax.legend(title="Driver type", loc="center left", bbox_to_anchor=(1.05, 1))
    plt.tight_layout()
    plt.show()

../../../_images/b93b22e070fc492bf82ef1c8a0e75c160fb746617ea5f5a2eacdf9f27c1190d4.png

Aggregate sentence-level evidence into graph edges#

The model works at sentence level, but a graph edge should usually aggregate repeated evidence for the same species, source and target. This section combines compatible candidate rows, keeps examples of the raw phrases and quotes, and records the number of supporting sentences.

Aggregation is deliberately conservative. It should make review easier without hiding where the evidence came from.

def join_unique(values: Iterable[Any], sep: str = " | ") -> str:
    out = []
    seen = set()
    for value in values:
        if value is None or (isinstance(value, float) and math.isnan(value)):
            continue
        text = str(value).strip()
        if text and text not in seen:
            seen.add(text)
            out.append(text)
    return sep.join(out)


if clustered_edges_df.empty:
    agg_edges_df = pd.DataFrame()
else:
    agg_edges_df = (
        clustered_edges_df
        .groupby(
            [
                "species_id",
                "canonical",
                "semantic_cluster_label",
                "target_canonical",
                "effect",
            ],
            dropna=False,
        )
        .agg(
            source_types=("source_type", lambda x: join_unique(x, "; ")),
            source_raw_examples=("source_raw", lambda x: join_unique(x, "; ")),
            source_canonical_examples=("source_canonical", lambda x: join_unique(x, "; ")),
            semantic_cluster_ids=("semantic_cluster_id", lambda x: join_unique(x, "; ")),
            n_evidence_sentences=("sentence_id", "nunique"),
            sentence_ids=("sentence_id", lambda x: join_unique([int(v) for v in x], "; ")),
            evidence_quotes=("evidence_quote", lambda x: join_unique(x, " || ")),
            mean_confidence=("confidence", "mean"),
            max_confidence=("confidence", "max"),
            extraction_methods=("extraction_method", lambda x: join_unique(x, "; ")),
        )
        .reset_index()
        .rename(columns={"semantic_cluster_label": "source_canonical"})
    )

    agg_edges_df["edge_origin"] = "text_extracted"
    agg_edges_df["assumption"] = ""
    agg_edges_df["mean_confidence"] = agg_edges_df["mean_confidence"].round(3)
    agg_edges_df["max_confidence"] = agg_edges_df["max_confidence"].round(3)

agg_edges_df.to_csv(OUTPUT_DIR / "all_dag_edges.csv", index=False)

print("Aggregated DAG edges:", len(agg_edges_df))
display(agg_edges_df.head(20))

Aggregated DAG edges: 29

	species_id	canonical	source_canonical	target_canonical	effect	source_types	source_raw_examples	source_canonical_examples	semantic_cluster_ids	n_evidence_sentences	sentence_ids	evidence_quotes	mean_confidence	max_confidence	extraction_methods	edge_origin
0	demo_001	Kickxia spuria	arable_margins	abundance	negative	disturbance	loss of open cultivated margins	arable_margins	disturbance_001	1	4	but loss of open cultivated margins remains a ...	1.0	1.0	llm_transformers	text_extracted
1	demo_001	Kickxia spuria	calcareous_soil	abundance	positive	soil_substrate	light calcareous soils	calcareous_soil	soil_substrate_002	1	0	most often on light calcareous soils.	1.0	1.0	llm_transformers	text_extracted
2	demo_001	Kickxia spuria	crop	habitat suitability	positive	land_use	crop	crop	land_use_001	1	1	It can persist where the crop is open and wher...	1.0	1.0	llm_transformers	text_extracted
3	demo_001	Kickxia spuria	cultivation	habitat suitability	positive	disturbance	cultivation	cultivation	disturbance_002	1	1	It can persist where the crop is open and wher...	1.0	1.0	llm_transformers	text_extracted
4	demo_001	Kickxia spuria	dense_vegetation	habitat suitability	negative	habitat_structure	dense competitive vegetation	dense_vegetation	habitat_structure_004	1	2	Dense competitive vegetation and intensive her...	1.0	1.0	llm_transformers	text_extracted
5	demo_001	Kickxia spuria	herbicide_use	habitat suitability	negative	management	intensive herbicide use	herbicide_use	management_002	1	2	Dense competitive vegetation and intensive her...	1.0	1.0	llm_transformers	text_extracted
6	demo_001	Kickxia spuria	southern_temperate	abundance	positive	climate	southern-temperate	southern_temperate	climate_002	1	5	A southern-temperate archaeophyte in Britain a...	1.0	1.0	llm_transformers	text_extracted
7	demo_002	Koenigia islandica	cool_upland_situations	habitat suitability	positive	habitat_structure	cool upland situations	cool_upland_situations	habitat_structure_003	1	0	in cool upland situations.	1.0	1.0	llm_transformers	text_extracted
8	demo_002	Koenigia islandica	flourishing_conditions	habitat suitability	positive	topography	flourishing conditions	flourishing_conditions	topography_001	1	3	Small colonies may fluctuate between years.	0.9	0.9	llm_transformers	text_extracted
9	demo_002	Koenigia islandica	reliable_moisture	habitat suitability	positive	hydrology	reliable moisture	reliable_moisture	hydrology_001	1	1	It depends on short, open vegetation and relia...	1.0	1.0	llm_transformers	text_extracted
10	demo_002	Koenigia islandica	short_vegetation	habitat suitability	positive	habitat_structure	short, open vegetation	short_vegetation	habitat_structure_006	1	1	It depends on short, open vegetation and relia...	1.0	1.0	llm_transformers	text_extracted
11	demo_002	Koenigia islandica	warmer_drier_summers	habitat suitability	negative	climate	warmer drier summers	warmer_drier_summers	climate_003	2	4; 5	warmer drier summers could increase stress on ...	1.0	1.0	llm_transformers	text_extracted
12	demo_003	Lagurus ovatus	bare_sand	habitat suitability	positive	habitat_structure; soil_substrate	bare sand	bare_sand	habitat_structure_001; soil_substrate_001	2	0; 1	It is favored by bare sand and warm open micro...	1.0	1.0	llm_transformers	text_extracted
13	demo_003	Lagurus ovatus	closed_dune_grassland	habitat suitability	negative	habitat_structure	closed dune grassland	closed_dune_grassland	habitat_structure_002	1	2	Closed dune grassland and lack of disturbance ...	1.0	1.0	llm_transformers	text_extracted
14	demo_003	Lagurus ovatus	seed_supply	population performance	positive	disturbance	seed supply	seed_supply	disturbance_004	1	5	Continued seed supply from gardens and disturb...	1.0	1.0	llm_transformers	text_extracted
15	demo_003	Lagurus ovatus	severe_winter_cold	habitat suitability	negative	climate	severe winter cold	severe_winter_cold	climate_001	1	4	while severe winter cold may limit establishme...	1.0	1.0	llm_transformers	text_extracted
16	demo_003	Lagurus ovatus	severe_winter_cold	population performance	negative	disturbance	severe winter cold	severe_winter_cold	disturbance_005	2	3; 5	severe winter cold may limit establishment inl...	1.0	1.0	llm_transformers	text_extracted
17	demo_003	Lagurus ovatus	warm_microsites	habitat suitability	positive	habitat_structure	warm open microsites	warm_microsites	habitat_structure_008	1	1	is favoured by bare sand and warm open microsites	1.0	1.0	llm_transformers	text_extracted
18	demo_004	Lamium confertum	managed_cultivated_ground	abundance	negative	habitat_structure	lightly managed cultivated ground	managed_cultivated_ground	habitat_structure_005	1	3	Declines are consistent with the loss of light...	1.0	1.0	llm_transformers	text_extracted
19	demo_004	Lamium confertum	modern_weed_control	abundance	negative	management	modern weed control	modern_weed_control	management_003	1	2	Modern weed control, autumn-sown dense crops a...	1.0	1.0	llm_transformers	text_extracted

Add transparent bridge edges#

Some extracted relationships point to intermediate concepts such as habitat suitability or population performance rather than directly to abundance. This section adds simple bridge edges from those intermediate nodes to abundance.

These bridge edges are marked as assumptions. They make the graph easier to inspect, but they should still be checked during expert review.

def add_bridge_edges(edges: pd.DataFrame) -> pd.DataFrame:
    if edges.empty:
        return edges

    bridge_rows = []
    for (species_id, canonical, target), group in edges.groupby(["species_id", "canonical", "target_canonical"]):
        if target in {"habitat suitability", "population performance"}:
            bridge_rows.append(
                {
                    "species_id": species_id,
                    "canonical": canonical,
                    "source_canonical": target,
                    "target_canonical": "abundance",
                    "effect": "positive",
                    "source_types": "model_assumption",
                    "source_raw_examples": "",
                    "source_canonical_examples": target,
                    "semantic_cluster_ids": "",
                    "n_evidence_sentences": 0,
                    "sentence_ids": "",
                    "evidence_quotes": "",
                    "mean_confidence": np.nan,
                    "max_confidence": np.nan,
                    "extraction_methods": "",
                    "edge_origin": "model_bridge",
                    "assumption": f"Assumes that higher {target} tends to support higher abundance.",
                }
            )

    if not bridge_rows:
        return edges

    return pd.concat([edges, pd.DataFrame(bridge_rows)], ignore_index=True)


model_augmented_edges_df = add_bridge_edges(agg_edges_df)
model_augmented_edges_df.to_csv(OUTPUT_DIR / "all_dag_edges_with_bridge_edges.csv", index=False)

print("DAG edges after optional bridge edges:", len(model_augmented_edges_df))
display(model_augmented_edges_df.tail(10))

DAG edges after optional bridge edges: 35

	species_id	canonical	source_canonical	target_canonical	effect	source_types	source_raw_examples	source_canonical_examples	semantic_cluster_ids	n_evidence_sentences	sentence_ids	evidence_quotes	mean_confidence	max_confidence	extraction_methods	edge_origin	assumption
25	demo_005	Larix decidua	larch_pathogens	abundance	negative	disease_pathogen	larch pathogens	larch_pathogens	disease_pathogen_001	2	4; 5	Disease pressure from larch pathogens and fell...	0.85	1.0	llm_transformers	text_extracted
26	demo_005	Larix decidua	rocky_ground	habitat suitability	positive	soil_substrate	open rocky ground	rocky_ground	soil_substrate_004	1	1	regenerates from seed on disturbed mineral soi...	1.00	1.0	llm_transformers	text_extracted
27	demo_005	Larix decidua	sites_waterlogged_soils	habitat suitability	negative	hydrology	dry sites and waterlogged soils	sites_waterlogged_soils	hydrology_002	1	2	Very dry sites and waterlogged soils are less ...	1.00	1.0	llm_transformers	text_extracted
28	demo_005	Larix decidua	soil_disturbance	abundance	positive	disturbance	soil disturbance	soil_disturbance	disturbance_006	1	3	Natural regeneration can follow soil disturban...	1.00	1.0	llm_transformers	text_extracted
29	demo_001	Kickxia spuria	habitat suitability	abundance	positive	model_assumption		habitat suitability		0			NaN	NaN		model_bridge	Assumes that higher habitat suitability tends ...
30	demo_002	Koenigia islandica	habitat suitability	abundance	positive	model_assumption		habitat suitability		0			NaN	NaN		model_bridge	Assumes that higher habitat suitability tends ...
31	demo_003	Lagurus ovatus	habitat suitability	abundance	positive	model_assumption		habitat suitability		0			NaN	NaN		model_bridge	Assumes that higher habitat suitability tends ...
32	demo_003	Lagurus ovatus	population performance	abundance	positive	model_assumption		population performance		0			NaN	NaN		model_bridge	Assumes that higher population performance ten...
33	demo_004	Lamium confertum	habitat suitability	abundance	positive	model_assumption		habitat suitability		0			NaN	NaN		model_bridge	Assumes that higher habitat suitability tends ...
34	demo_005	Larix decidua	habitat suitability	abundance	positive	model_assumption		habitat suitability		0			NaN	NaN		model_bridge	Assumes that higher habitat suitability tends ...

Build graph objects and check for cycles#

This section turns the aggregated edge table into one graph per species and checks whether the graph is acyclic. The graph representation is mainly a way to organise review.

Acyclicity is useful because the intended downstream use is a candidate DAG. However, an automatically produced cycle is not necessarily an ecological failure. It may show that the schema needs refinement or that a bidirectional process has been flattened into directed edges.

def build_species_graph(edges: pd.DataFrame, species_id: str) -> nx.DiGraph:
    graph = nx.DiGraph()
    subset = edges[edges["species_id"] == species_id]

    for _, row in subset.iterrows():
        source = row["source_canonical"]
        target = row["target_canonical"]
        if not source or not target or source == target:
            continue

        graph.add_node(source, node_type="driver" if source not in {"habitat suitability", "population performance",
                                                                    "abundance"} else "intermediate")
        graph.add_node(target, node_type="outcome" if target == "abundance" else "intermediate")

        graph.add_edge(
            source,
            target,
            effect=row["effect"],
            edge_origin=row["edge_origin"],
            weight=row["mean_confidence"] if pd.notna(row["mean_confidence"]) else 0.5,
            n_evidence_sentences=int(row.get("n_evidence_sentences", 0)),
            evidence_quotes=row.get("evidence_quotes", ""),
        )

    return graph


cycle_rows = []
node_rows = []

for species_id, group in model_augmented_edges_df.groupby("species_id"):
    canonical = group["canonical"].iloc[0]
    graph = build_species_graph(model_augmented_edges_df, species_id)

    is_dag = nx.is_directed_acyclic_graph(graph)
    cycles = []
    if not is_dag:
        try:
            cycles = list(nx.simple_cycles(graph))[:5]
        except Exception:
            cycles = []

    cycle_rows.append(
        {
            "species_id": species_id,
            "canonical": canonical,
            "n_nodes": graph.number_of_nodes(),
            "n_edges": graph.number_of_edges(),
            "is_dag": is_dag,
            "example_cycles": json.dumps(cycles),
        }
    )

    for node, attrs in graph.nodes(data=True):
        node_rows.append(
            {
                "species_id": species_id,
                "canonical": canonical,
                "node": node,
                "node_type": attrs.get("node_type", ""),
            }
        )

dag_diagnostics_df = pd.DataFrame(cycle_rows)
dag_nodes_df = pd.DataFrame(node_rows)

dag_diagnostics_df.to_csv(OUTPUT_DIR / "dag_diagnostics.csv", index=False)
dag_nodes_df.to_csv(OUTPUT_DIR / "all_dag_nodes.csv", index=False)

display(dag_diagnostics_df)

	species_id	canonical	n_nodes	n_edges	is_dag	example_cycles
0	demo_001	Kickxia spuria	9	8	True	[]
1	demo_002	Koenigia islandica	7	6	True	[]
2	demo_003	Lagurus ovatus	8	8	True	[]
3	demo_004	Lamium confertum	5	5	True	[]
4	demo_005	Larix decidua	8	8	True	[]

Visualise one species graph#

The plot below is deliberately closer to a hand-drawn causal diagram than to a generic network plot. abundance is placed as the focal outcome and short driver labels are arranged around it. Positive and negative relationships are shown differently because the sign is one of the most useful review fields.

The full evidence phrases remain in the review table. The shortened node labels are only used to make the figure readable.

VIS_DIR = OUTPUT_DIR / "visualisations"
VIS_DIR.mkdir(parents=True, exist_ok=True)


def display_label(node: str) -> str:
    """Short label used only for graph plotting."""
    if node == "abundance":
        return "abundance"
    if node == "habitat suitability":
        return "habitat\nsuitability"
    if node == "population performance":
        return "population\nperformance"

    text = str(node).replace(" ", "_")
    parts = [p for p in text.split("_") if p]
    if len(parts) <= 3:
        return "_".join(parts)

    # Keep labels compact. The evidence table keeps the full phrase.
    return "_".join(parts[:2]) + "\n" + "_".join(parts[2:4])


def outcome_focused_layout(graph: nx.DiGraph) -> dict[str, tuple[float, float]]:
    """Place abundance as the focal outcome and arrange drivers around it."""
    pos: dict[str, tuple[float, float]] = {}

    outcomes = [n for n in ["abundance"] if n in graph]
    intermediates = [n for n in ["habitat suitability", "population performance"] if n in graph]
    drivers = [n for n in graph.nodes if n not in set(outcomes + intermediates)]

    if "abundance" in graph:
        pos["abundance"] = (0.0, 0.0)

    if intermediates:
        y_values = np.linspace(1.0, -1.0, len(intermediates))
        for node, y in zip(intermediates, y_values):
            pos[node] = (-1.6, float(y))

    direct_drivers = [n for n in drivers if graph.has_edge(n, "abundance")]
    indirect_drivers = [n for n in drivers if n not in direct_drivers]

    def place_nodes(nodes: list[str], x: float, y_top: float, y_bottom: float) -> None:
        if not nodes:
            return
        y_values = np.linspace(y_top, y_bottom, len(nodes))
        for node, y in zip(nodes, y_values):
            pos[node] = (x, float(y))

    # Direct abundance drivers are closest to the focal outcome. Indirect drivers
    # sit further left, feeding through habitat or population-performance nodes.
    place_nodes(sorted(direct_drivers), -1.9, 1.9, -1.9)
    place_nodes(sorted(indirect_drivers), -3.6, 2.2, -2.2)

    # Fall back for any unexpected isolated node.
    missing = [n for n in graph.nodes if n not in pos]
    if missing:
        fallback = nx.spring_layout(graph.subgraph(missing), seed=RANDOM_SEED)
        for node, xy in fallback.items():
            pos[node] = (float(xy[0]) - 3.0, float(xy[1]))

    return pos


def draw_effect_edges(graph: nx.DiGraph, pos: dict[str, tuple[float, float]]) -> None:
    positive_edges = [(u, v) for u, v, a in graph.edges(data=True) if a.get("effect") == "positive"]
    negative_edges = [(u, v) for u, v, a in graph.edges(data=True) if a.get("effect") == "negative"]
    other_edges = [(u, v) for u, v, a in graph.edges(data=True) if a.get("effect") not in {"positive", "negative"}]

    nx.draw_networkx_edges(
        graph, pos, edgelist=positive_edges, arrows=True, arrowstyle="-|>",
        width=2.0, connectionstyle="arc3,rad=0.04"
    )
    nx.draw_networkx_edges(
        graph, pos, edgelist=negative_edges, arrows=True, arrowstyle="-|>",
        width=2.0, style="dashed", connectionstyle="arc3,rad=0.04"
    )
    nx.draw_networkx_edges(
        graph, pos, edgelist=other_edges, arrows=True, arrowstyle="-|>",
        width=1.5, style="dotted", connectionstyle="arc3,rad=0.04"
    )

    # Use + and - instead of full words; it keeps the graph closer to a causal
    # diagram while retaining the direction of effect.
    edge_labels = {}
    for u, v, attrs in graph.edges(data=True):
        effect = attrs.get("effect", "")
        edge_labels[(u, v)] = {"positive": "+", "negative": "-", "mixed": "+/-"}.get(effect, "?")
    nx.draw_networkx_edge_labels(graph, pos, edge_labels=edge_labels, font_size=9, label_pos=0.55)


def plot_species_graph(species_id: str, save: bool = True) -> None:
    graph = build_species_graph(model_augmented_edges_df, species_id)
    if graph.number_of_edges() == 0:
        print("No graph edges for", species_id)
        return

    canonical = model_augmented_edges_df.loc[model_augmented_edges_df["species_id"] == species_id, "canonical"].iloc[0]
    pos = outcome_focused_layout(graph)

    plt.figure(figsize=(10, 6.5))

    outcome_nodes = [n for n, a in graph.nodes(data=True) if a.get("node_type") == "outcome"]
    intermediate_nodes = [n for n, a in graph.nodes(data=True) if a.get("node_type") == "intermediate"]
    driver_nodes = [n for n, a in graph.nodes(data=True) if a.get("node_type") == "driver"]

    nx.draw_networkx_nodes(graph, pos, nodelist=driver_nodes, node_size=700, alpha=0.18)
    nx.draw_networkx_nodes(graph, pos, nodelist=intermediate_nodes, node_size=1600, alpha=0.25)
    nx.draw_networkx_nodes(graph, pos, nodelist=outcome_nodes, node_size=2100, alpha=0.35)

    draw_effect_edges(graph, pos)

    labels = {node: display_label(node) for node in graph.nodes}
    nx.draw_networkx_labels(
        graph,
        pos,
        labels=labels,
        font_size=10,
        bbox={"boxstyle": "round,pad=0.25", "facecolor": "white", "edgecolor": "none", "alpha": 0.85},
    )

    plt.title(f"Candidate evidence-linked DAG: {canonical}")
    plt.text(
        0.01, 0.01,
        "solid = positive, dashed = negative, dotted = mixed/unknown",
        transform=plt.gca().transAxes,
        fontsize=9,
        va="bottom",
    )
    plt.axis("off")
    plt.tight_layout()

    if save:
        safe_name = re.sub(r"[^A-Za-z0-9_.-]+", "_", canonical)
        path = VIS_DIR / f"{safe_name}_candidate_dag.png"
        plt.savefig(path, dpi=180, bbox_inches="tight")
        print("Saved:", path)

    plt.show()


species_with_edges = dag_diagnostics_df.loc[dag_diagnostics_df["n_edges"] > 0, "species_id"].tolist()
if species_with_edges:
    plot_species_graph(species_with_edges[2])
else:
    print("No species graphs to plot.")

Saved: outputs_plant_atlas_llm_to_dag/visualisations/Lagurus_ovatus_candidate_dag.png

../../../_images/09b950e0b7d63925706139029342c93205459ccde7b4e91bdd7f013ebcf299a7.png

# Save all species DAG plots for review using the same outcome-focused layout.
saved = []
for species_id in species_with_edges:
    graph = build_species_graph(model_augmented_edges_df, species_id)
    if graph.number_of_edges() == 0:
        continue

    canonical = model_augmented_edges_df.loc[model_augmented_edges_df["species_id"] == species_id, "canonical"].iloc[0]
    safe_name = re.sub(r"[^A-Za-z0-9_.-]+", "_", canonical)
    path = VIS_DIR / f"{safe_name}_candidate_dag.png"

    pos = outcome_focused_layout(graph)
    plt.figure(figsize=(10, 6.5))

    outcome_nodes = [n for n, a in graph.nodes(data=True) if a.get("node_type") == "outcome"]
    intermediate_nodes = [n for n, a in graph.nodes(data=True) if a.get("node_type") == "intermediate"]
    driver_nodes = [n for n, a in graph.nodes(data=True) if a.get("node_type") == "driver"]

    nx.draw_networkx_nodes(graph, pos, nodelist=driver_nodes, node_size=700, alpha=0.18)
    nx.draw_networkx_nodes(graph, pos, nodelist=intermediate_nodes, node_size=1600, alpha=0.25)
    nx.draw_networkx_nodes(graph, pos, nodelist=outcome_nodes, node_size=2100, alpha=0.35)
    draw_effect_edges(graph, pos)
    nx.draw_networkx_labels(
        graph,
        pos,
        labels={node: display_label(node) for node in graph.nodes},
        font_size=10,
        bbox={"boxstyle": "round,pad=0.25", "facecolor": "white", "edgecolor": "none", "alpha": 0.85},
    )
    plt.title(f"Candidate evidence-linked DAG: {canonical}")
    plt.axis("off")
    plt.tight_layout()
    plt.savefig(path, dpi=180, bbox_inches="tight")
    plt.close()
    saved.append(path)

print(f"Saved {len(saved)} DAG plot(s) to {VIS_DIR}")

Saved 5 DAG plot(s) to outputs_plant_atlas_llm_to_dag/visualisations

Expert-review table#

This table is the main hand-off from the automated workflow to ecological review. It keeps the species, source node, target node, effect direction, evidence quotes, confidence summaries and assumption flags in one place.

The LLM output is not treated as final. The table is designed so that a reviewer can accept, edit, merge or reject candidate edges while seeing the source evidence.

review_cols = [
    "species_id",
    "canonical",
    "edge_origin",
    "source_canonical",
    "target_canonical",
    "effect",
    "source_types",
    "mean_confidence",
    "max_confidence",
    "n_evidence_sentences",
    "source_raw_examples",
    "source_canonical_examples",
    "semantic_cluster_ids",
    "sentence_ids",
    "evidence_quotes",
    "assumption",
    "expert_decision",
    "expert_source_edit",
    "expert_target_edit",
    "expert_effect_edit",
    "expert_notes",
]

expert_review_df = model_augmented_edges_df.copy()
for col in ["expert_decision", "expert_source_edit", "expert_target_edit", "expert_effect_edit", "expert_notes"]:
    expert_review_df[col] = ""

expert_review_df = expert_review_df[[c for c in review_cols if c in expert_review_df.columns]]
expert_review_df.to_csv(OUTPUT_DIR / "expert_review_edges.csv", index=False)

display(expert_review_df.head(20))
print("Expert-review CSV:", OUTPUT_DIR / "expert_review_edges.csv")

	species_id	canonical	edge_origin	source_canonical	target_canonical	effect	source_types	mean_confidence	max_confidence	n_evidence_sentences	...	source_canonical_examples	semantic_cluster_ids	sentence_ids	evidence_quotes
0	demo_001	Kickxia spuria	text_extracted	arable_margins	abundance	negative	disturbance	1.0	1.0	1	...	arable_margins	disturbance_001	4	but loss of open cultivated margins remains a ...
1	demo_001	Kickxia spuria	text_extracted	calcareous_soil	abundance	positive	soil_substrate	1.0	1.0	1	...	calcareous_soil	soil_substrate_002	0	most often on light calcareous soils.
2	demo_001	Kickxia spuria	text_extracted	crop	habitat suitability	positive	land_use	1.0	1.0	1	...	crop	land_use_001	1	It can persist where the crop is open and wher...
3	demo_001	Kickxia spuria	text_extracted	cultivation	habitat suitability	positive	disturbance	1.0	1.0	1	...	cultivation	disturbance_002	1	It can persist where the crop is open and wher...
4	demo_001	Kickxia spuria	text_extracted	dense_vegetation	habitat suitability	negative	habitat_structure	1.0	1.0	1	...	dense_vegetation	habitat_structure_004	2	Dense competitive vegetation and intensive her...
5	demo_001	Kickxia spuria	text_extracted	herbicide_use	habitat suitability	negative	management	1.0	1.0	1	...	herbicide_use	management_002	2	Dense competitive vegetation and intensive her...
6	demo_001	Kickxia spuria	text_extracted	southern_temperate	abundance	positive	climate	1.0	1.0	1	...	southern_temperate	climate_002	5	A southern-temperate archaeophyte in Britain a...
7	demo_002	Koenigia islandica	text_extracted	cool_upland_situations	habitat suitability	positive	habitat_structure	1.0	1.0	1	...	cool_upland_situations	habitat_structure_003	0	in cool upland situations.
8	demo_002	Koenigia islandica	text_extracted	flourishing_conditions	habitat suitability	positive	topography	0.9	0.9	1	...	flourishing_conditions	topography_001	3	Small colonies may fluctuate between years.
9	demo_002	Koenigia islandica	text_extracted	reliable_moisture	habitat suitability	positive	hydrology	1.0	1.0	1	...	reliable_moisture	hydrology_001	1	It depends on short, open vegetation and relia...
10	demo_002	Koenigia islandica	text_extracted	short_vegetation	habitat suitability	positive	habitat_structure	1.0	1.0	1	...	short_vegetation	habitat_structure_006	1	It depends on short, open vegetation and relia...
11	demo_002	Koenigia islandica	text_extracted	warmer_drier_summers	habitat suitability	negative	climate	1.0	1.0	2	...	warmer_drier_summers	climate_003	4; 5	warmer drier summers could increase stress on ...
12	demo_003	Lagurus ovatus	text_extracted	bare_sand	habitat suitability	positive	habitat_structure; soil_substrate	1.0	1.0	2	...	bare_sand	habitat_structure_001; soil_substrate_001	0; 1	It is favored by bare sand and warm open micro...
13	demo_003	Lagurus ovatus	text_extracted	closed_dune_grassland	habitat suitability	negative	habitat_structure	1.0	1.0	1	...	closed_dune_grassland	habitat_structure_002	2	Closed dune grassland and lack of disturbance ...
14	demo_003	Lagurus ovatus	text_extracted	seed_supply	population performance	positive	disturbance	1.0	1.0	1	...	seed_supply	disturbance_004	5	Continued seed supply from gardens and disturb...
15	demo_003	Lagurus ovatus	text_extracted	severe_winter_cold	habitat suitability	negative	climate	1.0	1.0	1	...	severe_winter_cold	climate_001	4	while severe winter cold may limit establishme...
16	demo_003	Lagurus ovatus	text_extracted	severe_winter_cold	population performance	negative	disturbance	1.0	1.0	2	...	severe_winter_cold	disturbance_005	3; 5	severe winter cold may limit establishment inl...
17	demo_003	Lagurus ovatus	text_extracted	warm_microsites	habitat suitability	positive	habitat_structure	1.0	1.0	1	...	warm_microsites	habitat_structure_008	1	is favoured by bare sand and warm open microsites
18	demo_004	Lamium confertum	text_extracted	managed_cultivated_ground	abundance	negative	habitat_structure	1.0	1.0	1	...	managed_cultivated_ground	habitat_structure_005	3	Declines are consistent with the loss of light...
19	demo_004	Lamium confertum	text_extracted	modern_weed_control	abundance	negative	management	1.0	1.0	1	...	modern_weed_control	management_003	2	Modern weed control, autumn-sown dense crops a...

20 rows × 21 columns

Expert-review CSV: outputs_plant_atlas_llm_to_dag/expert_review_edges.csv

Run summary and metadata#

The summary table gives one row per species, and the metadata JSON records the key settings used in the run. This is important for LLM-assisted workflows because changes in the model, prompt, token limit or batch settings can change the extracted records.

summary_rows = []

for _, species in species_text.iterrows():
    species_id = species["species_id"]
    summary_rows.append(
        {
            "species_id": species_id,
            "canonical": species["canonical"],
            "n_sentences": int((sentences_df["species_id"] == species_id).sum()),
            "n_raw_candidate_edges": int((raw_edges_df["species_id"] == species_id).sum()) if len(raw_edges_df) else 0,
            "n_validated_dag_edges": int((clustered_edges_df["species_id"] == species_id).sum()) if len(
                clustered_edges_df) else 0,
            "n_aggregated_text_edges": int((agg_edges_df["species_id"] == species_id).sum()) if len(
                agg_edges_df) else 0,
            "n_augmented_dag_edges": int((model_augmented_edges_df["species_id"] == species_id).sum()) if len(
                model_augmented_edges_df) else 0,
        }
    )

summary_df = pd.DataFrame(summary_rows).merge(
    dag_diagnostics_df[["species_id", "n_nodes", "n_edges", "is_dag"]],
    on="species_id",
    how="left",
)

summary_df.to_csv(OUTPUT_DIR / "run_summary.csv", index=False)

_NOTEBOOK_END_TIME = dt.datetime.now()
_NOTEBOOK_ELAPSED = _NOTEBOOK_END_TIME - _NOTEBOOK_START_TIME
elapsed_minutes = round(_NOTEBOOK_ELAPSED.total_seconds() / 60, 2)

run_metadata = {
    "created_at": _NOTEBOOK_END_TIME.isoformat(timespec="seconds"),
    "elapsed_minutes": elapsed_minutes,
    "run_config_hash": RUN_CONFIG_HASH,
    "run_config": RUN_CONFIG,
    "python_version": os.sys.version,
    "pandas_version": pd.__version__,
    "numpy_version": np.__version__,
    "networkx_version": nx.__version__,
    "output_dir": str(OUTPUT_DIR),
    "development_hardware": {
        "gpu": "NVIDIA GeForce RTX 3060 Laptop GPU",
        "gpu_memory_mib": 6144,
        "nvidia_smi": "580.126.09",
        "driver_version": "580.126.09",
        "cuda_version_reported_by_nvidia_smi": "13.0",
    },
}

with open(OUTPUT_DIR / "run_metadata.json", "w", encoding="utf-8") as f:
    json.dump(run_metadata, f, indent=2)

display(summary_df)
print(f"Notebook elapsed time: {elapsed_minutes} minutes")
print("Run metadata:", OUTPUT_DIR / "run_metadata.json")

	species_id	canonical	n_sentences	n_raw_candidate_edges	n_validated_dag_edges	n_aggregated_text_edges	n_augmented_dag_edges	n_nodes	n_edges	is_dag
0	demo_001	Kickxia spuria	6	8	7	7	8	9	8	True
1	demo_002	Koenigia islandica	6	6	6	5	6	7	6	True
2	demo_003	Lagurus ovatus	6	9	8	6	8	8	8	True
3	demo_004	Lamium confertum	6	6	4	4	5	5	5	True
4	demo_005	Larix decidua	6	8	8	7	8	8	8	True

Notebook elapsed time: 7.87 minutes
Run metadata: outputs_plant_atlas_llm_to_dag/run_metadata.json

Output files#

The notebook writes the following files:

species_text_table.csv: species-level source text used in the run;
all_sentences.csv: sentence-level evidence units with neighbouring context;
baseline_deterministic_candidate_edges.csv: simple rule-based extraction used only as a sanity check;
all_raw_candidate_edges.csv: parsed LLM candidate edges;
llm_extraction_audit.csv: sentence-level LLM parsing audit;
all_candidate_edges_validated.csv: candidate rows with validation flags;
all_candidate_edges_clustered.csv: candidate rows with simple phrase-cluster identifiers;
driver_semantic_clusters.csv: one row per simple driver phrase cluster;
all_dag_edges.csv: aggregated sentence-level evidence as graph edges;
all_dag_edges_with_bridge_edges.csv: aggregated edges plus transparent bridge edges to abundance;
dag_diagnostics.csv: node and edge counts and cycle checks;
all_dag_nodes.csv: graph nodes and node types;
expert_review_edges.csv: the main review hand-off table;
run_summary.csv: one-row-per-species summary;
run_metadata.json: model path, generation settings, run hash and development hardware metadata;
visualisations/*.png: simple graph plots for species with extracted edges.

The review table is the main product. The graph plots and diagnostics help spot obvious problems before detailed ecological review.

Conclusions#

This notebook shows a practical route from species-account prose to evidence-linked candidate graph edges using a small local LLM. The design keeps the model task narrow: one sentence, one prompt, one JSON schema and explicit validation afterwards.

For the synthetic Plant Atlas-style example, the output should be read as a candidate review table rather than accepted ecological knowledge. The model proposes rows; the analyst decides whether the evidence is useful, whether the direction is plausible and whether the proposed edge belongs in a causal diagram.

The workflow is intentionally modular. If a different local model is used, only the model-loading and prompt-tuning parts should need attention. If a different text source is used, the same sentence-grounded extraction and validation pattern may still be useful.

References#

Boyd, R. J., Botham, M., Dennis, E., Fox, R., Harrower, C., Middlebrook, I., Roy, D. B., & Pescott, O. L. (2025). Using causal diagrams and superpopulation models to correct geographic biases in biodiversity monitoring data. Methods in Ecology and Evolution. https://doi.org/10.1111/2041-210X.14492

Botanical Society of Britain and Ireland. (2023). Plant Atlas 2020. https://plantatlas2020.org/

Hugging Face. Transformers documentation: bitsandbytes quantization. https://huggingface.co/docs/transformers/quantization/bitsandbytes

Hugging Face. Transformers installation/offline mode documentation. https://huggingface.co/docs/transformers/installation#offline-mode

Torrance, A. W. (2024). The environmental impacts of large language models. Scientific Reports, 14, 25020. https://doi.org/10.1038/s41598-024-76682-6