hra-ftu-rag-supporting-information

AI-Supported Extraction of Functional Tissue Unit Properties for Human Reference Atlas Construction


The Human Reference Atlas (HRA) effort brings together experts from more than 25 international consortia to capture the multiscale organization of the human body—from large anatomical organ systems (macro) to the single-cell level (micro). Functional tissue units (FTU, meso) in 10 organs have been detailed and 2D illustrations have been created by experts. Comprehensive FTU property characterization is essential for the HRA, but manual review of the vast number of scholarly publications is impractical. Here, we introduce Large-Model Retrieval-Augmented Generation for HRA FTUs (HRAftu-LM-RAG), an AI-driven framework for scalable and automated extraction of FTU-relevant properties from scholarly publications. This validated framework integrates Large Language Models for textual reasoning, Large Vision Models for visual interpretation, and Retrieval Augmented Generation for knowledge grounding, offering a balanced trade-off between accuracy and processing efficiency. We retrieved 244,640 PubMed Central publications containing 1,389,168 figures for 22 FTUs and identified 617,237 figures with microscopy and schematic images. From these images and associated text, we automatically extracted 331,189 scale bars and 1,719,138 biological entity mentions, along with donor metadata such as sex and age.

Introduction

This repository provides the supporting code and data for “AI-Supported Extraction of Functional Tissue Unit Properties for Human Reference Atlas Construction” paper, detailing the robust and scalable computation of scholarly evidence for the size, structure, and demographic differences of FTUs, facilitating the design and approval of future FTU illustrations during HRA construction.

Workflow Diagram

Example Image

Workflow Summary

Workflow Input Algorithm/Script Output
WF1: Image-type categorization - ftu_pub_pmc table: pmcid, graphic, caption, label, file_path
- image_refs table: ref_text
2-3-itype-run.py - Raw outputs from vision models: llama, llava, phi3, phi35, pixtral
- Final image-type labels stored in vision_llm table: micro, statis, schema, 3d, chem, math
WF2: In-image text-term extraction - image_node_lvm_total table: pmcid, graphic, file_path 3-2-lvm-entity-run.py - Extracted in-image text terms stored in image_node_lvm_total table under nodes field
WF3: Scale-bar extraction - ftu_pub_pmc table: pmcid, graphic, caption
- image_refs table: ref_text
- Filter: vision_llm.micro = "Yes"
4-2-sb-run.py - Scale bar information stored in scale_bar_all_info table:
Descriptor Type, Value, Units, Notes, Panel
WF4: Donor-metadata extraction - ftu_pub_pmc table: pmcid, graphic, caption
- publication_summary table: abstract
- Filter: vision_llm.micro = "Yes" or vision_llm.schema = "Yes"
- image_node_lvm_total table: nodes
5-1-donor-run.py - Donor metadata stored in donor_meta_all_info table:
species, sex, age, BMI, height, weight
WF5: AS+CT+B extraction - ftu_pub_pmc table: pmcid, graphic, caption
- image_node_lvm_total table: nodes
- Filter: vision_llm.micro = "Yes" or vision_llm.schema = "Yes"
6-1-bio-onto-run.py - Biological entity terms stored in bio_onto_all_info table:
entity (AS, CT, B)

Repository Structure

├── data      # Input and output datasets, plus test data for validation
├── docs      # Detailed documentation (architecture, deployment, usage)
├── src       # Source code for data fetching, LLM/Vision pipelines, similarity scripts
├── vis       # Generated SVG figures used in the paper

Installation

For full installation instructions, environment setup, and dependency management, see docs/installation.md.

Quick Start:

  1. Clone the repository:

    git clone https://github.com/cns-iu/hra-ftu-rag-supporting-information.git
    cd HRAftu-LM-RAG
    
  2. Follow the steps in docs/installation.md to install system dependencies (Docker, Python, CUDA), set up containers, and verify services.

Usage

Usage examples, command-line options, and configuration details are provided in docs/usage.md.

Common Workflows:

Documentation

Access architectural diagrams, deployment guides, and model-specific instructions in the docs/ folder:

Data

The data/ directory contains the following subdirectories and files:

Note: Due to size constraints, raw data files are not stored in this repository. Please follow the data preparation steps in docs/installation.md to download and extract the required archives.

Visualization

The vis/ directory holds SVG images generated from analysis results, matching the figures in the paper. These include scale bar overlays, FTU schematics, and demographic plots.

License

This project is licensed under the MIT License. See LICENSE for details.