The Human Reference Atlas (HRA) effort brings together experts from more than 25 international consortia to capture the multiscale organization of the human body—from large anatomical organ systems (macro) to the single-cell level (micro). Functional tissue units (FTU, meso) in 10 organs have been detailed and 2D illustrations have been created by experts. Comprehensive FTU property characterization is essential for the HRA, but manual review of the vast number of scholarly publications is impractical. Here, we introduce Large-Model Retrieval-Augmented Generation for HRA FTUs (HRAftu-LM-RAG), an AI-driven framework for scalable and automated extraction of FTU-relevant properties from scholarly publications. This validated framework integrates Large Language Models for textual reasoning, Large Vision Models for visual interpretation, and Retrieval Augmented Generation for knowledge grounding, offering a balanced trade-off between accuracy and processing efficiency. We retrieved 244,640 PubMed Central publications containing 1,389,168 figures for 22 FTUs and identified 617,237 figures with microscopy and schematic images. From these images and associated text, we automatically extracted 331,189 scale bars and 1,719,138 biological entity mentions, along with donor metadata such as sex and age.
This repository provides the supporting code and data for “AI-Supported Extraction of Functional Tissue Unit Properties for Human Reference Atlas Construction” paper, detailing the robust and scalable computation of scholarly evidence for the size, structure, and demographic differences of FTUs, facilitating the design and approval of future FTU illustrations during HRA construction.
| Workflow | Input | Algorithm/Script | Output |
|---|---|---|---|
| WF1: Image-type categorization |
- ftu_pub_pmc table: pmcid, graphic, caption, label, file_path- image_refs table: ref_text
|
2-3-itype-run.py |
- Raw outputs from vision models: llama, llava, phi3, phi35, pixtral - Final image-type labels stored in vision_llm table: micro, statis, schema, 3d, chem, math
|
| WF2: In-image text-term extraction |
- image_node_lvm_total table: pmcid, graphic, file_path
|
3-2-lvm-entity-run.py |
- Extracted in-image text terms stored in image_node_lvm_total table under nodes field
|
| WF3: Scale-bar extraction |
- ftu_pub_pmc table: pmcid, graphic, caption- image_refs table: ref_text- Filter: vision_llm.micro = "Yes"
|
4-2-sb-run.py |
- Scale bar information stored in scale_bar_all_info table:Descriptor Type, Value, Units, Notes, Panel |
| WF4: Donor-metadata extraction |
- ftu_pub_pmc table: pmcid, graphic, caption- publication_summary table: abstract- Filter: vision_llm.micro = "Yes" or vision_llm.schema = "Yes"- image_node_lvm_total table: nodes
|
5-1-donor-run.py |
- Donor metadata stored in donor_meta_all_info table:species, sex, age, BMI, height, weight |
| WF5: AS+CT+B extraction |
- ftu_pub_pmc table: pmcid, graphic, caption- image_node_lvm_total table: nodes- Filter: vision_llm.micro = "Yes" or vision_llm.schema = "Yes"
|
6-1-bio-onto-run.py |
- Biological entity terms stored in bio_onto_all_info table:entity (AS, CT, B) |
├── data # Input and output datasets, plus test data for validation
├── docs # Detailed documentation (architecture, deployment, usage)
├── src # Source code for data fetching, LLM/Vision pipelines, similarity scripts
├── vis # Generated SVG figures used in the paper
src/: Organized into submodules:
For full installation instructions, environment setup, and dependency management, see docs/installation.md.
Quick Start:
Clone the repository:
git clone https://github.com/cns-iu/hra-ftu-rag-supporting-information.git
cd HRAftu-LM-RAG
Follow the steps in docs/installation.md to install system dependencies (Docker, Python, CUDA), set up containers, and verify services.
Usage examples, command-line options, and configuration details are provided in docs/usage.md.
Common Workflows:
Access architectural diagrams, deployment guides, and model-specific instructions in the docs/ folder:
docs/architecture.md Provides an overview of the system architecture, including module interactions, data flow diagrams, and component responsibilities.docs/llm_deployment.md Step-by-step instructions for environment setup, dependency installation, Docker and CUDA configuration, and verification tests.docs/lvm_deployment.md Demonstrates common workflows with example commands, configuration parameters, and expected outputs for core functionalities such as data import, FTU extraction, and similarity evaluation.docs/installation.md Covers deployment of Large Language Models, including model selection criteria, API configuration, and performance optimization strategies.docs/usage.md Explains the vision-language model pipeline setup, detailing image preprocessing, model integration, and inference procedures.The data/ directory contains the following subdirectories and files:
bio-onto/
bio-onto-prompt.csv: Prompts for biological ontology extraction.bio-onto-test-answer.csv: Expected answers for ontology tests.selected_prompt.txt: The chosen prompt template.donor-meta/
prompt-donor.csv: Prompts for donor metadata extraction.donor-test-answer.csv: Expected donor metadata answers.selected_prompt.txt: Chosen prompt template.age/, age_yearold/, bmi/, sex/, species/: Each contains <metric>_1.csv (sample data) and round.csv (processing scripts for rounding values).emb/
test_questions.csv: Embedding-based similarity test questions.img-entity/
lvm-entity-prompt.csv: Prompts for LVM-based entity extraction.lvm-entity-testdata.tar.gz: Compressed test images for entity extraction.lvm-test-answer.csv: Expected outputs for image-entity tasks.selected-prompt.txt: The chosen image-entity prompt template.img-type/
prompt.txt: Prompt template for image type classification.test_img_info.json.gz: JSON test file with image metadata.img-type-test-answer.csv: Expected classification results.input-data/
0-0-ftu-pmc-total.tar.gz: Complete FTU–PMC dataset archive.0-1-oa-comm-ftu-pmcid-filepath.tar.gz: Subset with FTU–PMC ID to filepath mappings.ftu-description-from-bioportal.csv: FTU descriptions imported from BioPortal.organ-ftu-uberon.csv: Mapping of organs to UBERON FTU terms.pmc_result_<FTU name>.txt.scale-bar/
scale-bar-prompts.csv: Prompts for scale bar detection and extraction.scale-bar-sample.csv: Sample output file illustrating extracted scale bar values.selected_prompt.txt: The chosen scale-bar prompt template.um.csv, cm.csv, m.csv, etc.).vis-source-data/
vis/. Follow the processing pipelines in docs/usage.md to regenerate visualizations.Note: Due to size constraints, raw data files are not stored in this repository. Please follow the data preparation steps in
docs/installation.mdto download and extract the required archives.
The vis/ directory holds SVG images generated from analysis results, matching the figures in the paper. These include scale bar overlays, FTU schematics, and demographic plots.
This project is licensed under the MIT License. See LICENSE for details.