Supporting Information for "Human BioMolecular Atlas Program (HuBMAP): 3D Human Reference Atlas Construction and Usage"

Katy Börner1,2*, Philip D. Blood3, Jonathan C. Silverstein4, Matthew Ruffalo5, Rahul Satija6, Sarah A. Teichmann2,7, Gloria Pryhuber8, Ravi Misra8, Jeffrey Purkerson8, Jean Fan9, John W. Hickey10, Gesmira Molla6, Chuan Xu7, Yun Zhang11 Griffin Weber12, Yashvardhan Jain1, Danial Qaurooni1, Yongxin Kong1, HRA Team, Andreas Bueckle1*, Bruce W. Herr II1*

1 Department of Intelligent Systems Engineering, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN, USA
2 CIFAR MacMillan Multiscale Human program, CIFAR, Toronto, Canada 3 Pittsburgh Supercomputing Center, Carnegie Mellon University, Pittsburgh, PA, USA
4 Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA
5 Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA 6 New York Genome Center, New York, NY, USA
7 Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
8 University of Rochester Medical Center, Rochester, NY, USA
9 Department of Biomedical Engineering, Johns Hopkins University, Baltimore MD, USA
10 Department of Biomedical Engineering, Duke University, Durham, NC, USA
11 J. Craig Venter Institute, La Jolla, CA, USA
12 Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA

* Corresponding authors:
Katy Börner, katy@iu.edu
Andreas Bueckle, abueckle@iu.edu
Bruce W. Herr II, bherr@iu.edu


Link to Preprint
Link to HuBMAP Consortium Website
Link to HuBMAP Portal
Link to HRA Portal
Link to GitHub Repository


Flexible hybrid cloud microservices architecture

Systems, data download, tools, containers and APIs operate via Globus Tokens passed through the API Gateway on every call.Systems, data download, tools, containers and APIs operate via Globus Tokens passed through the API Gateway on every call.Globus Auth is the OAuth2 authentication and authorization service published by Globus (not hosted by HuBMAP) used for login utilizing user’s home institution identity provider for authentication to retrieve user tokens and then used to tie users to HuBMAP-maintained groups for authorization. In future integration with NIH Research Authorization Service (RAS) users will also, via single sign-on, be associated with their dbGaP authorizations.Globus Auth is the OAuth2 authentication and authorization service published by Globus (not hosted by HuBMAP) used for login utilizing user’s home institution identity provider for authentication to retrieve user tokens and then used to tie users to HuBMAP-maintained groups for authorization. In future integration with NIH Research Authorization Service (RAS) users will also, via single sign-on, be associated with their dbGaP authorizations.
Globus Auth (Authentication and Authorization)
Globus Auth (Authentication and Authorization)
The File Store and Compute Resources are hosted in dedicated hardware at the Pittsburgh Supercomputing Center including raw and processed data managed in Globus endpoints with distinct security for public, consortium, and protected data. Databases are in AWS or at PSC as optimal for the use case.The File Store and Compute Resources are hosted in dedicated hardware at the Pittsburgh Supercomputing Center including raw and processed data managed in Globus endpoints with distinct security for public, consortium, and protected data. Databases are in AWS or at PSC as optimal for the use case.The File Store and Compute Resources are hosted in dedicated hardware at the Pittsburgh Supercomputing Center including raw and processed data managed in Globus endpoints with distinct security for public, consortium, and protected data. Databases are in AWS or at PSC as optimal for the use case.
On-Prem File Store and Compute Resources
On-Prem File Store and Compute Resources
Globus Transfer Application a web application hosted by Globus that allows users to initiate and track file transfers.Globus Transfer Application a web application hosted by Globus that allows users to initiate and track file transfers.
Globus Transfer UI
Globus Transfer...
MySQL is used for relational data including UUID API.MySQL is used for relational data including UUID API.
MySQL
MySQL
OpenSearch Search EngineOpenSearch Search Engine
OpenSearch
OpenSearch
Neo4j graph database open and free versions deployed on AWS are used for Provenance (Entity API) and Knowledge (Ontology API) backends.Neo4j graph database open and free versions deployed on AWS are used for Provenance (Entity API) and Knowledge (Ontology API) backends.
Neo4j
Neo4j
Globus Transfer Application and API is used by HuBMAP to enable authorized users to securely upload and download data.Globus Transfer Application and API is used by HuBMAP to enable authorized users to securely upload and download data.
Globus Transfer API
Globus Transfer...
Resource
Resource
API
API
Application
Applicat...
Entity API is the main interface to the HuBMAP Provenance store/database. This is a standard HTTP RESTful web service providing POST/PUT/GET services for the metadata associated with Donors, Organs, Tissue Samples and Datasets.Entity API is the main interface to the HuBMAP Provenance store/database. This is a standard HTTP RESTful web service providing POST/PUT/GET services for the metadata associated with Donors, Organs, Tissue Samples and Datasets.
Entity
Entity
UUID API is used to create and translate HuBMAP specific ids (UUIDs, HuBMAP IDs and Submission IDs). These are used to codify Donors, Tissue Samples (including organs), Datasets and other miscellaneous entities used by the provenance graph data store.UUID API is used to create and translate HuBMAP specific ids (UUIDs, HuBMAP IDs and Submission IDs). These are used to codify Donors, Tissue Samples (including organs), Datasets and other miscellaneous entities used by the provenance graph data store.
UUID
UUID
Search API is a search-oriented service backed by Elasticsearch holding configurable views (configured via modular transform plugin) of HuBMAP provenance data.Search API is a search-oriented service backed by Elasticsearch holding configurable views (configured via modular transform plugin) of HuBMAP provenance data.
Search
Search
The Workspaces API enables the creation of user workspaces used to run analysis against HuBMAP data using on-prem compute resources.The Workspaces API enables the creation of user workspaces used to run analysis against HuBMAP data using on-prem compute resources.
Workspaces
Workspaces
Ingest API is used mainly by the Ingest UI to provide application specific functionality for the data ingest/provenance. A main function is to interact with the local PSC HIVE file system and is therefore installed at the PSC instead of AWS.Ingest API is used mainly by the Ingest UI to provide application specific functionality for the data ingest/provenance. A main function is to interact with the local PSC HIVE file system and is therefore installed at the PSC instead of AWS.
Ingest
Ingest
Cells API provides the capability to search for data from indexed cell molecular information.Cells API provides the capability to search for data from indexed cell molecular information.
Cells
Cells
Ontology API provides concept, code and term traversal within a unified knowledge graph derived from standard ontologies and application specific terminologies with its model schema enabling efficient intra ontology navigation and cross ontology translation.Ontology API provides concept, code and term traversal within a unified knowledge graph derived from standard ontologies and application specific terminologies with its model schema enabling efficient intra ontology navigation and cross ontology translation.
Ontology
Ontology
Ingest UI is a web application where Donors, Organs, Tissue Samples and Datasets are submitted. Information registered via ingest UI is stored in the provenance database (Entity API). To upload/ingest data users are directed to the Globus Transfer application.Ingest UI is a web application where Donors, Organs, Tissue Samples and Datasets are submitted. Information registered via ingest UI is stored in the provenance database (Entity API). To upload/ingest data users are directed to the Globus Transfer application.
Ingest UI
Ingest UI
RUI (Registration User Interface) is used to spatially register tissue samples within their organ of origin.RUI (Registration User Interface) is used to spatially register tissue samples within their organ of origin.
RUI
RUI
EUI (Exploration User Interface) is used to search and view tissue samples in the location as registered via the RUI.EUI (Exploration User Interface) is used to search and view tissue samples in the location as registered via the RUI.
EUI
EUI
Ingest Pipeline is the main pipeline wrapper called within AirFlow to execute validation and analysis pipelines based on information drawn from the Entity and Ingest APIs. Ingest Pipeline also coordinates dataset status updates and the creation of new datasets with Ingest API.Ingest Pipeline is the main pipeline wrapper called within AirFlow to execute validation and analysis pipelines based on information drawn from the Entity and Ingest APIs. Ingest Pipeline also coordinates dataset status updates and the creation of new datasets with Ingest API.
Ingest Pipeline
Ingest Pip...
Analysis Pipelines analyze data from the assays supported. Each pipeline has its own GitHub repository and associated Common Workflow Language (CWL) and Docker container(s).Analysis Pipelines analyze data from the assays supported. Each pipeline has its own GitHub repository and associated Common Workflow Language (CWL) and Docker container(s).
Analysis Pipeline
Analysis P...
Azimuth is an analysis tool that uses an annotated reference dataset to automate the processing, analysis, and interpretation of a new single-cell RNA-seq or ATAC-seq experiment.Azimuth is an analysis tool that uses an annotated reference dataset to automate the processing, analysis, and interpretation of a new single-cell RNA-seq or ATAC-seq experiment.
Azimuth
Azimuth
Apache Airflow is a workflow management application deployed at the PSC used for running, monitoring and returning responses from analysis and validation pipelines.Apache Airflow is a workflow management application deployed at the PSC used for running, monitoring and returning responses from analysis and validation pipelines.
Apache Airflow Pipeline Manager
Apache Airflow Pipeline Manager
Data Portal is where both public and Consortium users search for data and associated provenance information. Dataset information pages include provenance, metadata and Vitesse visualizations for the data. Public users (no login) only see published data and associated provenance information while Consortium users (with login) can view for validation yet-to-be-published data.Data Portal is where both public and Consortium users search for data and associated provenance information. Dataset information pages include provenance, metadata and Vitesse visualizations for the data. Public users (no login) only see published data and associated provenance information while Consortium users (with login) can view for validation yet-to-be-published data.
Portal
Portal
Vitessce is a visual integration tool for exploration of spatial single cell experiments deployed in HuBMAP as an embedded web tool.Vitessce is a visual integration tool for exploration of spatial single cell experiments deployed in HuBMAP as an embedded web tool.
Vitessce
Vitessce
Search API is a search-oriented service backed by Elasticsearch holding configurable views (configured via modular transform plugin) of HuBMAP provenance data.Search API is a search-oriented service backed by Elasticsearch holding configurable views (configured via modular transform plugin) of HuBMAP provenance data.
Assets
Assets
HuBMAP Flexible Hybrid Cloud Microservices Architecture:
HuBMAP Flexible Hybrid Cloud Microservices Architecture:
Resources that run on Amazon Web Services (AWS) are in yellow.Resources that run on Amazon Web Services (AWS) are in yellow.
AWS
AWS
Resources that run on prem at the Pittsburgh Supercomputing Center (PSC) are in blue.Resources that run on prem at the Pittsburgh Supercomputing Center (PSC) are in blue.
PSC
PSC
Resources that are run by Globus are in red.Resources that are run by Globus are in red.
Globus
Globus
PostgreSQL relational databasePostgreSQL relational database
Postgres
Postgres
HuBMAP
Gateway
HuBMAP...

Click on architecture components to explore resources, APIs, and applications.


Atlas construction and publication

Crosswalk tables for 3D Reference Objects:

Crosswalk tables for cell type annotation tools:


Atlas use case preview: Facilitating atlas construction by aligning new tissue blocks with existing data

User stories US#1-2 have been partially implemented and can be explored online via the HRA Portal at https://humanatlas.io/overview-use-the-hra

This paper uses the HRApop v0.10.2 run and all data is available via

The LOD server supports SPARQL queries. For easy access to data that is of general utility, pre-made SPARQL queries are provided as web API endpoints via grlc. For example, HRApop users might be interested to examine the biomarker expression values for one cell type across HRApop datasets for specific anatomical structures (Fig. 1) or explore similarity of the 553 datasets used in HRApop construction based on shared cell type populations (Fig. 2) or shared anatomical structures based on mesh-level collision detection (Fig. 3).

alt_text

SI Figure 1: Dot plot for biomarker expression of one cell type across HRApop datasets. Use the /datasets-with-ct SPARQL query to retrieve all atlas datasets with a given cell type. For cell type ‘adipocyte’, the query returns 109 datasets with that cell type (all were annotated by Azimuth, no other cell type annotation tool assigns an adipocyte cell type) and with a total of 420 biomarkers characterizing that cell type. The query is documented here. The Jupyter Notebook to render the visualization is here.

alt_text alt_text alt_text

SI Figure 2. Heatmaps for prevalence of cell types across organs in HRApop dataset. a. Azimuth can be run over four organs. b. CellTypist is available for six organs. c. popV was run for 10 organs. Each heatmap represents a scaled mean value (z-score) for percentage of cells identified in each dataset registered into an organ by tool. The percentage values are scaled using R’s scale() function, where values for a given variable are centered around the mean, and then scaled to the standard deviation from the mean, i.e., given a z-score. A z-score of 0 means these values are close to the variable’s mean value. A color corresponding to a score of 1 would indicate that the cell type percentage values are 1 standard deviation higher than the mean for that cell type, values of 2 would be 2 standard deviations from the mean, etc. Full versions for all three plots are provided here (Azimuth), here (CellTypist), and here (popV). The R Markdown document to generate these visualizations is here.

alt_text

SI Figure 3. UMAP plot of dataset similarity based on shared anatomical structures. a. The similarity of the 553 atlas-level datasets is plotted here based on the percentage of shared anatomical structures using mesh-level collision detection. Weighted cosine is used here and in US#2 available via the HRA Portal at https://humanatlas.io/user-story/2. Datasets cluster by organ, see legend on right. b. UMAP zoom into four subclusters for the small intestine reveals the four major extraction sites. Full versions for UMAP plots are provided here.


Atlas use case preview: Perivascular immune cells in lung

A use case featuring an example application using the Vitessce visualization tool to visualize similar locations of healthy adult compared to pediatric lung with BPD disease to demonstrate an assessment of multiple cell types relative to nearest endothelial cell nuclei using single-cell spatial protein biomarker data.

Link to data on Google Drive: https://drive.google.com/drive/folders/1LX4PHzohrK5l_2G5szZEdxz8iTvIztx2

*Disclaimer: The datasets for this analysis are still in preparation for upload to the HuBMAP Portal. As a result, the Tissue Datasets field in the Exploration User Interface linked below will show 0. We provide these two datasets via Google Drive for the time being. Once the datasets are on the HuBMAP Portal, this field will be updated.*

Link to code on GitHub: https://github.com/cns-iu/hra-construction-usage-supporting-information/tree/main/perivascular-immune-cells-in-lung

alt_text

alt_text


Atlas use case preview: Hierarchical cell type populations within FTUs

A use case featuring a code template for hierarchical cell neighborhood analysis. The code was developed for analyzing cell type neighborhoods across scales and we have named some of these scales: cellular “neighborhoods”, “communities”, and “functional tissue units.” The calculation of similar cellular neighborhoods, communities, and tissue units across different scales is analogous to how we might think that people form neighborhoods, cities, and states.

Link to Nature paper and data: https://portal.hubmapconsortium.org/browse/publication/77ab35880329b5932380104aa58795a4
Link to worksheet on GitHub: https://github.com/HickeyLab/Hierarchical-Tissue-Unit-Annotation

alt_text

alt_text

The cell type predictions for the same dataset, using the current version of the cell type model by the Van Valen lab are also made available at https://drive.google.com/drive/folders/1W0MVcc4Zx1pPHmshSohhzYIcvFxyFBDi. We show a comparison between the original STELLAR predictions (SI Fig. 4, left) vs. the predictions from the development version of the cell type model (SI Fig. 4, right) for one dataset. We also show a confusion matrix for the cell type categories for the same dataset, see SI Fig. 5.

alt_text

SI Figure 4. Comparison between cell type predictions from STELLAR (left) and development version of cell type model (right).

alt_text

SI Figure 5. Confusion matrix for B009_Trans_CL_reg001 dataset.