Tag: Data integration

A spatial human thymus cell atlas mapped to a continuous tissue axis
[ad_1]
Data generation by institute

Metadata about scRNA-seq and CITE-seq samples, including information on source study, cell enrichment and donor age, are provided in Supplementary Table 1. Information about spatial data, including Visium, IBEX, RareCyte and RNAscope, is provided in Supplementary Table 2. In brief, all CITE-seq data were generated at Ghent University and mapped to the human genome (GRCh38) at the Wellcome Sanger Institute (WSI). All other original scRNA-seq data and 10x Visium data were generated and mapped to the human genome (GRCh38) at WSI. All IBEX imaging was performed at the National Institute of Allergy and Infectious Diseases (NIAID), NIH. All non-IBEX imaging datasets (RareCyte, RNAscope, Visium H&E) were generated at WSI. No fetal work was performed at the NIH and at Ghent University.

Human fetal and paediatric data from several previous studies^3,7,15,16 were included and reanalysed from raw fastq files. Details on sample processing, ethics and funding are available in the respective publications; details on the origin of each sample are provided in Fig. 1c and Supplementary Tables 1 and 2.

For samples processed at WSI, paediatric samples were obtained from cardiac corrective surgeries and provided by Newcastle University collected under REC approved study 18/EM/0314 and Great Ormond Street Hospital under REC approved study 07/Q0508/43. Human embryonic and fetal material was provided by the joint MRC & Wellcome Trust (grant MR/006237/1) Human Developmental Biology Resource (http://www.hdbr.org) with written consent and approval from the Newcastle and North Tyneside NHS Health Authority Joint Ethics Committee (08/H0906/21+5). Samples processed at Ghent University were obtained according to and used with the approval of the Medical Ethical Commission of Ghent University Hospital, Belgium (EC/2019-0826) through the haematopoietic cell biobank (EC-Bio/1-2018). Samples processed at the NIH were obtained under NIAID MTA 2016-250 from the pathology department of the Children’s National Medical Center in Washington, DC, following cardiothoracic surgery from children with congenital heart disease. Use of these thymus samples for this study was determined to be exempt from review by the NIH Institutional Review Board in accordance with the guidelines issued by the Office of Human Research Protections. Informed consent was obtained from all donors or their legal guardians.

Sample processing and library preparation for 10x scRNA-seq at WSI

Surgically removed paediatric thymi were directly moved to HypoThermosol (Sigma-Aldrich, H4416-100ML), shipped by courier with ice packs, and processed within 24 h from time of surgery. For scRNA-seq experiments, we performed single-cell dissociation as described in the protocol available online (https://doi.org/10.17504/protocols.io.bx8sprwe). In brief, thymic tissue was finely minced and cell dissociation was performed using a mixture of liberase TH (Roche, 05401135001) and DNase I (Roche, 4716728001) for ~30–60 min in two rounds. Digested tissue was filtered through a 70 μm strainer and digestion was stopped with 2% FBS in RPMI medium. Red blood cell lysis was then performed on the cell pellet using RBC lysis buffer (eBioscience, 00-4333-57), after which cells were washed and counted. Magnetic and/or FACS sorting were performed to enrich stromal populations. Magnetic sorting to enrich for EPCAM⁺ or to deplete CD45⁺ or CD3⁺ cells was performed for U09, U48 and Z11 samples (Supplementary Table 1) using a magnetic sorting kit from Miltenyi Biotec including LS Columns (130-042-401) and the following bead-tagged antibodies: human CD45 MicroBeads (130-045-801), human CD326 (EPCAM) MicroBeads (130-061-101), human CD3 MicroBeads (130-050-101). Z11 and U40 samples were FACS-sorted to enrich for CD45⁻ stromal cells (Z11) or to obtain both CD45⁻ cells and total TECs (U40) (Supplementary Table 1). To perform FACS sorting, cells were resuspended in FACS buffer (0.5% FBS and 2 mM EDTA in PBS), blocked with TruStainFcX (BioLegend, 422302) for 10 min and were stained with antibodies against EPCAM (anti-CD326 PE, 9C4, BioLegend, 324206), CD45 (anti-CD45 BV785, HI30, BioLegend, 304048), CD205 (anti-CD205 (DEC-205) APC, BioLegend, 342207) and CD3 (anti-CD3 FITC, OKT3, BioLegend, 317306) and DAPI for 30 min. All antibodies were diluted 1:50 for staining. After staining, cells were washed and sorted using Sony SH800 or Sony MA900 sorters with a 130 μm nozzle. Samples were first gated to remove debris and dead cells, but no singlet gate was applied to ensure large TECs would not be excluded. For samples stained with anti-CD45, total CD45⁻ cells were sorted to enrich for stroma. For EPCAM/CD205-stained cells, EPCAM⁺CD205⁻ and EPCAM⁻CD205⁺ cells were sorted to obtain the total TEC fraction including cortical and medullary epithelial cells but exclude autofluorescent cells (Supplementary Fig. 19a). FCS Express v.7.18.0025 was used for data analysis. After sorting/enrichment, cells were resuspended in collection buffer to the recommended concentration (10⁶ cells per ml) and loaded to capture around 8,000–10,000 cells on the 10x Genomics chromium controller to generate an emulsion of cells in droplets using Chromium Next GEM Single Cell 5′ Kit v2 (1000263). GEX and TCR-seq libraries were further prepared using the Library Construction Kit (1000190) and Chromium Single Cell Human TCR Amplification Kit (1000252) according to the manufacturer’s instructions. Sequencing was performed on the NovaSeq 6000 sequencer (Illumina). Additional details are provided in Supplementary Table 1.

Curation of published datasets

Data from several previous studies^3,7,15,16 are included in this Article. For public datasets deposited on ArrayExpress, paired-end fastq files were downloaded from ENA, and the .sdhf file was used to determine the type of experiment (3′/5′ and the version of 10x Genomics kit). For public datasets deposited on GEO, if the data were deposited as a Cell Ranger .bam file, URLs for the bam files were obtained using srapath v.2.11.0. The .bam files were downloaded and converted to .fastq files using 10x bamtofastq v.1.3.2. If GEO data were deposited as paired-end fastq files, sra files were located using the search utility from NCBI entrez-direct v.15.6, downloaded and converted to fastq files using fastq-dump v.2.11.0. Sample metadata were curated from the abstracts deposited on GEO. Finally, published datasets that had been generated at WSI were downloaded from iRODs v.4.2.7 in the form of cram files and converted to fastq files using samtools v.1.12. using the command ‘samtools collate -O -u -@16 $CRAM $TAG.tmp | samtools fastq -N -F 0×900 -@16 −1 $TAG.R1.fastq.gz −2 $TAG.R2.fastq.gz -’. Sample metadata were obtained using the imeta command from iRODs.

Processing of published and newly generated scRNA-seq and TCR-seq datasets at WSI

After fastq file generation, 10x Genomics scRNA-seq experiments were processed using the STARsolo pipeline detailed on GitHub (https://github.com/cellgeni/STARsolo). A STAR human genome reference matching Cell Ranger GRCh38-2020-A was prepared as per instructions from 10x Genomics. Using STAR v.2.7.9a and the previously collected data about sample type (3′/5′, 10x Genomics kit version), we applied the STARsolo command to specify UMI collapsing, barcode collapsing, and read clipping algorithms to generate results maximally similar to the default parameters of the “cellranger count” command in Cell Ranger v.6: “–soloUMIdedup 1MM_CR –soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts –soloUMIfiltering MultiGeneUMI_CR –clipAdapterType CellRanger4 –outFilterScoreMin 30”. For cell filtering, the EmptyDrops algorithm of Cell Ranger v.4 and above was invoked using “–soloCellFilter EmptyDrops_CR”. The option “–soloFeatures Gene GeneFull Velocyto” was used to generate both exon-only and full length (pre-mRNA) gene counts, as well as RNA velocity output matrices. TCR-seq samples were processed using Cell Ranger v.6.1.1 with VDJ reference vdj-GRCh38-alts-5.0.0. The default settings of the reference-based “cellranger vdj” command were used. Fastq files were converted to _S1_L001_R1_001.fastq.gz format to be compatible with Cell Ranger.

scRNA-seq quality control, data integration and annotation

Jupyter notebooks used for data quality control, preprocessing, integration and annotations are available in the GitHub repository for this manuscript (Code availability). Scanpy v.1.9.1 with anndata v.0.10.7 and the statistics and plotting libraries pandas v.2.2.2, numpy v.1.26.4, scipy v.1.13.0, seaborn v.0.13.2 and matplotlib v.3.8.4 were used for data analysis and visualization. Mapped libraries were subjected to computational removal of ambient RNA using CellBender⁴² v.0.1.0. Next, all datasets underwent cell quality-control filtering and cells with <400 or > 6500 genes, >6% mitochondrial reads or <5% ribosomal counts were removed. Doublets were annotated using Scrublet⁴³ v.0.2.3. Next, datasets were integrated with scVI from scvi-tools⁴⁴ v.0.19.0, for which mitochondrial, TCR and cell cycle genes were removed, and cells were annotated into major lineages (cell_type_level_0: T_DN, T_DP, T_SP, Epithelial, Stroma, Myeloid, RBC, B and Schwann) by Leiden clustering. Individual cell lineages were then separated and integrated with scVI to perform fine-grained annotation and remove remaining doublets picked up by manual annotation. Cell annotations were assigned based on four sequential steps: (1) high-resolution Leiden clustering was performed to find all potential cell clusters. (2) Annotations of new cells generated in this study were predicted (i) based on KNN graph majority voting of neighbouring cells with annotations from previous studies by adaptation of the weighted KNN transfer solution from scArches⁴⁵, or (ii) automatic label transfer using CellTypist⁴⁶ v.1.6.2 with the Developing_Human_Thymus or Pan_Fetal_Human models. (3) Calling the cell type annotation for a given cluster was then informed by a combination of the predicted labels and by marker genes reported in the literature. (4) Additional QC was performed and newly detected doublet clusters were removed where applicable. Steps 1–4 were repeated until final, fine-grained annotations were reached for the highest resolution (cell_type_level_4). Importantly, cell clusters that passed quality control but that we could not confidently assign to a defined cell type either in the literature or by cell markers by the strategy described above, were kept in the integrated object as ‘cell_type_level_4_explore’, which we recommend for future exploration and validation of these cell states. Annotations in ‘cell_type_level_4’ were grouped into five hierarchical levels from the finest (cell_type_level_4) to the broadest (cell_type_level_0) (Supplementary Table 7).

Single-cell TCR-sequencing data were processed using Dandelion⁴⁷ v.0.3.1, and a detailed notebook can be found in the GitHub repository for this Article (see Code availability). In brief, a Dandelion class object with n_obs was constructed from all combined TCR libraries. Cells were then further subset to only include DP and SP subtypes that contained V and J rearrangements for both TRB and TRA loci. Next, milo neighbourhoods were constructed based on scVI neighbourhood graphs (n_neighbors = 100) and VDJ genes and frequency feature space was calculated.

mTEC trajectory analysis

The scFates package⁴⁸ v.1.0.7 was used for trajectory analysis on the combined mTECII and mTECIII cells. mTECII and mTECIII cells from the paediatric scRNA-seq dataset were reintegrated and batch-corrected using scVI. Next, the object was preprocessed according to recommendations from Palantir⁴⁹ and the scVI latent embedding was used as an input to ‘palantir.utils.run_diffusion_maps’ (Palantir v.1.3.3). Tree learning was performed using scf.tl.tree by using multiscale diffusion space from Palantir as recommended by scFates. Next, the node that was characterized by expression of some mTECI markers (ASCL1, CCL21) together with mTECII markers (AIRE, FEZF2) was used as a root to compute pseudotime using scf.tl.pseudotime. Finally, milestones obtained after running the pseudotime were used to adjust the annotations. To derive differentially expressed genes across pseudotime branches, we applied the scf.tl.test_fork function, followed by scf.tl.branch-specific as described in the scFates article⁴⁸. In brief, it fits a generalized additive model for each gene using pseudotime, branch and interaction between pseudotime and branch as covariates; two-sided P values were extracted for the interaction term (pseudotime:branch) and corrected using FDR to obtain significant differentially expressed genes. Next, these genes were tested for upregulation in each branch and assigned to different branches based on a cut-off of 1.3-fold upregulation.

mcTEC differentiation potential analysis

The STEMNET package⁵⁰ (v.0.1) was used to determine the differentiation potential of mcTEC progenitor cells. For this purpose, only mTEC, cTEC and mcTEC(-Prolif) cells were retained in the dataset. mTECI/II/III and cTECI/II/III were set as maturation endpoints and the probability of each cell to adopt any of these six possible fates was calculated. To identify priming of mcTEC(-Prolif) towards mTECI versus cTECI fate, we derived a priming score by calculating the difference between the posterior probabilities for these two fates for each cell. Cells with priming score <−0.5 or >0.5 were labelled as mTECI-primed and cTECI-primed, respectively, whereas all other cells with comparable mTECI and cTECI potential were deemed to be unprimed.

Formulation of the CMA with OrganAxis

OrganAxis is a mathematical model aimed to derive the relative, signed position of a point in space in respect to two morphological landmarks. The OrganAxis base function H is highly flexible and tuneable with respect to the research question, spatial resolution and sampling frequency. In Supplementary Notes 1 and 2, we provide a detailed guide on tissue annotations with TissueTag v.0.1.1 and the details of OrganAxis derivation. The CMA is an extrapolation of OrganAxis, which is defined by a weighted linear combination of two H functions (CMA = 0.2 × H(edge-to-cortex) + 0.8 × H(cortex-to-medulla)). All IBEX and Visium images were annotated with TissueTag v.0.1.1 at a resolution of 2 um per pixel (ppm = 0.5). Then, annotations were transferred to a quasi-hexagonal grid that was generated by placing points with r-microns spacing in the x and y directions and staggering every other row by r/2. Throughout this study we used r = 15. L2 distances to broad level annotations (annotation_level_0) and the corresponding CMA values were calculated with k = 10 for all spots in the hexagonal grid. L2 distances to fine annotations (annotation_level_1) were calculated with k = 1. CMA and L2 distances were then transferred to spots in Visium datasets or nuclei segmentations for IBEX by nearest-neighbour mapping and were therefore spatially homogeneous for these two spatial technologies.

To provide a common reference, we also binned the axis to a sequential discrete space for the entire thymus by ten levels (capsular, subcapsular, cortical level 1, cortical level 2, cortical level 3, cortical CMJ, medullary CMJ, medullary level 1, medullary level 2, medullary level 3; details are provided in Supplementary Note 2 and Supplementary Table 8).

10x Visium spatial transcriptomics sample processing and sequencing

Resected fetal and paediatric thymi were directly moved to HypoThermosol (Sigma-Aldrich, H4416-100ML) (paediatric samples) or cold PBS (fetal samples), shipped by courier with ice packs and processed within 24 h from time of surgery. For embedding, fetal or paediatric thymus tissue was first transferred to PBS and then placed onto ice for a few minutes to clear away any excess medium and preservation liquid (such as HypoThermosol). Next, as much liquid as possible was removed from the sample and, if necessary, the tissue was trimmed to fit into a cryomold. The sample was placed into a cryomold (Tissue-Tek, AGG4581) filled with OCT (Leica biosystems, 14020108926) and positioned according to the desired orientation. The cryomold was then placed in isopentane that had been equilibrated to −60 °C and left to fully freeze for 2 min. The sample was then rested on dry ice to allow draining of the isopentane. Finally, the cryomolds were wrapped in foil and stored at −80 °C. On the day of sectioning, the samples were removed 1 h before sectioning and placed into the cryostat (Leica biosystems) at −18 °C to equilibrate. Tissue was sectioned (section thicknesses are provided in Supplementary Table 2) and sections were placed onto Visium slides according to the manufacturer’s protocol. The sections were stained with H&E and imaged at ×20 magnification (Hamamatsu Nanozoomer 2.0 HT). Libraries were further processed according to the manufacturer’s protocol (Visium Spatial Gene Expression Slide & Reagent Kit, 10x Genomics, PN-1000184) (permeabilization times are shown in Supplementary Table 2). The samples were sequenced on the NovaSeq 6000 sequencer (Illumina) and the obtained fastq files were mapped with Space Ranger (10x Genomics; version numbers are provided in Supplementary Table 2).

Visium preprocessing, image registration and annotation

To process the Visium histology image data in higher resolution than the SpaceRanger defaults, we built a custom pipeline to extract an additional layer of image resolution at up to 5,000 pixels (hires5K), which we found to be more suitable for morphological analysis. We also developed our own fiducial image registration pipeline for increased accuracy where the fiducials are detected with cellpose v.2.1.1 and RANSAC from scikit-image v.0.22.0 is used for affine registration of reference fiducial frame (information provided by 10x Genomics). Lastly, for flexible tissue detection, we used Otsu thresholding with an adjustable threshold.

We subsequently used TissueTag v.0.1.1 for semiautomated image annotation (Supplementary Note 1). Cortical and medullary pixels were predicted with a pixel random forest classifier by generating training annotations based on spots with the highest gene expression of AIRE (for medulla) and ARPP21 (for cortex). Automatic cortex/medulla annotations were then adjusted manually where necessary. Moreover, we manually annotated individual thymic lobes and specific structures, such as capsule/edge, freezing/sectioning artefacts, HCs, PVS and fetal thymus-associated lymphoid aggregates (as defined previously¹⁶). The morphological annotation and evaluations were done in consultation with expert human thymic pathologists. A full example of the Visium processing pipeline and annotation is provided on the GitHub repository for this Article (see Code availability).

Spatial Visium mapping with cell2location

To ensure the best possible matching between Visium and single-cell profiles, we performed spatial mapping using cell2location⁵¹ v.0.1.3 separately for fetal and paediatric datasets. We therefore subset our single-cell reference datasets according to either fetal or paediatric stage and removed rare cell types that were predominantly found in one of these stages. We then further removed cell types that showed stress signatures (which we believed to originate from technical factors) and cell types with the total number of cells < 40. A list of cell types that were excluded and the exclusion criteria are provided in Supplementary Table 9. Before cell2location deconvolution, we removed cell cycle genes (Supplementary Table 10) and mitochondrial genes, as well as TCR genes using the regex expression ‘^TR[AB][VDJ]|^IG[HKL][VDJC]’. We then further calculated highly variable genes and used relevant metadata cofactors (sample, chemistry, study, donor, age) to correct for batch effects in the cell2location model.

Quality control and batch correction of Visium data

After deconvolution, all Visium data were subjected to filtering based on read coverage and predicted cell abundance. Spots with fewer than 1,000 genes per spot or fewer than 25 predicted cells were omitted. Furthermore, annotated tissue artefacts and areas not assigned to a specific structure were removed. Next, to generate a common embedding we performed scVI integration after removing cell cycle, mitochondrial and TCR genes from the highly variable gene selection. scVI training was performed with ‘SampleID’ as the batch key, ‘SlideID, Spaceranger, section_thickness(um)’ as categorical covariates, and ‘Age(numeric), n_genes_by_counts’ as continuous covariates.

Before performing any association analysis with the CMA, we further removed lobules (based on ‘annotations_lobules_0’) that had no or small medullar or cortical regions, as we expected our CMA model to be less accurate in these cases.

PCA cumulative contribution of CMA

To estimate the dependency of the axis on spot gene variance across samples, we first normalized to a target sum of 2,500 counts and performed log transformation followed by combat regression (https://scanpy.readthedocs.io/en/stable/api/generated/scanpy.pp.combat.html) by sample to adjust for the batch effect of individual samples. We then computed the PCA for batch-corrected gene expression and calculated the Spearman correlation between the first ten PCs and CMA or the number of genes detected per spot (n_gene_by_counts). Note that number of genes per spot, in our hands, was mostly influenced by inconsistent permeabilization during Visium library preparation and constituted the largest technical source of within-sample variance that we found in both fetal and paediatric Visium samples. To estimate the cumulative contribution of either CMA or the number of genes per spot, we multiplied the Spearman’s R with the percentage cumulative explained variance of the first 10 PCs.

Cytokine clustering by expression along the binned axis in Visium data

To analyse cytokine gradients based on the spatial distribution across CMA bins, we first selected a group of 65 cytokines that were broadly expressed from the CellphoneDB database (v.4.1.0, genes annotated as ‘Cytokine’, ‘growthfactor | cytokine’, ‘cytokine’, ‘cytokine | hormone’). We excluded cytokines that were expressed in less than 5% of the spots in all CMA bins. We then performed hierarchical clustering on gene expression batch-corrected (combat) fetal Visium samples of the standardized mean expression of genes across bins using the Ward linkage method with the linkage function from the scipy.cluster.hierarchy module. A heat map was generated using the matrixplot function from the scanpy package⁵².

Cytokine two-way ANOVA analysis and cosine similarity on binned axis Visium data

To compare the distribution of cytokines across developmental groups (fetal versus paediatric) and identify differentially distributed genes, we implemented a two-way ANOVA approach. We initially log-normalized Visium gene expression, then removed lobules for which not a single cortex or medulla Visium spot was detected to increase CMA confidence in both datasets. Data were grouped by mean expression per sample and CMA bin, such that each sample had a single datapoint per CMA bin; n = 16 (paediatric) and n = 12 (fetal) samples. Cosine similarity was calculated based on the median values of the pooled sample bins between fetal and paediatric gene profiles with sklearn.metrics.pairwise.cosine_similarity from scikit-learn v.0.22.0. Two-way ANOVA for age group (fetal versus paediatric) and CMA bin was calculated with statsmodels.api.stats.anova_lm(model, type=2). P values for main and interaction effects were Bonferroni corrected with statsmodels.stats.multitest.multipletests (pvals, alpha=0.05, method=‘bonferroni’). For the full report of the results refer to Supplementary Table 5.

Cell- and gene-weighted mean CMA location on Visium data

To estimate the average position of a cell or gene distribution along CMA and HC axes (L2 distance to the nearest HC), spots with low gene expression were filtered out by using appropriate thresholds (0.2 for scVI corrected gene expression and 0.5 for predicted cell abundances). The position of a gene or cell was then calculated according to the following formula: for every gene/cell and axis positions, the weighted mean was calculated as a dot product of spot cell abundance values and CMA position divided by the sum of the cell abundance values across spots.

Identification of cell-type-specific SGs

To identify genes exclusively expressed in a specific cell type or subset thereof (‘specialization genes’, SGs), we developed custom Python functions. Starting from raw read count, gene expression was scaled with scipy.stats.zscore(). Cells that showed expression below a cut-off of 0.05 and genes that had expression below 1.5 mean counts were excluded from further steps. Next, a quantile threshold (>95%) was used to select cells with the highest expression level of a specific gene. A χ² test (scipy chi2_contingency) was performed per gene to identify if the selected cells were over-represented in a specific cell type (cell_type_level_4_explore), indicating the gene to be a marker gene. Genes that were predicted to be expressed only in a single cell type (χ² α = 1 × 10⁻⁵⁰) were considered to be SGs and used for further analyses.

IBEX clinical cohort details and sample preparation

Human thymus samples were obtained from the pathology department of the Children’s National Medical Center in Washington, DC, after cardiothoracic surgery from children with congenital heart disease, as thymic tissue is routinely removed and discarded to gain adequate exposure of the retrosternal operative field. Use of these thymus samples for this study was determined to be exempt from review by the NIH Institutional Review Board in accordance with the guidelines issued by the Office of Human Research Protections. There were no genetic concerns for the patients in this cohort. Details about the cohort can be found in Supplementary Table 2. Human thymi were placed in PBS on receipt and processed within 24 h after surgery. Excess fat and connective tissue were trimmed and sectioned into <5 mm cubes. For IBEX imaging, human thymi were fixed with BD CytoFix/CytoPerm (BD Biosciences) diluted in PBS (1:4) for 2 days. After fixation, all tissues were washed briefly (5 min per wash) in PBS and incubated in 30% sucrose for 2 days before embedding in OCT compound (Tissue-Tek) as described previously^17,53.

IBEX sample imaging preparation

IBEX imaging was performed on fixed frozen sections as described previously^17,53. In brief, 20 μm sections were cut on a CM1950 cryostat (Leica) and adhered to two-well chambered cover glasses (Lab-tek) coated with 15 μl of chrome alum gelatin (Newcomer Supply) per well. Frozen sections were permeabilized, blocked and stained in PBS containing 0.3% Triton X-100 (Sigma-Aldrich), 1% bovine serum albumin (Sigma-Aldrich) and 1% human Fc block (BD Biosciences). Immunolabelling was performed with the PELCO BioWave Pro 36500-230 microwave equipped with a PELCO SteadyTemp Pro 50062 Thermoelectric Recirculating Chiller (Ted Pella) using a 2-1-2-1-2-1-2-1-2 program. The IBEX thymus antibody panel can be found in Supplementary Table 3 and has been formatted as an Organ Mapping Antibody Panel⁵⁴ (OMAP-17) accessible online (https://humanatlas.io/omap). Custom antibodies were purchased from BioLegend or conjugated in house using labelling kits for Lumican (AAT Bioquest, 1230) and LYVE-1 (Thermo Fisher Scientific, A20182). A biotin avidin kit (Abcam, ab64212) was used to block endogenous avidin, biotin and biotin-binding proteins before streptavidin application. Cell nuclei were visualized with Hoechst 33342 (Biotium) and the sections were mounted using Fluoromount G (Southern Biotech). Mounting medium was thoroughly removed by washing with PBS after image acquisition and before chemical bleaching of fluorophores. After each staining and imaging cycle, the samples were treated with two 15 min treatments of 1 mg ml⁻¹ of LiBH₄ (STREM Chemicals) prepared in deionized H₂O to bleach all fluorophores except Hoechst.

IBEX image acquisition and alignment

Representative sections from different tissues were acquired using the inverted Leica TCS SP8 X confocal microscope with a ×40 objective (NA 1.3), 4 HyD and 1 PMT detectors, a white-light laser that produces a continuous spectral output between 470 and 670 nm as well as 405, 685 and 730 nm lasers. Panels consisted of antibodies conjugated to the following fluorophores and dyes: Hoechst, Alexa Fluor (AF)488, FITC, AF532, phycoerythrin (PE), eF570, AF555, iFluor 594, AF647, eF660, AF680 and AF700. All images were captured at an 8-bit depth, with a line average of 3 and 1024 × 1024 format with the following pixel dimensions: x (0.284 μm), y (0.284 μm) and z (1 μm). Images were tiled and merged using the LAS X Navigator software (LAS X v.3.5.7.23225). For IBEX tissue imaging, multiple tissue sections were examined before selecting a representative tissue section that contained several distinct lobules with multiple functional units, often resulting in an unusually shaped region of interest. Fluorophore emission was collected on separate detectors with sequential laser excitation of compatible fluorophores (3–4 per sequential) used to minimize spectral spillover. The Channel Dye Separation module within LAS X v.3.5.7.23225 (Leica Microsystems) was then used to correct for any residual spillover. For publication-quality images, Gaussian filters, brightness/contrast adjustments and channel masks were applied uniformly to all images. Image alignment of all IBEX panels was performed as described previously using SimpleITK^55,56. Additional details on antibodies, protocols and software can be found on the IBEX Knowledge-Base (https://doi.org/10.5281/zenodo.7693279).

IBEX 3D single nuclei segmentation with cellpose

IBEX images were converted from .ims format (Imaris, Oxford Instruments, v.9.5.0 and v.10.0.0) to 3D stacks (TIFF) by individual channels with FIJI v.1.54j. We then applied a custom pipeline for 3D single nuclei segmentation with cellpose v.2.1.1:
1. (1)
  
  Image preparation: individual TIFF images were separated by z plane and channel and the sample channel metadata were extracted as a .csv file. The nuclear staining channel (Hoechst) was located, and a set of tiled 3D image arrays was generated.
2. (2)
  
  Segmentation: the image arrays were sequentially segmented with cellpose using specific parameters, for example, diameter, resampling, anisotropy, thresholds and batch size. We used tiles to overcome restrictions on GPU and RAM resources.
3. (3)
  
  Restitching and formatting: after segmentation, tiles were stitched back together, and all segmentation masks were stored in a uint32-bit image format to hold more than 65,535 mask labels, as a typical sample produces 200,000–700,000 cell masks.
4. (4)
  
  Handling high nuclear density: the high nuclear density, especially in the thymic cortex, produced scenarios in which segmentation masks were often bordered by neighbouring cells, resulting in substantial inter-cell signal bleed. To overcome this issue, we excluded pixels in the interface between two cells such that the mask boundary was preserved if no neighbouring cell was present.
5. (5)
  
  Mask filtering: we removed small cell masks to avoid noise and cell fragments using skimage.morphology.remove_small_objects.
6. (6)
  
  Signal intensity extraction: for each cell mask post-filtering, we extracted the mean and maximum signal intensity for each channel in the IBEX image. This produced a cellXprotein file for each sample.
7. (7)
  
  Data merging: once all sample single-cell segmentations were collected, we merged all samples (n = 8) and stored metadata, channel and spatial information in a unified AnnData object, totalling more than 1.1 million cells.
Label transfer to IBEX cells from scRNA-seq and CITE-seq reference atlas with ISS patcher

We used a KNN algorithm to compare the annotated cell types in our scRNA-seq reference atlas with the IBEX single-cell segmentations. For this purpose, the protein expression in the IBEX query cells was matched with the RNA expression of the corresponding genes in the scRNA-seq reference. Protein and gene names were matched according to Supplementary Table 11. Batch effects were removed from IBEX with scanpy.pp.combat(ibex_gene, key=‘sample’, inplace=True). We next subsetted each IBEX sample and ran the KNN prediction algorithm per sample with k = 30, including the following steps:
1. (1)
  
  The shared feature space between the two objects was identified, log-normalized and z-scored on a per-object basis.
2. (2)
  
  The KNN of the low-dimensional observations (IBEX) in the high-dimensional space (GEX) were identified. The counts of absent features were imputed as the mean of the high-dimensional neighbours.
3. (3)
  
  On the basis of majority voting, the most frequent cell annotation in the GEX reference was assigned to the IBEX query cell. The proportion of KNNs that contributed to the majority voting was recorded as the evidence fraction (KNNf). For example, if out of the 30 nearest neighbours, 13 were labelled as A, 10 as B and 7 as C in GEX, then the IBEX cell received the label A. The fraction in this case refers to the proportion of the 30 nearest neighbours that contributed to the majority label, which would be KNNf = 13/30 = 0.43 for label A.
In some of our IBEX samples AIRE staining in particular produced a relatively high level of non-specific signal and low signal-to-noise ratio. As a consequence, we identify some predicted AIRE⁺ mTECII cells that are not in the medulla and have capsular/cortical localization (Fig. 4g). We had cases of individual samples for which a specific antibody did not perform well or was missing (for example, for IBEX_01 the KRT10 antibody was out of stock). However, as ISS patcher was run on each sample separately, this did not affect proper scaling of that marker and its use of it for mapping in other samples. Moreover, by selecting cells with a higher proportion of matched KNN cells from the single-cell data (KKNf) these effects are reconciled through the removal of cells with low-confidence mapping. We would like to flag this point for future researchers who want to reuse our datasets.

The general code for the ISS patcher KNN mapping algorithm can be found in the dedicated GitHub repository (https://github.com/Teichlab/iss_patcher/tree/main) and the full example for KNN mapping using IBEX is reported in the GitHub repository for this Article (see Code availability).

To annotate T cell types in the IBEX data with high accuracy, we applied the ISS patcher with the CITE-seq T lineage data as a reference as follows: we first applied the KNN algorithm to IBEX data using the scRNA-seq reference atlas to identify and subset IBEX T lineage cells. We then used the CITE-seq data as a reference to repeat the KNN-based annotation on these selected cells. This KNN-based reannotation was performed on a hybrid RNA/protein reference, which included protein measurements for the 19 markers assayed in both CITE-seq and IBEX in addition to RNA measurements for the remaining 23 genes as described in Supplementary Table 11. We used the same KNN implementation as described above but with k = 7, while also imputing CD4 and CD8 pseudotime.

Human thymus FFPE processing

Human paediatric samples were obtained from cardiac corrective surgeries. Removed thymi were directly moved to HypoThermosol (Sigma-Aldrich, H4416-100ML), shipped with a courier with ice packs and processed in under 24 h after surgery. After arrival, for FFPE processing, the samples were cut to approximately 1 cm² pieces with sharp scissors in 1× DPBS. Tissue pieces were then rinsed in clean DPBS to remove any excess HypoThermosol, patted dry with wipes (Kimtech) and placed into 10% formalin (cellpath, BAF-6000-08A) for 16–24 h at room temperature. The next day, tissues were dehydrated and embedded in wax, then kept at 4 °C.

RareCyte immunostaining and 14-plex imaging

Multiplex immunofluorescence and single round imaging was performed as described previously⁵⁷. All steps were performed at room temperature unless stated otherwise. In brief, FFPE blocks were sectioned using a microtome (Leica, RM2235) at 3.5–5 µm thickness and placed onto a superfrost slide (Thermo Fisher Scientific, 12312148). Slides were dried at 60 °C for 60 min to ensure that tissue sections had adhered to the slides. Tissue sections were deparaffinized and subjected to antigen retrieval using the BioGenex EZ-Retriever system (95 °C for 5 min followed by 107 °C 5 min). For OCT sections, 7 µm sections were taken using a cryostat (Leica CM3050S), placed onto SuperFrost Plus slides (VWR) and immediately submerged in 10% buffered saline formalin (Cellpath, BAF-6000-08A) for 1 h at room temperature. The samples were then subjected to the following steps similarly to the FFPE samples. To remove autofluorescence, slides were bleached with AF quench buffer (4.5% H₂O₂, 24 mM NaOH in PBS). The slides were quenched for 60 min using the ‘high’ setting with a strong white-light exposure followed by further quenching for 30 min using the 365 nm ‘high’ setting using a UV transilluminator. The slides were rinsed with 1× PBS and incubated in 300 µl of Image-iT FX Signal Enhancer (Thermo Fisher Scientific, I36933) for 15 min. The slides were rinsed again and 300 µl of labelled primary antibody staining cocktail was added to the tissue, which was subsequently incubated for 120 min in the dark within a humidity tray. All antibodies were prediluted according to company recommendations and were not adjusted further (Supplementary Table 4). The slides were washed with a surfactant wash buffer and 300 µl of nuclear staining in goat diluent was added to the slide. The slides were then incubated in the dark for 30 min in a humidity tray. The slides were then washed and placed in 1× PBS. Finally the slides were coverslipped using ArgoFluor mount medium and left in the dark at room temperature overnight to dry. The slides were imaged the next day using the RareCyte Orion microscope with a ×20 objective and relevant acquisition settings were applied using the software Artemis v.4.

RNAscope processing and imaging

For RNAscope analysis, thymus tissue was processed as described above for Visium sectioning. Sections were cut from the fresh frozen OCT-embedded (OCT, Leica) samples at a thickness of 10 μm using a cryostat (Leica, CM3050S) and placed onto SuperFrost Plus slides (VWR). Sections were stored at −80 °C until staining. The sections were removed from the −80 °C storage and submerged in chilled (4 °C) 4% PFA for 15 min, then acclimatized to room temperature 4% PFA over 120 min. The sections were then briefly washed in 1× PBS to remove any remaining OCT. Then, the sections were dehydrated in a series of 50%, 70%, 100% and 100% ethanol (5 min each) and air-dried before performing automated 4-plex RNAscope.

Using the automated Leica BOND RX, RNAscope staining was performed on the fresh frozen sections using the RNAscope LS multiplex fluorescent Reagent Kit v2 Assay and RNAscope LS 4-Plex Ancillary Kit for LS Multiplex Fluorescent (Advanced Cell Diagnostics (ACD), Bio-Techne) according to the manufacturer’s instructions. All of the sections were subjected to 15 min of protease III treatment before staining protocols were performed. Before running RNAscope probe panels, the RNA quality of fresh frozen samples was assessed using multiplex positive (RNAscope LS 2.5 4-plex Positive Control Probe, ACD Bio-Techne, 321808) and negative (RNAscope 4-plex LS Multiplex Negative Control Probe, ACD Bio-Techne, 321838) controls.

The probes were labelled using Opal 520, 570 and 650 fluorophores (Akoya Biosciences, 1:1,000) and one probe channel was labelled using Atto 425-streptavidin fluorophore (Sigma-Aldrich, 1:500), which was first incubated with TSA–biotin (Akoya Biosciences, 1:400). The following RNAscope 2.5 LS probes were used for this study: Hs-AIRE (ACD Bio-Techne, 551248), Hs-LY75-C2 (ACD Bio-Techne, 481438-C2), Hs-CAMP-C3 (ACD Bio-Techne, 446248-C3), Hs-EPCAM-C4 (ACD Bio-Techne, 310288-C4), Hs-IGFBP6-C1 (ACD Bio-Techne, 496068) and Hs-DLK2-C3 (ACD Bio-Techne, 425088-C3). All nuclei were DAPI stained (Life Technologies, D1306). Details are provided in Supplementary Table 12.

Confocal imaging was performed on the Perkin Elmer Operetta CLS High Content Analysis System using a ×20 (NA 0.16, 0.299 μm px⁻¹) water-immersion objective with 9-11 z-stacks with 2 µm step. Channels: DAPI (excitation, 355–385 nm; emission, 430–500 nm), Atto 425 (excitation, 435–460 nm; emission, 470–515 nm), Opal 520 (excitation, 460–490 nm; emission, 500–550 nm), Opal 570 (excitation, 530–560 nm; emission, 570–620 nm), Opal 650 (excitation, 615–645 nm; emission, 655–760 nm). Confocal image stacks were stitched as individual z stacks using proprietary Acapella scripts provided by Perkin Elmer, and visualized using OMERO Plus (Glencoe Software).

The contrast used for Extended Data Fig. 6e was as follows: DLK2 (magenta 150–500), IGFBP6 (yellow 200–1500), LY75 (green 200–4000), EPCAM (red 300–2500).

CITE-seq tissue processing

Paediatric thymus samples from children undergoing cardiac surgery were obtained according to and used with the approval of the Medical Ethical Commission of Ghent University Hospital, Belgium (EC/2019-0826) through the haematopoietic cell biobank (EC-Bio/1-2018). Thymus tissue was cut into small pieces using scalpels and digested with 1.6 mg ml⁻¹ collagenase (Gibco, 17104-019) in IMDM medium for 30 min at 37 °C with regular agitation to generate a single-cell suspension. The reaction was quenched with 10% FBS and the thymocyte suspension was passed through a 70 μm strainer to remove undigested tissue. Cells were frozen in FBS containing 10% DMSO and stored in liquid nitrogen until needed.

CITE-seq antibody preparation

The TotalSeq-C Human Universal Cocktail 1.0 (BioLegend, 399905) was resuspended according to the manufacturer’s instructions. In brief, the lyophilized cocktail was equilibrated to room temperature for 5 min and then centrifuged at 10,000g for 30 s. Then, 27.5 µl cell staining buffer (BioLegend, 420201) was added and the tube was vortexed for 10 s, incubated for 5 min at room temperature and vortexed again. The resuspended cocktail was centrifuged for 30 s at room temperature at 10,000g and the entire volume was transferred to a low-bind tube. Finally, the tube was centrifuged again at 14,000g for 10 min at 4 °C and 25 µl of the supernatant was used per sample (2 × 10⁶ cells in 200 µl).

In total, 13 TotalSeq-C antibodies (BioLegend) were titrated individually by flow cytometry using PE-conjugated versions of the same antibody clone as recommended by BioLegend. After choosing a suitable concentration for each antibody, a master mix was prepared for cell staining. For this, all antibodies were initially diluted in the cell staining buffer to obtain a concentration 100-fold higher than the desired final staining concentration. Then, 2 µl of each diluted antibody substock was combined in a master mix, which was added to the cells for labelling in a total volume of 200 µl. Details on the TotalSeq-C antibodies are provided in Supplementary Table 6.

CITE-seq sample preparation

Cells were thawed slowly by gradually adding 15 volumes of pre-warmed IMDM medium and pelleted at 1,700 rpm for 6 min at 4 °C. After resuspending in PBS, cells were passed through a 70 μm strainer to remove clumps. Enrichment for viable cells was achieved using a magnetic bead-based dead cell removal kit (Miltenyi, 130-090-101). For this, cells were pelleted as before, washed with 1× binding buffer (part of the kit, prepared with sterile distilled water) and resuspended in dead cell removal microbeads (part of the kit) at a concentration of 10⁷ cells per 100 µl beads. After incubation at room temperature for 15 min, cells were applied to an LS column (Miltenyi, 130-122-729), which was prerinsed with 3 ml 1× binding buffer. The column was washed four times with 3 ml binding buffer and the flow-through containing viable cells was collected. Cells in the flow-through were pelleted and viability was confirmed using trypan blue. A total of 2 × 10⁶ viable cells was used for TotalSeq-C and anti-CD3-PE antibody staining. For this purpose, cells were washed with cell staining buffer (BioLegend, 420201), pelleted at 600g for 10 min at 4 °C and resuspended in 90 µl cell staining buffer. Then, 10 µl Human TruStain FcX blocking solution (BioLegend, 422301) was added and cells were incubated for 10 min at 4 °C. The TotalSeq-C Human Universal Cocktail 1.0 (BioLegend, 399905) was resuspended as described above, centrifuged at 14,000g for 10 min at 4 °C and 25 µl of the supernatant was added to the blocked cells. Individual TotalSeq-C antibodies were prepared as described above and 26 µl of the master mix was added to each sample. To facilitate enrichment of immature and mature thymocytes by FACS, 10 µl anti-CD3-PE (SK7, BioLegend, 344805) was added and the samples were topped up with 40 µl cell staining buffer resulting in a total staining volume of 200 µl. The samples were incubated for 30 min at 4 °C in the dark. To wash off unbound antibodies, cell staining buffer was added to the samples, and cells were pelleted for 10 min at 600g at 4 °C. All supernatant was removed, cells were resuspended in cell staining buffer, transferred to a new tube and pelleted as before. Cells were again resuspended in cell staining buffer and pelleted and this wash step was repeated once more before cells were resuspended in 200 µl MACS buffer (2% FCS, 2 mM EDTA in PBS) in preparation for sorting. Then, 1 µl propidium iodide (Invitrogen, 230111) was added for detection of dead cells and samples were sorted on the BD FACSAria III or BD FACSAria Fusion cell sorter using a 100 μm nozzle and a maximum flow rate of 4 (FACSDiva v.8.0.2, reanalysis with FlowJo v.10.8.2). Cells were gated using forward/side scatter to remove doublets and debris, then dead cells were excluded based on PI staining. CD3⁻ and CD3⁺ cells were collected separately in cooled IMDM + 50% FCS (Supplementary Fig. 19b). After completion of the sort, collection tubes were topped up with DPBS and cells were pelleted at 400g for 10 min at 4 °C. The supernatant was removed and cells were resuspended at an estimated concentration of 1,500 cells per µl in PBS + 0.04% BSA (Miltenyi, 130-091-376), of which 16.5 µl was used in the GEM generation step. The Next GEM Single Cell 5′ Kit v2 (10x Genomics, 1000265) was used to prepare the reaction master mix, and load cells, gel beads and partitioning oil on a Chip K (10x Genomics, 1000286) according to the manufacturer’s protocol CG000330 Rev A. GEMs were generated using a Chromium Controller (10x Genomics), transferred to a tube strip and reverse transcription was performed in a BioRad C1000 Touch Thermal Cycler according to the protocol. The samples were stored at 4 °C overnight and the library preparation was carried out the next day.

CITE-seq library preparation and sequencing

Feature barcode (FB), gene expression (GEX) and TCR libraries were prepared according to protocol CG000330 Rev A (10x Genomics) using the Chromium Next GEM Single Cell 5′GEM Kit v2 (10x Genomics, 1000244), Library Construction Kit (10x Genomics, 1000190), 5′ Feature Barcode Kit (10x Genomics, 1000256), Human TCR Amplification Kit (10x Genomics, 1000252), Dual Index Kit TT set A (10x Genomics, 1000215) and Dual Index Kit TN set A (10x Genomics, 1000250). The protocol version for >6,000 cells was followed and libraries were amplified for 13 cycles (cDNA), 14 cycles (GEX), 8 cycles (FB) or 12 + 10 cycles (TCR libraries). Library quality and quantity were checked on the Bioanalyzer instrument (Agilent) using a high-sensitivity DNA assay. Libraries were pooled and sequenced on the NovaSeq 6000 instrument (Illumina) to a minimum of 25,000 reads per cell for GEX, 10,000 reads per cell for FB and 5,000 reads per cell for TCR libraries.

CITE-seq quality control and denoising

CITE-seq data were processed using the R packages Seurat⁵⁸ (v.4.3.0), SeuratObject (v.4.1.4), SeuratDisk (v.0.0.0.9021), SingleCellExperiment (v.1.24.0), Matrix (v.1.6-4), matrixStats (v.1.2.0), dplyr (v.1.1.4), tidyr (v.1.3.1), reshape2 (v.1.4.4), BiocNeighbors (v.1.20.2), BiocParallel (v.1.36.0), stringr (V.1.5.1), reticulate (v.1.35.0) and sceasy (v.0.0.7). Data were visualized using ggplot2 (v.3.5.0), ggrastr (v.1.0.2), ggridges (v.0.5.6) and RColorBrewer (v.1.1-3).

Fastq files from FB libraries were mapped with Cell Ranger v.7.0.0. GEX libraries were mapped with STARsolo as described above for scRNA-seq data. CITE-seq data were first subjected to quality-control processing based on RNA properties as described above. For the retained high-quality cells, ADT data were then denoised using dsb (v.1.0.3)⁵⁹. For this purpose, empty droplets were identified in the unfiltered mapping output and used as a reference for estimating noise and antibody background levels. Approximately 1.4 million droplets were selected on the basis of RNA count < 240 reads, ADT count between 120 and 350 reads, and <5% mitochondrial reads to ensure that damaged cells were not included in the subset. During denoising and normalization with DSBNormalizeProtein, ADT data from droplets were used for background correction and the seven isotype control antibodies included in the TotalSeq-C Human Universal Cocktail were used to determine technical variation. Data were scaled based on the background by subtracting the mean and then dividing by the standard deviation of the empty droplets. Negative values after denoising, which correspond to very low expression, were set to zero to improve interpretation and visualization. In an additional quality-control step, cells in which less than 100 of the antibodies were detected were removed from the dataset as technical artefact. Moreover, a subset of cells affected by aggregates of the antibodies against TCRγδ, CD199 (CCR9), CD370, CD357 and XCR1 was removed from the dataset due to unreliable surface-staining properties.

CITE-seq annotation

Annotation of the CITE-seq data was carried out on integrated RNA and denoised ADT modalities. For this purpose, both data modalities were log-normalized and the top 2,000 highly variable genes were identified, followed by PCA on the scaled highly variable genes using the standard functions in the Seurat package. MNN correction was applied to the PCA loadings matrix using the reducedMNN function from the Batchelor package⁶⁰ (v.1.18.1) to reduce batch and donor effects between samples. To integrate both modalities, multimodal neighbours and modality weights were identified and a weighted nearest neighbours (WNN) graph⁵⁸ was constructed using FindMultimodalNeighbors based on the PCA. The number of PCs to be used was determined based on the difference in variation of consecutive PCs being higher than 0.1%. A UMAP visualization was generated based on the WNN graph to represent the weighted combination of both modalities. Furthermore, a supervised PCA (sPCA) was performed on transcriptome data using the RunSPCA function of the Seurat package to obtain dimensionality reduction for the RNA modality that represents the structure of the WNN graph⁵⁸.

To identify cell types and developmental stages, we performed Leiden clustering at a low resolution based on the sPCA using the Seurat functions FindNeighbors and FindClusters. The obtained clusters were then subsetted and analysed individually starting with the most immature cluster, which was identified based on high expression of the surface marker CD34. For each subset, normalization and scaling for RNA and ADT data were repeated as described above and a new WNN UMAP and sPCA were constructed. A combination of Leiden clustering on the sPCA and thresholding of protein levels was used to identify cell types. Moreover, to identify proliferating cells, cell cycle scoring was performed using the G2M and S phase markers supplied in the Seurat package. The FindMarkers function (Seurat) and the package singleCellHaystack⁶¹ (v.1.0.0) were used to identify differentially expressed genes and surface markers in a cluster-based and cluster-independent manner, respectively. Distinction of CD4 and CD8 lineage maturation stages was based on CD45RA, CD45RO and CD1a and identical expression cut-offs were used for both lineages to ensure that subsets would be directly comparable.

Paired TCR-seq data were processed with Dandelion⁴⁷ v.0.3.1 and information about productive or non-productive TRA and TRB rearrangements was extracted for each cell to validate cell type annotations after clustering.

CITE-seq pseudotime analysis

To carry out trajectory inference for αβT lineage cells, CITE-seq data were subsetted to contain DP_pos_sel, DP_4hi8lo, SP_CD4_immature, SP_CD4_semi-mature, SP_CD4_mature, SP_CD8_immature, SP_CD8_semi-mature and SP_CD8_mature cells. A new WNN UMAP was constructed based on surface protein and RNA. Slingshot⁶² (v.2.6.0) was used to establish a minimum spanning tree on the WNN UMAP using the getLineages function based on mutual nearest neighbour-based distance with DP_pos_sel set as start point and SP_CD4_mature and SP_CD8_mature specified as end points. Smooth lineages were obtained using the getCurves function and the derived pseudotime orderings were used to assess transcript and surface marker expression throughout differentiation.

CITE-seq cell2location mapping, integration and processing

For the focused multimodal analysis, we subsetted the full scRNA-seq dataset to include only paediatric data and further removed T lineage cells without CITE-seq protein information. We then performed cell2location as described above to obtain deconvolved cell type mapping based on CITE-seq annotations. In addition, as cell2location deconvolves spots by unique pseudobulk expression profiles from a single-cell reference based on discrete cell annotations, the mapping resolution is limited to that of the annotated cell subsets. To investigate the continuous spatiotemporal nature of CD4/CD8 cell lineages at increased resolution, we performed high-resolution Leiden clustering on the CITE-seq data using scanpy (resolution=35), resulting in 567 cell clusters. These cell clusters were then mapped to our paediatric Visium data with cell2location as described above. To measure the position of each cell cluster across the CMA, we selected Visium spots with the highest cluster label abundance (above percentile 95%) and calculated the weighted mean CMA values for these spots. This value was then assigned this to the cells comprising the associated cluster in the single-cell object. Further details are provided in the GitHub repository for this Article (see Code availability).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
[ad_2]

Source link
November 20, 2024
Single-cell integration reveals metaplasia in inflammatory gut diseases

[ad_1]

Patient samples and tissue processing

Healthy tissue from adults

Healthy adult gastrointestinal tissue was obtained by the Cambridge Biorepository of Translational Medicine (CBTM) from deceased transplant organ donors (n = 2) after ethical approval (REC 15/EE/0152, East of England–Cambridge South Research Ethics Committee) and informed consent from the donor families. Details of the gastrointestinal regions processed and donor information are compiled in Supplementary Table 5. Donors were perfused with cold University of Wisconsin (UW) solution, fresh tissue was collected from the distal stomach (antrum/pylorus), duodenum and terminal ileum within 1 h of circulatory arrest, and tissue was stored in HypoThermosol FRS preservation solution (H4416, Sigma) at 4 °C until processing. Intestinal tissue was open longitudinally and rinsed with D-PBS and then processed to single-cell suspensions following standard protocols^5,58. For tissues from donor A68/759B (D105), epithelium and lamina propria were separated into different fractions by dissection. Epithelial cells were removed by washing the intestinal mucosa twice in Hank’s balanced salt solution (HBSS) medium (Sigma-Aldrich) containing 5 mM EDTA (15575020, Thermo Fisher), 10 mM HEPES (42401042, Gibco), 2% (v/v) FCS supplemented with 10 mM ROCK inhibitor (Y-27632 (Y0503, Merck)) while shaking at 4 °C for 20 min. Epithelial wash-offs were centrifuged at 300g for 7 min at 4 °C and incubated at 37 °C with TrypLE (Thermo Fisher) supplemented with 0.1 mg ml⁻¹ DNase I (11284932001, Sigma) for 5 min. Cells were pelleted and filtered through a 40-μm cell strainer and resuspended in Advanced DMEM F12 (12634028, Thermo Fisher) with 10% (v/v) FCS. The remaining epithelium-depleted tissue was minced and incubated in digestion media (HBSS medium, 0.25 mg ml⁻¹ Liberase TL (5401020001, Roche) and 0.1 mg ml⁻¹ DNase I (11284932001, Sigma)) on a shaker at 37 °C for up to 45 min. The tissue was gently homogenized using a P1000 pipette every 15 min. For tissues from donor A68/770C (D99), full-thickness tissue was diced with a scalpel and digested in digestion media, as described above. Cells were pelleted and filtered through a 70-μm strainer before proceeding to Chromium 10x Genomics single cell 5′ v2 protocol as per the manufacturer’s instructions. Libraries were prepared according to the manufacturer’s protocol and sequenced on an Illumina NovaSeq 6000 S2 flow cell with 50-bp paired-end reads.

Control tissue from preterm infants

Uninvolved tissue from preterm infants, between 23 and 31 post-conception weeks (pcw), with necrotizing enterocolitis (NEC), focal intestinal perforation or intestinal fistula (n = 4) were collected at the Neonatal Department of Newcastle upon Tyne Hospitals NHS Foundation Trust with consent and ethical approval as part of the SERVIS study (REC 10/H0908/39). Tissue was resected from the infant and placed immediately into ice-cold PBS. Within 3 h, samples were enzymatically dissociated into a single-cell suspension using collagenase type IV (Worthington) for 30 min at 37 °C. Cells were filtered with 100-µm cell strainer, treated with red blood cell lysis and filtered through a 35-µm strainer. Cells were stained with DAPI before FACS sorting, selecting only for live, single cells and separating CD45-positive and CD45-negative cells. Sorted cells were then loaded onto the Chromium Controller (10x Genomics) using the Single Cell Immune Profiling kits and subsequently sequenced as per the manufacturer’s protocol.

Disease tissue from patients with Crohn’s disease, ulcerative colitis and coeliac disease

Crohn’s disease tissue used for validations was obtained from multiple sites. Adult Crohn’s disease surgical resections were collected from patients in the IBSEN III (Inflammatory Bowel Disease in South Eastern Norway) at Oslo University Hospital (n = 4) or Hospital Clinic Barcelona (n = 9), and biopsy material was collected from patients undergoing colonoscopy at Addenbrookes Hospital Cambridge (n = 4); all patients gave informed written consent. Fresh tissue was fixed in formalin and embedded in paraffin for subsequent immunostaining. Ulcerative colitis tissue was also collected from Hospital Clinic Barcelona (n = 3) during colonic resections, with the same consent and tissue processing procedure. Coeliac disease tissue was obtained from Oslo University Hospital (n = 2) or the Oxford University Hospitals NHS Foundation Trust (OUHFT) coeliac disease clinic (n = 2 treated coeliac, n = 3 untreated coeliac). As controls, healthy tissue was also collected at Oslo University Hospital from the proximal duodenum (during pancreaticoduodenectomy for patients with pancreatic cancer, n = 2) and the terminal ileum (n = 4).

Duodenal biopsies from Oslo University Hospital were collected from newly diagnosed untreated patients with coeliac disease (n = 2) and subsequently fixed in formalin and embedded in paraffin for immunostaining. Mucosal pinch biopsies from the second part of the duodenum from the OUHFT were obtained during gastroscopy of untreated patients with coeliac disease (n = 3) and treated patients with coeliac on a gluten-free diet (n = 2). Equivalent healthy control samples from the OUHFT (n = 3) were obtained from patients undergoing gastroscopy with gastrointestinal symptoms without coeliac disease. Biopsies were stored in MACS tissue storage solution (Miltenyi Biotec) before cryopreservation in freezing medium (Cryostor Cs10, Sigma-Aldrich). Samples were later recovered by thawing in a 37 °C water bath and washed in 20 ml R10 (90% RPMI (Sigma-Aldrich) and 10% FBS) before tissue dissociation. Epithelial cells were isolated using v1.11 of the published protocol (https://doi.org/10.17504/protocols.io.bcb6isre)⁶⁹. After isolation, epithelial cells proceeded to single-cell sequencing (10x Genomics Next GEM 5′ v1.1) as per the manufacturer’s protocol. Details of samples and metadata are available in Supplementary Table 4.

Ethical approval for collection of disease tissue

Tissue collected at Oslo University Hospital was approved by the Regional Committee for Medical Research Ethics (REK 20521/6544, REK 2015/946 and REK 2018/703, Health Region South-East, Norway) and comply with the Declaration of Helsinki. Tissue collected at Hospital Clinic Barcelona was approved by the Ethics Committee of Hospital Clinic Barcelona (HCB/2016/0389). Tissue from Addenbrookes Hospital was collected through the Addenbrookes–Human Research Tissue Bank HTA research licence no: 12315 (Cambridge University Hospitals Trust). Tissue collected at the OUHFT was collected under the Oxford Gastrointestinal Illnesses Biobank (REC 21/TH/0206).

Single-molecule fluorescence in situ hybridization

Intestinal tissue was embedded in OCT and frozen on an isopentane-dry ice slurry at −60 °C, and then cryosectioned onto SuperFrost Plus slides at a thickness of 10 μm. Before staining, tissue sections were post-fixed in 4% paraformaldehyde in PBS for 15 min at 4 °C, then dehydrated through a series of 50%, 70% and 100% ethanol, for 5 min each. Staining with the RNAscope Multiplex Fluorescent Reagent Kit v2 Assay (Bio-Techne, Advanced Cell Diagnostics) was automated using a Leica BOND RX, according to the manufacturers’ instructions. After manual pre-treatment, automated processing included epitope retrieval by protease digestion with Protease IV for 30 min before RNAscope probe hybridization and channel development with Opal 520, Opal 570 and Opal 650 dyes (Akoya Biosciences). Stained sections were imaged with a Perkin Elmer Opera Phenix High-Content Screening System, in confocal mode with 1-μm z-step size, using a 20× water-immersion objective (NA 0.16, 0.299 μm per pixel). Channels were: DAPI (excitation 375 nm, emission 435–480 nm), Opal 520 (excitation 488 nm, emission 500–550 nm), Opal 570 (excitation 561 nm, emission 570–630 nm) and Opal 650 (excitation 640 nm, emission 650–760 nm). The fourth channel was developed using TSA-biotin (TSA Plus Biotin Kit, Perkin Elmer) and streptavidin-conjugated Atto 425 (Sigma-Aldrich).

Immunohistochemistry

For samples collected at Oslo University Hospital, sections of formalin-fixed, paraffin-embedded tissue were cut in series at 4 µm and mounted on Superfrost Plus object glasses (Thermo Fisher Scientific). Haematoxylin–eosin staining was performed on the first sections and reviewed by an expert pathologist (F.L.J.) and the following sections were used for immunohistochemical studies. AB-PAS staining was performed by dewaxing formalin-fixed, paraffin-embedded samples and staining with Alcian blue (8GX) (AB) at pH 2.5 for acidic mucins and periodic acid-Schiff reagent (PAS) staining for neutral mucins, as previously described⁷⁰.

Multiplex immunostaining was performed sequentially using a Ventana Discovery Ultra automated slide stainer (Ventana Medical System, 750-601, Roche). After deparaffinization of the sections, heat-induced epitope retrieval was performed by boiling the sections for 48 min with cell conditioning 1 buffer (DISC CC1 RUO, 6414575001, Roche) followed by incubation with DISC inhibitor (7017944001, Roche) for 8 min. The following primary antibodies were used: anti-human MUC6 clone CLH5 dilution 1:400 (RA0224-C.1, Scytek), anti-human MUC5AC clone CLH2 dilution 1:100 (MAB2011, Sigma), anti-human CD3 rabbit polyclonal dilution 1:50 (A0452, Dako), anti-human CD8 clone 4B11 dilution 1:30 (MA1-80231, Leica Biosystems, Invitrogen), anti-human CD4 clone SP35 dilution 1:30 (MA5-16338, Thermo Fisher), anti-TCRδ clone H-41 dilution 1:100 (sc-100289, Santa Cruz Biotechnology), anti-human FOXP3 clone 236A/E7 dilution 1:1,000 (NBP-43316, Novus Biologicals), anti-human HLA-DRα-chain clone TAL.1B5 dilution 1:200 (M0746, Dako), anti-human CD68 clone PG-M1 dilution 1:100 (M0876, Dako), anti-human CD20 clone L26 dilution 1:200 (M0755, Dako), anti-human TFF2 clone #366508 dilution 1:1,000 (MAB4077, RnD), anti-human TFF3 clone BSB-181 dilution 1:1,000 (BSB-3820-01, BioSB) and anti-human pan-CK clone AE1/AE3/PCK26, ready to use reagent (RTU) (Ventana Medical System, 760–2595, Roche).

Each primary antibody was diluted in antibody diluent (5266319001, Roche), incubated for 32 min and then washed in a 1× reaction buffer (Concentrate (10X), 5353955001, Roche). OmniMap anti-mouse horseradish peroxidase (HRP) RTU (5269652001, Roche) secondary antibody was incubated for 16 min followed by 12-min incubation with diluted opal fluorophores (Opal 6-Plex Detection Kit for Whole Slide Imaging formerly Opal Polaris 7 Color IHC Automated Detection Kit NEL871001KT) following the manufacturer’s instructions. After that, bound antibodies were denatured and HRP was quenched using Ribo CC solution (DISC CC2, 5266297001, Roche) and DISC inhibitor (7017944001, Roche). Sections were then counterstained with DAPI (DISC QD DAPI RUO, 5268826001, Roche) for 8 min and mounted with ProLong Glass Antifade mountant (Molecular Probes). Imaging was performed using a Vectra Polaris multispectral whole-slide scanner (PerkinElmer). Irrelevant, concentration-matched primary antibodies were used as negative controls. For some tissue sections, bound anti-CD3, anti-CD20, anti-MUC6 and anti-MUC5AC primary antibodies were detected with secondary antibodies conjugated with peroxidase, using the automated Ventana Discovery Ultra system and DAB, purple-responsive, yellow-responsive or teal-responsive chromogens (ChromoMap DAB Detection Kit, 5266645001; DISCOVERY Purple Kit, 07053983001; DISCOVERY Yellow Kit, 07698445001; and Discovery Teal-HRP detection kit) all from Ventana Medical System.

For samples collected at Hospital Clinic Barcelona, sections of formalin-fixed, paraffin-embedded tissue were cut into 3.5-µm sections. Immunohistochemistry was conducted for the following commercially available antibodies: anti-human MUC5AC (1:4,000; MAB2011, Sigma-Aldrich) and anti-human MUC6 (1:4,000; RA0224-C.1, ScyTek). Deparaffinization, rehydration and epitope retrieval of the sections were automatedly performed with PT link (Agilent) using Envision Flex Target Retrieval Solution Low pH (Dako). Samples were blocked with 20% of goat serum (Vector) in a PBS and 0.5% BSA solution. Biotinylated anti-mouse secondary antibodies were used (1:200; Vector). Positivity was detected with the DAB Substrate kit (K3468, Dako). Image acquisition was performed on a Nikon Ti microscope (Japan) using Nis-Elements Basic Research Software (v5.30.05).

Image quantification

For quantification of T cell density in MUC6⁺ and neighbouring control epithelium, tissue sections from patients with Crohn’s disease (n = 5 sections, 3 donors) and patients with coeliac disease (n = 2 sections, 2 donors) stained with antibodies to MUC6, CD3, CD4, CD8 and TCRδ (see above) were used. Individual glands/epithelium (either MUC6⁺ or MUC6⁻) were annotated manually using PathViewer v3.4.0 freehand region-of-interest tool outlining the entire gland cross-section. We subtracted 3 × 3 pixel averages of autofluorescence measurement per channel with subtraction coefficients of: DAPI (1.5), TCRγδ (0.5), MUC6 (1.0), CD4 (0.25), CD3 (0.25) and CD8 (0.25). We next used QuPath⁷¹ v0.5 with the cellpose⁷² v2.2.3 extension to segment T cells with the ‘cyto2’ model from maximum projection of CD3, CD4, CD8 and TCRγδ, with DAPI as the nuclear marker, an expected median diameter of 10 μm and excluding cells with diameters of less than 5 μm. Segmented cells were thresholded for mean intensity expression of T cell markers by manual inspection with cut-offs of more than 25 (CD3), more than 20 (CD4), more than 10 (CD8) and more than 10 (TCRδ) and classified into subsets based on positive and negative marker expression as indicated. Using the centroid position of cells, we counted T cells per gland if the majority of the cell area was within the region of interest and quantified the T cell density per gland area comparing MUC6⁺ and control epithelium.

Data curation and mapping

Datasets (Supplementary Table 1) were chosen from a literature search of scRNA-seq studies^{5,6,7,9,19,22,23,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67}. Studies were included when there was raw scRNA-seq data (FASTQ) from human gastrointestinal tract tissues (oral cavity (excluding tongue), salivary glands, oesophagus, stomach, and small and large intestine).

Available metadata from each sample were collated from various data repositories and harmonized for consistent nomenclature. Metadata related to sample retrieval methods, tissue processing and cell enrichment methods were retrieved from the methods section of the original study. Where possible, the suggestions of sample metadata from the Gut Cell Atlas Roadmap manuscript were considered³. An explanation and overview of metadata included and harmonized in the atlas are available in Supplementary Table 2.

For public datasets deposited to ArrayExpress, archived paired-end FASTQ files were downloaded from the European Nucleotide Archive (ENA) or ArrayExpress. For public datasets deposited to the Gene Expression Omnibus (GEO), if the Sequence Read Archive (SRA) archive did not contain the barcode read, URLs for the submitted 10X bam files were obtained using srapath v2.11.0. The bam files were then downloaded and converted to FASTQ files using 10x bamtofastq v1.3.2. If the SRA archive did contain the barcode read, the SRA archives were downloaded from the ENA and converted to FASTQ files using fastq-dump v2.11.0. Sample metadata were gathered from the abstracts deposited to the GEO or ArrayExpress, and supplementary files from publications.

Following the FASTQ file generation, 10X Chromium scRNA-seq experiments were processed using the STARsolo pipeline v1.0 detailed in https://github.com/cellgeni/STARsolo. In brief, STAR v2.7.9a was used. Transcriptome reference exactly matching Cell Ranger 2020-A for human was prepared as described in the 10X online protocol (https://support.10xgenomics.com/single-cell-gene-expression/software/release-notes/build#header). Automated script ‘starsolo_10x_auto.sh’ was used to automatically infer sample type (3′ or 5′, 10X kit version, among others). STARsolo command optimized to generate the results maximally similar to Cell Ranger v6 was used. To this end, the following parameters were used to specify unique molecular identifiers (UMI) collapsing, barcode collapsing and read clipping algorithms: ‘–soloUMIdedup 1MM_CR –soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts –soloUMIfiltering MultiGeneUMI_CR –clipAdapterType CellRanger4 –outFilterScoreMin 30’. For cell filtering, the EmptyDrops algorithm used in Cell Ranger v4 and above was invoked using ‘–soloCellFilter EmptyDrops_CR’ options. Options ‘–soloFeatures Gene GeneFull Velocyto’ were used to generate both exon-only and full-length (pre-mRNA) gene counts, as well as RNA velocity output matrices.

Following read alignment and quantification, Cellbender v0.2.0 with default parameters was used to remove ambient RNA (soup). In cases where the model learning curve did not indicate convergence, the script was re-run with ‘–learning-rate 0.00005 –epochs 300’ parameters. For certain large datasets or datasets with low UMI counts, ‘–expected-cells’ and ‘–low-count-threshold’ parameters had to be adjusted individually for each sample.

scAutoQC

On a per sample basis, scAutoQC calculated the following metrics: logarithmized numbers of counts per cell (log1p_n_counts), logarithmized numbers of genes per cell (log1p_n_genes) and the percentages of total genes expressed that are mitochondrial genes (percent_mito), ribosomal genes (percent_ribo), haemoglobin genes (percent_hb), within the top 50 genes expressed in a given cell (percent_top50), classified as soup by CellBender (percent_soup) and spliced genes (percent_spliced) (Extended Data Fig. 2). The dimensions of these eight metrics were reduced to generate a neighbourhood graph and UMAP for each sample, which was then clustered at low resolution; these clusters are referred to as quality control (QC) clusters. Classification of cells/droplets as passing or failing QC was then performed in a two-step process, first by classifying each cell as passing or failing QC based on four-metric parameters and thresholds set by a Gaussian mixture model (GMM). For the atlas, the number of GMM components was set to 10 for an overfit model. scAutoQC was subsequently improved to automate the best model fit between 1 and 10 components based on the Bayesian information criterion. Then, whole clusters were classified as passing QC if 50% or more of individual cells within the cluster passed QC. The benefits of the approach include the automated nature, removing most manually set thresholds and limiting hands-on analysis. Our unbiased approach exploits both the distribution of individual metrics and their correlations. Although there are some parameters that are set up-front, they only serve as guidance for the final flagging of low quality cells and are not sensitive to small changes in the starting points (for example, setting an initial per cent of mitochondrial genes to 15% or 20% is likely to flag the same clusters). An overview of the pipeline is in Extended Data Fig. 2, and the code (https://github.com/Teichlab/sctk/blob/master/sctk/_pipeline.py v0.1.1) and example workflow (https://teichlab.github.io/sctk/index.html) can be found in GitHub.

Assembly of the healthy reference

After samples were run through scAutoQC, they were pooled and cells were flagged as failing QC, along with samples where less than 10% of cells or 100 cells total passed QC (18 samples). In total, we removed 596,449 (31.22%) low-quality cells during this initial filtering step. Cells were further filtered through automated doublet removal based on scrublet scores, removing a further 67,846 from the healthy reference (Extended Data Fig. 2). Cells from healthy/control samples were integrated using scVI^73,74,75 (v0.16.4) with donorID_unified as batch key, log1p_n_counts and percent_mito as continuous covariates, cell cycle genes removed and 7,500 highly variable genes. For comparison, we integrated with Harmony⁷⁶ (v0.1.7) and BBKNN⁷⁷ (v1.4.1) using donorID_unified as the batch key and ran through the standard scIB benchmarking pipeline⁷⁸ (v1.1.4), assessing batch correction metrics based on donorID_unified as batch key.

Annotations of the healthy reference

Cells from the core atlas were grouped by Scanpy (v1.8.0) leiden clustering into seven broad lineages based on marker gene expression (annotation level 1; Extended Data Fig. 1a). Each lineage was split, and reintegrated with scVI (using the settings above but selecting for 5,000 highly variable genes with lineage-dependent gene list exclusions: cell cycle genes removed for all non-epithelial subsets, ribosomal genes removed for all epithelial subsets and variable immunoglobulin genes removed for B/B plasma cells) to annotate cells at fine resolution (annotation level 3). Mesenchymal populations were further split by developmental age group (first trimester fetal, second trimester fetal/preterm and adult/paediatric). Epithelial cells were further split by gastrointestinal region and/or developmental age group (oral all ages, oesophagus all ages, stomach all ages, small intestine first trimester fetal, small intestine second trimester fetal/preterm, small intestine adult/paediatric, large intestine first trimester, large intestine second trimester fetal/preterm, large intestine adult/paediatric). For fine-grained annotations of objects by broad compartment (and age/region if applicable), a combined approach including automated annotation with leiden clustering and marker gene analysis was used. Celltypist⁵³ predicted labels were calculated for the entire core atlas using various relevant models (Cells_Intestinal_Tract v2, Immune_All_Low v2 and Pan_Fetal_Human v2 based on studies^5,15,53) and custom-label transfer models based on intestinal⁶ and salivary gland⁷⁹ datasets. During annotation, further doublets were manually removed based on a combinatorial approach considering factors such as coexpression of different cell-type marker genes, scrublet scores, gene counts, positioning relative to other cells and CellTypist predictions. Notebooks for all annotations are available via our GitHub (https://github.com/Teichlab/PanGIAtlas). MGN cells (MUC6⁺) in the healthy reference in the small intestine were identified in the healthy duodenum with leiden clustering resolution 0.5, and further refined to remove any residual doublets or MUC6⁻ cells by subclustering.

Data projection and label prediction for diseased data

To include the disease data, we started from the raw data, remapped and applied scAutoQC to the disease data, ensuring that the healthy and disease references are comparable. Models for disease projection were made on the full healthy reference dataset (without doublets) using scANVI⁸⁰ incorporating broad (level 1) annotations, based on the healthy reference scVI model. We projected disease data using scArches⁸¹ with the scANVI model. To annotate at fine resolution, we first predicted broad (level 1) lineages in the projected disease data using a label transfer method based on majority voting from k-nearest neighbour (kNN). Broad lineages were then split as for the healthy reference. For all lineages except epithelial, lineage-specific disease cells were projected onto the respective healthy reference lineage-specific latent space and fine-grained annotations predicted using the same method as for broad lineage predictions. Owing to an underrepresentation of epithelial cells, we added additional epithelial cell data from coeliac disease duodenum (unpublished data from the Klenerman laboratory (M.E.B.F., unpublished) and Crohn’s disease ileum and colon²², increasing the amount of diseased epithelial cells from 57,406 to 92,342 cells plus an additional 219,472 cells from healthy controls/non-inflamed tissue. These additional datasets were not remapped, instead these studies were added based on the raw counts matrix. Split epithelial cells from the original disease set (remapped data) and the additional disease sets (from count matrices) were concatenated and reduced to a common gene set of 18,485 genes. The resulting epithelial dataset was further split by region (stomach, small intestine and large intestine), prepared for projection using scANVI_prepare_anndata function (fills 0s for non-overlapping genes) and projected onto the respective healthy reference epithelial region-specific latent space embeddings.

To refine level 3 annotations on disease cells, we utilized the scArches weighted kNN uncertainty metric. We labelled cells as unknown if they had an ‘uncertainty score’ greater than the 90th quantile for each lineage. For epithelial cells, the 90th quantile was calculated separately for cancer cells and non-cancer cells to account for high uncertainty labelling of tumour cells. To refine the labels of these unknown cells, we performed leiden clustering (resolution = 1) and reassigned the label based on both majority voting of the higher certainty cells (above the cut-off) and marker genes. In stomach epithelium, there was one cluster of unknown cells, likely to be cancer cells, which could not be assigned a label and was therefore left annotated as unknown. In large intestinal epithelium, we found a cluster that corresponded to metaplastic Paneth cells (a cell type not present in the healthy reference), which were reannotated based on the distinct marker genes (Extended Data Fig. 4).

Technical and biological variation

To determine the contribution of different metadata covariates to the integrated embedding of the healthy reference data, we performed linear regression for each latent component of the embedding with each covariate as previously described⁸². We performed the analysis per cell type based on level_1_annot (broad level) and level_2_annot (medium level) annotations, and for all ages or adult/paediatric only (excluding developing and preterm samples). It should be noted that although this analysis can be informative, many of the covariates included in our atlas are correlated, for example, specific studies with tissue processing methods, diseases, ages or organs. Therefore, multiple covariates can explain the same variance in the data.

Differential abundance analysis

To identify differentially abundant cell populations, we used Milo⁸³ (Milopy v0.0.999), which tests for differentially abundant neighbourhoods from kNN graphs. For comparisons between healthy developing (6–31 pcw, including preterm infants ex utero) and adult/paediatric gut, Milo was run separately per tissue with more than two donors for each group (stomach, duodenum, ileum and colon) using default parameters. For comparisons between organs in the healthy adult gut (18 years of age or older), Milo was run for each organ (oral mucosa, salivary gland, oesophagus, stomach, small intestine, large intestine and mesenteric lymph node) versus the others combined with the covariates of tissue_fraction and cell_fraction_unified and otherwise default parameters. For comparisons between disease and healthy adult samples, Milo was run comparing disease and controls from an individual study, rather than all disease and controls in the atlas, on the kNN graph from joint embedding, which has been shown to have greater sensitivity for detecting disease-associated cell states⁸⁴. We focused comparing inflamed with neighbouring inflamed tissue from the Martin (2019)⁶ dataset.

Differential gene expression analysis

Differential gene expression (DGE) analysis was performed using Scanpy rank gene groups function (Wilcoxon rank-sum test with default parameters) and/or by pseudobulking (decoupler⁸⁵) and DESeq2 (ref. ⁸⁶) analysis. For Scanpy DGE analysis, samples were preprocessed by downsampling to 200 cells per cell type per donor and removing ribosomal-related and mitochondrial-related genes to limit unwanted batch and technical effects. Decoupler pseudobulking (v1.5.0) was performed combining donor-cell-type combinations, summing raw counts per gene across cells for each combination. DGE analysis was then performed with DESeq2 (v1.38.0), with log₂(fold change) (log₂FC) shrinkage calculated using the ashr (v2.2_63) estimator. Genes were classified as differentially expressed when log₂FC ≥ 0.5 or log₂FC ≤ −0.5 and adjusted P ≤ 0.05. For comparison of metaplastic Paneth cells, INFLAREs and oral mucosa fibroblasts with healthy counterparts, minimum cells per donor-cell-type combination for pseudobulking was 2 and DESeq2 was run without covariates. For oral mucosa fibroblasts, comparison was between oral mucosa fibroblasts in healthy oral mucosa versus inflammatory fibroblasts annotated as oral mucosa fibroblasts in diseased ileum. For all other comparisons, minimum cells were 10 and study was included as a covariate, and comparison was between small intestinal cells in IBD versus healthy controls. DESeq2 run on bulk data from the GSE126299 LCM dataset compared metaplastic glands and inflamed epithelium from patients with IBD using default settings, without covariates. For gene set analysis, the output from Scanpy rank gene groups was filtered to contain genes with a minimum log fold change of 0.25 and a P value cut-off of 0.05. The resulting gene list was used for gene set analysis using the GSEApy (v1.0.4) enrichr function with relevant gene sets such as MSigDB, KEGG and GO Biological Process examined. Gene scores for epithelial cells were calculated using Drug2Cell⁸⁷ score function with default parameters. Gene scores for fibroblasts were calculated using the Scanpy score_genes function with default parameters. Full gene lists used for gene scores are available in Supplementary Table 6. Odds ratio and P value of gene overlap for MGN and INFLARE marker genes in different gastrointestinal regions were calculated using GeneOverlap⁸⁸ (v0.99.0), with the genomic background set to 18,485 genes as the total number of genes used in the marker gene analysis.

Cell–cell interaction analysis

Cell–cell interaction analysis was performed using LIANA+ (v1.0.4)⁸⁹, CellChat (v1.1.1)⁹⁰ and CellPhoneDB v3 (statistical_method)⁹¹ to determine cell–cell interactions occurring in the small intestine during Crohn’s disease. Interaction analysis was performed on remapped data, to avoid loss of genes or interactions lost when merging additional count matrices (see ‘Data projection and label prediction for diseased data’ for more detail). Before analysis, data were preprocessed by downsampling to 50 cells per cell type per donor. Normalized count matrix with cell annotation metadata were processed through the standard CellChat and CellPhoneDB pipeline, with the communication probability truncated mean/threshold set to 0.1. Output of LIANA+ analysis was further analysed using NMF with ligand–receptor mean expression and considering only interactions expressed in at least 5% of cells. This analysis resulted in 10 interaction programmes by an automatic elbow selection procedure. Pathway enrichment analysis on the resultant ligand–receptor loadings was performed using decoupler’s univariate linear model method with pathway prior knowledge from PROGENy⁹²; only factors in which at least one pathway was significantly enriched (false discovery rate ≤ 0.05) were included for analysis. Using the differential analysis statistics from DESeq2, as described above, we generated a list of deregulated ligand–receptor interactions in IBD versus healthy, or for INFLAREs and oral mucosa fibroblasts, comparing the disease cells to the appropriate healthy counterparts (see above).

cNMF analysis

To identify shared activity and cell identity gene programs cells from diseased small intestine (Crohn’s disease, paediatric IBD and coeliac disease with a total of 99,465 cells), we analysed raw counts with cNMF (v1.3.4)⁹³. We used the default processing and normalization of cNMF, which considers 2,000 highly variable genes along with 100 iterations of NMF. All other parameters were set at default values. We tested hyperparameter values of K, the number of factors, ranging in steps of 1 from 5 to 80, and picked on inspection a favourable tradeoff between factor stability and overall model error at K = 44. For determining consensus clusters, we excluded 6% of fitted cNMF spectra with a mean distance to kNNs above 0.3. The resulting per-cell gene program usage was compared across fine-grained cell annotations, identifying gene programs corresponding to the identity of MGN cells and other relevant cell types (goblet, stem and surface foveolar cells). To assess programs specific to health or disease, we performed analysis on all cells from small and large intestines using identical parameters, downsampled randomly to 200 cells per cell type per donor (resulting in 313,879 cells). In this case, we tested for values of K in steps of 2 from 10 to 80, choosing an optimal K = 64.

Trajectory analysis

Monocle3

To infer the developmental trajectory giving rise to MGN or INFLAREs in the ileum IBD, we used monocle3 (v1.3.1)⁹⁴ on a subset of data containing cells in the ileum from studies^5,6,22. We performed Scanpy Louvain clustering on the original UMAP representation generated from the scANVI latent space to account for batch effects and inferred developmental trajectories along pseudotime by choosing the node assigned the highest number of epithelial stem cells as the root node. We then extracted the MGN or INFLARE-specific trajectory by selecting the nodes assigned the highest number of MGN or INFLAREs as the final nodes. Finally, we determined genes whose expression changes along pseudotime by using ‘monocle3::graph_test’, which leverages a Moran’s I-test considering gene expression changes within groups of k = 25 neighbouring cells on the principle trajectory graph.

Palantir

We analysed epithelial cell trajectories in the ileum from patients with IBD from studies^5,6 by running transcriptome-based pseudotime estimation using Palantir (v1.3.1)⁹⁵. Before running Palantir, we reintegrated the datasets using scVI with settings described above in ‘Assembly of the healthy reference’.

We used the default Palantir parameters with 500 waypoints specifying the root cell with the maximum gene score (using Scanpy rank genes function) of LGR5, ASCL2, RGMB and OLFM4. We then computed a CellRank⁹⁶ (v2.0.1) kernel (Markov transition probability matrix) for Palantir pseudotime to allow projection of directional cell-state transitions onto the UMAP. To predict macrostates (potential terminal cell states), we ran CellRank’s Generalized Perron Cluster Analysis on the Markov matrix and then computed the fate probability for each cell under each terminal-state lineage. We calculated the top lineage driver genes along the stem → TA → INFLARE lineage using CellRank inference and generalized additive models. All corresponding visualizations were made using the plotting functions available in the CellRank package.

Genes2Genes trajectory alignment

We used Genes2Genes (G2G)⁶⁸ to compare the INFLARE trajectory (stem → TA → INFLARE) in the diseased IBD to three other different trajectories: (1) the stem → MGN trajectory in the healthy duodenum, (2) the stem → enterocyte trajectory in the diseased ileum, and (3) the stem → goblet trajectory in the diseased ileum.

Preparing trajectories for comparison. For comparison 1, we ran scVI integration and Palantir pseudotime analysis as above for healthy small intestinal epithelial cells to facilitate reconstruction of the stem → MGN trajectory in the healthy duodenum. To be more confident, we also took only the stem and TA cells that have a pseudotime estimate less than the mean pseudotime of the INFLARE population (as there were some outlier stem/TA cells with higher pseudotime values in the INFLARE pseudotime range). For comparisons 2 and 3, we used the already estimated Palantir pseudotime. To extract lineage-specific cells with high confidence, we assessed the fate probability distribution (estimated by Palantir) for the INFLARE lineage across all the cells annotated under the non-lineage-specific cell types (that is, a negative control under the cells not annotated as either stem, TA or INFLARE), and removed the stem and TA cells if their fate probability was less than the 75th percentile of the negative control.

Trajectory alignment. G2G aligns genes along reference and query trajectories by running a dynamic programming algorithm that optimizes matching and mismatching of gene expression distributions between timepoints. This function formulates an alignment cost based on a minimum message length inference framework. As per the G2G workflow, we first discretized each pseudotime trajectory into interpolation timepoints at equal-length intervals based on the optimal number of bins inferred using the optbinning package. We then ran G2G (under its default settings) for each of the three trajectory comparisons to align transcription factors⁹⁷ using log₁p normalized gene expression and pseudotime estimates for each cell. For comparison 1, we considered 1,171 transcription factors common between the healthy and disease datasets, whereas 1,262 common transcription factors were aligned for comparisons 2 and 3. Interrogating the output of G2G alignment, we considered mismatches between trajectories when transcription factors had an alignment similarity ≤ 50% and optimal alignment cost ≥ 30 nits (in the unit of Shannon information).

Bulk RNA-seq deconvolution

For bulk deconvolution analysis, we first downloaded published bulk RNA-seq datasets of adult IBD from the GEO database (GSE111889), paediatric IBD from the ArrayExpress database (E-MTAB-5464) and the Expression Atlas (E-GEOD-101794), The Cancer Genome Atlas colon adenocarcinoma using R package TCGAbiolinks (v2.18.0), coeliac disease data from the GEO (GSE131705 and GSE145358) and RNA-seq from laser capture microdissected pyloric metaplasia, inflamed and control epithelium (GSE126299). A single-cell reference for deconvolution analysis was then prepared by subsetting the overall object to only include cells from the small intestine in IBD and downsampling to 200 cells for each fine-grained cell-type annotation. BayesPrism⁹⁸ (v2.0) was used for deconvolution analysis with raw counts for both single-cell and bulk RNA-seq data as inputs. Both the ‘cell-type labels’ and the ‘cell-state labels’ were set to fine-grained annotations. Ribosomal protein genes and mitochondrial genes were removed from single-cell data as they are not informative in distinguishing cell types and can be a source of large spurious variance. We also excluded genes from sex chromosomes and lowly transcribed as recommended by the BayesPrism tutorial. For further analysis, we applied a pairwise Welch t-test to select differentially expressed genes with the ‘pval.max’ being set to 0.05 and ‘lfc.min’ to 0.1. Finally, a prism object containing all data required for running BayesPrism was created using the new.prism() function, and the deconvolution was performed using the run.prism() function. For correlation analysis, we calculated the Pearson correlation between (1) the estimated abundance of INFLAREs and other cell types, and (2) the estimated INFLAREs abundance and gene expression in bulk RNA-seq datasets. For the later calculation, we first normalized raw counts in the expression matrix from each bulk dataset using R package DESeq2. To estimate the number of patients with INFLAREs in bulk RNA-seq data, we categorized samples by MUC6 expression with a cut-off higher than the mean + 2× the standard deviation, stratifying patients as MUC6-high above this cut-off.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

[ad_2]

Source link

November 20, 2024
An integrated transcriptomic cell atlas of human neural organoids

[ad_1]

Metadata curation and harmonization of human neural organoid scRNA-seq datasets

We included 33 human neural organoid data from a total of 25 publications^{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,26} plus three unpublished datasets in our atlas (Supplementary Table 1). We curated all neural organoid datasets used in this study through the sfaira⁷⁸ framework (GitHub dev branch, 18 April 2023). For this, we obtained scRNA-seq count matrices and associated metadata from the location provided in the data availability section for every included publication or directly from the authors in case of unpublished data. We harmonized metadata according to the sfaira standards (https://sfaira.readthedocs.io/en/latest/adding_datasets.html) and manually curated an extra metadata column organoid_age_days, which described the number of days the organoid had been in culture before collection.

We next removed any non-applicable subsets of the published datasets: diseased samples or samples expressing disease-associated mutations (refs. ^{14,15,16,18,19,26}), fused organoids (ref. ¹), primary fetal data (refs. ^10,23), hormone-treated samples (ref. ²²), data collected before neural induction (refs. ^3,20) and share-seq data (ref. ²³). We harmonized all remaining datasets to a common feature space using any genes of the biotype ‘protein_coding’ or ‘lncRNA’ from ensembl⁷⁹ release 104 while filling any genes missing in a dataset with zero counts. On average, 50% of the full gene space (36,842 genes) was reported in each of the constituent datasets. We then concatenated all remaining datasets to create a single AnnData⁸⁰ object.

Preprocessing of the HNOCA scRNA-seq data

All processing and analyses were carried out using scanpy⁸¹ (v.1.9.3) unless indicated otherwise. For quality control and filtering of HNOCA, we removed any cells with fewer than 200 genes expressed. We next removed outlier cells in terms of two quality control metrics: the number of expressed genes and percentage mitochondrial counts. To define outlier cells on the basis of each quality control metric, z-transformation is first applied to values across all cells. Cells with any z-transformed metric less than −1.96 or greater than 1.96 are defined as outliers. For any dataset collected using the v.3 chemistry by 10X Genomics, which contains more than 500 cells after the filtering, we fitted a Gaussian distribution to the histogram denoting the number of expressed genes per cell. If a bimodal distribution was detected, we removed any cell with fewer genes expressed than defined by the valley between the two maxima of the distribution. We then normalized the raw read counts for all Smart-seq2 data by dividing it by the maximum gene length for each gene obtained from BioMart. We next multiplied these normalized read counts by the median gene length across all genes in the datasets and treated those length-normalized counts equivalently to raw counts from the datasets obtained with the help of unique molecular identifiers in our downstream analyses.

As a next step we generated a log-normalized expression matrix by first dividing the counts for each cell by the total counts in that cell and multiplying by a factor of 1,000,000 before taking the natural logarithm of each count + 1. We computed 3,000 highly variable features in a batch-aware manner using the scanpy highly_variable_genes function (flavor = ‘seurat_v3’, batch_key = ‘bio_sample’). Here, bio_sample represents biological samples as provided in the original metadata of the datasets. On average, 72% of the 3,000 highly variable genes were reported in each of the constituent HNOCA datasets. We used these 3,000 features to compute a 50-dimensional representation of the data using principal component analysis (PCA), which in turn we used to compute a k-nearest-neighbour (kNN) graph (n_neighbors = 30, metric = ‘cosine’). Using the neighbour graph we computed a two-dimensional representation of the data using UMAP⁸² and a coarse (resolution 1) and fine (resolution 80) clustering of the unintegrated data using Leiden⁸³ clustering.

Hierarchical auto-annotation with snapseed

Snapseed is a scalable auto-annotation strategy, which annotates cells on the basis of a provided hierarchy of cell types and the corresponding cell type markers. It is based on enrichment of marker gene expression in cell clusters (high-resolution clustering is preferred), and data integration is not necessarily required.

In this study, we used snapseed to obtain initial annotations for label-aware integration. First, we constructed a hierarchy of cell types including progenitor, neuron and non-neural types, each defined by a set of marker genes (Supplementary Data 1). Next, we represented the data by the RSS³ to average expression profiles of cell clusters in the recently published human developing brain cell atlas²⁷. We then constructed a kNN graph (k = 30) in the RSS space and clustered the dataset using the Leiden algorithm⁸³ (resolution 80). For both steps, we used the graphical processing unit (GPU)-accelerated RAPIDS implementation that is provided through scanpy^81,84.

For all cell type marker genes on a given level in the hierarchy, we computed the area under the receiver operating characteristic curve (AUROC) as well as the detection rate across clusters. For each cell type, a score was computed by multiplying the maximum AUROC with the maximum detection rate among its marker genes. Each cluster was then assigned to the cell type with the highest score. This procedure was performed recursively for all levels of the hierarchy. The same procedure was carried out using the fine (resolution 80) clustering of the unintegrated data to obtain cell type labels for the unintegrated dataset that were used downstream as a ground-truth input for benchmarking integration methods.

This auto-annotation strategy was implemented in the snapseed Python package and is available on GitHub (https://github.com/devsystemslab/snapseed). Snapseed is a light-weight package to enable scalable marker-based annotation for atlas-level datasets in which manual annotation is not readily feasible. The package implements three main functions: annotate() for non-hierarchical annotation of a list of cell types with defined marker genes, annotate_hierarchy() for annotating more complex, manually defined cell type hierarchies and find_markers() for fast discovery of cluster-specific features. All functions are based on a GPU-accelerated implementation of AUROC scores using JAX (https://github.com/google/jax).

Label-aware data integration with scPoli

We performed integration of the organoid datasets for HNOCA using the scPoli⁴⁵ model from the scArches⁵¹ package. We defined the batch covariate for integration as a concatenation of the dataset identifier (annotation column ‘id’), the annotation of biological replicates (annotation column ‘bio_sample’) as well as technical replicates (annotation column ‘tech_sample’). This resulted in 396 individual batches. The batch covariate is represented in the model as a learned vector of size five. We used the top three levels of the RSS-based snapseed cell type annotation as the cell type label input for the scPoli prototype loss. We chose the hidden layer size of the one-layer scPoli encoder and decoder as 1,024, and the latent embedding dimension as ten. We used a value of 100 for the ‘alpha_epoch_anneal’ parameter. We did not use the unlabelled prototype pretraining. We trained the model for a total of seven epochs, five of which were pretraining epochs.

Benchmark of data integration methods

To quantitatively compare the organoid atlas integration results from several tools, we used the GPU-accelerated scib-metrics^46,85 Python package (v.0.3.3) and used the embedding with the highest overall performance for all downstream analyses. We compared the data integration performance across the following latent representations of the data: unintegrated PCA, RSS³ integration, scVI⁴⁹ (default parameters except for using two layers, latent space of size 30 and negative binomial likelihood) integration, scANVI⁵⁰ (default parameters) integrations using snapseed level 1, 2 or 3 annotation as cell type label input, scPoli⁴⁵ (parameters shown above) integrations using either snapseed level 1, 2 or 3 annotation or all three annotation levels at once as the cell type label input, scPoli⁴⁵ integrations of metacells aggregated with the aggrecell algorithm (first used as ‘pseudocell’³) using either snapseed level 1 or 3 annotation as the cell type label input to scPoli. We used the following scores for determining integration quality (each described in ref. ⁴⁶): Leiden normalized mutual information score, Leiden adjusted rand index, average silhouette width per cell type label, isolated label score (average silhouette width-scored) and cell type local inverse Simpson’s index to quantify conservation of biological variability. To quantify batch-effect removal, we used average silhouette width per batch label, integration local inverse Simpson’s index, kNN batch-effect test score and graph connectivity. Integration approaches were then ranked by an aggregate total score of individually normalized (into the range of [0,1]) metrics. Before we carried out the benchmarking, we iteratively removed any cells from the dataset that had an identical latent representation to another cell in the dataset until no latent representation contained any more duplicate rows. This procedure removed a total of 3,293 duplicate cells (0.002% of the whole dataset) and was required for the benchmarking algorithm to complete without errors. We used the snapseed level 3 annotation computed on the unintegrated PCA embedding as ground-truth cell type labels in the integration.

Pseudotime inference

To infer a global ordering of differentiation state, we sought to infer a real-time-informed pseudotime on the basis of neural optimal transport⁴⁷ in the scPoli latent space. We first grouped organoid age in days into seven bins ((0, 15], (15, 30], (30,60], (60, 90], (90, 120], (120, 150], (150, 450]). Next, we used moscot⁴⁸ to solve a temporal neural problem. To score the marginal distributions on the basis of expected proliferation rates, we obtained proliferation and apoptosis scores for each cell with the method score_genes_for_marginals(). Marginal weights were then computed with

$$\exp (4\times ({\rm{prolif\_score}}-{\rm{apoptosis\_score}}))$$

The optimal transport problem was solved using the following parameters: iterations = 25,000, compute_wasserstein_baseline = False, batch_size = 1,024, patience = 100, pretrain = True, train_size = 1. To compute displacement vectors for each cell in age bin i, we used the subproblem corresponding to the [i, i + 1] transport map, except for the last age bin, where we used the subproblem [i − 1,i]. Displacement vectors were obtained by subtracting the original cell distribution from the transported distribution. Using the velocity kernel from CellRank⁸⁶ we computed a transition matrix from displacement vectors and used it as an input for computing diffusion maps⁸⁷. Ranks on negative diffusion component 1 were used as a pseudotemporal ordering.

Preprocessing of the human developing brain cell atlas scRNA-seq data

The cell ranger-processed scRNA-seq data for the primary atlas²⁷ were obtained from the link provided on its GitHub page (https://storage.googleapis.com/linnarsson-lab-human/human_dev_GRCh38-3.0.0.h5ad). For further quality control, cells with fewer than 300 detected genes were filtered out. Transcript counts were normalized by the total number of counts for that cell, multiplied by a scaling factor of 10,000 and subsequently natural-log transformed. The feature set was intersected with all genes detected in the organoid atlas and the 2,000 most highly variable genes were selected with the scanpy function highly_variable_genes using ‘Donor’ as the batch key. An extra column of ‘neuron_ntt_label’ was created to represent the automatic classified neural transmitter transporter subtype labels derived from the ‘AutoAnnotation’ column of the cell cluster metadata (https://github.com/linnarsson-lab/developing-human-brain/files/9755350/table_S2.xlsx).

Reference mapping of the organoid atlas to the primary atlas

To compare our organoid atlas with data from the primary developing human brain, we used scArches⁵¹ to project it to the above mentioned primary human brain scRNA-seq atlas²⁷. We first pretrained a scVI model⁴⁹ on the primary atlas with ‘Donor’ as the batch key. The model was constructed with following parameters: n_latent = 20, n_layers = 2, n_hidden = 256, use_layer_norm = ‘both’, use_batch_norm = ‘none’, encode_covariates = True, dropout_rate = 0.2 and trained with a batch size of 1,024 for a maximum or 500 epochs with early stopping criterion. Next, the model was fine-tuned with scANVI⁵⁰ using ‘Subregion’ and ‘CellClass’ as cell type labels with a batch size of 1,024 for a maximum of 100 epochs with early stopping criterion and n_samples_per_label = 100. To project the organoids atlas to the primary atlas, we used the scArches⁵¹ implementation provided by scvi-tools^88,89. The query model was fine-tuned with a batch size of 1,024 for a maximum of 100 epochs with early stopping criterion and weight_decay = 0.0.

Bipartite weighted kNN graph reconstruction

With the primary reference²⁷ and query (HNOCA) data projected to the same latent space, an unweighted bipartite kNN graph was constructed by identifying 100 nearest neighbours of each query cell in the reference data with either PyNNDescent or RAPIDS-cuML (https://github.com/rapidsai/cuml) in Python, depending on availability of GPU acceleration. Similarly, a reference kNN graph was also built by identifying 100 nearest neighbours of each reference cell in the reference data. For each edge in the reference-query bipartite graph, the similarity between the reference neighbours of the two linked cells, defined as A and B, respectively, is represented by the Jaccard index:

$$J(A,B)=\frac{| A\cap B| }{| A\cup B| }.$$

The square of Jaccard index was then assigned as the weight of the edge, to get the bipartite weighted kNN graph between the reference and query datasets.

wkNN-based primary developing brain atlas label transfer to HNOCA cells

Given the wkNN estimated between primary reference²⁷ and query (HNOCA), any categorical metadata label of reference can be transferred to query cells by means of majority voting. In brief, for each category, its support was calculated for each query cell as the sum of weights of edges that link to reference cells in this category. The category with the largest support was assigned to the query cell.

To get the final regional labels for the non-telencephalic NPCs and neurons, as well as the NTT labels for non-telencephalic neurons, constraints were added to the transfer procedure. For regional labels, only the non-telencephalic regions, namely diencephalon, hypothalamus, thalamus, midbrain, midbrain dorsal, midbrain ventral, hindbrain, cerebellum, pons and medulla, were considered valid categories to be transferred. The label-transfer procedure was only applied to the non-telencephalic NPCs and neurons in HNOCA. Before any majority voting was done, the support scores of each valid category across all non-telencephalic NPCs and neurons in HNOCA were smoothed with a random-walk-with-restart procedure (restart probability alpha, 85%). Next, a hierarchical label transfer, which takes into account the structure hierarchy, was applied. First, the considered regions were grouped into diencephalon, midbrain and hindbrain, with a support score of each structure as its score summed up with scores of its substructures. Majority voting was applied to assign each cell to one of the three structures. Next, a second majority voting was applied to only consider the substructures under the assigned structure (for example, hypothalamus and thalamus for diencephalon).

For NTT labels, we first identified valid region-NTT label pairs in the reference on the basis of the provided NTT labels in the reference neuroblast and neuron clusters and their most common regions. Here, the most common regions were re-estimated in a hierarchical manner to the finest resolution mentioned above. Next, when transferring NTT labels, for each non-telencephalic neuron with the regional label transferred, only NTT labels that were considered valid for the region were considered during majority voting.

Stage-matching analysis

To match telencephalic NPCs and neurons in HNOCA to developmental stages, we used the recently published human neocortical development atlas³⁰ as the reference. The processed single nucleus RNA-seq data were obtained from its data portal (https://cell.ucsf.edu/snMultiome/). Given the ‘class’, ‘subclass’ and ‘type’ labels in the provided metadata as annotations, and ‘individual’ as the batch label, scPoli was applied for label-aware data integration. Next, data representing different developmental stages were split. For each stage, Louvain clustering based on the scPoli latent representation (resolution, 5) was applied. Clusters of all stages were pooled, and highly variable genes were identified on the basis of coefficient of variations as described in this page: https://pklab.med.harvard.edu/scw2014/subpop_tutorial.html. Finally, every one of HNOCA telencephalic NPCs and neurons were correlated to each cluster across the identified highly variable genes. The stage label of the best-correlated cluster was assigned to the query HNOCA cell.

To extend the analysis to other neuronal cell types, the second-trimester multiple-region human brain atlas²⁹ was also introduced. The processed count matrices and metadata were obtained from the NeMO data portal (https://data.nemoarchive.org/biccn/grant/u01_devhu/kriegstein/transcriptome/scell/10x_v2/human/processed/counts/). Given the ‘cell_type’ label of the provided metadata as the annotation and ‘individual’ as the batch label, scPoli was run for label-aware data integration. Louvain clustering was applied to the scPoli latent representation to identify clusters (resolution, 20). Similarly, Louvain clustering with a resolution of 20 was also applied to the first-trimester multiple-region human brain atlas²⁷ on the basis of the scANVI latent representation we generated earlier. Average expression profiles were calculated for all the clusters, and highly variable genes were identified using the same procedure as above for clusters of the two primary atlases combined. Next, every NPC and neuron in HNOCA was correlated to the average expression profiles of those clusters. The best-correlated first- and second-trimester clusters, as well as the correlations, were identified. The differences between the two correlations were used as the metrics to indicate the stage-matching preferences of NPCs and neurons in HNOCA.

Presence scores and max presence scores of cells in the primary developing brain atlas

Given a reference dataset and a query dataset, the presence score is a score assigned to each cell in the reference, which describes the frequency or likelihood of the cell type or state of that reference cell appearing in the query data. In this study, we calculated the presence scores of primary atlas cells in each HNOCA dataset to quantify how frequently we saw a cell type or state represented by each primary cell in each of the HNOCA datasets.

Specifically, for each HNOCA dataset, we first subset the wkNN graph to only HNOCA cells in that dataset. Next, the raw weighted degree was calculated for each cell in the primary atlas, as the sum of weights of the remaining edges linked to the cell. A random-walk-with-restart procedure was then applied to smooth the raw scores across the kNN graph of the primary atlas. In brief, we first represented the primary atlas kNN graph as its adjacency matrix (A), followed by row normalization to convert it into a transition probability matrix (P). With the raw scores represented as a vector s₀, in each iteration t, we generated s_t as

$${s}_{t}=\alpha {{\bf{s}}}_{0}+(1-\alpha ){P}^{T}{s}_{t-1}$$

This procedure was performed 100 times to get the smooth presence scores that were subsequently log transformed. Scores lower than the 5th percentile or higher than the 95th percentile were trimmed. The trimmed scores were normalized into the range of [0,1] as the final presence scores in the HNOCA dataset.

Given the final presence scores in each of the HNOCA datasets, the max presence scores in the whole HNOCA data were then easily calculated as the maximum of all the presence scores for each cell in the primary atlas. A large (close to one) max presence score indicates a high frequency of appearance for the cell type or state in at least one HNOCA dataset whereas a small (close to zero) max presence score suggests under-representation in all the HNOCA datasets.

Cell type composition comparison among morphogen usage using scCODA

To test the cell type compositional changes on admission of certain morphogens from different organoid differentiation protocols, we used the pertpy⁹⁰ implementation of the scCODA algorithm⁹¹. scCODA is a Bayesian model for detecting compositional changes in scRNA-seq data. For this, we have extracted the information about the added morphogens from each differentiation protocol and grouped them into 15 broad molecule groups on the basis of their role in neural differentiation (Supplementary Table 1). These molecule groups were used as a covariate in the model. The region labels transferred from the primary atlas were used as labels in the analysis (cell_type_identifier). For cell types without regional identity, the cell type labels presented in Fig. 1c were used. Pluripotent stem cells and neuroepithelium cells were removed from the analysis because they are mainly present in the early organoid stages. We used bio_sample as the sample_identifier. We ran scCODA sequentially with default parameters, using No-U-turn sampling (run_nuts function) and selecting each cell type once as a reference. We used a majority vote-based system to find the cell types that were credibly changing in more than half of the iterations.

Cell type composition comparison among morphogen usage using regularized linear regression

To complement the composition analysis conducted with scCODA, we devised an alternative approach to test for differential composition using regularized linear regression. We fit a generalized linear model with the region composition matrix as the response Y and molecule usage as independent variables X:

$$Y \sim X{\boldsymbol{\beta }}$$

The model was fit with lasso regularization (alpha = 1) using Gaussian noise and an identity link function. The regularization parameter lambda was automatically determined through cross-validation as implemented in the function cv.glmnet() from the glmnet⁹² R package. All non-zero coefficients β were considered as indications of enrichment and depletion.

DE analysis between HNOCA neural cell types and their primary counterparts and functional enrichment analysis

To study the transcriptomic differences between organoid and primary cells, we subset HNOCA using the final level 1 annotation to cells labelled ‘Neuron’. We furthermore subset the human developing brain atlas to cells that had been assigned a valid label in the neuron_ntt_label annotation column. We added an extra two datasets of fetal cortical cells from ref. ³⁹ and ref. ²⁸. For the data from ref. ³⁹, we subset the data to cells labelled ‘fetal’ and estimated transcripts per million reads for each gene in each cell using RSEM⁹³ given the STAR⁹⁴ mapping results. We then computed a PCA, a kNN graph, UMAP and Leiden clustering (resolution 0.2) using scanpy. We then selected the cluster with the highest STMN2 and NEUROD6 expression as the cortical neuron cluster and used only those cells. For the data from ref. ²⁸ we subset the datasets to cells annotated as ‘Neuronal’ in Supplementary Table 5 (‘Cortex annotations’) of their publication and computed a PCA, neighbourhood graph and UMAP to visualize the dataset. We found that only samples from the individuals CS14_3, CS20, CS22 and CS20 contained detectable expression of STMN2 and NEUROD6 so we subset the dataset further to only cells from those individuals.

To compute DE between HNOCA cells and their primary counterparts, we first aggregated cells of the same regional neural cell type into pseudobulk samples by summing the counts for every sample (annotation columns, ‘batch’ for HNOCA; ‘SampleID’ for the human developing brain atlas; ‘sample’ for ref. ³⁹ and ‘individual’ for ref. ²⁸) using the Python implementation of decoupler⁹⁵ (v.1.4.0) while discarding any samples with fewer than ten cells or 1,000 total counts. We then subsetted the feature space to the intersection of features of all datasets and removed any cells with fewer than 200 genes expressed. We further removed any genes expressed in less than 1% of neurons in HNOCA and any genes located on the X and Y chromosomes. Out of the remaining 11,636 genes, on average, 99% were reported in each of the constituent HNOCA datasets. For each regional neural cell type, we removed any sample from the pseudobulk data that was associated with an organoid differentiation assay with fewer than two total samples or fewer than 100 total cells. We next used edgeR⁹⁶ to iteratively compute DE genes between each organoid differentiation protocol and primary cells of the matching regional neural cell types for every regional neural cell type while correcting for organoid age in days, number of cells per pseudobulk sample, median and standard deviation of the number of detected genes per pseudobulk sample. We used the data from ref. ²⁷ (the human developing brain atlas mentioned above), ref. ²⁸ and ref. ³⁹ as primary data for the DE comparison in the cell type ‘Dorsal Telencephalic Neuron NT-VGLUT’, whereas for all other cell types we used the human developing brain atlas as the fetal dataset. We used the edgeR genewise negative binomial generalized linear model with quasi-likelihood F-tests. We deemed a gene significantly DE if its false-discovery rate (Benjamini–Hochberg) corrected P value was smaller than 0.05 and it had an absolute log₂-fold change above 0.5. We used the GSEApy⁹⁷ Python package to carry out functional enrichment analysis in our DE results using the ‘GO_Biological_Process_2021’ gene set.

To evaluate the effect of different primary datasets on the DE results, we computed the DE between Dorsal Telencephalic Neuron NT-VGLUT from the HNOCA subset generated with the protocol from ref. ⁶ and the matching cell type from the Braun et al.²⁷ primary dataset as well as the data from ref. ²⁸. To prevent technology effects to affect this analysis, we only used cells generated with the 10X Genomics 3′ v.2 protocol in this comparison. We generate pseudobulk samples as described above and corrected organoid age in days and number of cells per pseudobulk sample in the DE comparison. We used the same edgeR-based procedure and cut-offs as described above. We used the scipy fcluster method to cluster genes on the basis of their log-fold changes in the two primary datasets. We grouped clusters to represent consistently upregulated, consistently downregulated and three different inconsistently regulated groups of genes. We computed functional enrichment of each gene group as described above.

To evaluate the effect of different organoid datasets on the protocol-based DE analysis, we computed DE between Dorsal Telencephalic Neuron NT-VGLUT of every organoid publication (further split by protocol, where more than one protocol was used in a publication) and the matching cell type in the dataset from ref. ²⁷. We computed pseudobulk samples and carried out the DE analysis using the same procedure and cut-offs as in the protocol-based DE analysis.

Transcriptomic similarity between HNOCA neural cell types and their primary counterparts in the human developing brain atlas

To estimate the transcriptomic similarity between neurons in HNOCA and the human developing brain atlas²⁷, we first summarized the average expression of each neural cell type in the primary reference, as well as in each dataset of HNOCA. For each HNOCA dataset, only neural cell types with at least 20 cells were considered. Highly variable genes were identified across the neural cell types in the primary reference using a Chi-squared test-based variance ratio test on the generalized linear model with Gamma distribution (identity link), given coefficient of variance of transcript counts across neural cell types as the response and the reciprocal of average transcript count across neural cell types as the independent variable. Genes with Benjamini–Hochberg adjusted P values less than 0.01 were considered as highly variable genes. Similarity between one neural cell type in the primary atlas and its counterpart in each HNOCA dataset was then calculated as the Spearman correlation coefficient across the identified highly variable genes.

To estimate the similarity of the core transcriptomic identity, which is defined by the coexpression of transcription factors, the highly variable genes were subset to only transcription factorsfor calculating Spearman correlations. The list of transcription factors was retrieved from the AnimalTFDB v.4.0 database⁹⁸.

To identify metabolically stressed cells in the datasets, we used the scanpy score_genes function with default parameters to score the ‘canonical glycolysis’ gene set obtained from the enrichR GO_Biological_Process_2021 database across all neuronal cells from HNOCA and refs. ^27,28,39.

To estimate the significance of the difference between the correlation of glycolysis scores and whole transcriptomic similarities, and the correlation of glycolysis scores and core transcriptomic identity similarities, we generated 100 subsets of highly variable genes, each with the same size as the highly variable transcription factor. Transcriptomic similarities were calculated on the basis of those subsets, and then correlated with the glycolysis scores.

Heterogeneity of the telencephalic trajectories

To characterize heterogeneity of telencephalic NPCs and neurons in HNOCA, we first transferred the cell type labels (as indicated as the ‘type’ label in the given metadata) from the human neocortical development atlas to the HNOCA telencephalic NPCs, intermediate progenitor cells and neurons, on the basis of transcriptomic correlation. In brief, each primary atlas cluster we obtained as mentioned above was assigned to a cell type as the most abundant cell type among cells in the cluster. The label of the best-correlated primary cluster was then transferred to every query cell. Given the transferred label, together with the level 2 cell type annotation shown in Fig. 1c, as the annotation label, scPoli was applied to the telencephalic subset of HNOCA for data integration.

To benchmark how well different integration strategies recover the neuron subcell type heterogeneity, we generated four different clustering labels: (1) Louvain clustering (resolution, 2) with the original scPoli latent representation; (2) Louvain clustering (resolution, 2) with the updated scPoli representation; (3) Louvain clustering (resolution, 2) with PCA of HNOCA telencephalic subset (based on scaled expression of 3,000 highly variable genes of the telencephalic subset with flavor = ‘seurat’) and (4) Louvain clustering (resolution, 1) for each sample separately (each with 3,000 highly variable genes identified with flavor = ‘seurat’, followed by data scale and PCA). Next, for each sample with at least 500 dorsal telencephalic neurons, the adjusted mutual information scores were calculated between each of those four clustering labels with the transferred cell type label mentioned above as the gold standard, across the dorsal telencephalic neurons as annotated as the level 2 annotation.

To create a comprehensive primary atlas of dorsal telencephalic neurons for DE analysis between neural organoids and primary tissues, we subset dorsal telencephalic neurons or neocortical neurons from four different primary atlases^27,28,29,30. For ref. ²⁸, cells in five author-defined clusters (60, 57, 79, 45, 65) with high expression of MAP2, DCX and NEUROD6 were selected. For ref. ²⁹, cells with the following ‘clusterv2 – final’ labels were selected: ‘Neuron_28’, ‘Neuron_34’, ‘GW19_2_29NeuronNeuron’, ‘Neuron_30’, ‘Neuron_66Neuron’, ‘GW18_2_42NeuronNeuron’, ‘Neuron_33’, ‘Neuron_39Neuron’, ‘Neuron_35’, ‘Neuron_63Neuron’, ‘Neuron_9’, ‘Neuron_11’, ‘Neuron_20’, ‘Neuron_22’, ‘Neuron_5Neuron’, ‘Neuron_21’, ‘Neuron_18’, ‘Neuron_101Neuron’, ‘Neuron_17’, ‘Neuron_19’, ‘Neuron_16’, ‘Neuron_50Neuron’, ‘Neuron_12’, ‘Neuron_13’, ‘Neuron_68Neuron’, ‘Neuron_100Neuron’, ‘Neuron_25’, ‘Neuron_27’, ‘Neuron_53Neuron’, ‘Neuron_23’, ‘Neuron_26’, ‘Neuron_24’, ‘Neuron_102Neuron’, ‘Neuron_72Neuron’, ‘Neuron_15’, ‘Neuron_29’ and ‘Neuron_35Neuron’ on the basis of their high expression of NEUROD6 and FOXG1. For ref. ²⁷, cells dissected from dorsal telencephalon that were annotated as neurons with and only with the VGLUT NTT label were selected. For ref. ³⁰, cells annotated as excitatory neurons were selected. The curated clusters of the Wang et al. primary atlas, as described earlier, were also subset to those with excitatory neuron labels. The selected dorsal telencephalic neuron subsets of the atlases were merged into the joint neocortical neuron atlas.

Next, cells in the joint neocortical neuron atlas were correlated with the average expression profile of each excitatory neuron cluster of the Wang et al. atlas³⁰. The cluster label of the best-correlated cluster was assigned to each cell in the joined neocortical neuron atlas, so that cell cluster labels were harmonized for all cells in the atlas. Label-aware data integration was then performed using scPoli⁴⁵. On the basis of the scPoli latent representation, Louvain clustering was performed on the joint neocortical neuron atlas (resolution, 1). This cluster label was transferred to the dorsal telencephalic neurons in HNOCA with max-correlation manner across highly variable genes defined on average transcriptomic profiles of clusters in the joint neocortical neuron atlas.

Reference mapping of the neural organoid morphogen screen scRNA-seq data to the human developing brain atlas and HNOCA

We used scArches to map scRNA-seq data from the neural organoid morphogen screen to both the scANVI model of the human developing brain atlas²⁷ and the scPoli model of the HNOCA. In both cases, the ‘dataset’ field of the screen data was used as the batch covariate, which indicates belonging to one of the three categories: ‘organoid screen’, ‘secondary organoid screen’ or ‘fetal striatum 21pcw’. For mapping to the primary reference, we used the scvi-tools implementation of scArches without the use of cell type annotations and trained the model for 500 epochs with weight_decay of 0 and otherwise default parameters. For mapping to HNOCA we used scArches through scPoli and trained the model for 500 epochs without unlabelled prototype training.

Retrieval and harmonization of disease-modelling human neural organoid scRNA-seq datasets

We included 11 scRNA-seq datasets of neural organoids, which were designed to model 10 different neural diseases including microcephaly⁵⁶, amyotrophic lateral sclerosis⁴³, Alzheimer’s disease⁵⁷, autism⁴², FXS⁵⁸, schizophrenia⁵⁹, neuronal heterotopia^60,61, Pitt–Hopkins syndrome⁶², myotonic dystrophy⁶³ and glioblastoma⁶⁴. Count matrices and metadata were directly downloaded for the ten datasets with processed data provided in the Gene Expression Omnibus or ArrayExpress. For the dataset with only FASTQ files available⁵⁶, we downloaded the FASTQ files and used Cell Ranger (v.4.0) to map reads to the human reference genome and transcriptome retrieved from Cell Ranger website (GRCh38 v.3.0.0) for gene expression quantification. All datasets were concatenated together with anndata in Python (join = ‘inner’). For each dataset, samples were grouped into either ‘disease’ or ‘control’ as their disease status, with ‘disease’ representing data from patient cell lines, mutant cell lines with disease-related alleles, cells carrying targeting guide RNAs (gRNAs) in CRISPR-based screen and tumour-derived organoids. and ‘control’ representing data from healthy cell lines, mutation-corrected cell lines and cells carrying only non-targeting gRNAs in a CRISPR-based screen.

Projection and label transfer-based annotation of the disease-modelling dataset

To compare the disease-modelling atlas with the integrated HNOCA, we used scArches⁵¹ to project it to the HNOCA as well as the first-trimester primary human brain scRNA-seq atlas²⁷. For projecting to the primary atlas, the same implementation as mentioned above to map HNOCA to the atlas was used. For projecting to HNOCA, the query model was based on the scPoli model pretrained with the HNOCA data, and fineturned with a batch size of 16,384 for a maximum of 30 epochs with 20 pretraining epochs. A nearest neighbour graph was created for the disease-modelling atlas on the basis of the projected latent representation to HNOCA with scanpy (default parameters), with which a UMAP embedding was created with scanpy (default parameters).

Next, for both HNOCA and the disease-modelling atlas, cells were represented by the concatenated representation of HNOCA-scPoli and primary-scANVI models. A bipartite wkNN graph was then reconstructed as mentioned above, by identifying 50 nearest neighbours in HNOCA for each disease-modelling atlas cell. On the basis of the bipartite wkNN, the majority voting-based label transfer was applied to transfer the four levels of hierarchical cell type annotation and regional identity to the disease-modelling atlas.

Reconstruction of matched HNOCA metacells

For each cell in the disease-modelling atlas, a matched HNOCA metacell was reconstructed on the basis of the above mentioned bipartite wkNN. In brief, for a query cell i and a gene j measured in HNOCA, its matched metacell expression of j, denoted as ${{e}}_{{ij}}^{{\prime} }$, is calculated as:

$${e}_{{ij}}^{{\prime} }=\frac{{\sum }_{k\subseteq {N}_{i}}{w}_{{ik}}{e}_{{kj}}}{{\sum }_{k\subseteq {N}_{i}}{w}_{{ik}}}$$

Here, N_i represents all HNOCA nearest neighbours of the query cell c_i, w_ik represents the edge weight between query cell i and reference cell k, and e_kj represents expression level of gene j in reference cell k.

Given the matched HNOCA metacell transcriptomic profile, the similarity between a query cell and its matched cell state in HNOCA is then calculated as the Spearman correlation between the query cell transcriptomic profile and its matched HNOCA metacell transcriptomic profile.

Re-analysis of GBM-2019 and FXS-2021 datasets

To analyse the glioblastoma organoid dataset (GBM-2019), cells from the publication were subset from the integrated disease-modelling atlas. Using scanpy, highly variable genes were identified with default parameters. The log-normalized expression values of the highly variable genes were then scaled across cells, the truncated PCA was performed with the top 20 principal components used for the following analysis. Next, harmonypy, the Python implementation of harmony⁹⁹, was applied to integrate cells from different samples. On the basis of the harmony-integrated embeddings, the neighbour graph was reconstructed. UMAP embeddings and Louvain clusters (resolution, 0.5) were created on the basis of the nearest neighbour graph. Among the 12 identified clusters, cluster-7 and cluster-0, the two clusters with the highest AQP4 expression, were selected for the following DE analysis.

To analyse the FXS dataset (FXS-2021), cells from the publication were subset from the integrated disease-modelling atlas. The same procedure of highly variable gene identification, data scaling and PCA as the GBM-2019 dataset was applied. Next, the nearest neighbour graph was created directly on the basis of the top 20 principal components. UMAP embeddings and Louvain clusters (resolution, 1) were then created on the basis of the reconstructed nearest neighbour graph. Among the 30 clusters, cluster-17 and cluster-23, which express EMX1 and FOXG1 and were largely predicted to be dorsal telencephalic NPCs and neurons according to the transferred labels from HNOCA, were selected for the following DE analysis.

F-test-based DE analysis for paired transcriptome

To compare expression levels of two groups of paired cells, the expression difference per gene of each cell pair is first calculated on the basis of the log-normalized expression values. Next, for each gene to test for DE, its variance over the calculated expression difference per cell pair (σ²) is compared with the sum of squared of expression differences (d_i for gene i) normalized by the number of cell pairs:

$${s}_{0}^{2}=\frac{{\sum }_{i=1}^{n}{d}_{i}}{n}.$$

Here, an F-test is applied for the comparison, with f = σ²/s²₀, d.f.₁ = n − 1 and d.f.₂ = n.

Construction of the HNOCA Community Edition by query-to-reference mapping

To construct the HNOCA-CE, we first collected raw count matrices and associated metadata of five more neural organoid studies. For two publications^71,75, we obtained them from the sources listed in the ‘Data availability’ section of the paper. For the remaining three publications^72,73,74, count matrices and associated metadata were provided directly by the authors. We subset each dataset to the healthy control cells and removed any cells with fewer than 200 genes expressed. We subset the gene space of every dataset to the 3,000 HVGs of HNOCA while filling the expression of missing genes in the community datasets with zeros. On average, 23% of genes with zero expression were added per dataset. We instantiated a mapping object from the HNOCA-tools package (at commit fe38c52) using the saved scPoli⁴⁵ model weights from the HNOCA integration. Using the map_query method of the mapper instance, we projected the community datasets to HNOCA. We used the following training hyperparameters: retrain = ‘partial’, batch_size = 256, unlabeled_prototype_training = False, n_epochs = 10, pretraining_epochs = 9, early_stopping_kwargs = early_stopping_kwargs, eta = 10, alpha_epoch_anneal = 10. We computed the wkNN graph using the compute_wknn method of the mapper instance with k = 100. We transferred the final level_2 cell type labels from HNOCA to the community datasets using this neighbour graph. To obtain the combined representation of HNOCA-CE, we projected HNOCA together with the added community datasets through the trained model and computed a neighbour graph and UMAP from the resulting latent representation.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

[ad_2]

Source link

November 20, 2024
Global marine microbial diversity and its potential in bioprospecting

[ad_1]
Rusch, D. B. et al. The Sorcerer II Global Ocean Sampling Expedition: northwest Atlantic through eastern tropical Pacific. PLoS Biol. 5, e77 (2007).

Article
PubMed
PubMed Central

Google Scholar
Sunagawa, S. et al. Structure and function of the global ocean microbiome. Science 348, 1261359 (2015).

Article
PubMed

Google Scholar
Nayfach, S. et al. A genomic catalog of Earth’s microbiomes. Nat. Biotechnol. 39, 499–509 (2021).

Article
CAS
PubMed

Google Scholar
Paoli, L. et al. Biosynthetic potential of the global ocean microbiome. Nature 607, 111–118 (2022).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Overmann, J. & Lepleux, C. in The Marine Microbiome (ed. Stal, L. J. & Cretoiu, M. S.) 21–55 (2016).
Nishimura, Y. & Yoshizawa, S. The OceanDNA MAG catalog contains over 50,000 prokaryotic genomes originated from various marine environments. Sci. Data 9, 305 (2022).

Article
CAS
PubMed
PubMed Central

Google Scholar
Fuhrman, J. A. et al. A latitudinal diversity gradient in planktonic marine bacteria. Proc. Natl Acad. Sci. USA 105, 7774–7778 (2008).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Auladell, A. et al. Seasonal niche differentiation among closely related marine bacteria. ISME J. 16, 178–189 (2022).

Article
CAS
PubMed

Google Scholar
Ghiglione, J. F. et al. Pole-to-pole biogeography of surface and deep marine bacterial communities. Proc. Natl Acad. Sci. USA 109, 17633–17638 (2012).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Richter, D. J. et al. Genomic evidence for global ocean plankton biogeography shaped by large-scale current systems. eLife 11, e78129 (2022).

Article
CAS
PubMed
PubMed Central

Google Scholar
McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. J. Open Source Softw. 3, 29 (2018).
Jönsson, B. F. & Watson, J. R. The timescales of global surface-ocean connectivity. Nat. Commun. 7, 11239 (2016).

Article
ADS
PubMed
PubMed Central

Google Scholar
Bentkowski, P., Van Oosterhout, C. & Mock, T. A model of genome size evolution for prokaryotes in stable and fluctuating environments. Genome Biol. Evol. 7, 2344–2351 (2015).

Article
CAS
PubMed
PubMed Central

Google Scholar
Rodríguez-Gijón, A. et al. A genomic perspective across Earth’s microbiomes reveals that genome size in archaea and bacteria is linked to ecosystem type and trophic strategy. Front. Microbiol. 12, 761869 (2022).

Article
PubMed
PubMed Central

Google Scholar
Mara, P. et al. Viral elements and their potential influence on microbial processes along the permanently stratified Cariaco Basin redoxcline. ISME J. 14, 3079–3092 (2020).

Article
CAS
PubMed
PubMed Central

Google Scholar
Cabello-Yeves, P. J. et al. The microbiome of the Black Sea water column analyzed by shotgun and genome centric metagenomics. Environ. Microbiome 16, 5 (2021).

Article
CAS
PubMed
PubMed Central

Google Scholar
Mende, D. R. et al. Environmental drivers of a microbial genomic transition zone in the ocean’s interior. Nat. Microbiol. 2, 1367–1373 (2017).

Article
CAS
PubMed

Google Scholar
Musto, H. et al. Genomic GC level, optimal growth temperature, and genome size in prokaryotes. Biochem. Biophys. Res. Commun. 347, 1–3 (2006).

Article
CAS
PubMed

Google Scholar
Almpanis, A., Swain, M., Gatherer, D. & McEwan, N. Correlation between bacterial G+C content, genome size and the G+C content of associated plasmids and bacteriophages. Microb. Genom. 4, e000168 (2018).

PubMed
PubMed Central

Google Scholar
Sánchez-Romero, M. A. & Casadesús, J. The bacterial epigenome. Nat. Rev. Microbiol. 18, 7–20 (2020).

Article
PubMed

Google Scholar
Hampton, H. G., Watson, B. N. & Fineran, P. C. The arms race between bacteria and their phage foes. Nature 577, 327–336 (2020).

Article
ADS
CAS
PubMed

Google Scholar
Whittaker, C. A. & Hynes, R. O. Distribution and evolution of von Willebrand/integrin A domains: widely dispersed domains with roles in cell adhesion and elsewhere. Mol. Biol. Cell 13, 3369–3387 (2002).

Article
CAS
PubMed
PubMed Central

Google Scholar
Pasternak, Z. et al. By their genes ye shall know them: genomic signatures of predatory bacteria. ISME J. 7, 756–769 (2013).

Article
CAS
PubMed

Google Scholar
Guo, M., Wang, J., Zhang, Y. & Zhang, L. Increased WD40 motifs in Planctomycete bacteria and their evolutionary relevance. Mol. Phylogenet. Evol. 155, 107018 (2021).

Article
CAS
PubMed

Google Scholar
Hu, X. J. et al. Prokaryotic and highly-repetitive WD40 proteins: a systematic study. Sci. Rep. 7, 10585 (2017).

Article
ADS
PubMed
PubMed Central

Google Scholar
Neer, E. J., Schmidt, C. J., Nambudripad, R. & Smith, T. F. The ancient regulatory-protein family of WD-repeat proteins. Nature 371, 297–300 (1994).

Article
ADS
CAS
PubMed

Google Scholar
Fuerst, J. A. & Sagulenko, E. Beyond the bacterium: planctomycetes challenge our concepts of microbial structure and function. Nat. Rev. Microbiol. 9, 403–413 (2011).

Article
CAS
PubMed

Google Scholar
Meaden, S. et al. High viral abundance and low diversity are associated with increased CRISPR–Cas prevalence across microbial ecosystems. Curr. Biol. 32, 220–227.e225 (2022).

Article
CAS
PubMed
PubMed Central

Google Scholar
Burstein, D. et al. Major bacterial lineages are essentially devoid of CRISPR–Cas viral defence systems. Nat. Commun. 7, 10613 (2016).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Weissman, J. L., Laljani, R. M. R., Fagan, W. F. & Johnson, P. L. F. Visualization and prediction of CRISPR incidence in microbial trait-space to identify drivers of antiviral immune strategy. ISME J. 13, 2589–2602 (2019).

Article
CAS
PubMed
PubMed Central

Google Scholar
Gophna, U. et al. No evidence of inhibition of horizontal gene transfer by CRISPR–Cas on evolutionary timescales. ISME J. 9, 2021–2027 (2015).

Article
PubMed
PubMed Central

Google Scholar
Zeng, X., Alain, K. & Shao, Z. Microorganisms from deep-sea hydrothermal vents. Mar. Life Sci. Tech. 3, 204–230 (2021).

Article
ADS
CAS

Google Scholar
Wheatley, R. M. & MacLean, R. C. CRISPR–Cas systems restrict horizontal gene transfer in Pseudomonas aeruginosa. ISME J. 15, 1420–1433 (2021).

Article
CAS
PubMed

Google Scholar
Shehreen, S., Chyou, T. Y., Fineran, P. C. & Brown, C. M. Genome-wide correlation analysis suggests different roles of CRISPR–Cas systems in the acquisition of antibiotic resistance genes in diverse species. Philos. Trans. R Soc. B 374, 20180384 (2019).

Article
CAS

Google Scholar
Wilkinson, R. A., Martin, C., Nemudryi, A. A. & Wiedenheft, B. CRISPR RNA-guided autonomous delivery of Cas9. Nat. Struct. Mol. Biol. 26, 14–24 (2019).

Article
CAS
PubMed

Google Scholar
Gavriilidou, A. et al. Compendium of specialized metabolite biosynthetic diversity encoded in bacterial genomes. Nat. Microbiol. 7, 726–735 (2022).

Article
CAS
PubMed

Google Scholar
Ayikpoe, R. S. et al. A scalable platform to discover antimicrobials of ribosomal origin. Nat. Commun. 13, 6135 (2022).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Wei, B. et al. Global analysis of the biosynthetic chemical space of marine prokaryotes. Microbiome 11, 144 (2023).

Article
CAS
PubMed
PubMed Central

Google Scholar
Fjell, C. D., Hiss, J. A., Hancock, R. E. & Schneider, G. Designing antimicrobial peptides: form follows function. Nat. Rev. Drug Discov. 11, 37–51 (2011).

Article
PubMed

Google Scholar
Salazar, G. et al. Gene expression changes and community turnover differentially shape the global ocean metatranscriptome. Cell 179, 1068–1083 e1021 (2019).

Article
CAS
PubMed
PubMed Central

Google Scholar
Kawai, F., Kawabata, T. & Oda, M. Current state and perspectives related to the polyethylene terephthalate hydrolases available for biorecycling. ACS Sustain. Chem. Eng. 8, 8894–8908 (2020).

Article
CAS

Google Scholar
DeFrancesco, L. Closing the recycling circle. Nat. Biotechnol. 38, 665–668 (2020).

Article
CAS
PubMed

Google Scholar
Zhu, B., Wang, D. & Wei, N. Enzyme discovery and engineering for sustainable plastic recycling. Trends Biotechnol. 40, 22–37 (2022).

Article
CAS
PubMed

Google Scholar
Lu, H. et al. Machine learning-aided engineering of hydrolases for PET depolymerization. Nature 604, 662–667 (2022).

Article
ADS
CAS
PubMed

Google Scholar
Tournier, V. et al. An engineered PET depolymerase to break down and recycle plastic bottles. Nature 580, 216–219 (2020).

Article
ADS
CAS
PubMed

Google Scholar
Joo, S. et al. Structural insight into molecular mechanism of poly(ethylene terephthalate) degradation. Nat. Commun. 9, 382 (2018).

Article
ADS
PubMed
PubMed Central

Google Scholar
Jin, M., Gai, Y., Guo, X., Hou, Y. & Zeng, R. Properties and applications of extremozymes from deep-sea extremophilic microorganisms: a mini review. Mar. Drugs 17, 656 (2019).

Article
CAS
PubMed
PubMed Central

Google Scholar
Schmidt, T. S. B. et al. SPIRE: a Searchable, Planetary-scale mIcrobiome REsource. Nucleic Acids Res. 52, D777–D783 (2023).

Article
PubMed Central

Google Scholar
Collins, S., Boyd, P. W. & Doblin, M. A. Evolution, microbes, and changing ocean conditions. Annu. Rev. Mar. Sci. 12, 181–208 (2020).

Article
ADS

Google Scholar
Cordero, O. X. & Polz, M. F. Explaining microbial genomic diversity in light of evolutionary ecology. Nat. Rev. Microbiol. 12, 263–273 (2014).

Article
CAS
PubMed

Google Scholar
Pursey, E., Dimitriu, T., Paganelli, F. L., Westra, E. R. & van Houte, S. CRISPR–Cas is associated with fewer antibiotic resistance genes in bacterial pathogens. Philos. Trans. R Soc. B 377, 20200464 (2022).

Article
CAS

Google Scholar
Kaminski, M. M., Abudayyeh, O. O., Gootenberg, J. S., Zhang, F. & Collins, J. J. CRISPR-based diagnostics. Nat. Biomed. Eng. 5, 643–656 (2021).

Article
CAS
PubMed

Google Scholar
Jacinto, F. V., Link, W. & Ferreira, B. I. CRISPR/Cas9-mediated genome editing: from basic research to translational medicine. J. Cell. Mol. Med. 24, 3766–3778 (2020).

Article
PubMed
PubMed Central

Google Scholar
Saati-Santamaria, Z., Selem-Mojica, N., Peral-Aranega, E., Rivas, R. & Garcia-Fraile, P. Unveiling the genomic potential of Pseudomonas type strains for discovering new natural products. Microb. Genom. 8, 000758 (2022).

CAS
PubMed
PubMed Central

Google Scholar
Belknap, K. C., Park, C. J., Barth, B. M. & Andam, C. P. Genome mining of biosynthetic and chemotherapeutic gene clusters in Streptomyces bacteria. Sci. Rep. 10, 2003 (2020).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Yan, S., Zeng, M., Wang, H. & Zhang, H. Micromonospora: a prolific source of bioactive secondary netabolites with therapeutic potential. J. Med. Chem. 65, 8735–8771 (2022).

Article
CAS
PubMed

Google Scholar
Szymczak, P. et al. Discovering highly potent antimicrobial peptides with deep generative model HydrAMP. Nat. Commun. 14, 1453 (2023).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Tully, B. J., Wheat, C. G., Glazer, B. T. & Huber, J. A. A dynamic microbial community with high functional redundancy inhabits the cold, oxic subseafloor aquifer. ISME J. 12, 1–16 (2018).

Article
CAS
PubMed

Google Scholar
Galambos, D., Anderson, R. E., Reveillaud, J. & Huber, J. A. Genome-resolved metagenomics and metatranscriptomics reveal niche differentiation in functionally redundant microbial communities at deep-sea hydrothermal vents. Environ. Microbiol. 21, 4395–4410 (2019).

Article
CAS
PubMed
PubMed Central

Google Scholar
Dombrowski, N., Seitz, K. W., Teske, A. P. & Baker, B. J. Genomic insights into potential interdependencies in microbial hydrocarbon and nutrient cycling in hydrothermal sediments. Microbiome 5, 106 (2017).

Article
PubMed
PubMed Central

Google Scholar
Reysenbach, A. L. et al. Complex subsurface hydrothermal fluid mixing at a submarine arc volcano supports distinct and highly diverse microbial communities. Proc. Natl Acad. Sci. USA 117, 32627–32638 (2020).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Konstantinidis, K. T., Braff, J., Karl, D. M. & DeLong, E. F. Comparative metagenomic analysis of a microbial community residing at a depth of 4,000 meters at station ALOHA in the North Pacific subtropical gyre. Appl. Environ. Microbiol. 75, 5345–5355 (2009).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Pelve, E. A., Fontanez, K. M. & DeLong, E. F. Bacterial succession on sinking particles in the ocean’s interior. Front. Microbiol. 8, 2269 (2017).

Article
PubMed
PubMed Central

Google Scholar
Kato, S., Hirai, M., Ohkuma, M. & Suzuki, K. Microbial metabolisms in an abyssal ferromanganese crust from the Takuyo-Daigo Seamount as revealed by metagenomics. PLoS ONE 14, e0224888 (2019).

Article
CAS
PubMed
PubMed Central

Google Scholar
Buongiorno, J., Sipes, K., Wasmund, K., Loy, A. & Lloyd, K. G. Woeseiales transcriptional response to shallow burial in Arctic fjord surface sediment. PLoS ONE 15, e0234839 (2020).

Article
CAS
PubMed
PubMed Central

Google Scholar
Robbins, S. J. et al. A genomic view of the reef-building coral Porites lutea and its microbial symbionts. Nat. Microbiol. 4, 2090–2100 (2019).

Article
PubMed

Google Scholar
De Corte, D. et al. Viral communities in the global deep ocean conveyor belt assessed by targeted viromics. Front. Microbiol. 10, 1801 (2019).

Article
PubMed
PubMed Central

Google Scholar
Rinke, C. et al. A phylogenomic and ecological analysis of the globally abundant Marine Group II archaea (Ca. Poseidoniales ord. nov.). ISME J. 13, 663–675 (2019).

Article
CAS
PubMed

Google Scholar
Martin-Cuadrado, A. B. et al. A new class of marine Euryarchaeota group II from the Mediterranean deep chlorophyll maximum. ISME J. 9, 1619–1634 (2015).

Article
CAS
PubMed

Google Scholar
Fuchsman, C. A., Devol, A. H., Saunders, J. K., McKay, C. & Rocap, G. Niche partitioning of the N cycling microbial community of an offshore oxygen deficient zone. Front. Microbiol. 8, 2384 (2017).

Article
PubMed
PubMed Central

Google Scholar
Haro-Moreno, J. M., Rodriguez-Valera, F. & Lopez-Perez, M. Prokaryotic population dynamics and viral predation in a marine succession experiment using metagenomics. Front. Microbiol. 10, 2926 (2019).

Article
PubMed
PubMed Central

Google Scholar
Pascoal, F. et al. Inter-comparison of marine microbiome sampling protocols. ISME Commun. 3, 84 (2023).

Article
PubMed
PubMed Central

Google Scholar
Raes, E. J., Bodrossy, L., van de Kamp, J., Bissett, A. & Waite, A. M. Marine bacterial richness increases towards higher latitudes in the eastern Indian Ocean. Limnol. Oceanogr. Lett. 3, 10–19 (2017).

Article

Google Scholar
Schreiber, L. et al. Potential for microbially mediated natural attenuation of diluted bitumen on the coast of British Columbia (Canada). Appl. Environ. Microbiol. 85, e00086-19 (2019).

Article
PubMed
PubMed Central

Google Scholar
Biller, S. J. et al. Marine microbial metagenomes sampled across space and time. Sci. Data 5, 180176 (2018).

Article
CAS
PubMed
PubMed Central

Google Scholar
Cao, S. et al. Structure and function of the Arctic and Antarctic marine microbiota as revealed by metagenomics. Microbiome 8, 47 (2020).

Article
CAS
PubMed
PubMed Central

Google Scholar
Tremblay, J. et al. Metagenomic and metatranscriptomic responses of natural oil degrading bacteria in the presence of dispersants. Environ. Microbiol. 21, 2307–2319 (2019).

Article
CAS
PubMed

Google Scholar
Anstett, J. et al. A compendium of bacterial and archaeal single-cell amplified genomes from oxygen deficient marine waters. Sci. Data 10, 332 (2023).

Article
CAS
PubMed
PubMed Central

Google Scholar
Diez, B. et al. Metagenomic analysis of the Indian Ocean picocyanobacterial community: structure, potential function and evolution. PLoS ONE 11, e0155757 (2016).

Article
PubMed
PubMed Central

Google Scholar
Orsi, W. D. et al. Climate oscillations reflected within the microbiome of Arabian Sea sediments. Sci. Rep. 7, 6040 (2017).

Article
ADS
PubMed
PubMed Central

Google Scholar
Murray, A. E. et al. Discovery of an Antarctic ascidian-associated uncultivated Verrucomicrobia with antimelanoma palmerolide biosynthetic potential. mSphere 6, e0075921 (2021).

Article
PubMed

Google Scholar
Boeuf, D. et al. Biological composition and microbial dynamics of sinking particulate organic matter at abyssal depths in the oligotrophic open ocean. Proc. Natl Acad. Sci. USA 116, 11824–11832 (2019).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Zheng, T. et al. Mining, analyzing, and integrating viral signals from metagenomic data. Microbiome 7, 42 (2019).

Article
CAS
PubMed
PubMed Central

Google Scholar
Aylward, F. O. et al. Diel cycling and long-term persistence of viruses in the ocean’s euphotic zone. Proc. Natl Acad. Sci. USA 114, 11446–11451 (2017).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Fernandes, S. et al. Enhanced carbon-sulfur cycling in the sediments of Arabian Sea oxygen minimum zone center. Sci. Rep. 8, 8665 (2018).

Article
ADS
PubMed
PubMed Central

Google Scholar
Markussen, T. et al. Coupling biogeochemical process rates and metagenomic blueprints of coastal bacterial assemblages in the context of environmental change. Environ. Microbiol. 20, 3083–3099 (2018).

Article
CAS
PubMed

Google Scholar
Duarte, C. M. et al. Sequencing effort dictates gene discovery in marine microbial metagenomes. Environ. Microbiol. 22, 4589–4603 (2020).

Article
CAS
PubMed
PubMed Central

Google Scholar
Yoshitake, K. et al. Development of a time-series shotgun metagenomics database for monitoring microbial communities at the Pacific coast of Japan. Sci. Rep. 11, 12222 (2021).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Abdel-Ghaffar, F. et al. Morphological and molecular biological characterization of Pleistophora aegyptiaca sp. nov. infecting the Red Sea fish Saurida tumbil. Parasitol. Res. 110, 741–752 (2012).

Article
PubMed

Google Scholar
Atlas, R. M. et al. Oil biodegradation and oil-degrading microbial populations in marsh sediments impacted by oil from the Deepwater Horizon well blowout. Environ. Sci. Technol. 49, 8356–8366 (2015).

Article
ADS
CAS
PubMed

Google Scholar
Hauptmann, A. L. et al. Contamination of the Arctic reflected in microbial metagenomes from the Greenland ice sheet. Environ. Res. Lett. 12, 074019 (2017).

Article
ADS

Google Scholar
Botte, E. S. et al. Future ocean conditions induce necrosis, microbial dysbiosis and nutrient cycling imbalance in the reef sponge Stylissa flabelliformis. ISME Commun. 3, 53 (2023).

Article
PubMed
PubMed Central

Google Scholar
Thompson, L. R. et al. Metagenomic covariation along densely sampled environmental gradients in the Red Sea. ISME J. 11, 138–151 (2017).

Article
CAS
PubMed

Google Scholar
Hilton, J. A., Satinsky, B. M., Doherty, M., Zielinski, B. & Zehr, J. P. Metatranscriptomics of N₂-fixing cyanobacteria in the Amazon River plume. ISME J. 9, 1557–1569 (2015).

Article
CAS
PubMed

Google Scholar
Nilsson, E. et al. Genomic and seasonal variations among aquatic phages infecting the Baltic Sea Gammaproteobacterium Rheinheimera sp. Strain BAL341. Appl. Environ. Microbiol. 85, e01003–e01019 (2019).

Article
CAS
PubMed
PubMed Central

Google Scholar
Glasl, B. et al. Comparative genome-centric analysis reveals seasonal variation in the function of coral reef microbiomes. ISME J. 14, 1435–1450 (2020).

Article
PubMed
PubMed Central

Google Scholar
Abdou, Y. T. et al. Characterization of a novel peptide mined from the Red Sea brine pools and modified to enhance its anticancer activity. BMC Cancer 23, 699 (2023).

Article
CAS
PubMed
PubMed Central

Google Scholar
Romero Picazo, D. et al. Horizontally transmitted symbiont populations in deep-sea mussels are genetically isolated. ISME J. 13, 2954–2968 (2019).

Article
PubMed
PubMed Central

Google Scholar
Saw, J. H. W. et al. Pangenomics analysis reveals diversification of enzyme families and niche specialization in globally abundant SAR202 bacteria. mBio 11, e02975–19 (2020).

Article
CAS
PubMed
PubMed Central

Google Scholar
St John, E., Flores, G. E., Meneghin, J. & Reysenbach, A. L. Deep-sea hydrothermal vent metagenome-assembled genomes provide insight into the phylum Nanoarchaeota. Environ. Microbiol. Rep. 11, 262–270 (2019).

Article

Google Scholar
Anantharaman, K. et al. Sulfur oxidation genes in diverse deep-sea viruses. Science 344, 757–760 (2014).

Article
ADS
CAS
PubMed

Google Scholar
Niemann, H. et al. Novel microbial communities of the Haakon Mosby mud volcano and their role as a methane sink. Nature 443, 854–858 (2006).

Article
ADS
CAS
PubMed

Google Scholar
Jungbluth, S. P., Bowers, R. M., Lin, H. T., Cowen, J. P. & Rappe, M. S. Novel microbial assemblages inhabiting crustal fluids within mid-ocean ridge flank subsurface basalt. ISME J. 10, 2033–2047 (2016).

Article
CAS
PubMed
PubMed Central

Google Scholar
Lopez-Perez, M., Haro-Moreno, J. M., Gonzalez-Serrano, R., Parras-Molto, M. & Rodriguez-Valera, F. Genome diversity of marine phages recovered from Mediterranean metagenomes: Size matters. PLoS Genet. 13, e1007018 (2017).

Article
PubMed
PubMed Central

Google Scholar
Dombrowski, N., Teske, A. P. & Baker, B. J. Expansive microbial metabolic versatility and biodiversity in dynamic Guaymas Basin hydrothermal sediments. Nat. Commun. 9, 4999 (2018).

Article
ADS
PubMed
PubMed Central

Google Scholar
Liu, J. et al. Proliferation of hydrocarbon-degrading microbes at the bottom of the Mariana Trench. Microbiome 7, 47 (2019).

Article
CAS
PubMed
PubMed Central

Google Scholar
Yu, H. et al. Comparative genomics and proteomic analysis of assimilatory sulfate reduction pathways in anaerobic methanotrophic archaea. Front. Microbiol. 9, 2917 (2018).

Article
PubMed
PubMed Central

Google Scholar
Backstrom, D. et al. Virus genomes from deep sea sediments expand the ocean megavirome and support independent origins of viral gigantism. mBio 10, e02497-18 (2019).

Article
PubMed
PubMed Central

Google Scholar
Li, D., Liu, C. M., Luo, R., Sadakane, K. & Lam, T. W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31, 1674–1676 (2015).

Article
CAS
PubMed

Google Scholar
Kang, D. D., Froula, J., Egan, R. & Wang, Z. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ 3, e1165 (2015).

Article
PubMed
PubMed Central

Google Scholar
Uritskiy, G. V., DiRuggiero, J. & Taylor, J. MetaWRAP—a flexible pipeline for genome-resolved metagenomic data analysis. Microbiome 6, 158 (2018).

Article
PubMed
PubMed Central

Google Scholar
Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).

Article
CAS
PubMed
PubMed Central

Google Scholar
Mattock, J. & Watson, M. A comparison of single-coverage and multi-coverage metagenomic binning reveals extensive hidden contamination. Nat. Methods 20, 1170–1173 (2023).

Article
CAS
PubMed

Google Scholar
Kitts, P. A. et al. Assembly: a resource for assembled genomes at NCBI. Nucleic Acids Res. 44, D73–D80 (2016).

Article
CAS
PubMed

Google Scholar
Olm, M. R., Brown, C. T., Brooks, B. & Banfield, J. F. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J. 11, 2864–2868 (2017).

Article
CAS
PubMed
PubMed Central

Google Scholar
Parks, D. H. et al. A complete domain-to-species taxonomy for bacteria and archaea. Nat. Biotechnol. https://doi.org/10.1038/s41587-020-0501-8 (2020).

Article
PubMed
PubMed Central

Google Scholar
Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2—approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).

Article
ADS
PubMed
PubMed Central

Google Scholar
Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL) v4: recent updates and new developments. Nucleic Acids Res. 47, W256–W259 (2019).

Article
CAS
PubMed
PubMed Central

Google Scholar
Seemann, T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30, 2068–2069 (2014).

Article
CAS
PubMed

Google Scholar
Russel, J., Pinilla-Redondo, R., Mayo-Munoz, D., Shah, S. A. & Sorensen, S. J. CRISPRCasTyper: automated identification, annotation, and classification of CRISPR–Cas loci. CRISPR J. 3, 462–469 (2020).

Article
CAS
PubMed

Google Scholar
Yang, B., Zheng, J. & Yin, Y. AcaFinder: genome mining for anti-CRISPR-associated genes. mSystems 7, e0081722 (2022).

Article
PubMed

Google Scholar
Sauer, D. B. & Wang, D. N. Predicting the optimal growth temperatures of prokaryotes using only genome derived features. Bioinformatics 35, 3224–3231 (2019).

Article
CAS
PubMed
PubMed Central

Google Scholar
Liu, Y. et al. A genome and gene catalog of glacier microbiomes. Nat. Biotechnol. 40, 1341–1348 (2022).

Article
PubMed

Google Scholar
Johansson, M. H. K. et al. Detection of mobile genetic elements associated with antibiotic resistance in Salmonella enterica using a newly developed web tool: MobileElementFinder. J. Antimicrob. Chemother. 76, 101–109 (2021).

Article
CAS
PubMed

Google Scholar
Pinilla-Redondo, R. et al. Discovery of multiple anti-CRISPRs highlights anti-defense gene clustering in mobile genetic elements. Nat. Commun. 11, 5652 (2020).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Mahendra, C. et al. Broad-spectrum anti-CRISPR proteins facilitate horizontal gene transfer. Nat Microbiol 5, 620–629 (2020).

Article
CAS
PubMed
PubMed Central

Google Scholar
Mohanraju, P. et al. Alternative functions of CRISPR–Cas systems in the evolutionary arms race. Nat. Rev. Microbiol. 20, 351–364 (2022).

Article
CAS
PubMed

Google Scholar
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Li, Z. et al. DNB-based on-chip motif finding: a high-throughput method to profile different types of protein–DNA interactions. Sci. Adv. 6, eabb3350 (2020).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Wagih, O. ggseqlogo: a versatile R package for drawing sequence logos. Bioinformatics 33, 3645–3647 (2017).

Article
CAS
PubMed

Google Scholar
Canver, M. C. et al. BCL11A enhancer dissection by Cas9-mediated in situ saturating mutagenesis. Nature 527, 192–197 (2015).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Weber, L. et al. Editing a γ-globin repressor binding site restores fetal hemoglobin synthesis and corrects the sickle cell disease phenotype. Sci. Adv. 6, eaay9392 (2020).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Clement, K. et al. CRISPResso2 provides accurate and rapid genome editing sequence analysis. Nat. Biotechnol. 37, 224–226 (2019).

Article
CAS
PubMed
PubMed Central

Google Scholar
Medema, M. H. et al. antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences. Nucleic Acids Res. 39, W339–W346 (2011).

Article
CAS
PubMed
PubMed Central

Google Scholar
Kautsar, S. A., van der Hooft, J. J. J., de Ridder, D. & Medema, M. H. BiG-SLiCE: a highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters. Gigascience 10, giaa154 (2021).

Article
PubMed
PubMed Central

Google Scholar
Kautsar, S. A., Blin, K., Shaw, S., Weber, T. & Medema, M. H. BiG-FAM: the biosynthetic gene cluster families database. Nucleic Acids Res. 49, D490–D497 (2021).

Article
CAS
PubMed

Google Scholar
Ma, Y. et al. Identification of antimicrobial peptides from the human gut microbiome using deep learning. Nat. Biotechnol. 40, 921–931 (2022).

Article
CAS
PubMed

Google Scholar
Humphries, R. M. et al. CLSI methods development and standardization working group best practices for evaluation of antimicrobial susceptibility tests. J. Clin. Microbiol. 56, e01934–17 (2018).

Article
CAS
PubMed
PubMed Central

Google Scholar
Zhu, W., Lomsadze, A. & Borodovsky, M. Ab initio gene identification in metagenomic sequences. Nucleic Acids Res. 38, e132 (2010).

Article
PubMed
PubMed Central

Google Scholar
Steinegger, M. & Soding, J. Clustering huge protein sequence sets in linear time. Nat. Commun. 9, 2542 (2018).

Article
ADS
PubMed
PubMed Central

Google Scholar
Coelho, L. P. et al. Towards the biogeography of prokaryotic genes. Nature 601, 252–256 (2022).

Article
ADS
CAS
PubMed

Google Scholar
Liao, S. et al. Deciphering the microbial taxonomy and functionality of two diverse mangrove ecosystems and their potential abilities to produce bioactive compounds. mSystems 5, e00851–19 (2020).

Article
CAS
PubMed
PubMed Central

Google Scholar
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2014).

Article
PubMed

Google Scholar
Yoshida, S. et al. A bacterium that degrades and assimilates poly(ethylene terephthalate). Science 351, 1196–1199 (2016).

Article
ADS
CAS
PubMed

Google Scholar
Han, X. et al. Structural insight into catalytic mechanism of PET hydrolase. Nat. Commun. 8, 2106 (2017).

Article
ADS
PubMed
PubMed Central

Google Scholar
Danso, D. et al. New insights into the function and global distribution of polyethylene terephthalate (PET)-degrading bacteria and enzymes in marine and terrestrial metagenomes. Appl. Environ. Microbiol. 84, e02773-17 (2018).

Article
PubMed
PubMed Central

Google Scholar
Almagro Armenteros, J. J. et al. SignalP 5.0 improves signal peptide predictions using deep neural networks. Nat. Biotechnol. 37, 420–423 (2019).

Article
CAS
PubMed

Google Scholar
Erickson, E. et al. Sourcing thermotolerant poly(ethylene terephthalate) hydrolase scaffolds from natural diversity. Nat. Commun. 13, 7850 (2022).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Kumar, S., Stecher, G., Li, M., Knyaz, C. & Tamura, K. MEGA X: molecular evolutionary genetics analysis across computing platforms. Mol. Biol. Evol. 35, 1547–1549 (2018).

Article
CAS
PubMed
PubMed Central

Google Scholar
Robert, X. & Gouet, P. Deciphering key features in protein structures with the new ENDscript server. Nucleic Acids Res. 42, W320–W324 (2014).

Article
CAS
PubMed
PubMed Central

Google Scholar
Liu, K. et al. A dual fluorescence assay enables high-throughput screening for poly(ethylene terephthalate) hydrolases. ChemSusChem 16, e202202019 (2022).

Article

Google Scholar
Cui, Y. et al. Computational redesign of a PETase for plastic biodegradation under ambient condition by the GRAPE strategy. ACS Catal. 11, 1340–1350 (2021).

Article
CAS

Google Scholar

[ad_2]

Source link

Tag: Data integration

A spatial human thymus cell atlas mapped to a continuous tissue axis

Data generation by institute

Sample processing and library preparation for 10x scRNA-seq at WSI

Curation of published datasets

Processing of published and newly generated scRNA-seq and TCR-seq datasets at WSI

scRNA-seq quality control, data integration and annotation

mTEC trajectory analysis

mcTEC differentiation potential analysis

Formulation of the CMA with OrganAxis

10x Visium spatial transcriptomics sample processing and sequencing

Visium preprocessing, image registration and annotation

Spatial Visium mapping with cell2location

Quality control and batch correction of Visium data

PCA cumulative contribution of CMA

Cytokine clustering by expression along the binned axis in Visium data

Cytokine two-way ANOVA analysis and cosine similarity on binned axis Visium data

Cell- and gene-weighted mean CMA location on Visium data

Identification of cell-type-specific SGs

IBEX clinical cohort details and sample preparation

IBEX sample imaging preparation

IBEX image acquisition and alignment

IBEX 3D single nuclei segmentation with cellpose

Label transfer to IBEX cells from scRNA-seq and CITE-seq reference atlas with ISS patcher

Human thymus FFPE processing

RareCyte immunostaining and 14-plex imaging

RNAscope processing and imaging

CITE-seq tissue processing

CITE-seq antibody preparation

CITE-seq sample preparation

CITE-seq library preparation and sequencing

CITE-seq quality control and denoising

CITE-seq annotation

CITE-seq pseudotime analysis

CITE-seq cell2location mapping, integration and processing

Reporting summary

Single-cell integration reveals metaplasia in inflammatory gut diseases

Patient samples and tissue processing

Healthy tissue from adults

Control tissue from preterm infants

Disease tissue from patients with Crohn’s disease, ulcerative colitis and coeliac disease

Ethical approval for collection of disease tissue

Single-molecule fluorescence in situ hybridization

Immunohistochemistry

Image quantification

Data curation and mapping

scAutoQC

Assembly of the healthy reference

Annotations of the healthy reference

Data projection and label prediction for diseased data

Technical and biological variation

Differential abundance analysis

Differential gene expression analysis

Cell–cell interaction analysis

cNMF analysis

Trajectory analysis

Monocle3

Palantir

Genes2Genes trajectory alignment

Bulk RNA-seq deconvolution

Reporting summary

An integrated transcriptomic cell atlas of human neural organoids

Metadata curation and harmonization of human neural organoid scRNA-seq datasets

Preprocessing of the HNOCA scRNA-seq data

Hierarchical auto-annotation with snapseed

Label-aware data integration with scPoli

Benchmark of data integration methods

Pseudotime inference

Preprocessing of the human developing brain cell atlas scRNA-seq data

Reference mapping of the organoid atlas to the primary atlas

Bipartite weighted kNN graph reconstruction

wkNN-based primary developing brain atlas label transfer to HNOCA cells

Stage-matching analysis

Presence scores and max presence scores of cells in the primary developing brain atlas

Cell type composition comparison among morphogen usage using scCODA

Cell type composition comparison among morphogen usage using regularized linear regression

DE analysis between HNOCA neural cell types and their primary counterparts and functional enrichment analysis

Transcriptomic similarity between HNOCA neural cell types and their primary counterparts in the human developing brain atlas

Heterogeneity of the telencephalic trajectories

Reference mapping of the neural organoid morphogen screen scRNA-seq data to the human developing brain atlas and HNOCA