Host genetic regulation of human gut microbial structural variation

Cohort description


The DMP consists of 8,719 individuals and is part of the Lifelines study, a multidisciplinary prospective population-based cohort study that utilizes a unique three-generation design to examine health and health-related behaviours in 167,729 people living in the northern Netherlands. Lifelines uses a broad range of investigative procedures to assess the biomedical, socio-demographic, behavioural, physical and psychological factors that contribute to health and disease, with a special focus on multi-morbidity and complex genetics48.

Microbiome data generation for the DMP was described elsewhere8. In brief, fresh-frozen faecal samples were collected from participants of the DMP study. Microbial DNA was isolated using the QIAamp Fast DNA Stool Mini Kit (Qiagen) by the QIAcube automated sample preparation system (Qiagen). Metagenomic sequencing was carried out at Novogene, China using the Illumina HiSeq 2000 sequencer. After filtering, 8,534 DMP samples were used for SV calling.

DMP genotype data generation was described previously2. In brief, genotyping was carried out using the Infinium Global Screening Array MultiEthnic Diseases version. Missing genotypes were imputed using Haplotype Reference Consortium (HRC) panel v.1.1 (ref. 49). Only bi-allelic SNPs with imputation quality >0.4, minor allele frequency (MAF) > 0.05, call rate >0.95 and Hardy–Weinberg equilibrium P-value > 10−6 were retained. A total of 7,738 samples had both metagenomic and genotype data after quality control (QC)2. We further removed 349 samples overlapping with the LLD cohort. This resulted in phenotype, metagenomic and genotype data being available for 7,389 DMP samples.


The LLD cohort is another part of the Lifelines cohort consisting of 1,539 individuals. Microbiome data generation for LLD was described elsewhere25. Fresh-frozen faecal samples were collected, and DNA was isolated with the AllPrep DNA/RNA Mini Kit (Qiagen, catalogue number 80204). Sequencing was carried out using the Illumina HiSeq platform at the Broad Institute, Boston. A total of 1,135 metagenomic samples passed QC.

Genotyping was carried out using the CytoSNP and ImmunoChip assays, as previously described50, and missing genotypes were imputed using the HRC v.1.1 reference panel49. A total of 984 samples had phenotype, metagenomic and genotype data.


The 500FG cohort is part of the Dutch Human Functional Genomics Project (DHFGP) and consists of 534 individuals. The metagenomic data generation was described previously26,51. Briefly, DNA was isolated from faecal samples with the AllPrep DNA/RNA Mini Kit, and libraries were sequenced on the Illumina HiSeq 2000 platform. A total of 450 metagenomic samples passed QC and were included in SV calling.

500FG genotype data generation was described previously52. Briefly, genotyping was carried out using the Illumina HumanOmniExpressExome-8 v.1.0 SNP chip. Missing genotypes were imputed using the Genome of the Netherlands as a reference panel53. After QC, 396 samples had phenotype, metagenomic and genotype data.


300OB is also part of the DHFGP and consists of 302 individuals with body mass index > 27 kg m−2. Metagenomic data generation was described previously26,54 and was carried out using a similar protocol and analysis pipeline to those of LLD. A total of 302 samples had metagenomic data available for SV calling.

300OB genotype data generation was described previously55. In brief, samples were genotyped on the Illumina HumanCoreExome-24 BeadChip Kit or the Illumina Infinium Omni-express chip. Imputation was carried out using the HRC v.1.1 reference panel49. After genotype QC, 274 samples had phenotype, genotype and metagenomic data available.


For replication in non-European individuals, we included 300TZFG, a population cohort of 323 individuals from both rural and urban areas of Tanzania. This study is part of the DHFGP. Metagenomic data generation has been described previously28. Briefly, bacterial DNA was isolated using the AllPrep 96 PowerFecal DNA/RNA kit (Qiagen), and libraries were sequenced on the Illumina NovaSeq 6000 platform. A total of 320 samples passed QC and were available for SV calling.

Host genotype data generation was described previously56. In brief, samples were genotyped on the Global Screening Array SNP chip, and genotype imputation was carried out using Minimac4 with the HRC v.1.1 reference panel. After genotype QC, phenotype, genotype and metagenomic data were available for 279 samples.

QC of metagenomic sequencing data

We removed host-genome-contaminated reads and low-quality reads from the raw metagenomic sequencing data using KneadData (v.0.7.4), Bowtie2 (v. and Trimmomatic (v.0.39)58. In brief, the data-cleaning procedure included two main steps: raw reads mapped to the human reference genome GRCh37 (hg19) were filtered out; and adapter sequences and low-quality reads were filtered out using Trimmomatic with default settings (SLIDINGWINDOW:4:20 MINLEN:50).

Taxonomic abundance

We estimated the relative abundance of gut microbial species from the cleaned metagenomic reads using Kraken2 (v.2.1.2)59 in conjunction with Bracken (v.2.6.2)60 based on the same reference genomes included in the database of SGV-Finder, and MetaPhlAn 3 (ref. 61) based on the MetaPhlAn database of clade-specific marker genes (mpa_v30). The first of these was used in the GWAS analysis to remove the confounding effect of species abundance, and the last of these was used for the gut microbiome diversity and richness calculation.

Metagenomic SV detection

SVs are highly variable genomic segments within bacterial genomes that can be absent from the metagenomes of some individuals and present with variable abundance in other individuals. On the basis of the cleaned metagenomic reads, we detected microbial SVs using SGV-Finder with default parameters. SGV-Finder (v.1) was developed and described previously20 and can detect two types of SV—vSVs and dSVs.

In brief, the SV-calling procedure includes two main steps: resolving ambiguous reads with multiple alignments according to the mapping quality and genomic coverage using the iterative-coverage-based read assignment algorithm and reassigning ambiguous reads to the most likely reference with high accuracy; and splitting the reference genomes of each microbial species into genomic bins and examining the coverage of genomic bins across all samples. For the determination of dSVs within each species, the genomic bins are classified as deleted (coverage close to 0) or retained (coverage close to median coverage of the genome) bins in each sample, and those that are deleted in 25–75% of samples are kept in the analysis as raw dSVs. The raw dSVs that are highly correlated in co-occurrence are further merged into larger SV regions to produce the final dSV profile. For the determination of vSVs within each species, the coverage of genomic bins within each sample is standardized using the Z-score approach. Each bin is then assessed across all samples, and those that are highly variable on the basis of a β′ distribution are kept as raw vSVs. The raw vSVs that are highly correlated in standardized coverage are further merged into large SV regions to produce the final vSV profile.

To define the genes that belong to the SV region, we expanded the genomic coordinates of SVs 1 kb upstream and downstream, with the genes that overlap with the expanded genomic region considered genes that belong to the corresponding SV.

To identify highly variable genomic segments and detect SVs, we used the reference database provided by SGV-Finder, which is based on the proGenomes database ( We called SVs using default parameters in a larger panel of 13,195 samples from 10 datasets: 7 population cohorts (HMP1 (ref. 63), HMP2 (refs. 64,65), DMP8, LLD baseline25,48, LLD follow-up22, 500FG66 and 300TZFG28) and 3 disease cohorts (300OB67, IBD68 and HIV69). This resulted in 10,265 dSVs and 3,931 vSVs. All bacterial species with SV calling were present in at least 75 samples. For the current study, we focused on the four Dutch cohorts for which host genetic data were also available: DMP, LLD baseline, 500FG and 300OB. We removed samples with <5% of SVs called. After sample removal, SV and genotype data were available for 9,015 samples from the four cohorts: DMP (n = 7,372), LLD baseline (n = 981), 500FG (n = 396) and 300OB (n = 266).

SV filtering and normalization

First, we carried out filtering per cohort. Only SVs that were called in >10% of samples were used in the analyses. In addition, we removed dSVs with a MAF (frequency of either deletion or its absence) <5% and with both reference and alternative allele count ≤80 (this number was determined on the basis of the recommendation that the number of cases and controls is >10× the number of predictors in the generalized linear model association test70; see below). Next, we kept only SVs that were present in at least two cohorts. vSV data were normalized using inverse normal rank transformation for the heritability and association analyses.

Heritability estimation

We estimated SV heritability using the GREML software from the GCTA toolbox (v.1.94.1). We applied the family-based approach71 implemented in GREML on the SV data from the DMP cohort because this cohort has the largest sample size and contains relatives. A total of 7,389 samples with genotype and microbiome data were used for the analysis. To estimate heritability, we used default settings correcting for age, sex, total metagenomic sequencing read number and species abundance. Heritability estimates for species abundance and the corresponding confidence intervals were obtained from ref. 8, which estimated heritability on the basis of family relations in the same DMP cohort.

GWAS and meta-analysis

The manipulation of human genotype datasets was conducted using PLINK (version alpha 2.1). Association analysis was carried out using fastGWA from the GCTA toolbox (v.1.94.1)72, per cohort per SV. For dSVs, we used the generalized linear mixed model-based version of the tool (–fastGWA-mlm-binary)73. In the association analyses, we used a sparse genetic relationship matrix (GRM) created from the full GRM built on genotyped (non-imputed) SNPs with MAF > 5% using GCTA with default options (–make-grm and –make-bK-sparse 0.05). The following covariates were added to the model: age, sex, total metagenomic sequencing read number and centred log ratio (CLR)-transformed species abundance. The total read count was standardized to have a mean of zero and a variance of one. Meta-analysis was carried out using the Metal software (version 2020-05-05)74 with default options (weighting cohort-based P values according to sample size). To control for multiple testing, we applied the Bonferroni-corrected genome-wide significance threshold (5 × 10−8/SV number) and considered association results with P values below this threshold as statistically significant. For dSVs, the P-value threshold was 5 × 10−8/1,666 = 3.00 × 10−11. For vSVs, it was 5 × 10−8/1,886 = 2.65 × 10−11.

Association with ABO blood group

We used two approaches to determine the ABO blood group. In the DMP cohort, we determined the blood group on the basis of three variants (rs8176719, rs41302905 and rs8176746), as described previously2. For LLD and 500FG, in which some of these variants were not genotyped, we used a less sensitive approach based on two SNPs, rs8176693 (T allele determines blood group B) and rs505922 (T allele determines blood group O), as reported in previously published papers75,76. Association of blood groups with F. prausnitzii SVs was carried out in R (v.4.1.0) using (generalized) linear mixed models using the R package lme4qtl (v.0.2.2). This package allows a kinship matrix to be included as a random effect to account for sample relatedness. For each cohort, we created a kinship matrix based on a GRM built by GCTA using the function kinship from the R package kinship2 (v.1.9.6). We corrected for the same covariates as in the GWAS as described above. Meta-analysis was carried out using Metal74.

Population genetic structure of F. prausnitzii

We calculated an SV-based between-sample microbial genetic dissimilarity based on Canberra distance for each microbial species separately using the vegdist() function of the R package vegan (v.2.6-2) to generate species-specific genetic distance matrices (MSV). We then carried out a principal coordinate analysis based on MSV using the pcoa() function of the R package ape (v.5.6-2), with the negative eigenvalues corrected with Cailliez’s method53.

Phylogenetic tree construction

For the F. prausnitzii strains with SVs containing the GalNAc utilization gene cluster, we first constructed a phylogenetic tree using the RAxML approach based on 81 accurately selected single-copy marker genes77. We then constructed another phylogenetic tree using RAxML (v.8) based on the GalNAc utilization genes located in the SV region78. The phylogenetic trees were converted to between-strain cophenetic distances using the cophenetic() function from the R package stats (v.4.3.0).

The phylogenetic tree shown in Fig. 3c was constructed using CSI Phylogeny 1.4 on the basis of SNPs of whole-genome sequences of the 12 isolates79 and was visualized using the R packages ggtree (v.3.2.1) and gggenomes (v.

Cohousing and SV sharing

Cohousing information at the time of faecal sampling is known for 8,880 individuals from the DMP cohort. For this cohort, we removed individuals not cohousing with any other participant and those with no microbial or genetic information. For 2,631 participants, we assessed whether any individual cohousing with them at the time of sampling had F. prausnitzii 577–579. We then used a logistic regression using the presence or absence of 577–579 as a dependent variable and the secretion of A-antigens and the presence of household SV as independent variables to estimate the effect of the presence of SV in the household on SV presence in an individual. We also assessed the possible gain or loss of F. prausnitzii in 338 individuals whose gut microbiome was profiled again after 4 years22. For 119 individuals, F. prausnitzii SV profiles were generated at both time points.

Genomic island prediction

Genomic islands were predicted by SIGI-HMM81 and IslandPath-DIMOB82 as integrated into IslandViewer 4, a computational tool that integrates multiple genomic island prediction methods83. Both SIGI-HMM and IslandPath-DIMOB have been shown to have high overall accuracy, with IslandPath-DIMOB having a slightly higher recall and SIGI-HMM having a slightly higher precision.

Microbial gene annotation

The genes of F. prausnitzii strains and reference genomes used for gut microbial SV calling were annotated using MicrobeAnnotator (v.2.0.5)84 and Bakta (v.1.8.1)85. For the annotation of genes encoding glycoside hydrolase family 109 (GH109) in F. prausnitzii and C. aerofaciens strains, we first obtained 2,113 GH109 protein sequences from CAZy ( and then conducted a homologue search of GH109 genes in the genomes of F. prausnitzii and C. aerofaciens strains using tblastn (v.2.5.0+)87 with the following parameters: -outfmt 7 -evalue 1e-10.

Homologue search in genes involved in the GalNAc pathway

We downloaded 10,487 assembled genomes of ABO-associated species from the Unified Human Gastrointestinal Genome collection33, including 1,103 assemblies of C. aerofaciens, 484 of F. lactaris, 1,109 of B. bifidum and 7,791 of F. prausnitzii. We then used the sequences of genes located in SV 577–579 as queries and carried out a homologue search in the assemblies using tblastn (v.2.5.0+)87 with the following parameters: -outfmt 7 -evalue 1e-10.

Protein family search and profiling with shortBRED

We searched the metagenomes for 27 bacterial proteins identified in the SV segment of F. prausnitzii (excluding dinB and HTF-238_02530, which were used as SV region markers and are not located within the SV), including the genes known to be involved in GalNAc metabolism, using the shortBRED toolkit (v.0.9.5)88. We extracted the genes located in the SV and converted the gene sequences to protein sequences, as required by shortBRED. We used the shortBRED tool (v.0.9.5) to identify unique markers for the query genes, using the UniRef90 database (downloaded on 1 November 2021) as a negative control.

Next, the tool (v.0.9.5) was used to quantify these markers in metagenomes. First, we assessed the association of these gene abundances with the ABO blood group. We log-transformed the RPKM values provided by shortBRED and carried out a linear mixed model analysis using shortBRED gene abundances as outcomes and ABO A or AB blood group as a predictor accounting for sample relatedness using random effects in the lme4qtl package. We also included other covariates as predictors, including age, sex, total metagenomic sequencing read number and CLR-transformed F. prausnitzii abundance, together with four F. prausnitzii dSVs and one vSV found to be associated with ABO in the primary GWAS analysis.

Next, we estimated the association of gene abundance with the α-diversity (Shannon index and richness) of the gut microbiome in DMP using linear regression using the following formula:

αdiversity = SV 577–579 + F. prausnitzii taxonomic abundance + C. aerofaciens taxonomic abundance + gene abundance.

Bacterial strains and growth

The Faecalibacterium and Collinsella strains used in this study were from culture collections (ATCC and DSMZ) and our local strain collection (Department of Medical Microbiology, University Medical Center Groningen, Groningen, the Netherlands). On the basis of the presence or absence of SVs, the following Faecalibacterium strains were selected: F. prausnitzii A2-165 (DSM 17677), F. prausnitzii ATCC 27768, F. prausnitzii HTF-F (DSM 26943), F. prausnitzii HTF-112, F. prausnitzii HTF-495, F. prausnitzii HTF-238, F. prausnitzii HTF-383, F. prausnitzii 60C2, F. prausnitzii HTF-121, F. prausnitzii HTF-133, F. prausnitzii HTF−441 and F. prausnitzii FM4. Two strains of C. aerofaciens were selected on the basis of the presence of the GalNAc genes: C. aerofaciens 4PBA and C. aerofaciens HTF-129.

Strains were cultured in a modified YCFA medium supplemented with different carbohydrates (glucose, galactose, GalNAc, mannose, lactose, fructose, N-acetylglucosamine, 2-fucosyllactose and N-acetylneuraminic acid). YCFA medium was prepared as for YCFA–glucose (YCFAG) medium described before89 without the addition of glucose. YCFA medium was composed of (g l−1) 10 casitone, 2.5 yeast extract, 4 sodium bicarbonate, 0.45 dipotassium hydrogen phosphate, 0.45 potassium dihydrogen phosphate, 0.9 sodium chloride, 0.09 magnesium (II) sulfate heptahydrate, 0.12 calcium chloride dihydrate, 2.7 sodium acetate, 1 cysteine, 5 ml 0.02% resazurin and 0.2% haemin, 1 ml pink vitamin mixture and yellow vitamin mixture, and the liquid medium. The pink vitamin mixture (per 100 ml) contains 1 mg biotin, 1 mg cobalamin, 3 mg p-aminobenzoic acid, 5 mg folic acid and 15 mg pyridoxamine. The yellow vitamin mixture (per 100 ml) contains 5 mg thiamine and 5 mg riboflavin. The liquid medium includes 600 µl l−1 propionate (≥99% purity, Sigma-Aldrich), 100 µl l−1 isobutyrate (≥99% purity, Sigma-Aldrich), 100 µl l−1 isovalerate (≥99% purity, Sigma-Aldrich) and 100 µl l−1 valerate (≥99% purity, Sigma-Aldrich). The medium is adjusted to a final pH of 6.5.

Growth experiments were carried out in a Bactron 600 anaerobic incubator (Kentron Microbiome BV) using a 24-well flat-bottom-plate with total volume of 1 ml per well YCFA medium supplemented with 4.5 g l−1 of the desired carbohydrate source. Cultures were started at an initial OD600nm range of 0.10–0.15 by the addition of an overnight glucose-grown pre-culture, and growth was monitored anaerobically at 600 nm over 24 h at 37 °C. Readings were taken every 2 h, after 10 s shaking, using Epoch 2 (Agilent BioTek Instruments), and growth curves were generated using Gen5 software. Each growth condition was carried out in triplicate using three independent pre-cultures. Data of growth curves are reported as means ± s.d.

Gene expression analysis of GalNAc induction

Sample collection

The F. prausnitzii strains HTF-495, HTF-441 and ATCC 27768 were selected to test the mRNA expression level of genes on the basis of the shortest distance within the phylogenetic tree. The F. prausnitzii strains were pre-cultured individually in YCFAG medium overnight anaerobically at 37 °C in triplicate. To get enough biomass, these pre-cultures were used to inoculate fresh triplicates of each strain in a ratio of 1:20 (20 ml) and incubated for 24 h anaerobically at 37 °C in YCFAG medium. Each culture was then split into two tubes (10 ml per tube) and centrifuged at 3,000 r.p.m. for 10 min. The supernatants were removed and resuspended with 10 ml YCFAG or YCFA-GalNAc, separately for each culture, in a total of 18 samples. After 6 h of incubation, a 1:1 ratio (10 ml) of ice-cold killing buffer (20 mM Tris-HCl pH 7.5, 5 mM MgCl2, 20 mM NaN3) was added to the cultures. Samples were centrifuged at 3,000 r.p.m. for 10 min at 4 °C, and the supernatants were removed. The pellets were resuspended in 1 ml TRIzol (Invitrogen) and stored at −80 °C until further RNA isolation.

RNA isolation and cDNA synthesis

For RNA isolation, 200 µl of RNAse-free chloroform was added to each sample and incubated at room temperature for 5 min. After incubation, the samples were centrifuged at 12,000g at 4 °C, and the aqueous phase was recovered into a new tube. To precipitate RNA, 500 µl of RNAse-free isopropanol was added to each sample and mixed briefly. Samples were incubated for 10 min at room temperature and centrifuged for 10 min at 12,000g and 4 °C. The supernatant was removed, and the pellets were washed in 1 ml of 75% RNAse-free ethanol, vortexed briefly and centrifuged for 5 min at 7,500g at 4 °C. The supernatant was removed, and the pellets were air-dried at room temperature for 10 min. Afterward, the samples were resuspended with RNAse-free water.

Finally, DNA contamination was removed from 10 µg of the sample using TURBO DNA-free Kit (Invitrogen). cDNA was generated using the TaqMan Reverse Transcription Reagents (Invitrogen) with random hexamers.

Quantitative PCR

Samples were diluted to working concentration and used as a template for quantitative PCR (qPCR) amplification of the target genes (for primers, see Supplementary Table 20). Each reaction contained 10 μl of GoTaq qPCR Master Mix (Promega), 9 μl of DNA template (10 ng) and two times 0.5 μl primer solution (20 µM) in a total reaction volume of 20 μl. The amplification was carried out in a 7500 Real-Time PCR System (Applied Biosystems). The amplification program comprised two stages: an initial denaturation step at 95 °C for 2 min, followed by 40 two-step cycles at 95 °C for 15 s and at 60 °C for 1 min. At the end of the run, a melting curve analysis was carried out. The cycle threshold (Ct) value was first determined using the 7500 Real-Time PCR System detection system and then adjusted manually to set the threshold within the exponential phase of the curves. All qPCR reactions were carried out in triplicate. TheΔCt values of the genes of interest were obtained by correction for the Ct value of rpoA as the housekeeping gene. Afterward, the different \({2}^{-\Delta {C}_{{\rm{t}}}}\) values of each strain were calculated per condition. These values were used to determine the relative fold change expression of the genes after GalNAc induction compared to growth in glucose.

Ethical approval

The Lifelines study was approved by the ethics committee of the University Medical Center Groningen (METc2007/152). All participants signed an informed consent form before enrolment. Additional written consents were signed by the DMP participants or legal representatives for children aged under 18 years. The LLD study was approved by the Institutional Ethics Review Board of the University Medical Center Groningen (ref. M12.113965), the Netherlands. The 300OB study was approved by the IRB CMO Regio Arnhem-Nijmegen (number 46846.091.13). The 500FG study was approved by the Ethical Committee of Radboud University Nijmegen (NL42561.091.12, 2012/550). The inclusion of volunteers and experiments was conducted according to the principles expressed in the Declaration of Helsinki. All volunteers gave written informed consent before any material was taken. The 300FGTZ study was approved by the Ethical Committees of the Kilimanjaro Christian Medical University College (CRERC; number 936) and the National Institute for Medical Research (NIMR/HQ/R.8a/Vol. IX/2290) in Tanzania. The Tanzanian cohort provided consent for the use of their data for the purposes of this analysis.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts