Cancer genomics – Page 2

Geographic variation of mutagenic exposures in kidney cancer genomes

[ad_1]

Recruitment of cases and informed consent

The International Agency for Research on Cancer (IARC/WHO) coordinated case recruitment through an international network of more than 40 collaborators from the 11 participating countries (Table 1 and Supplementary Table 16). The inclusion criteria for patients were ≥18 years of age (range from 23 to 87, with a mean of 60 and a s.d. of 12), confirmed diagnosis of primary ccRCC and no prior cancer treatment. Informed consent was obtained for all participants. Patients were excluded if they had any condition that could interfere with their ability to provide informed consent or if there were no means of obtaining adequate tissues or associated data as per the protocol requirements. Ethical approvals were first obtained from each Local Research Ethics Committee and Federal Ethics Committee when applicable, as well as from the IARC Ethics Committee.

Bio-samples, data collection and expert pathology review

Dedicated standard operating procedures, following guidelines from the International Cancer Genome Consortium (ICGC), were designed by IARC/WHO to select appropriate case series with complete biological samples and exposure information as described previously³⁴ (Supplementary Table 16). In brief, for all case series included, anthropometric measures were taken, together with relevant information regarding medical and familial history. Comparable smoking and alcohol history was available from all centres. Detailed epidemiological information on residential history was collected in Czech Republic, Romania, and Serbia. Potential limitations of using retrospective clinical data collected using different protocols from different populations were addressed by a central data harmonization to ensure a comparable group of exposure variables (Supplementary Table 16). All patient-related data as well as clinical, demographical, lifestyle, pathological and outcome data were pseudonymized locally through the use a dedicated alpha-numerical identifier system before being transferred to IARC/WHO central database.

Original diagnostic pathology departments provided diagnostic histological details of contributing cases through standard abstract forms. IARC/WHO centralized the entire pathology workflow and coordinated a centralized digital pathology examination of the frozen tumour tissues collected for the study as well as formalin-fixed, paraffin-embedded sections when available, via a web-based report approach and dedicated expert panel following standardized procedures as described previously³⁴. A minimum of 50% viable tumour cells was required for eligibility to whole-genome sequencing.

In summary, frozen tumour tissues were first examined to confirm the morphological type and the percentage of viable tumour cells. A random selection of tumour tissues was independently evaluated by a second pathologist. Enrichment of tumour component was performed by dissection of non-tumoral part, if necessary. 90 cases overlapped with a previously published cohort recruited under the Cancer Genomics of the Kidney (CAGEKID) project⁵, which were also part of the Pan-Cancer Analysis of Whole Genomes (PCAWG) project⁷.

DNA extraction

Extraction of DNA from fresh frozen tumour and matched blood samples was centrally conducted at IARC/WHO except for Japan, which performed DNA extractions at the local centre following a similarly standardized DNA extraction procedure. Of the cases which proceeded to the final analysis (n = 962), germline DNA was extracted from either buffy-coat, whole blood, or from adjacent normal tissue (samples from Japan) using previously described protocols and methods³⁴.

Whole-genome sequencing

In total, 1,583 renal cell carcinoma cases were evaluated, with 1,267 confirmed as ccRCC cases. One hundred and sixteen cases (9%) were excluded due to insufficient viable tumour cells (pathology level), or inadequate DNA (tumour or germline). DNA from 1,151 cases was received at the Wellcome Sanger Institute for whole-genome sequencing. Fluidigm single nucleotide polymorphism (SNP) genotyping with a custom panel was performed to ensure that each pair of tumour and matched normal samples originated from the same individual. Whole-genome sequencing (150 bp paired end) was performed on the Illumina NovaSeq 6000 platform with target coverage of 40X for tumours and 20X for matched normal tissues. All sequencing reads were aligned to the GRCh38 human reference genome using Burrows-Wheeler-MEM (v0.7.16a and v0.7.17). Post-sequencing quality control metrics were applied for total coverage, evenness of coverage and contamination. Cases were excluded if coverage was below 30X for tumour or 15X for normal tissue. For evenness of coverage, the median over mean coverage score was calculated. Tumours with median over mean coverage scores outside the range of values determined by previous studies to be appropriate for whole-genome sequencing (0.92–1.09) were excluded. Conpair³⁹ (https://github.com/nygenome/Conpair) was used to detect contamination, cases were excluded if the result was greater than 3%⁴⁰. A total of 962 cases passed all criteria and were included in subsequent analysis.

Somatic variant calling

Variant calling was performed using the standard Sanger bioinformatics analysis pipeline (https://github.com/cancerit). Copy number profiles were determined first using the algorithms ASCAT⁴¹ and BATTENBERG²⁸, where tumour purity allowed. Single nucleotide variants (SNVs) were called with cgpCaVEMan⁴², indels were called with cgpPINDEL⁴³, and structural rearrangements were called using BRASS. CaVEMan and BRASS were run using the copy number profile and purity values determined from ASCAT where possible (complete pipeline, n = 857). Where tumour purity was insufficient to determine an accurate copy number profile (partial pipeline, n = 105), CaVEMan and BRASS were run using copy number defaults and an estimate of purity obtained from ASCAT/BATTENBERG. For SNV additional filters on ASRD (read length-adjusted alignment score of reads showing the variant allele) and CLPM (median number of soft-clipped bases in variant supporting reads) (ASRD ≥ 140 and CLPM = 0) were applied to remove potential false positive calls. A second variant caller, Strelka2, was run for SNVs and indels as consensus variant calling was previously shown to eliminate algorithm specific artefacts and to generate highly reproducible mutational spectra compared to using a single variant calling algorithm^34,44. Only variants called by both the Sanger variant calling pipeline and Strelka2 were included in subsequent analysis.

Validation of sequencing for Japanese cases

The matched normal tissue sequenced was blood for all countries with the exception of Japan, where adjacent normal kidney was used. As Japan displayed an enrichment of SBS12, matched blood was obtained from 28 of the 36 patients to confirm that the source of the matched normal tissue was not influencing the result. In all cases, the mutational spectra of Japanese ccRCC generated using either blood or adjacent normal kidney matched each other with a cosine similarity of >0.99.

Generation of mutational matrices

Mutational matrices for SBS, DBS and indels were generated using SigProfilerMatrixGenerator (https://github.com/AlexandrovLab/SigProfilerMatrixGenerator) with default options (v1.2.12)⁴⁵.

Mutational signature analysis

Mutational signatures were extracted using two algorithms, SigProfilerExtractor (https://github.com/AlexandrovLab/SigProfilerExtractor), based on non-negative matrix factorization, and mSigHdp⁴⁶ (https://github.com/steverozen/mSigHdp), based on the Bayesian hierarchical Dirichlet process. For SigProfilerExtractor, de novo mutational signatures were extracted from each mutational matrix using SigProfilerExtractor with nndsvd_min initialization (NMF_init = “nndsvd_min”) and default parameters (v1.1.9)⁴⁷. Briefly, SigProfilerExtractor deciphers mutational signatures by first performing Poisson resampling of the original matrix with additional renormalization (based on a generalized mixture model approach) of hypermutators to reduce their effect on the overall factorization⁴⁷. Non-negative matrix factorization (NMF) was performed using initialization with non-negative singular value decomposition and by applying the multiplicative update algorithm using the Kullback–Leibler divergence as an objective function⁴⁷. NMF was applied with factorizations between k = 1 and k = 20 signatures; each factorization was repeated 500 times⁴⁷. De novo SBS mutational signatures were extracted with SigProfilerExtractor for both SBS-288 and SBS-1536 contexts⁴⁵. The results were largely concordant with the SBS-1536 de novo signatures allowing additional separation of mutational processes, therefore the SBS-1536 de novo signatures were taken forward for further analysis (Supplementary Table 2). Mutational signatures for DBS and indels were extracted in DBS-78 and ID-83 contexts respectively (Supplementary Tables 3 and 4). Where possible, SigProfilerExtractor matched each de novo extracted mutational signature to a set of previously identified COSMIC signatures⁴, for SBS-1536 signatures this requires collapsing the 1536 classification into the standard 96 substitution type classification with six mutation classes having single 3’ and 5’ sequence contexts (Supplementary Table 7). This step makes it possible to distinguish between de novo signatures which can be explained by a combination of the known catalogue of mutational process (which have not been completely separated during the extraction), and those which have not been previously identified. mSigHdp extraction of SBS-96 and ID-83 signatures was performed using the suggested parameters and using the country of origin to construct the hierarchy. SigProfilerExtractor’s decomposition module was subsequently used to match mSigHdp de novo signatures to previously identified COSMIC signatures⁴. Further details on the comparison of results between SigProfilerExtractor and mSigHdp, decomposition of de novo signatures into COSMIC reference signatures and support for the splitting of SBS40 components can be found in the Supplementary Note.

Attribution of activities of mutational signatures

The de novo (SigProfiler) and COSMIC signature (SigProfiler and mSigHdp) activities were attributed for each sample using the MSA signature attribution tool (v2.0, https://gitlab.com/s.senkin/MSA)⁴⁸. For COSMIC attributions, only COSMIC reference signatures, which were identified in the decomposition of de novo signatures, were included in the panel for attribution, in addition to de novo signatures which could not be decomposed into COSMIC reference. At its core, the tool utilizes the non-negative least squares (NNLS) approach minimizing the L2 distance between the input sample and the one reconstructed using available signatures. To limit false positive attributions, an automated optimization procedure was applied by repeated removal of all signatures that do not increase the L2 similarity of a sample by >0.008 for SBS, >0.014 for DBS, and >0.03 for ID mutation types, as suggested by simulations. These optimal penalties were derived using an optional parameter (params.no_CI_for_penalties = false) utilizing a conservative approach in calculation of penalties. Finally, a parametric bootstrap approach was applied to extract 95% confidence intervals for each attributed mutational signature activity.

Driver mutations

A dNdS approach was used to identify genes under positive selection in ccRCC⁴⁹. The analysis was performed both for the whole genome (q value < 0.01), and with restricted hypothesis testing for a panel of 369 known cancer genes⁴⁹. Variants in any gene identified as under positive selection in global dNdS or in the 369-cancer gene panel were assessed as potential drivers⁴⁹. Candidate driver mutations were annotated with the mode of action using the Cancer Gene Census (https://cancer.sanger.ac.uk/census) and the Cancer Genome Interpreter tool⁵⁰ (https://www.cancergenomeinterpreter.org). Missense mutations were assessed using the MutationMapper tool⁵¹ (http://www.cbioportal.org/mutation_mapper). Variants were considered likely drivers if they met any of the following criteria: (1) Truncating mutations in genes annotated as tumour suppressors; (2) mutations annotated as probably or known oncogenic in MutationMapper; (3) truncating variants in genes with selection (q value < 0.05) for truncating mutations assumed to be tumour suppressors and thus likely drivers; (4) missense variants in all genes under positive selection and with dN/dS ratios for missense mutations above 5 (assuming 4 of every 5 missense mutations are drivers) labelled as likely drivers; or (5) in-frame indels in genes under significant positive selection for in-frame indels.

Evolutionary analysis

Subclonal architecture reconstruction was performed using the DPClust R package v2.2.8 (refs. ^28,29), after obtaining cancer cell fraction (CCF) estimates by dpclust3p v1.0.8 (https://github.com/Wedge-lab/dpclust3p) based on the variant allele frequency provided by the somatic variant callers and the copy number profiles determined by the BATTENBERG algorithm. Only tumours with at least 40% purity according to BATTENBERG were considered for further evolutionary analysis. For each tumour with at least one subclone, the respective somatic mutations were split into clonal and subclonal mutations using the most probable cluster assignment for each mutation as per the DPClust output. Mutations not assigned to a cluster by DPClust were removed from further analysis. Clusters centred at a CCF > 1.5 and ones where chromosome X contributed the highest number of mutations were deemed artifactual, and the respective mutations were removed. Samples with a total number of clonal or subclonal mutations below 256 were also removed. Additionally, samples with poor separation between the clonal and subclonal distributions (e.g., subclone centred at a CCF > 0.80) were removed. Finally, only samples that had both a clone and at least one subclone post-filtering were retained for further analysis. This yielded a total of 223 samples, each with clonal and subclonal mutations. SigProfilerAssignment (v0.0.13)⁵² (https://github.com/AlexandrovLab/SigProfilerAssignment) was used to identify the activity of each mutational signature in each clone/subclone, and these activities were then normalized by the total number of mutations belonging to the clone or subclone (that is, clonal mutations were not included in the subclone). A two-sided Wilcoxon signed-rank test⁵³ was used to assess the differences in the relative activity of each mutational signature between the clones and their respective subclones. P values were corrected using the Benjamini–Hochberg procedure⁵⁴ and reported as q values in this Article.

Regressions

Signature attributions were dichotomized into presence and absence using confidence intervals, with presence defined as both lower and upper limits being positive, and absence as the lower limit being zero. If a signature was present in at least 75% of cases (SBS1, SBS40a, SBS40b, ID1 and ID5), it was dichotomized into above and below the median of attributed mutation counts. The binary attributions served as dependent variables in logistic regressions, and relevant risk factors were used as factorized independent variables. To adjust for confounding factors, sex, age of diagnosis, country, and tobacco status were added as covariates in regressions. The Bonferroni method was used to test for significant P values (that is, a total of 224 comparisons for regressions with signatures, and a total of 24 comparisons for regressions with mutation burden). P values reported are raw (not corrected). Regressions with incidence of renal cancer were performed as linear regressions with mutation burdens or signature attributions (those present in at least 75% of cases) with confidence intervals not consistent with zero as a dependent variable, and ASR of renal cancer obtained from Global Cancer Observatory (GLOBOCAN)¹¹, sex and age of diagnosis as independent variables. ASR of renal cancer for regions of Czech Republic were obtained from SVOD web portal⁵⁵. Signatures present in less than 75% of cases were dichotomized into presence and absence as previously mentioned and analysed using the logistic regressions. All regressions were performed on a sample basis.

Polygenic risk score analysis of lifestyle risk factors

In this analysis, we used the genome-wide association studies (GWAS) summary statistics estimated in European populations for well-established risk factors for ccRCC. For tobacco smoking status, we used results from the GSCAN consortium meta-analysis of smoking initiation (ever vs never status)⁵⁶. For BMI, the results of UK Biobank and GIANT consortium meta-analysis of continuous BMI were used⁵⁷. GWAS summary statistics related to hypertension, namely systolic blood pressure and diastolic blood pressure, as well as the ones related to diabetes⁵⁸, such as fasting glucose and fasting insulin were also obtained using UK Biobank studies⁵⁹.

Since all the GWAS summary statistics used in the current work were based on European populations, we used ADMIXTURE tool (v1.3.0)⁶⁰ and PCA to infer the unsupervised cluster of individuals with European genetic background within ccRCC cases. Hapmap SNPs (n = 1,176,821 variants) were extracted from the ccRCC whole-genome sequence genotype data. After basic quality control using PLINK (v1.9b, www.cog-genomics.org/plink/1.9/), 333 variants were removed due to missing genotype rate > 5%, 1,236 variants failed Hardy–Weinberg equilibrium test (P values < 10⁻⁸), and 18,702 variants had MAF < 1% in our cohort. Additionally, 3 ambiguous variants and 21,358 variants within regions of long-range, high linkage disequilibrium in the human genome (hg38) were excluded. After pruning for linkage disequilibrium, 143,727 variants remained in ccRCC genotype data. The 1000 genome reference population genotype data (phase 3) for Europeans (N = 489), Africans (Yoruba in Ibadan, Nigeria, N = 108) and East Asians (N = 103 from China and 104 from Japan) (https://www.internationalgenome.org/data/) were filtered and merged with ccRCC genotype data based on the pruned set of variants present in both datasets. ADMIXTURE was run on the merged genotype data with k = 3, which would correspond to the 3 ancestral continental population groups that probably reflect the participants of our study. The ccRCC cases with European genetic fraction greater than 80% by the ADMIXTURE analysis were selected for the PRS analyses. To complement the ADMIXTURE analysis, PCA was run on the same samples.

The initial genotype data based on whole-genome sequence from 849 ccRCC cases with European genetic background consisted of biallelic SNPs with MAF > 0.01% (to exclude ultra-rare variants; N ≈ 30 million variants). After basic quality control, variants with missing genotype rate of greater than 5% (N = 7,519,196 variants) with strong deviation from Hardy–Weinberg equilibrium (P values < 10⁻⁸, N = 220,862) were excluded. For each GWAS trait, we restricted our analyses to the biallelic SNPs with minor allele frequency (MAF) greater than 1% in the 1000 Genomes reference for European populations. For the selection of the independent genome-wide significant hits (P values < 5 × 10⁻⁸) of each GWAS summary statistic used to generate the PRS, SNPs were clumped (r² = 0.1 within a linkage disequilibrium window of 10 MB) using PLINK v1.9b⁶¹ (www.cog-genomics.org/plink/1.9/) based on the 1000 Genomes European reference population genotype data (N = 489; ~10 million variants). Where a selected GWAS hit was not found in ccRCC genotype data, we extracted proxies (r² > 0.8 in 1000 Genomes) also present in ccRCC dataset where possible (Supplementary Table 17). The variance of each genetic trait explained by the genetic variants were calculated as previously suggested⁶². PRS was subsequently calculated as the sum of the individual’s beta-weighted genotypes using PRSice-2 software⁶³. Associations were estimated per standard deviation increase in the PRS, which was normalized to have a mean of zero across ccRCC cases of European genetic ancestry.

Untargeted metabolomics association with signatures

Of the 962 subjects from the main analysis, 901 subjects were included in this sub-study—all Japanese samples (n = 36) as well as few cases from Czech Republic (n = 13), Romania (n = 5) and Russia (n = 3) were not included due to lack of available plasma samples. Samples were randomized and analysed as two independent analytical batches. Analysis was performed with a UHPLC-QTOF-MS system that consisted of a 1,290 Binary LC and a 6,550 QTOF mass spectrometer equipped with Jet Stream electrospray ionization source (Agilent Technologies), using previously described methods⁶⁴. Pre-processing was performed using Profinder 10.0.2.162 and Mass Profiler Professional B.14.9.1 software (Agilent Technologies, https://www.agilent.com/). A ‘batch recursive feature extraction (small molecules)’ process was employed for samples and blanks to find [M + H]⁺ ions. The two batches were processes separately and the resulting features were aligned in Mass Profiler Professional. Chromatographic peak areas were used as a measurement of intensity. No normalization or transformation of raw data was performed prior to the downstream data analysis.

A total of 2,392 features were detectable in at least one of the 901 samples. Features present in only one of the two batches were filtered out. Recursive filtering elimination was applied to decrease redundancy from highly correlated variables (r ≥ 0.85, Pearson’s r calculated before any transformation/imputation) by selecting the features with least missing data within clusters of features. A total of 944 features were included in the statistical analysis. Features were pre-processed: missing values were replaced with one-fifth of the minimal value of the feature before applying mean centering and Pareto scaling. Each feature was regressed against both de novo and COSMIC signatures, adjusting for sex and age of diagnosis, as well as BMI and technical factors (batch, acquisition order) that could impact chromatographic peak area. Models for SBS22a and SBS22b were restricted to Romanian and Serbian samples to find potential pathways of aristolochic acid exposure in the Balkan region. Logistic models were used for zero-inflated signatures (≥30% zeros) while quasi-Poisson regressions were used for the least zero-inflated signatures (SBS1, SBS40a, and SBS40b). To derive specific false detection rates, random variables were created from permutations of the initial features and regressed against signatures in the same fashion as true features. Maximum P value thresholds from regressions with random features were compared to adjusted P value thresholds according to Bonferroni’s procedure. The more conservative approach was used in selecting features of interest. Random forest models were also used as cross-checking multivariate models to assess the relative importance of each feature in explaining the signature attribution. As with univariate models, regression models were used for the least zero-inflated signatures (<30% of zeros) while classification models were used for all other signatures, with restriction to Romanian and Serbian samples for SBS22a and SBS22b. Importance was estimated from the total decrease in node impurities from splitting on the variable, averaged over all trees. Node impurity was measured by the Gini index for classification, and by residual sum of squares for regression. The significance of importance metrics for Random forest models were estimated by permuting the response variable (https://github.com/EricArcher/rfPermute).

Features considered for identification, along with their highly correlated counterparts, were searched in Human Metabolome Database (HMDB), LipidMaps, Metlin and KEGG. Compound identity was confirmed by comparison of retention times and MS/MS fragmentation against chemical standards when available, or otherwise against reference MS/MS spectra. Since the feature [email protected] was strongly correlated with several features identified as TMAP (Supplementary Table 10), the integration of these features was inspected and corrected manually, and regressed against SBS40b using the same model applied to features selected for analysis. Creatinine was identified among the features by matching its retention time and MS/MS spectra against a reference standard and also regressed against SBS40b in the same fashion as other metabolites. Estimation of correlation between metabolic features was done using linear regression adjusting for batch and acquisition order.

Targeted metabolomics analyses

Circulating levels of per- and polyfluorinated substances (PFAS) and cystatin C compounds were investigated using targeted mass spectrometry-based methods as described previously^30,65.

Out of the 962 subjects from the main analysis, plasma samples from 909 subjects (from all countries except Japan) were randomized and sent frozen in dry ice to each respective laboratory for analyses. Measurement of cystatin C from 906 subjects included its native form and isoforms (3Pro-OH cystatin C, cystatin C-desS, 3Pro-OH cystatin C-desS and cystatin C-desSSP) that were modelled individually and for the total concentration of cystatin C isoforms. Measurements of PFAS compounds included PFOA (total, branch, linear), PFOS (perfluorooctanoic acid; total, branch, linear), PFHxS (perfluorohexane sulfonate), PFNA (perfluorononanoic acid), PFDA (perfluorodecanoic acid), MePFOSAA (n-methylperfluoro-1 octanesulfonamido acetic acid) and EtPFOSAA (2-(N-ethyl-perfluorooctane. sulfonamido) acetic acid).

Multivariable quasi-Poisson (for the least sparse signatures SBS1, SBS40a and SBS40b) and logistic regression were used to estimate the association between plasma concentrations of the aforementioned substances and mutational signatures. All compounds were modelled continuously (log₂-transformed) and categorically, with adjustments made by sex, age, date of recruitment, country, BMI, tobacco and alcohol status in the case of PFAS molecules and by sex, age and BMI, in the case of cystatin C.

Geospatial analyses

Geospatial analyses were performed to estimate the regional effect for signature attribution, particularly for signatures thought to be from exogenous exposure (SBS40b, unknown and SBS22a/SB22b, aristolochic acid). Residential history information was available for a large proportion of cases from the countries of interest: Czech Republic for SBS40b and Romania and Serbia for SBS22a and SBS22b, respectively. The 259 cases from Czech Republic within this study were recruited from four separate regions including Prague, České Budějovice (in Southern Bohemia), as well as Brno and Olomouc in the east of the country. Each individual residence was geocoded to its administrative region. All locations outside the country of recruitment were labelled as ‘abroad’. A multi-membership mixed model was used to account for the full list of regions in which each subject resided, as well as the proportion of life spent in that region before diagnosis. As dependent variable, signatures were inverse-normal transformed. Models were adjusted for sex and age of diagnosis (fixed effects). The regional effect was treated as random effect.

Statistics and reproducibility

Analyses were conducted using R version 4.1 (ref. ⁶⁶) and python version 3.9.13 (ref. ⁶⁷). Handling of geospatial and other data was conducted using the R packages lme4, matrixStats, Matrix, geojsonio, raster, rgeos, sf, sp, tmaptools, patchwork, leaflet, data.table, dplyr, haven, Hmisc, openxlsx, rgdal, scales, stringr, tidyr, tibble, xlsx, rfPermute, randomForest, forcats, and in python using the packages pandas, numpy, scipy, statsmodels, firthlogist, patsy and jupyter^{68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97}. Figures were created using ggplot, ggnewscale, ggpattern, ggrepel, ggsflabel, ggspatial, ggpubr, cowplot, matplotlib, plotly (https://plot.ly), seaborn and TMB_plotter^{98,99,100,101,102,103,104,105,106,107,108}. Open-source maps of Czech Republic, Romania and Serbia were obtained from the Global Administrative Areas project¹⁰⁹ (https://gadm.org), with the surrounding borders obtained from the Natural Earth project¹¹⁰ (https://www.naturalearthdata.com/). Signature extraction was replicated two times independently at both Wellcome Sanger Institute and UCSD, with similar results. Signature attribution was replicated two times independently at both Wellcome Sanger Institute and IARC, with similar results. All attempts at replication were successful. No other experiments other than those mentioned here were replicated independently due to limited resources. Additional details relating to the methods used in this study can be found in Supplementary Figs. 1–27 and Supplementary Note Tables 1–10.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

[ad_2]

Source link

May 1, 2024
3D genomic mapping reveals multifocality of human pancreatic precancers

[ad_1]
Hruban, R. H., Goggins, M., Parsons, J. & Kern, S. E. Progression model for pancreatic cancer. Clin. Cancer Res. 6, 2969–2972 (2000).

CAS
PubMed

Google Scholar
Siegel, R. L., Miller, K. D., Hannah, F. E. & Jemal, A. Cancer statistics, 2022. CA 72, 7–33 (2022).

PubMed

Google Scholar
Ryan, D. P., Hong, T. S. & Bardeesy, N. Pancreatic adenocarcinoma. N. Engl. J. Med. 371, 1039–1049 (2014).

Article
CAS
PubMed

Google Scholar
Takaori, K., Kobashi, Y., Matsusue, S., Matsui, K. & Yamamoto, T. Clinicopathological features of pancreatic intraepithelial neoplasias and their relationship to intraductal papillary-mucinous tumors. J. Hepatobiliary Pancreat. Surg. 10, 125–136 (2003).

Article
PubMed

Google Scholar
Hruban, R. H. et al. An illustrated consensus on the classification of pancreatic intraepithelial neoplasia and intraductal papillary mucinous neoplasms. Am. J. Surg. Pathol. 28, 977–987 (2004).

Article
PubMed

Google Scholar
Kanda, M. et al. Presence of somatic mutations in most early-stage pancreatic intraepithelial neoplasia. Gastroenterology 142, 730–733.e739 (2012).

Article
CAS
PubMed

Google Scholar
Hong, S. M. et al. Genome-wide somatic copy number alterations in low-grade PanINs and IPMNs from individuals with a family history of pancreatic cancer. Clin. Cancer Res. 18, 4303–4312 (2012).

Article
CAS
PubMed
PubMed Central

Google Scholar
Andea, A., Sarkar, F. & Adsay, V. N. Clinicopathological correlates of pancreatic intraepithelial neoplasia: a comparative analysis of 82 cases with and 152 cases without pancreatic ductal adenocarcinoma. Mod. Pathol. 16, 996–1006 (2003).

Article
PubMed

Google Scholar
Makohon-Moore, A. P. et al. Precancerous neoplastic cells can move through the pancreatic ductal system. Nature 561, 201–205 (2018).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Kiemen, A. L. et al. CODA: quantitative 3D reconstruction of large tissues at cellular resolution. Nat. Methods 19, 1490–1499 (2022).

Article
CAS
PubMed
PubMed Central

Google Scholar
Hosoda, W. et al. Genetic analyses of isolated high-grade pancreatic intraepithelial neoplasia (HG-PanIN) reveal paucity of alterations in TP53 and SMAD4. J. Pathol. 242, 16–23 (2017).

Article
CAS
PubMed
PubMed Central

Google Scholar
Opitz, F. V., Haeberle, L., Daum, A. & Esposito, I. Tumor microenvironment in pancreatic intraepithelial neoplasia. Cancers 13, 6188 (2021).

Article
CAS
PubMed
PubMed Central

Google Scholar
Hata, T. et al. Genome-wide somatic copy number alterations and mutations in high-grade pancreatic intraepithelial neoplasia. Am. J. Pathol. 188, 1723–1733 (2018).

Article
CAS
PubMed
PubMed Central

Google Scholar
Chhoda, A., Lu, L., Clerkin, B. M., Risch, H. & Farrell, J. J. Current approaches to pancreatic cancer screening. Am. J. Pathol. 189, 22–35 (2019).

Article
PubMed

Google Scholar
Fischer, C. G. et al. Intraductal papillary mucinous neoplasms arise from multiple independent clones, each with distinct mutations. Gastroenterology 157, 1123–1137.e1122 (2019).

Article
CAS
PubMed

Google Scholar
Wu, J. et al. Recurrent GNAS mutations define an unexpected pathway for pancreatic cyst development. Sci. Transl. Med. 3, 92ra66 (2011).

Article
CAS
PubMed
PubMed Central

Google Scholar
Felsenstein, M. et al. IPMNs with co-occurring invasive cancers: neighbours but not always relatives. Gut 67, 1652–1662 (2018).

Article
CAS
PubMed

Google Scholar
Connor, A. A. et al. Integration of genomic and transcriptional features in pancreatic cancer reveals increased cell cycle progression in metastases. Cancer Cell 35, 267–282.e267 (2019).

Article
CAS
PubMed
PubMed Central

Google Scholar
Makohon-Moore, A. P. et al. Limited heterogeneity of known driver gene mutations among the metastases of individual patients with pancreatic cancer. Nat. Genet. 49, 358–366 (2017).

Article
CAS
PubMed
PubMed Central

Google Scholar
Shi, C. et al. KRAS2 mutations in human pancreatic acinar-ductal metaplastic lesions are limited to those with PanIN: implications for the human pancreatic cancer cell of origin. Mol. Cancer Res. 7, 230–236 (2009).

Article
CAS
PubMed
PubMed Central

Google Scholar
Qu, C. et al. Detection of early-stage hepatocellular carcinoma in asymptomatic HBsAg-seropositive individuals by liquid biopsy. Proc. Natl Acad. Sci. USA 116, 6308–6312 (2019).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Wang, P. et al. Simultaneous analysis of mutations and methylations in circulating cell-free DNA for hepatocellular carcinoma detection. Sci. Transl. Med. 14, eabp8704 (2022).

Article
CAS
PubMed

Google Scholar
Alexandrov, L. B. et al. The repertoire of mutational signatures in human cancer. Nature 578, 94–101 (2020).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Moore, L. et al. The mutational landscape of human somatic and germline cells. Nature 597, 381–386 (2021).

Article
ADS
CAS
PubMed

Google Scholar
Aguirre, A. J. et al. High-resolution characterization of the pancreatic adenocarcinoma genome. Proc. Natl Acad. Sci. USA 101, 9067–9072 (2004).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar
Murphy, S. J. et al. Integrated genomic analysis of pancreatic ductal adenocarcinomas reveals genomic rearrangement events as significant drivers of disease. Cancer Res. 76, 749–761 (2016).

Article
CAS
PubMed

Google Scholar
Waddell, N. et al. Whole genomes redefine the mutational landscape of pancreatic cancer. Nature 518, 495–501 (2015).

Article
CAS
PubMed
PubMed Central

Google Scholar
Zheng, L., Niknafs, N., Wood, L. D., Karchin, R. & Scharpf, R. B. Estimation of cancer cell fractions and clone trees from multi-region sequencing of tumors. Bioinformatics 38, 3677–3683 (2022).

Article
CAS
PubMed
PubMed Central

Google Scholar
Baker, A.-M. et al. Robust RNA-based in situ mutation detection delineates colorectal cancer subclonal evolution. Nat. Commun. 8, 1998 (2017).

Article
ADS
PubMed
PubMed Central

Google Scholar
Amano, T. et al. Number of polyps detected is a useful indicator of quality of clinical colonoscopy. Endosc. Int. Open 6, E878–E884 (2018).

Article
PubMed
PubMed Central

Google Scholar
Müller, A. D. & Sonnenberg, A. Prevention of colorectal cancer by flexible endoscopy and polypectomy. A case-control study of 32,702 veterans. Ann. Intern. Med. 123, 904–910 (1995).

Article
PubMed

Google Scholar
Rohan, T. E., Henson, D. E., Franco, E. L. & Albores-Saavedra, J. in Cancer Epidemiology and Prevention (eds Schottenfeld, D. & Fraumeni, J. F.) 21–46 (Oxford Univ. Press, 2006).
Williams, A. R., Balasooriya, B. A. & Day, D. W. Polyps and cancer of the large bowel: a necropsy study in Liverpool. Gut 23, 835–842 (1982).

Article
CAS
PubMed
PubMed Central

Google Scholar
Pollock, P. M. et al. High frequency of BRAF mutations in nevi. Nat. Genet. 33, 19–20 (2003).

Article
CAS
PubMed

Google Scholar
Kumar, R., Angelini, S., Snellman, E. & Hemminki, K. BRAF mutations are common somatic events in melanocytic nevi. J. Invest. Dermatol. 122, 342–348 (2004).

Article
CAS
PubMed

Google Scholar
Ichii-Nakato, N. et al. High frequency of BRAFV600E mutation in acquired nevi and small congenital nevi, but low frequency of mutation in medium-sized congenital nevi. J. Invest. Dermatol. 126, 2111–2118 (2006).

Article
CAS
PubMed

Google Scholar
Cooke, K. R., Spears, G. F. & Skegg, D. C. Frequency of moles in a defined population. J. Epidemiol. Community Health 39, 48–52 (1985).

Article
CAS
PubMed
PubMed Central

Google Scholar
Schäfer, T., Merkl, J., Klemm, E., Wichmann, H. E. & Ring, J. The epidemiology of nevi and signs of skin aging in the adult general population: results of the KORA-survey 2000. J. Invest. Dermatol. 126, 1490–1496 (2006).

Article
PubMed

Google Scholar
Bryant, K. L., Mancias, J. D., Kimmelman, A. C. & Der, C. J. KRAS: feeding pancreatic cancer proliferation. Trends Biochem. Sci 39, 91–100 (2014).

Article
CAS
PubMed
PubMed Central

Google Scholar
Chen, Z., Chen, M., Fu, Y. & Zhang, J. The KRAS signaling pathway’s impact on the characteristics of pancreatic cancer cells. Pathol. Res. Pract. 248, 154603 (2023).

Article
CAS
PubMed

Google Scholar
Matsuda, Y. et al. The prevalence and clinicopathological characteristics of high-grade pancreatic intraepithelial neoplasia: autopsy study evaluating the entire pancreatic parenchyma. Pancreas 46, 658–664 (2017).

Article
PubMed

Google Scholar
Schindelin, J. et al. Fiji: an open-source platform for biological-image analysis. Nat. Methods 9, 676–682 (2012).

Article
CAS
PubMed

Google Scholar
DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).

Article
CAS
PubMed
PubMed Central

Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

Article
CAS
PubMed
PubMed Central

Google Scholar
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

Article
CAS
PubMed
PubMed Central

Google Scholar
Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).

Article
CAS
PubMed
PubMed Central

Google Scholar
Gabow, H. N. & Myers, E. W. Finding all spanning trees of directed and undirected graphs. SIAM J. Comput. 7, 280–287 (1978).

Article
MathSciNet

Google Scholar
Niknafs, N., Beleva-Guthrie, V., Naiman, D. Q. & Karchin, R. SubClonal hierarchy inference from somatic mutations: automatic reconstruction of cancer evolutionary trees from multi-region next generation sequencing. PLoS Comput. Biol. 11, e1004416 (2015).

Article
ADS
PubMed
PubMed Central

Google Scholar
Talevich, E., Shain, A. H., Botton, T. & Bastian, B. C. CNVkit: genome-wide copy number detection and visualization from targeted DNA sequencing. PLoS Comput. Biol. 12, e1004873 (2016).

Article
ADS
PubMed
PubMed Central

Google Scholar
Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Nature 500, 415–421 (2013).

Article
CAS
PubMed
PubMed Central

Google Scholar
Bergstrom, E. N. et al. SigProfilerMatrixGenerator: a tool for visualizing and exploring patterns of small mutational events. BMC Genomics 20, 685 (2019).

Article
PubMed
PubMed Central

Google Scholar
Islam, S. M. A. et al. Uncovering novel mutational signatures by de novo extraction with SigProfilerExtractor. Cell Genomics 2, 100179 (2022).

Article
CAS
PubMed
PubMed Central

Google Scholar
Olshen, A. B., Venkatraman, E. S., Lucito, R. & Wigler, M. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5, 557–572 (2004).

Article
PubMed

Google Scholar
Fujikura, K. et al. Multiregion whole-exome sequencing of intraductal papillary mucinous neoplasms reveals frequent somatic KLF4 mutations predominantly in low-grade regions. Gut 70, 928–939 (2021).

Article
CAS
PubMed

Google Scholar
Zhao, D. et al. Personalized analysis of minimal residual cancer cells in peritoneal lavage fluid predicts peritoneal dissemination of gastric cancer. J. Hematol. Oncol. 14, 164 (2021).

Article
CAS
PubMed
PubMed Central

Google Scholar
Zhao, L. et al. Integrated analysis of circulating tumour cells and circulating tumour DNA to detect minimal residual disease in hepatocellular carcinoma. Clin. Transl. Med. 12, e793 (2022).

Article
CAS
PubMed
PubMed Central

Google Scholar
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

Article
PubMed
PubMed Central

Google Scholar
McLaren, W. et al. The Ensembl variant effect predictor. Genome Biol. 17, 122 (2016).

Article
PubMed
PubMed Central

Google Scholar
Ramos, A. H. et al. Oncotator: cancer variant annotation tool. Hum. Mutat. 36, E2423–E2429 (2015).

Article
PubMed
PubMed Central

Google Scholar

[ad_1]

Human lung tumour specimens

This study was approved by the institutional review board of the University of Cologne. We analysed 160 tumours and patient-matched blood samples from 65 patients with SCLC (Fig. 1a). The samples were collected from multiple collaborating hospitals and clinical facilities under institutional review board-approved protocols, and all patients provided written informed consent. For some patients the material was collected as part of an ongoing clinical trial (BIOLUMA, study no. NCT03083691), and those patients received as second- or third-line treatment anti-PD-1 either alone or in combination with anti-CTLA-4 immune checkpoint inhibitors. The course of treatment for all patients and information on all samples are detailed below and summarized in Extended Data Table 1 and Supplementary Tables 1 and 2.

All tumour samples were pathologically reviewed by at least two independent expert pathologists who inspected the histomorphology based on haematoxylin and eosin and immunohistochemical staining. All tumours were confirmed with SCLC histology; tumours from three patients were diagnosed with additional morphological components of LCNEC or adenocarcinoma (Extended Data Table 1 and Supplementary Table 1). All patient-matched, multiregional tumour and normal blood samples were confirmed as belonging to the same patient by short tandem repeat (STR) analysis conducted at the Institute of Legal Medicine at the University of Cologne, Germany, and further confirmed by genome sequencing data.

In the majority of cases we analysed at least two tumour samples per patient, which were acquired at either single or multiple timepoints throughout the clinical course of treatment (Supplementary Table 2). More than two tumour samples were acquired for 37% of patients (n = 24 of 65). For five patients we analysed tumour samples at three distinct time points (n = 5 of 65, 8%; Extended Data Table 1 and Supplementary Table 2). Samples were acquired as biopsies and lung resections, and we additionally engrafted tumour tissue from fine-needle biopsies (n = 2, one pleural and one lymph node metastasis) and CTCs (n = 29 of 160, 18%) onto immune-compromised mice (NSG mice) to establish PDX (in total n = 31 of 160, 19%; Fig. 1a); this approach allowed for enrichment of limited tumour material for in-depth genomic studies. Samples analysed as PDX are listed in Supplementary Tables 2 and 3 and are highlighted in Fig. 2d. As previously described^12,13, sampling a patient’s blood for CTCs provides a minimally invasive approach towards analysis of tumour cells under therapy, and xenotransplant models have been shown to recapitulate the genomic profiles of the patient’s tumour. Xenotransplant models were established following an approach previously described¹²; tumour cells were engrafted subcutaneously into the flanks of 7–14-week-old NSG mice (male and female, NOD.Cg-Prkdcscid Il2rgtm1Wjl/SzJ; Jackson Laboratories), and tumours were harvested at a maximum volume of 1.5 mm³. Tumour histology was confirmed by pathological review, and STR profiling with patient-matched normal and tumour samples confirmed the identity of the engrafted patient-derived material. All animals were housed in a specific-pathogen-free facility under ambient temperature and humidity while maintaining a 12/12 h light/dark cycle. Animal experiments were approved by, and conducted in accordance with, the regulations of the local animal welfare authorities (State Agency for Nature, Environment and Consumer Protection of the State of North Rhine-Westphalia, nos. AZ: 84-02.04.2012.A281, 84-02.04.2015.A172 and 84-02.04.2018.A002).

Samples were categorized by location: we referred to the primary lung tumour and grouped metastatic sites as intrapulmonary metastases, including pulmonary and lung and mediastinal lymph node metastases; tumour sites grouped as extrapulmonary metastases include intrathoracic distant metastases of the pleura and extrathoracic distant metastases affecting abdominal sites, the brain or other less common metastatic sites (breast, skin, sternum), as well as CTCs propagated as CTC-derived xenotransplant models, which represent cells that spread to the bloodstream with the potential to seed distant metastases. In patients with highly metastatic disease we furthermore assessed whether, based on radiological images, tumour sites sampled throughout therapy were pre-existing at the time of first diagnosis or before treatment, and whether these sites were exposed to any given therapy (chemotherapy, radiation or immune checkpoint blockade). Furthermore, we assessed whether any samples were taken from a newly formed metastatic site which, according to radiological imaging, was not pre-existing at the time of first diagnosis or before any other treatment exposure. For CTC-derived models, because we had no information regarding whether the tumour site may have shed cells to the bloodstream, we classified any CTC-derived sample as tumour cells that may have been exposed to any given treatment. A schematic overview of the acquired samples and affected organ sites is depicted for each patient in the Supplementary Appendix.

Clinical characteristics

The clinical characteristics of the patients in our cohort are in line with those typically found in SCLC (Fig. 1b, Extended Data Table 1 and Supplementary Table 1). Median age at the time of first diagnosis was 64 years, and patients were predominantly male (n = 43 of 65, 66%) with a history of heavy smoking and a median number of 40 pack years (smoking history was known for 89% of patients, n = 58 of 65; the number of pack years was determined for 85% of patients, n = 55 of 65). For clinical correlations the following categories were defined: age groups of 65 years or more and under 65. Smoking status was classified as ‘current smoker’, ‘former smoker’ or ‘never smoker’.

The majority of the patients presented with a highly metastatic tumour classified as stages III and IV (n = 57 of 65, 88%; additional information on tumour, node and metastasis staging is provided in Supplementary Table 1). Seven patients were diagnosed with limited-stage disease or with tumour stage I, II or IIIA, and were therefore amenable to surgical lung resection.

Although one patient declined further therapy, all other patients in our cohort received systemic treatment with platinum-based chemotherapy. The majority of patients were treated with a combination of cisplatin/carboplatin and etoposide (n = 61 of 65; 94%); with regard to recent changes in the treatment of SCLC², additional PD-L1 inhibition was administered to five of these patients. Due to the initial diagnosis with histological components of non-SCLC (adenocarcinoma or LCNEC), two patients were treated with cisplatin/carboplatin combined with vinorelbine (patients S02814 and S02917). Furthermore, one patient received only monotherapy with carboplatin. Throughout the course of treatment 72% of patients (n = 47 of 65) received additional radiation, mainly of the chest/lung/mediastinum (n = 35 or 47) or brain (n = 38 of 47); four patients underwent stereotactic surgery of brain metastases.

The clinical response to treatment was assessed by radiological imaging and classified as either complete response (CR), partial response (PR), stable disease (SD), progressive disease (PD) or mixed response (PR/PD). The clinical response to systemic first-line platinum-based chemotherapy was analysed for n = 55 patients; these patients receiving treatment with only systemic chemotherapy and were therefore considered for subsequent correlations of genomic and molecular phenotypes with clinical response. Genomic and molecular correlations with clinical response to chemotherapy were not considered for n = 10 patients in our cohort, because these patients were either lost to follow-up (n = 2), declined further treatment (n = 1) or received a lung resection resulting in differences in the dynamics of disease progression (n = 7).

Of the 55 patients who received only first-line systemic platinum-based chemotherapy, 60% (n = 33 of 55) responded to treatment with PR (n = 32) or CR (n = 1), 9% had stable disease (n = 5 of 55), 11% showed mixed response (n = 6 of 55) and 20% (n = 11 of 55) experienced a progressive disease, of which three succumbed to the disease during first-line treatment. Following treatment, two patients experienced fatal sepsis (patient S02608 while receiving treatment and experiencing disease progression; and patient S02658 following completion of chemotherapy; Supplementary Table 1); both patients were consequently censored when performing correlations with relapse-free survival, and the therapy response of patient S02608 was not evaluated. Median progression-free survival was 6.3 months. In addition we determined CTFI as an independent measure of sensitivity and duration of response to first-line platinum-based chemotherapy; median CTFI was 88 days. Fifty-three per cent of patients (n = 28 of 53) either did not respond, relapsed or succumbed to the tumour disease within 90 days following completion of first-line chemotherapy (following the guidelines of NCCN)¹⁴, and these patients were thus clinically classified as either chemorefractory or -resistant). Of the remaining patients who, based on NCCN guidelines, were considered as ‘platinum-sensitive’, 30% (n = 16 of 53) relapsed within 6 months following completion of chemotherapy and 17% were relapse-free for more than 6 months (n = 9 of 53). At relapse, 83% of patients (n = 44 of 53) received second-line systemic therapies that included treatment with anti-PD-1 and/or anti-CTLA-4 immune checkpoint inhibitors (n = 27) or other chemotherapeutics, including topotecan (n = 8), rechallenge with carboplatin and etoposide (n = 2) or combinations of adriamycin, cyclophosphamide and vincristine (n = 7) (Fig. 1 and Supplementary Table 1). Following tumour progression, ten patients were amenable to additional lines of therapy including immune checkpoint inhibitors (n = 6) or chemotherapeutics (n = 4).

The analysis of multiregional and longitudinal tumour sites from 65 patients with SCLC focused on distinct clinical scenarios. For interpatient comparisons we focused on studies of tumour pairs (‘Analysis of clonal architecture from multiregional and longitudinal tumour samples’; Fig. 1c). We focused on distinct clinical scenarios: (1) analysis of tumour samples from spatially distinct sites obtained from treatment-naive patients at the time of first diagnosis (n = 16); (2) analysis of temporally distinct tumour sites referring to samples acquired before treatment and during therapy, including those from patients undergoing neo-adjuvant treatment (n = 5); and (3) samples acquired before treatment and at relapse following completion of first-line platinum-based chemotherapy (that is, either following an initial response or disease progression despite treatment, n = 42). The analysis further focused on (4) spatially, but not temporally, separate tumours analysed solely at the time of relapse (n = 14), and (5) tumour sites acquired at the time of relapse from platinum-based therapy and following subsequent lines of treatment with immune checkpoint inhibitors (pre- and post-treatment with ICI, n = 7). We thus performed in total n = 84 paired analyses of tumour sites in 65 patients with SCLC (Supplementary Table 4).

In addition we performed clinical correlations in an independent cohort of patients with SCLC, who all received first-line systemic treatment with platinum-based chemotherapy; we performed whole-exome sequencing of the tumour samples and identified key genome alterations (n = 64 patients; Supplementary Table 12). This cohort was analysed to validate findings described in Fig. 5b; at least 56 samples are required to validate the findings at a significance level of 5% and a power of 80%; thus, we validated our findings at a power of greater than 80%.

DNA and RNA extraction

Nucleic acids were extracted from fresh-frozen blood or tissue or from formalin-fixed, paraffin-embedded (FFPE) tissue specimens (Supplementary Table 3). Tumour tissues were analysed by haematoxylin and eosin staining and nucleic acids were extracted from regions with a tumour content of at least 70%. All tumour samples derived from murine xenotransplant models showed a tumour content of at least 95% with no discernible innervation of murine cells, which was similarly observed in previous studies^12,13. Fresh-frozen samples were processed by preparation of tissue sections, each of 20 μm thickness, on a cryostat (Leica) while maintaining a temperature of −20 °C. In the case of FFPE samples, sections of 20 μm thickness were prepared on slides on a microtome. DNA was extracted from both fresh-frozen tissues and EDTA blood with the Gentra Puregene DNA extraction kit (Qiagen) according to the protocol of the manufacturer.

To allow for high-quality sequencing data of FFPE material we applied ultrasonic acoustic energy, using the adaptive focused acoustics technology from Covaris and following the protocol of the manufacturer. DNA isolation was then performed with a bead-based approach (AMPure XP Beads, Beckman) and any fractions containing paraffin material were excluded from subsequent DNA isolation steps.

For samples with limited tumour material we further adjusted protocols, which included repeated rounds of protein and nucleic acid precipitation, to increase the DNA yield for subsequent sequencing studies. All DNA isolates were hydrated in TE buffer and molecular weight was assessed using the Agilent TapeStation system (Genomic DNA ScreenTape no. 5067-5365, Agilent Technologies). DNA isolates from fresh-frozen samples were confirmed as being of high molecular weight (above 10 kb), and samples with evident signs of degradation were excluded from further sequencing studies.

For RNA extraction, tissue sections were first lysed and homogenized with the Tissue Lyzer (Qiagen). Subsequent RNA extraction was performed with the Qiagen RNAeasy Mini Kit according to the instructions of the manufacturer. Alternatively we used the RNAeasy Micro Kit to extract RNA from small tissue biopsies. RNA quality was assessed with RNA Screen Tape (no. 5067-5576, Agilent Technologies) at the TapeStation. Samples with RNA integrity number above 7 were further analysed by RNA sequencing (RNA-seq).

Next-generation sequencing

All sequencing reactions were performed on either the Illumina HiSeq or NovaSeq sequencing platform. Details on genome sequencing data and quality metrics are provided in Supplementary Table 3. Sequencing data are deposited in the European Genome-Phenome Archive (accession no. EGAS50000000169).

Whole-exome sequencing

We performed whole-exome sequencing for all patient samples with the SureSelect Human All Exon V6 Kit (Agilent) following the protocol of the manufacturer. Exon-enriched libraries were subjected to paired-end sequencing on either the Illumina NovaSeq or Illumina HiSeq platform. For the former, libraries were prepared to reach a mean insert size of 200 base pairs (bp) for sequencing with a read length of 2× 100 bp. For the latter, DNA was prepared with a mean insert size of 160 bp for 2× 75 bp paired-end sequencing. Both tumour and normal DNA material were sequenced aiming for a coverage of at least 150× which, following filtering of PCR-duplicated reads and alignment to the annotated human genome (hg19), resulted in an average coverage of 127×. Tumour samples showed a median purity of 88% (interquartile range 78–96%), thus minimizing problems in the assessment of tumour-specific mutations. This allowed for sufficient sequencing depth for reliable analysis for allelic fractions and clonality, as described below. Median genome ploidy was determined at 2.5 (interquartile range 1.9–3.2; Supplementary Table 3).

WGS

Whole-genome sequencing (WGS) was performed for samples with sufficient DNA material and quality, additionally providing information on genomic rearrangements not identified by WES. Short-insert DNA libraries from fresh-frozen samples were prepared with the TruSeq DNA Nano PCRfree sample preparation kit (Illumina), and FFPE samples were prepared with the Aceel-NGS 2S Plus DNA library Kit. Paired-end sequencing at a minimum read length of 2× 150 bp was performed, and human DNA libraries were sequenced to an average coverage of 31× for both tumour and matched normal tissue (Supplementary Table 3).

RNA-seq

Whole-transcriptome sequencing was performed to determine expression profiles for SCLC tumours in this cohort. RNA-seq was performed with RNA extracted from fresh-frozen human tumour tissue samples. Complementary DNA libraries were prepared from poly-A-selected RNA, applying the Illumina TruSeq protocol for messenger RNA. Libraries were then sequenced with a 2× 100 bp paired-end protocol, generating 50 Mio reads and thus accounting for a minimum mean coverage of 30× of the annotated transcriptome. Samples analysed by transcriptome sequencing are shown in Supplementary Table 2.

Dideoxynucleotide sequencing for validation of somatic alterations

If available, transcriptome or additional genome sequencing data were used to validate somatic mutations determined by genome sequencing. In cases without additional sequencing data, dideoxynucleotide chain termination sequencing (Sanger sequencing) was performed to validate key mutations, genomic rearrangements and chimeric fusion transcripts. Specifically, shared clonal mutations of key mutated genome alterations were confirmed by Sanger sequencing as being present in all tumour samples from a patient. For genomic rearrangements determined by WGS in a subset of samples per patient (Supplementary Tables 2 and 3), PCR reactions were performed and the genomic breakpoint was probed and analysed in that subset of samples. Complex genome alterations affecting TP53, RB1 and TP73 were thus confirmed in all samples of the respective patient (annotation provided in Extended Data Fig. 4). Clonal assessment of genomic rearrangement affecting key genes was determined with SVclone⁴⁰ (see below). For subclonal and private mutations of key gene alterations, Sanger sequencing was performed to confirm both the mutation call and absence of these alterations in matching tumour samples. Primer pairs were designed to amplify the target region encompassing the somatic alteration. PCR reactions were performed with either genomic DNA, whole-genome-amplified DNA or cDNA. Amplified products were subjected to Sanger sequencing and the respective electropherogram was analysed with Geneious v.8 (www.geneious.com).

Data processing of transcriptome sequencing data

As previously described^5,41, transcriptome sequencing data were processed with TRUP (tumour-specimen suited RNA-seq unified pipeline). Paired-end reads were mapped to the human reference genome (GRCh37/hg19). Samples obtained from patient-derived xenotransplant models were mapped to a combined human and murine reference genome (GRCh37/hg19 and GRCm38/mm10). Expression levels were determined for uniquely mapped paired-end reads using Cufflinks referring to the human reference genome, and expression levels were quantified as fragments per kilobase exon per million mapped reads (Supplementary Table 10).

Data processing of genome sequencing data

Raw sequencing reads were processed as previously described^5,6,15. Reads were aligned to the human reference genome (GRCh37/hg19). Our cohort additionally included patient tumours expanded in immune-compromised mice (n = 32 samples; Fig. 1a and Supplementary Table 2). In these cases, sequencing reads of all samples from a given patient (including the normal reference sample and tumour samples obtained directly from the patient and derived from murine xenotransplant models) were aligned to a combined human and murine reference genome (GRCh37/hg19 and GRCm38/mm10), to exclude sequencing reads from murine cells and to allow for uniform processing of all samples from a given patient. Concordant read-pairs were identified as potential PCR duplicates and were subsequently masked in the alignment file and annotated as the number of masked reads. The quality of the sequencing data is summarized in Supplementary Table 3.

Human sequencing reads (mapped to the human reference genome) were analysed for tumour purity, tumour ploidy, somatic mutations and copy number alterations¹⁵. In addition, WGS data were analysed for genomic rearrangements with the previously described analysis pipeline^5,6,15,42. Mutation calling was performed as previously described^5,6,43. In brief, variant counts were assessed for tumour and matching normal samples, corrected for sequencing noise and compared with a database of 300 whole-exome and genome sequenced normal samples to filter and determine somatic mutation calls. Variants at low allelic fractions are often prone to result from sequencing artefacts, which occur as a consequence of sequencing noise arising from high-coverage WES due to either fragmented DNA as part of FFPE material or low-level contamination with murine reads in tumours derived from murine xenograft models. We therefore implemented strict filtering criteria for mutations occurring at allelic fractions of less than 0.2. Mutations were then filtered out if (1) the forward–reverse score was below 0.2 (forward–reverse score is 1.0 if 50% of variant reads are found on the forward or reverse read, and 0 if all variant reads are on one orientation); and (2) the allelic fraction of the variant v in consideration of minimal coverage C of the normal or matching tumour sample at position i (C_i^{min(tumour/normal)}) did not exceed the read count (rc) threshold with a default value of 10. This was calculated as C_i^{min(tumour/normal)} × v_i < rc. We thus introduced a decision boundary that filters out mutations at relatively low allelic fractions and low sequencing coverage; mutations with low allelic fractions but high coverage were retained for further analyses. In addition we adjusted the stringency of this cut-off for individual samples. Although this stringent cut-off limits the identification of subclonal mutations, we have thus controlled for potential sequencing noise and false-positive mutation calls. As described below, multiregional studies may suggest mutations at very low allele fractions in one tumour that might be more abundant at another tumour site. In this instance, truly subclonal mutations at low allelic fractions that were filtered out in one sample at this step of the analysis were reintroduced as somatic mutation calls if the same mutation passed all stringent filtering criteria in another matched tumour sample.

Analysis of clonal architecture from multiregional and longitudinal tumour samples

We have developed a computational approach to identify individual clones from tumour sequencing data by applying a model that assigns an expected allelic fraction to each mutation under the assumption of clonality (that is, all tumour cells carry this mutation). The expected allelic fraction is corrected for tumour purity, average tumour ploidy and copy number state at the respective genomic coordinates of the said mutation. Relating the observed to the expected allelic fraction results in an estimated CCF that is a specific metric pertinent to each mutation^15,44,45. Subsequent clustering of CCFs enabled identification of cell clones represented by subsets of individual mutations. The CCFs and associated clones present in a given tumour thus define the overall clonal composition at the time point of sampling. Through a one-dimensional approach to CCF clustering, we determined for each single tumour its clonal composition (one-dimensional mutation clustering¹⁵), a method benchmarked in pan-cancer studies for tumour heterogeneity^44,45.

To study tumour evolution from multiregional or longitudinal tumour samples from a given patient, we further developed a two-dimensional approach to analysing pairs of samples from the same patient (two-dimensional clustering) and thus to the reconstruction of clonal dynamics^15,43 (manuscript in preparation). Information on tumour phylogenies, subclonal mutations, subclones and clonal composition of sites is summarized in Fig. 2 and detailed information is provided in Supplementary Table 4. In addition, tumour phylogenies determined for each patient are provided in the Supplementary Appendix.

The sequencing data of tumour samples in our cohort showed an average purity of 85% (Supplementary Table 3). Thus WES at an average coverage of 127× provided the required sequencing depth to determine subclones in our data. The analysis of tumour subclones focused on mutation calls as determined by exome sequencing in each sample to track individual tumour clones.

The computational method for tumour phylogeny reconstruction starts by executing an extensive set of comparisons and quality controls of copy number states, and a set of mutations and their respective CCFs for each sample. Rather than working with mutation calls and copy number states assessed individually for each tumour sample, we first performed comparisons and adjustments across all samples of a given patient. This included generating a unified copy number segmentation for all samples, which is critical for assigning within each chromosomal segment allele-specific mutation calls, and subsequently to compute CCFs for each mutation. We furthermore created for each patient a unified list of all somatic single-nucleotide mutations (SNMs) determined from each sample, and in all samples we reprobed the presence of somatic mutations of the unified list with relaxed filter criteria for calling somatic mutations at low allelic fractions from sequencing data. This approach allowed us to confirm whether high-confidence mutation calls from one sample were either private events or also present in other patient-matched samples but occurring at lower allelic fractions. Following the refined assessment of copy number states and somatic mutation calls, we determined for each mutation both the observed and expected allele frequency under the assumption of clonality (that is, a cancer cell fraction of 1), so that the CCF of the mutation can be calculated as the ratio between observed and expected allele frequency¹⁵. We applied additional filter criteria to mark somatic mutations calls occurring near telomers (that is, located in the tails determined by 1.5% of chromosome length) or centromeric regions on the chromosome, where copy number estimations are frequently error prone and therefore lead to a potentially incorrect calculation of CCF.

Somatic insertions and deletions (indel calls) can lead to additional false-positive calls of SNMs as a consequence of improper mapping of reads with inserted or deleted bases. To reduce the number of false-positive SNM calls resulting from indels, we filtered out all SNMs in close proximity (less than 10 bp away) to any mutation call for insertions and deletions.

We applied filtering criteria for mutation calls present on chromosomal areas and which, in multiregional analyses, were found to undergo loss of heterozygosity (LOH) in at least one, but not all, of the samples of a given patient^45,46,47. Samples with LOH may not harbour certain mutational calls due to the LOH event, whereas patient-matched samples without LOH may show those mutations. Consequently observed private, or almost private (CCF < 0.2), mutations in one sample lacking LOH events (whereas other patient-matched samples show the LOH event) may indicate a shared clone that undergoes copy number losses, and argue against the subclonal private acquisition of these mutations in this chromosomal area. A clear phylogenetic reconstruction in these cases is not straightforward: due to the inherent uncertainty if the mutations were not present in the other sample (that is, truly private) or lost via the LOH event, these mutations were excluded from phylogenetic tumour clone reconstructions. Following the same criteria, mutations in areas with subclonal copy number events in which one of the copy number clones was hit by an LOH event were also filtered out to avoid further uncertainty in the reconstruction of tumour phylogenies. As previously described¹⁵, our method also considered subclonal copy number changes in single-tumour samples. In consideration of copy number status and the observed allele frequency, the number of mutated copies was estimated and the CCF of the mutation determined. Somatic mutations that were found as clonal and that were the subject of subclonal copy number changes within single samples were filtered out.

In addition, we used the mapping qualities of the aligner (bwa mem, v.0.7.13-r1126) to filter out mutations in regions where more than 10% of uniquely mapped reads had a mapping quality below 10 (that is, less than 90% probability of having identified the correct mapping position).

With regard to potentially shared mutations, we also performed a power analysis to compare the CCFs of a given mutation between two samples with regard to their sequencing depth: we calculated a score per sample to consider the contribution of a single mutated read to the CCF. Per sample, the distribution of these scores could be estimated by a log-normal distribution whose 2.5% tails (z-score = 1.96) were cut off to filter out subsets of over- and underpowered mutations.

Last, to check further whether mutations observed as being private to one of the samples were truly private or simply not detected in the other sample (for example, due to insufficient coverage), we applied this statistical test: under the null hypothesis, the mutation is shared with an allelic fraction at least as high as that observed in one of the samples, and the probability (P value) of not detecting it within the given number of sequencing reads can be estimated using a binomial model. If the null hypothesis is rejected, the mutation is considered as being truly private, or otherwise is being filtered out. To determine those mutations that are rejected we apply the false discovery rate control at 5% by Benjamini–Hochberg correction.

Subsequent two-dimensional cluster analyses were performed with the set of mutations that passed all filters. This set was binned into a two-dimensional histogram of CCFs representing the observed data, which were modelled as a surface using two-dimensional smoothing splines with a common smoothing parameter. Based on an error estimate of the samples’ CCFs, this method deconvolutes part of the sequencing noise from the data. Subsequently the peaks of the surface were identified and interpreted as cluster centres (marked as red triangles in the cluster images for each patient; Supplementary Appendix), and all mutations were assigned to their nearest cluster centre by Euclidean distance. During the assignment procedure we require that shared mutations are assigned only to shared clusters whereas private mutations (that is, those exclusively called in one of the two samples) are assigned only to private clusters. Moreover, we set a minimum threshold of four mutations per cluster and disregarded identified surface peaks otherwise. Considering the cluster centre’s CCF as being representative of the corresponding cell clone, we applied the infinite sites hypothesis assuming that mutations appear once in the evolutionary history, and then determined the CCF sum rule^46,47 to infer the most probable phylogenetic tree and, in particular, clonal composition per sample at the time point when sampling was derived. In the rare event that tumour phylogenetic rules allow for multiple solutions of tumour phylogeny, we assume maximum parsimony and prefer linear evolution over branched evolution within one sample.

In the case of CCF clusters that conflicted with phylogenetic rules we reanalysed somatic mutation calls initially computed with expected allele frequencies under the assumption of clonality. However, chromosomal segments with polyploidy allow for multiple values of absolute numbers of mutated copies (the so-called mutation multiplicity of each mutation call^15,48). We therefore accounted for all potential solutions for mutation multiplicity of a given somatic mutation call and computed CCFs that rejected the assumption of clonality (null hypothesis) within the sample and which, in subsequent paired two-dimensional cluster analyses, resolved conflicts in phylogenetic tumour clone reconstruction.

Analysis of tumour phylogenies

Our approach thus enabled us to assign tumour phylogenies for all 65 patients, and to track individual clones from multiregional and longitudinal data. We assigned mutations to the most recent common ancestor (C0) if they were shared and found to be clonal across all tumour sites sampled (that is, having CCFs of approximately 1.0). Alterations with lower CCFs, or those found to be private to single-tumour sites, were determined as subclones. Clusters of at least n = 5 subclonal mutations were defined and labelled as subclone C1, C2 or C3, and derivates of these subclones were assigned accordingly (Fig. 2a). The resulting tumour phylogenies for all 65 patients are provided in the Supplementary Appendix, detailing all spatially and temporally distinct sites analysed and depicting the clinical treatment history for each patient. Additional information is provided in Supplementary Table 4.

To study patterns of tumour evolution we assigned tumour phylogenetic trees to the following classes (Fig. 2a): class A if no subclones were identified; class B if one subclone was identified, allowing only for linear evolvement of this subclone; class B if at least two subclones were found with linear phylogenies; class D, phylogenies with one branching event from C1 subclones; class E, phylogenies with one branching event from the most recent common ancestor clone C0; and class F, tumour phylogenies showing two or more branching events.

In this regard, increasing the number of tumour samples per patient will enhance the ability to determine subclonal mutations and subclones¹⁶. Because we analysed various numbers of samples for each patient (in 37% of cases, more than two samples per patient) we additionally downscaled our analyses to only two samples per patient to permit interpatient comparisons (Fig. 2d); we thus performed a total of n = 84 paired analyses (Fig. 1c and Supplementary Table 4). In the paired analysis for each patient we chose as representative the analysis showing the highest level of subclonal complexity, defined by the number of subclones and subclonal mutations identified. Downscaling the number of tumour samples per patient did not show any significant change in the absolute number of subclonal mutations but led to reduced numbers of assigned subclones with phylogenetic complexity of classes A–E only (Extended Data Fig. 1a,b). Downscaling the analysis to two samples per patient for interpatient comparison enabled the study of distinct scenarios throughout the clinical course of the patients (Fig. 1b). To study the full complexity of a patient’s tumour, all available samples were taken into consideration (Supplementary Appendix).

Analysis of cancer cell fractions for structural rearrangements

The analysis of the clonal architecture from multiregional and longitudinal tumour samples focused on the study of CCFs assigned to SNMs. In addition, to assess the clonality of structural rearrangement we applied SVclone (with default settings) to the whole-genome sequencing data of cases harbouring genomic rearrangements in key genes including RB1, TP53, TP73 and CREBBP/EP300. We first performed local remapping to the human genome for genomic rearrangements identified by our in-house pipeline⁴² and assigned CCFs for both chromosomal pairs of a given rearrangement with SVclone⁴⁰ (Supplementary Table 7). The data are presented in Extended Data Fig. 5c; the gene alterations identified were found to be part of the clonal proportion of the respective sample.

Analysis of mutational signatures

We analysed our data for the activity of mutational signatures available in COSMIC, referring to SBS (COSMIC_v3.3_SBS_GRChr37_exome¹⁷).

Mutational signatures were analysed for the following categories: (1) the clonal proportion of all treatment-naive tumours, (2) the subclonal proportion of all treatment-naive tumours and (3) the subclonal proportion of all post-treatment tumours acquired following first-line platinum-based chemotherapy (Fig. 3a). The analysis of treatment-naive tumours refers to all naive samples available in this cohort (n = 58); signatures assigned to post-treatment tumours included all patients who received first-line platinum-based chemotherapy (n = 45), and we further distinguished whether tumour sites were exposed to chemotherapy alone (n = 20) or were potentially exposed to additional ionizing radiation (n = 25; Supplementary Table 2). Due to the high tumour mutational burden, signature assignments to clonal mutations were performed in cases with a median of over 300 mutations. To avoid overfitting and noise, assignments for subclonal mutations were performed only for cases with at least n > 20 mutations.

To fit mutational signatures to our samples we applied SigProfilerAssignment (that is, Analyze.cosmic_fit function^17,49) to identify a representative subset of signatures. We initially fitted SBS mutational signatures to the mutation catalogue of each sample assigned to the categories. Selecting mutational signatures found in at least n = 5 cases, we thus identified the most prevalent subset of signatures in the clonal and subclonal proportions of treatment-naive and post-treatment tumours (SBS1, SBS2, SBS3, SBS4, SBS5, SBS13, SBS15, SBS16, SBS24, SBS29, SBS39, SBS40 and SBS92), to which all mutations were then fitted. Post-treatment samples additionally showed platinum-based signatures (SBS31 and SBS35), which were therefore included for the assignment of signatures for the subclonal proportion of post-treatment tumours. In addition we applied the in-house-developed computational tool CaMuS⁵⁰ to confirm signature assignments. With CaMuS we first linearly fitted the COSMIC signatures to all mutations for each sample (including clonal and subclonal mutations) using a backward selection procedure. We next selected only those signatures that markedly reduced the cost of the model calculated over the whole dataset. Both tools generated similar results. The results of SigProfilerAssignment are provided in Fig. 3 and Extended Data Figs. 2 and 3. Comparisons with CaMuS are provided in Extended Data Fig. 2h and the data are summarized in Supplementary Table 5.

To track the dynamic activity of mutational signatures in patient-matched tumour samples over the course of the disease, we specifically assigned the subset of signatures identified with SigProfilerAssignment to patient-matched clonal and subclonal mutations pre- and post-treatment, including SBS31 and SBS35 (both related to platinum chemotherapy treatment) for all assignments of signatures. We thus confirmed the presence of platinum-based signatures only in post-treatment subclonal mutations of tumour samples but not in the patient-matched treatment-naive clonal or subclonal proportion of the tumour. In addition we analysed tumour samples from a cohort of patients undergoing subsequent second- or third-line treatment with immune checkpoint inhibition (n = 7). Tumour samples acquired before treatment with ICI were analysed in the categories above (corresponding to samples acquired at the time of relapse following first-line platinum-based chemotherapy). Samples pre- and post-treatment with ICI were analysed with the subset described above (Supplementary Table 5).

We furthermore tested our whole-genome and whole-exome sequencing data for mutational processes related to ionizing radiation. Following previous studies in this field²⁵, we determined the ratio of insertions to deletions (indels) versus substitution burden and the ratio of deletions versus insertions based on exome- and genome-wide data (Extended Data Fig. 2f).

Analysis of significant mutations, copy number alterations and genome ploidy

To assess the relevance of key gene alterations in our cohort we referred to our previous study of significant gene alterations determined for 110 human SCLC samples⁵ (Supplementary Table 8). In addition we expanded this analysis to our present cohort of 65 patients. We determined the mutational landscape for each patient by creating the union of all mutations identified in multiple samples—this refers to the sum of mutually inclusive and private events (Supplementary Table 6). We combined the data from our current cohort of 65 patients with mutational data for 110 human SCLC samples⁵ (n = 175 patients) and determined significant gene alterations at a significance threshold of Q < 0.05 following our previously described method⁵. In brief, our approach estimates the background mutation rate for each gene and corrects for both synonymous mutations and the expression in human SCLC, referring to the transcriptional data of human SCLC⁵. The analysis included genes with fragments per kilobase exon per million mapped reads values of over 1 in at least 50 samples. Furthermore we analysed the data for significant mutational hotspots and significant enrichment of gene-damaging mutations. Mutations that significantly cluster within a gene were determined at Q < 0.05 (mutational hotspots). The analysis of gene-damaging mutations refers to (1) nonsense mutations resulting in early stop codons, (2) splice site mutations resulting in aberrant splicing, intron retention or in-frame losses of larger regions within the protein product and (3) frameshift mutations leading to early stop codons and thus resulting in greater changes in the gene and encoded transcript, presumably leading to either no protein product, to proteins with larger deletions within the protein structure or to truncated proteins. The enrichment of gene-damaging alterations was determined at Q < 0.05. We focused our studies on genes recurrently mutated in at least 8% of cases (affecting at least n = 14 patients in the combined analysis of this cohort and the previous cohort⁵); this allowed us to perform interpatient comparisons and to study a sufficient number of cases in our present cohort of n = 65 patients. To complement our analytical approach we also used other computational tools to study significant gene alterations, including MutSig2CV⁵¹, dNdSCV⁵² and OncodriveFML⁵³. In brief, MutSig2CV and dNdSCV were run using their default configuration; for OncodriveFML we used the ‘complement’ method for the signature and ‘amean’ as statistics. Taking into account different levels of stringency, all computational models showed a high degree of overlap. All relevant and significant gene alterations are listed in Supplementary Table 8. In addition we studied gene alterations previously reported for targeted sequencing data from larger cohorts of patients with SCLC⁴; we scored the frequency and significance level of reported alterations for the samples in our cohort. Comparison of these data is provided in Extended Data Fig. 4e and Supplementary Table 8.

With regard to frequent alterations affecting TP53, RB1 and TP73 (Supplementary Table 8), which also included larger genomic rearrangements of these genes (Supplementary Table 7), we further analysed the gene-damaging effect of alterations. The impact of any genome alterations was evaluated in combination with the transcriptome sequencing data of these tumours, thus further informing on the presumed damage to the gene transcript and resulting protein product (Supplementary Table 11).

Significant copy number alterations were determined from uncorrected unsegmented copy number signals obtained from whole-exome sequencing data by applying the method CGARS⁵⁴. We determined the analysis separately for pre- and post-treatment tumour samples, referring to one sample per patient case in both scenarios. Significant amplifications were determined with the upper quantiles 0.30, 0.10 and 0.05; deletions were computed in reference to lower quartiles 0.30, 0.15 and 0.05. Significance threshold was set at Q = 0.05. Significant copy number alterations are listed in Supplementary Table 9.

Overall genome ploidy was assigned for all patient tumours (Supplementary Table 3), with a threshold of 2.8 or above set to define those with higher genome ploidy³³. Higher ploidy in cancer genomes can result either from multiple successive and independent copy number gains or through events of whole-genome doubling. To further determine events of genome duplication (or whole-genome doubling), tumours found to undergo ploidy changes were further analysed for the fraction of the genome with LOH to assign an event of genome doubling⁴⁵ (Extended Data Fig. 5g,h).

Clinical correlations with chemotherapy relapse-free survival

We studied correlations of genomic subsets with relapse-free survival in patients receiving first-line systemic treatment with platinum-based chemotherapy. The analysis focused on the study of n = 55 patients for whom the clinical response to first-line platinum-based chemotherapy was determined. Ten patients from our cohort were not considered for this analysis because of either loss to follow-up (n = 2), declined further treatment and no longer in clinical care (n = 1) or received a lung resection resulting in longer disease-free survival and differences in the dynamics of disease progression (n = 7). We determined relapse-free survival by referring to CTFI, defined as the time between the end of chemotherapy and tumour recurrence, including for patients with disease progression resulting in death. Two patients in our cohort were reported with sepsis-related mortality and were censored in the analysis for recurrence-free survival, leaving a final total of n = 53 patients. All survival analyses were performed with SPSS. Survival distributions were plotted as Kaplan–Meier curves, with P values determined by log-rank test (Extended Data Fig. 8a). Hazard ratios with a 95% confidence interval and P values were further derived from Cox proportional hazard models. We performed correlations with key genomic parameters referring to significant gene mutations identified in Extended Data Fig. 4 and, in addition, we stratified patients according to genome ploidy (information available for n = 53 patients). We included in our analysis as clinical characteristics information on sex, age and tumour stage. We performed additional analyses on both smoking status and pack years of patients (available for n = 50 and n = 47 patients, respectively). Furthermore we included in our analyses the gene expression of key lineage transcription factors ASCL1, NEUROD1 and POU2F3 (available for n = 45 patients).

We checked that the assumption of proportional hazards was provided by log-minus-log survival plots and by the addition of time-dependent covariates to models. We performed multicollinearity assessment of predictors. We identified relevant gene alterations by performing regressions with backward elimination of insignificant predictors (backwards Wald, at a retention threshold of P < 0.05). The results of the Cox proportional hazard model are shown as forest plots.

Clinical correlations of genomic alterations with relapse-free survival were additionally analysed in an independent cohort of patients with SCLC (n = 64) who all received first-line systemic treatment with platinum-based chemotherapy. Note that we used WES and WGS to determine the full spectrum of alterations in key genes in our discovery cohort. By contrast, data for the independent cohort refer to WES data, which limits the detection of complex gene rearrangements that frequently affect CREBBP, EP300, TP73 and, to some extent, TP53 (ref. ⁵) (Extended Data Fig. 4a). The somatic alteration status for TP53, TP73, CREBBP, EP300 and FMN2 as determined by WES is provided in Supplementary Table 12.

Immunoblot analysis

Immunoblots were performed to probe tumour cell lysates for the expression of p53 (Extended Data Fig. 9a). Tissue samples from this cohort containing sufficient material were processed to 5 μm sections on a cryostat maintained at −20 °C. The non-SCLC cell line A549 served as control for the expression of wild-type p53 (ref. ⁵⁵); we confirmed the identity of this cell line by STR profiling and performed tests to ensure no contamination with mycoplasma. Between 40 and 50 tissue sections per sample were sonicated for 3× 10 min and incubated for an additional 30 min in RIPA buffer supplemented with protease inhibitors (cOmplete Mini Protease Inhibitor Cocktail, Roche) and nuclease (benzonase, Millipore) at 4 °C. A549 cells were incubated in RIPA for 30 min at 4 °C. Supernatants were collected following centrifugation at 4 °C for 10 min at 20,000g and protein concentrations determined by bicinchoninic acid assay (Pierce). Either 15 μg (tissue samples) or 90 μg (A549) of protein in 3× Laemmli buffer was separated on 4–12% Tris-glycine SDS–polyacrylamide gel electrophoresis gels (Thermo Fisher Scientific) and transferred to polyvinylidene difluoride membranes (Millipore). PageRuler 10–180 kDa (Thermo Scientific) served as the protein ladder for size determination. Membranes were blocked with Tris buffered saline with 5% milk powder for 1 h at room temperature and incubated overnight with a 1:1,000 dilution of anti-p53 (clone D07, mouse monoclonal antibody, abcam, no. ab80644) and anti-HSP90 (clone C45G5, rabbit monoclonal antibody, Cell Signaling, no. 4877) at 4 °C, washed in Tris buffered saline with Tween 20 and incubated for 1 h with a 1:10,000 dilution of fluorescence-labelled secondary anti-mouse (IRDye 800CW goat anti-mouse, LI-COR, no. 926-32210) and anti-rabbit (IRDye 800CW goat anti-rabbit, LI-COR, no. 926-32211) antibodies. Blots were analysed with the Odyssey CLx imaging system (LI-COR).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

[ad_2]

Source link

An atlas of epithelial cell states and plasticity in lung adenocarcinoma

[ad_1]

Multi-regional sampling of human surgically resected LUADs and NL tissues

Study participants were evaluated at the MD Anderson Cancer Center and underwent standard-of-care surgical resection of early-stage LUAD (I–IIIA). Samples from all patients were obtained from banked or residual tissues under informed consent and approved by MD Anderson institutional review board protocols. Residual surgical specimens were then used for derivation of multi-regional samples for single-cell analysis (Supplementary Table 1). Immediately following surgery, resected tissues were processed by an experienced pathologist assistant. One side of the specimen was documented and measured, followed by tumour margin identification. Based on the placement of the tumour within the specimen, incisions were made at defined collection sites in one direction along the length of the specimen and spanning the entire lobe: tumour-adjacent and tumour-distant normal parenchyma at 0.5 cm from the tumour edge and from the periphery of the overall specimen or lobe, respectively. An additional tumour-intermediate normal tissue sample was selected for patients P2–P16 and ranged between 3 and 5 cm from the edge of the tumour. Sample collection was initiated with NL tissues that are farthest from the tumour moving inward towards the tumour to minimize cross-contamination during collection.

Single-cell isolation from tissue samples

Fresh tissues from human donors and mouse lungs were collected in RPMI medium supplemented with 2% FBS and maintained on ice for immediate processing. Tissues were placed in a cell culture dish containing Hank’s balanced salt solution (HBSS) on ice, and extra-pulmonary airways and connective tissue were removed with scissors. Samples were transferred to a new dish on ice and minced into about 1 mm³ pieces followed by enzymatic digestion. For human tissues, the enzymatic solution was composed of collagenase A (10103578001, Sigma Aldrich), collagenase IV (NC9836075, Thermo Fisher Scientific), DNase I (11284932001, Sigma Aldrich), dispase II (4942078001, Sigma Aldrich), elastase (NC9301601, Thermo Fisher Scientific) and pronase (10165921001, Sigma Aldrich) as previously described⁴⁰. For mouse lung digestion, the enzymatic solution was composed of collagenase type I (CLS-1 LS004197, Worthington), elastase (ESL LS002294, Worthington) and DNase I (D LS002007, Worthington). Samples were transferred to 5 ml LoBind Eppendorf tubes and incubated in a 37 °C oven for 20 min with gentle rotation. Samples were then filtered through 70 μm strainers (Miltenyi Biotech, 130-098-462) and washed with ice-cold HBSS. Filtrates were then centrifuged and resuspended in ice-cold ACK lysis buffer (A1049201, Thermo Fisher Scientific) for red blood cell lysis. Following red blood cell lysis, samples were centrifuged and resuspended in ice-cold FBS, filtered (using 40 μm FlowMi tip filters; H13680-0040, Millipore) and an aliquot was taken to count cells and check for viability by Trypan blue (T8154, Sigma Aldrich) exclusion analysis.

Sorting and enrichment of viable lung epithelial singlets

Single cells from patient P1 were stained with Sytox Blue viability dye (S34857, Life Technologies) and processed on a FACS Aria I instrument. Cells from P2–P16 were stained with anti-EPCAM-PE (347198, BD Biosciences; 1:50 dilution in ice-cold PBS containing 2% FBS) for 30 min with gentle rotation at 4 °C. Mouse lung single cells were similarly stained but with a cocktail of antibodies (1:250 each) against CD45-PE/Cy7 (103114, BioLegend), ICAM2-A647 (A15452, Life Technologies), EPCAM-BV421 (118225, BioLegend) and ECAD-A488 (53-3249-80, eBioscience). Stained cells were then washed, filtered using 40 μm filters, stained with Sytox Blue (human) or Sytox Green (mouse) and processed on a FACS Aria I instrument (gating strategies for epithelial cell sorting are shown in Supplementary Figs. 1 and 4 for human and mouse cells, respectively). Doublets and dead cells were eliminated, and viable (Sytox-negative) epithelial singlets were collected in PBS containing 2% FBS. Cells were washed again to eliminate ambient RNA, and a sample was taken for counting by Trypan Blue exclusion before loading on 10X Genomics Chromium microfluidic chips.

Preparation of single-cell 5′ gene expression libraries

Up to 10,000 cells per sample were partitioned into nanolitre-scale Gel beads-in-emulsion (GEMs) using a Chromium Next GEM Single Cell 5′ Gel Bead kit v.1.1 (1000169, 10X Genomics) and by loading onto Chromium Next GEM Chips G (1000127, 10X Genomics). GEMs were then recovered to construct single-cell gene expression libraries using a Chromium Next GEM Single Cell 5′ Library kit (1000166, 10X Genomics) according to the manufacturer’s protocol. In brief, recovered barcoded GEMs were broken and pooled, followed by magnetic bead clean-up (Dynabeads MyOne Silane, 37002D, Thermo Fisher Scientific). 10X-barcoded full-length cDNA was then amplified by PCR and analysed using a Bioanalyzer High Sensitivity DNA kit (5067-4626, Agilent). Up to 50 ng of cDNA was carried over to construct gene expression libraries and was enzymatically fragmented and size-selected to optimize the cDNA amplicon size before 5′ gene expression library construction. Samples were then subjected to end-repair, A-tailing, adaptor ligation and sample index PCR using Single Index kit T Set A (2000240, 10X Genomics) to generate Illumina-ready barcoded gene expression libraries. Library quality and yield were measured using a Bioanalyzer High Sensitivity DNA kit (5067-4626, Agilent) and a Qubit dsDNA High Sensitivity Assay kit (Q32854, Thermo Fisher Scientific). Indexed libraries were normalized by adjusting for the ratio of the targeted cells per library as well as individual library concentration and then pooled to a final concentration of 10 nM. Library pools were then denatured and diluted as recommended for sequencing on an Illumina NovaSeq 6000 platform.

scRNA-seq data processing and quality control

Raw scRNA-seq data were pre-processed (demultiplex cellular barcodes, read alignment and generation of gene count matrix) using Cell Ranger Single Cell Software Suite (v.3.0.1) provided by 10X Genomics. For read alignment of human and mouse scRNA-seq data, human reference GRCh38 (hg38) and mouse reference GRCm38 (mm10) genomes were used, respectively. Detailed quality control metrics were generated and evaluated, and cells were carefully and rigorously filtered to obtain high-quality data for downstream analyses¹⁵. In brief, for basic quality filtering, cells with low-complexity libraries (in which detected transcripts were aligned to <200 genes such as cell debris, empty drops and low-quality cells) were filtered out and excluded from subsequent analyses. Probable dying or apoptotic cells in which >15% of transcripts derived from the mitochondrial genome were also excluded. For scRNA-seq analysis of Gprc5a^−/−;Sftpc^creER/+;Rosa^Sun1GFP/+ mice, cells with ≤500 detected genes or with a mitochondrial gene fraction that is ≥15% were filtered out using Seurat⁴¹.

Doublet detection and removal, and batch effect evaluation and correction

Probable doublets or multiplets were identified and carefully removed through a multi-step approach as described in previous studies^15,42. In brief, doublets or multiplets were identified based on library complexity, whereby cells with high-complexity libraries in which detected transcripts are aligned to >6,500 genes were removed and, based on cluster distribution and marker gene expression, whereby doublets or multiplets forming distinct clusters with hybrid expression features and/or exhibiting an aberrantly high gene count were also removed. Expression levels and proportions of canonical lineage-related marker genes in each identified cluster were carefully reviewed. Clusters co-expressing discrepant lineage markers were identified and removed. Doublets or multiplets were also identified using the doublet detection algorithm DoubletFinder⁴³. The proportion of expected doublets was estimated based on cell counts obtained before scRNA-seq library construction. Data normalization was then performed using Seurat⁴¹ on the filtered gene–cell matrix. Statistical assessment of possible batch effects was performed on non-malignant epithelial cells using the R package ROGUE³⁶, an entropy-based statistic, as described in previous studies^15,42 and Harmony⁴⁴ was run with default parameters to remove batch effects present in the PCA space.

Unsupervised clustering and subclustering analysis

The function FindVariableFeatures of Seurat⁴¹ was applied to identify highly variable genes for unsupervised cell clustering. PCA was performed on the top 2,000 highly variable genes. The elbow plot was generated with the ElbowPlot function of Seurat and, based on which, the number of significant principal components (PCs) was determined. The FindNeighbors function of Seurat was used to construct the shared nearest neighbour (SNN) graph based on unsupervised clustering performed using the Seurat function FindClusters. Multiple rounds of clustering and subclustering analyses were performed to identify major epithelial cell types and distinct cell transcriptional states. Dimensionality reduction and 2D visualization of cell clusters was performed using UMAP⁴⁵ and the Seurat function RunUMAP. The number of PCs used to calculate the embedding was the same as that used for clustering. For analysis of human epithelial cells, ROGUE was used to quantify cellular transcriptional heterogeneity of each cluster. Subclustering analysis was then performed for low-purity clusters identified by ROGUE. Hierarchical clustering of major epithelial subsets was performed on the Harmony batch-corrected PCA dimension reduction space. For malignant cells, except for global UMAP visualization, downstream analyses, including identification of large-scale CNVs, inference of cancer cell differentiation states, quantification of meta-program expression, trajectory analysis and mutation analysis, were performed without Harmony batch correction. The hierarchical tree of human epithelial cell lineages was computed based on Euclidean distance using the Ward linkage method, and the dendrogram was generated using the R function plot.hc. For scRNA-seq analysis of Gprc5a^−/− mice, the top-ranked ten PCs were selected using the elbowplot function. SNN graph construction was performed with resolution parameter = 0.4, and UMAP visualization was performed with default parameters. For scRNA-seq analysis of Gprc5a^−/−;Sftpc^creER/+;Rosa^Sun1GFP/+ mice, the top-ranked 20 Harmony-corrected PCs were used for SNN graph construction, and unsupervised clustering was performed with resolution parameter = 0.4. UMAP visualization was performed with the RunUMAP function with min.dist = 0.1. Differentially expressed genes (DEGs) of clusters were identified using the FindAllMarkers function with FDR-adjusted P value < 0.05 and log₂(fold change) > 1.2.

Identification of malignant cells and mapping KRAS codon 12 mutations

Malignant cells were distinguished from non-malignant subsets based on information integrated from multiple sources as described in previous studies^15,42. The following strategies were used to identify malignant cells. (1) Cluster distribution: owing to the high degree of inter-patient tumour heterogeneity, malignant cells often exhibit distinct cluster distribution compared with normal epithelial cells. Although non-malignant cells derived from different patients are often clustered together by cell type, malignant cells from different patients probably form separate clusters. (2) CNVs: we applied inferCNV¹⁶ (v.1.3.2) to infer large-scale CNVs in each individual cell with T cells as the reference control. To quantify CNVs at the cell level, CNV scores were aggregated using a previously described strategy¹⁶. In brief, arm-level CNV scores were computed based on the mean of the squares of CNV values across each chromosomal arm. Arm-level CNV scores were further aggregated across all chromosomal arms by calculating the arithmetic mean value of the arm-level scores using the R function mean. (3) Marker gene expression: expression of lung epithelial lineage-specific genes and LUAD-related oncogenes was determined in epithelial cell clusters. (4) Cell-level expression of KRAS^G12D mutations: as we previously described¹⁵, BAM files were queried for KRAS^G12D mutant alleles, which were then mapped to specific cells. KRAS^G12D mutations, along with cluster distribution, marker gene expression and inferred CNVs as described above, were used to distinguish malignant cells from non-malignant cells. Following clustering of malignant cells from all patients, an absence of malignant cells that were identified from P12 or P16 was noted. This can be possibly attributed to the low number of epithelial cells captured in tumour samples from these patients (Supplementary Table 2).

Mapping KRAS codon 12 mutations

To map somatic KRAS mutations at single-cell resolution, alignment records were extracted from the corresponding BAM files using mutation location information. Unique mapping alignments (MAPQ = 255) labelled as either PCR duplication or secondary mapping were filtered out. The resulting somatic variant carrying reads were evaluated using Integrative Genomics Viewer (IGV)⁴⁶ and the CB tags were used to identify cell identities of mutation-carrying reads. To estimate the VAF of KRAS^G12D mutation and cell fraction of KRAS^G12D-carrying cells within malignant and non-malignant epithelial cell subpopulations (for example, malignant cells from all LUADs, malignant cells from KM-LUADs, KACs from KM-LUADs), reads were first extracted based on their unique cell barcodes and BAM files were generated for each subpopulation using samtools (v.1.15). Mutations were then visualized using IGV, and VAFs were calculated by dividing the number of KRAS^G12D-carrying reads by the total number of uniquely aligned reads for each subpopulation. A similar approach was used to visualize KRAS^G12C–carrying reads and to calculate the VAF of KRAS^G12C in KACs of normal tissues from KM-LUAD cases. To calculate the mutation-carrying cell fraction, extracted reads were mapped to the KRAS^G12D locus from BAM files using AlignmentFile and fetch functions in pysam package. Extracted reads were further filtered using the ‘Duplicate’ and ‘Quality’ tags to remove PCR duplicates and low-quality mappings. The number of reads with or without KRAS^G12D mutation in each cell was summarized using the CB tag in read barcodes. Mutation-carrying cell fractions were then calculated as the ratio of the number of cells with at least one KRAS^G12D read over the number of cells with at least one high-quality read mapped to the locus.

PCA analysis of malignant cells and quantification of transcriptome similarity

Raw unique molecular identifier counts of identified malignant cells were log-normalized and used for PCA analysis using Seurat (RunPCA function). PCA dimension reduction data were extracted using the Embeddings function. The top three most highly ranked PCs were exported for visualization using JMP (v.15). 3D scatterplots of PCA data were generated using the scatterplot 3D tool in JMP (v.15). Bhattacharyya distances were calculated using the bhattacharyya.dist function from the R package fpc (v.2.2-9). The top 25 highly ranked PCs were used for both patient-level and cell-level distance calculations. For Bhattacharyya distance quantification at the cell level, 100 cells were randomly sampled for each patient group defined by driver mutations (for example, KM-LUADs). The random sampling process was repeated 100 times, and pairwise Bhattacharyya distances were then calculated between patient groups. Differences in Bhattacharyya distances between patient groups were tested using Wilcoxon rank-sum tests, and boxplots were generated using the geom_boxplot function from the R package ggplot2 (v.3.2.0).

Determination of non-malignant cell types and states

Non-malignant cell types and states were determined based on unsupervised clustering analysis following batch effect correction using Harmony⁴⁴. Two rounds of clustering analysis were performed on non-malignant cells to identify major cell types and cell transcriptional states within major cell types. Clustering and UMAP visualization of human normal epithelial cells (Extended Data Fig. 1a) were performed using Seurat with default parameters. Specifically, the parameters k.param = 20 and resolution = 0.4 were used for SNN graph construction and cluster identification, respectively. UMAP visualization was performed with default parameters (min.dist = 0.3). For clustering analysis of airway and alveolar epithelial cells, the RunPCA function was used to determine the most contributing top PCs for each subpopulation and similar clustering parameters (k.param = 20 and resolution = 0.4) were used for SNN graph construction and cluster identification. UMAP plots were generated with min.dist = 0.3 using the RunUMAP function in Seurat. Density plots of alveolar intermediate cells were generated using the stat_densit_2d function in the R package ggplot2 (v.3.3.5) with the first two UMAP dimension reduction data as the input. DEGs for each cluster were identified using the FindMarkers function in Seurat with a FDR-adjusted P < 0.05 and a fold change cut-off > 1.2. Canonical epithelial marker genes from previously published studies by our group and others^15,47,48 were used to annotate normal epithelial cell types and states. Bubble plots were generated for select DEGs and canonical markers to define AT1 cells (AGER1⁺ETV5⁺PDPN⁺), AT2 cells (SFTPB⁺SFTPC⁺ETV5⁺), SCGB1A1⁺SFTPC⁺ dual-positive cells, AICs (AGER1⁺ETV5⁺PDPN⁺ and SFTPB⁺SFTPC⁺), club and secretory cells (SCGB1A1⁺SCGB3A1⁺CYP2F1⁺), basal cells (KRT5⁺TP63⁺), ciliated cells (CAPS⁺PIFO⁺FOXJ1⁺), ionocytes (ASCL3⁺FOXI⁺), neuroendocrine cells (CALCA⁺ASCL1⁺) and tuft cells (ASCL2⁺MGST2⁺PTGS1⁺). KACs were identified by unsupervised clustering of AICs and defined based on previously reported marker genes^25,26,49, including significant upregulation of the following genes relative to other alveolar cells: KRT8, CLDN4, PLAUR, CDKN1A and CDKN2A.

Scoring of curated gene signatures

Genes in previously defined ITH MPs¹⁹ were downloaded from the original study. Among a total of 41 consensus ITH MPs identified, MPs with unassigned functional annotations (unassigned MPs 38–41; n = 4), neural and haematopoietic lineage-specific MPs (MPs 25–29, MPs 33–37; n = 10) and cell-type-specific MPs irrelevant to LUAD (MPs 22–24 secreted/cilia, MP 32 skin-pigmentation; n = 4) were filtered out, resulting in 23 MPs that closely correlated with hallmarks of cancer and that were used for further analysis. Signature scores were computed using the AddModuleScore function in Seurat as previously described^15,42. The KRAS signature used in this study was derived by calculating DEGs between the KRAS mutant malignant-cell-enriched cluster and other malignant cells (FDR-adjusted P value < 0.05, log(fold change) > 1.2; Extended Data Fig. 2i). Human and mouse KAC signatures and the human ‘other AIC’ signature were derived by calculating DEGs using FindAllMarkers among alveolar cells (FDR-adjusted P value < 0.05, log(fold change) > 1.2). Mouse genes in the p53 pathway were downloaded from the Molecular Signature Database (MSigDB; https://www.gsea-msigdb.org/gsea/msigdb/mouse/geneset/HALLMARK_P53_PATHWAY; MM3896). Signature scores for KACs, other AICs, KRAS and p53 were calculated using the AddModuleScore function in Seurat.

Analysis of alveolar cell differentiation states and trajectories

Analysis of differentiation trajectories of lung alveolar and malignant cells was performed using Monocle 2 (ref. ⁵⁰) by inferring the pseudotemporal ordering of cells according to their transcriptome similarity. Monocle 2 analysis of malignant cells from P14 was performed using default parameters with the detectGenes function. Detected genes were further required to be expressed by at least 50 cells. For construction of the differentiation trajectory of lineage-labelled epithelial cells (GFP⁺), the top 150 DEGs (FDR-adjusted P value < 0.05, log(fold change) > 1.5, expressed in ≥50 cells) ranked by fold-change of each cell population from NNK-treated samples were used for ordering cells with the setOrderingFilter function. Trajectories were generated using the reduceDimension function with the method set to ‘DDRTree’. Trajectory roots were selected based on the following criteria: (1) inferred pseudotemporal gradient; (2) CytoTRACE score prediction; and (3) careful manual review of the DEGs along the trajectory. To depict expression dynamics of ITH MPs¹⁹, ITH MP scores were plotted along the pseudotime axis and smoothed lines were generated using the smoother tool in JMP Pro (v.15). Using the raw counts without normalization as input, CytoTRACE¹⁸ was applied with default parameters to infer cellular differentiation states to complement trajectory analysis and further understand cellular differentiation hierarchies. The normalmixEM function from the R package mixtools was used to determine the CytoTRACE score threshold in AICs with k = 2. A final threshold of 0.58 was used to dichotomize AICs into high-differentiation and low-differentiation groups. The Wasserstein distance metric was applied using R package transport (v.0.13) to quantify the variability of distribution of CytoTRACE scores. The function wasserstein1d was used to calculate the distance between the distribution of actual CytoTRACE scores of one patient and the distribution of simulated data with identical mean and standard deviation. The robustness of Monocle 2-based pseudotemporal ordering prediction was validated by independent pseudotime prediction tools including Palantir⁵¹, Slingshot⁵² and Cellrank⁵³. Slingshot (v.2.6.0) pseudotime prediction was performed using slingshot function with reduceDim parameter set to ‘PCA’ and other parameters set to defaults. Cellrank prediction was performed using the CytoTRACEKernel function with default parameters from Cellrank python package (v.1.5.1). Palantir prediction was performed using Palantir python package (v.1.0.1). A diffusion map was generated using run_diffusion_maps function with n_components = 5. Palantir prediction was generated using run_palantir function with num_waypoints = 500 and other parameters set to defaults. Inferred pseudotime by the three independent methods was then integrated with that generated by Monocle 2 for each single cell, followed by pairwise mapping and correlation analysis. Cell density plots were generated using Contour tool in JMP (v.15) with n = 10 gradient levels and contour type parameter set to ‘Nonpar Density’. To assess the pseudotime prediction consistency between Monocle 2 and the three independent methods, Spearman’s correlation coefficients were calculated and statistically tested using cor.test function in R.

ST data generation and analysis

ST profiling of formalin fixation and paraffin-embedding (FFPE) tissues from P14 with LUAD and of three lung tissues from two Gprc5a^−/− mice was performed using the Visium platform from 10X Genomics according to the manufacturer’s recommendations and as previously reported⁵⁴. P14 FFPE tissues were collected from areas adjacent to the tissues analysed by scRNA-seq. Regions of interest per tissue or sample, each comprising a 6.5 × 6.5 mm capture area, were selected based on careful annotation of H&E-stained slides that were digitally acquired using an Aperio ScanScope Turbo slide scanner (Leica Microsystems). HALO software (Indica Labs) was used for pathological annotation (tumour areas, blood vessels, bronchioles, lymphoid cell aggregates, macrophages, muscle tissue, normal parenchyma and reactive pneumocytes) of H&E histology images. Spot-level histopathological annotation and visualization was generated using loupe browser (v.6.3.0). In brief, cloupe files generated from Space Ranger were loaded into the loupe browser. Visualization of annotation was then generated in svg formats using the export plot tool. ST RNA-seq libraries were generated according to the manufacturer’s instructions, each with up to about 3,600 uniquely barcoded spots. Libraries were sequenced on an Illumina NovaSeq 6000 platform to achieve a depth of at least 50,000 mean read pairs per spot and at least 2,000 median genes per spot.

Demultiplexed raw sequencing data were aligned, and gene level expression quantification was generated with analysis pipelines as previously described⁵⁴. In brief, demultiplexed clean reads were aligned against the UCSC human GRCh38 (hg38) or the GRCm38 (mm10) mouse reference genomes by Spaceranger (v.1.3.0 for human ST data and v.2.0.0 for mouse ST data) and using default settings. Generated ST gene expression count matrices were then analysed using Seurat (v.4.1.0) to perform unsupervised clustering analysis. Using default parameters, the top-ranked 30 PCA components were used for SNN graph construction and clustering and for UMAP low-dimension space embedding with default parameters. UMAP analysis was performed using the RunUMAP function. The SpatialDimPlot function was used to visualize unsupervised clustering. The R package inferCNV¹⁶ was used for copy number analysis. Reference spots used in CNV analysis were selected on the basis of careful review of cluster marker genes using the DotPlot function from Seurat and inspection of pathological annotation. CNV scores were calculated by computing the standard deviations of CNVs inferred across 22 autosomes. Loupe browser (v.6.3.0) was used for visualization of pathological annotation results. Expression levels of genes of interest (for example, KRT8) as well as signatures of interest (for example, KAC and KRAS) were measured and directly annotated on histology images with pixel-level resolution using the TESLA (v.1.2.2) machine learning framework⁵⁵ (https://github.com/jianhuupenn/TESLA). TESLA can compute superpixel-level gene expression and detect unique structures within and surrounding tumours by integrating information from high-resolution histology images. The annotation and visualize_annotation functions were used to annotate regions with high signature signals. KRT8, PLAUR, CLDN4, CDKN1A and CDKN2A were used for ‘KAC markers’ signature annotation in the human ST analysis. For mouse ST data, Krt8, Plaur, Cldn4, Cdkn1a and Cdkn2a were used for ‘KAC signature’ annotation. Gene level expression visualization of Krt8 and Plaur was generated using the scatter function from scanpy (v.1.9.1). Deconvolution analysis was conducted using CytoSPACE⁵⁶ (https://github.com/digitalcytometry/cytospace). Annotated scRNA-seq data were first transformed into a compatible format using function generate_cytospace_from_scRNA_seurat_object. Visium spatial data were prepared using the function generate_cytospace_from_ST_seurat_object. Deconvolution was performed using CytoSpace function (v.1.0.4) with default parameters. To determine neighbouring cell composition for a specific cell population in Visium data, CytoSPACE was first applied to annotate every spot with the most probable cell type. Neighbouring spots were defined as the six spots surrounding each spot and, accordingly, the neighbouring cell composition for specific cell types were computed. Trajectory construction of ST data was performed using Monocle 2 (ref. ¹⁸) with the DDRTree method using DEGs with FDR-adjusted P value < 0.05.

Bulk DNA extraction and WES

Total DNA was isolated from homogenized cryosections of human lung tissues and, when available, from frozen peripheral blood mononuclear cells (PBMCs) using a Qiagen AllPrep mini kit (80204) or a DNeasy Blood and Tissue kit (69504), respectively (both from Qiagen) according to the manufacturer’s recommendations. Qubit 4 Fluorometer (Thermo Fisher Scientific) was used for measurement of DNA yield. TWIST-WES was performed on a NovaSeq 6000 platform at a depth of 200× for tumour samples and 100× for NL and PBMCs to analyse recurrent driver mutations and using either PBMCs or distant NL tissues when blood draw was not consented, as germline control. WES data were processed and mapped to the human reference genome, and somatic mutations were identified and annotated as previously described^57,58 with further filtration steps. In brief, only MuTect⁵⁹ calls marked as ‘KEEP’ were selected and taken into the next step. Mutations with a low VAF (<0.02) or low alt allele read coverage (<4) were removed. Then, common variants reported by ExAc (the Exome Aggregation Consortium, http://exac.broadinstitute.org), Phase-3 1000 Genome Project (http://phase3browser.1000genomes.org/Homo_sapiens/Info/Index) or the NHLBI GO Exome Sequencing Project (ESP6500) (http://evs.gs.washington.edu/EVS/) with minor allele frequencies greater than 0.5% were further removed. Intronic mutations, mutations at 3′ or 5′ UTR or UTR-flanking regions, and silent mutations were also removed. The mutation load in each tumour was calculated as the number of nonsynonymous somatic mutations (nonsense, missense, splicing, stop gain, stop loss substitutions as well as frameshift insertions and deletions).

Survival analysis

Analysis of OS in the TCGA LUAD and PROSPECT⁶⁰ cohorts was performed as previously described¹⁵. KRAS mutation status in TCGA LUAD samples was downloaded from cBioPortal (https://www.cbioportal.org, study ID: luad_tcga_pan_can_atlas_2018). For TCGA dataset, clinical data were downloaded from the PanCanAtlas study¹⁸. The logrank test and Kaplan–Meier methods were used to calculate P values between groups and to generate survival curves, respectively. Statistical significance testing for all survival analyses was two-sided. To control for multiple hypothesis testing, Benjamini–Hochberg method was applied to correct P values, and FDR q values were calculated where applicable. Results were considered significant at P value or FDR q value of <0.05. Multivariate survival analysis was performed using a Cox proportional hazards regression model that calculated the hazard ratio, the 95% confidence interval and P values when using pathologic stage, age, KAC and ‘other AIC’ signatures as covariables.

Analysis of public datasets

Publicly available datasets were obtained from the Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/) under accession numbers GSE149813, GSE154989, GSE150263, GSE102511 and GSE219124. Details of the studies^28,29 analysed are as follows: GSE149813 investigated single lung cells from Kras^{LSL-G12D;LSL-YFP} mice with Ad5CMV-Cre infection²⁹; GSE154989 studied AT2 lineage-labelled cells from lungs of Kras^LSL-G12D/+;Rosa26^{LSL-tdTomato/+} mice²⁸. Gene expression count matrices of dataset interrogating Kras^G12D-driven mouse model from GSE149813 were pre-processed using Seurat following the same filtering steps in that original report. For the GSE154989 dataset²⁸, cells used for analysis were the ones labelled as “PASSED_QC” in supplementary table S7 in that study. For the GSE149813 dataset²⁹, cells with >500 median number of genes detected and <10% fraction of mitochondrial genome derived reads, and according to the pre-processing methods described in their original report²⁹, were retained for analysis. Cells with >7,500 number of genes detected were further filtered to remove potential doublets or multiplets, resulting in 8,304 cells in total for downstream analysis. Both datasets were integrated with mouse cell data generated in this study using Harmony¹⁸ with default parameters settings. The top ranked 20 Harmony-corrected PCs were used for clustering with the FindClusters function using resolution = 0.4. UMAP dimension reduction embedding was performed using the RunUMAP function with the same set of Harmony-corrected PCs. Gene expression levels and frequencies of representative cluster marker genes were visualized using DotPlot function from Seurat. The KAC signature score was calculated using the AddModuleScore function from Seurat. The mouse KAC signature was also studied in human AT2 cells with and without inducible KRAS^G12D (dataset GSE150263) also from ref. ²⁹. Cell filtration criteria described in the original report²⁹ were followed to filter out potential dead cells and doublets (number of detected genes > 800 and the percent of mitochondrial gene reads fraction < 25%). The 20 top-ranked PCs were used for clustering using the FindClusters function with resolution = 0.1. UMAP dimension reduction embeddings were computed using the same SNN graph. The KAC signature score was calculated using AddModuleScore function from Seurat package.

The bulk RNA-seq dataset GSE102511 was a previously published dataset by our group and comprised normal lung tissues, precursor AAHs and matched LUADs (n = 15, each)⁶¹. The previously published⁶² bulk RNA-seq data GSE219124 were generated on cancer stem cell and stem cell-like progenitor cells, in the form of spheres, and their parental MDA-F471 counterparts (a cell line we had developed and cultured from a KM-LUAD of an NNK-exposed Gprc5a^−/− mouse)⁶². To interrogate the association of KACs with tumour formation, gene expression matrices of bulk RNA-seq data GSE102511 (TPM count matrix) and GSE219124 (FPKM count matrix) were extracted and used for quantification of KAC signature expression using MCPcounter (v.1.2.0) R package. Heatmaps were generated using pheatmap (v.1.0.12) R package.

Mouse KACs from this study were compared to mouse Krt8⁺ transitional cells involved in alveolar regeneration post-acute lung injury from a previous study²⁵. Overlapping marker genes between mouse KACs and the previously reported Krt8⁺ transitional cells were statistically evaluated using the ggvenn (v.0.1.9) R package using the top-ranked 50 marker genes based on fold change from each study.

Digital spatial profiling of human tissues

The following antibodies were used for digital spatial profiling (DSP): claudin 4 (clone 3E2C1, AF594, LSBio, LS-C354893, concentration 0.5 µg ml^–1) and keratin 8 (clone EP1628Y, AF647, Abcam, ab192468, concentration 0.25 µg ml^–1). Optimization of antibodies was performed with different dilutions using colorectal carcinoma and LUAD tissues. IF staining was performed on three cases of matched LUAD and NL using the standard GeoMx DSP protocol for morphology markers only (PanCk: clone AE1/AE3, AF532, concentration 0.25 µg ml^–1, from GeoMx Solid Tumour Morp kit HsP, 121300301, Novus Biologicals). Slides were scanned at ×20 using the GeoMx DSP platform (NanoString Technologies). Following scanning, multiplex IF image slides were visualized, adjusting channel thresholds for each fluorophore. Expression of KRT8, PanCK and CLDN4 was assessed in adenocarcinoma cells, adjacent reactive lung tissue and distant non-reactive lung tissue.

Animal housing and tobacco carcinogen exposure experiments

Animal experiments were conducted according to Institutional Animal Care and Use Committee (IACUC)-approved protocols at the University of Texas MD Anderson Cancer Center. Mice were maintained in a pathogen-free animal facility. No statistical methods were used to predetermine sample sizes. In all animal experiments, sex-matched and age-matched mice were randomized to treatment groups. For all experiments and until end points were reached (up to 7 months after exposure to saline or NNK), mice were monitored for signs of ill health and their body weight was measured to ensure weight loss did not exceed 20% of body weight over 72 h. None of the mice developed these symptoms; therefore, they were all euthanized after reaching IACUC-approved end points. End points permitted by our IACUC protocols were not exceeded in any of the experiments. Analysis of data from animal experiments was performed in a blinded fashion. To study KACs in the context of KM-LUAD pathogenesis in vivo, Gprc5a^−/− mice were interrogated because they form LUADs that are accelerated by tobacco carcinogen exposure and acquire somatic Kras^G12D mutations—features that are highly pertinent to KM-LUAD development^21,63,64 and therefore to exploring KACs in this setting. Gprc5a^−/− mice were generated as previously described^21,65. Sex-matched and age-matched Gprc5a^−/− mice were divided into starting groups of 4 mice per exposure (NNK or saline control) and time point (EOE or 7 months after exposure, n = 16 mice in total). Eight-week-old mice were intraperitoneally injected with 75 mg kg^–1 of body weight NNK or vehicle 0.9% saline (control), 3 times per week for 8 weeks. At EOE or at 7 months after exposure, lungs were collected for derivation of live single cells for scRNA-seq. Whole lungs from additional mice treated as described above were processed by FFPE and for analysis by IF (n = 2 mice per treatment group at EOE and 7 months after exposure, 8 mice in total) and ST (3 lung tissues from n = 2 mice at 7 months after NNK exposure).

Sftpc^creER/+;Rosa^Sun1GFP/+ mice were provided by H. Chapman (University of California, San Francisco) and were crossed to Gprc5a^−/− mice to generate Gprc5a^−/−;Sftpc^creER/+;Rosa^Sun1GFP/+ mice for analysis of lineage-labelled AT2 cells. Gprc5a^−/−;Sftpc^creER/+;Rosa^Sun1GFP/+ mice were treated with 75 mg kg^–1 NNK or control saline (intraperitoneally), 3 times per week for 8 weeks. At week 6 of treatment (2 weeks before EOE), mice from both groups received 250 µg (intraperitoneally) tamoxifen dissolved in corn oil for four consecutive days. At EOE or 3 months after exposure to saline or NNK, lungs were digested to derive live (Sytox Blue-negative) GFP⁺ single cells by flow cytometry using a FACS Aria I instrument as previously described⁶⁶ (the gating strategy for GFP cell sorting is shown in Supplementary Fig. 6). Sorted single cells were analysed by scRNA-seq (GFP⁺ and GFP⁻ fractions from n = 2 mice per treatment at 3 months after exposure to saline and NNK) or used to derive organoids (GFP⁺ cells from n = 4 or 5 mice at EOE to saline or NNK, respectively, and from n = 10 or 13 mice at 3 months after saline or NNK, respectively). Whole lungs from additional mice treated with saline or NNK and tamoxifen as described above (n = 2 per treatment group) were collected (FFPE) at 3 months after NNK and analysed by IF.

Krt8-creER;Rosa^tdT/+ animals were used to generate Gprc5a^−/−;Krt8-creER;Rosa^tdT/+ mice for analysis of lineage-labelled KRT8⁺ cells. Krt8-creER (stock number 017947) and Rosa^tdT/+ (Ai14; stock number 007914) mice were obtained from the Jackson Laboratory. Mice harbouring Krt8-creER;Rosa^tdT/+ were first used for pilot studies to examine labelling of KRT8⁺ cells. Mice were exposed to control saline (n = 2 mice) or to 8 weeks of NNK (n = 3 mice) as described above followed by 1 mg tamoxifen for 6 continuous days, after which lungs were analysed at the end of tamoxifen exposure. To examine the relevance of labelled KRT8⁺ cells to tumour development, Gprc5a^−/−;Krt8-creER;Rosa^tdT/+ mice were similarly exposed to NNK for 8 weeks followed by tamoxifen, and lungs were then analysed at 8–12 weeks after NNK exposure (n = 3 mice). All lungs were collected and processed for formalin fixation, OCT embedding and IF analysis.

Histopathological and IF analysis of mouse lung tissues

Lungs of Gprc5a^−/− mice (n = 2 per treatment and time point) were inflated with formalin by gravity drip inflation, excised, examined for lung surface lesions by macroscopic observation and processed for FFPE, sectioning and H&E staining. Stained slides were digitally scanned using an Aperio ScanScope Turbo slide scanner (Leica Microsystems) at ×200 magnification, and visualized using ImageScope software (Leica Microsystems). Unstained lung tissue sections were obtained for IF analysis of LAMP3 (clone 391005, Synaptic Systems), KRT8 (TROMA-I clone from the University of Iowa DSHB) and PDPN (clone 8.1.1, from the University of Iowa DSHB). Lung FFPE tissue samples were obtained in the same manner from Gprc5a^−/−;Sftpc^creER/+;Rosa^Sun1GFP/+ mice at 3 months after exposure to saline or NNK (n = 2 mice per condition) and following injection with tamoxifen. Tissue sections were obtained for H&E staining and assessment of tumour development, and unstained sections were used for IF analysis using antibodies against GFP (AB13970, Abcam, 1:5000), LAMP3 (391005, Synaptic Systems, 1:10,000), KRT8 (TROMA-I, University of Iowa Developmental Studies Hybridoma Bank, 1:100), PDPN (clone 8.1.1, University of Iowa Developmental Studies Hybridoma Bank, 1:100), claudin 4 (ZMD.306, Invitrogen, 1:250), and PRKCDBP (cavin 3, Proteintech, 1:250). Slides were then stained with fluorophore-conjugated secondary antibodies and 4′,6′-diamidino-2-phenylindole (DAPI). Sections were mounted with Aquapolymount (18606, Polysciences), cover slipped, imaged using an Andor Revolution XDi WD spinning disk confocal microscope and analysed using Imaris software (Oxford Instruments).

Formalin-inflated lung lobes from Krt8-creER;Rosa^tdT/+ mice were cryoprotected in 20% sucrose in PBS containing 10% OCT compound (4583, Tissue-Tek) overnight on a rocker at 4 °C and embedded in OCT. The next day, 10 µm cryosections were blocked in PBS with 0.3% Triton X-100 and 5% normal donkey serum (017-000-121, Jackson ImmunoResearch) and incubated overnight in a humidified chamber at 4 °C with primary antibodies diluted in PBS with 0.3% Triton X-100 and raised against NKX2-1 (sc-13040, Santa Cruz, 1:1000), LAMP3 (same as above) and KRT8 (same as above). The next morning, sections were washed followed by incubation with secondary antibodies (Jackson ImmunoResearch) and DAPI. Slides were then washed, cover slipped as described above and imaged using a Nikon A1plus confocal microscope. Cell counter ImageJ plugin was used to count tdT⁺ cells within lesions and cells in normal-appearing areas, namely: AT2 cells (LAMP3⁺), tdT⁺ AT2 cells (tdT⁺LAMP3⁺), AT1 cells (LAMP3^–NKX2-1⁺, avoiding noticeable airways) and tdT⁺ AT1 cells (tdT⁺NKX2-1⁺LAMP3^–). Percentages of tdT⁺LAMP3⁺ and tdT⁺NKX2-1⁺LAMP3^– cells out of total tdT⁺ cells were computed. Counts were averages of triplicate images taken at ×20 magnification for each time point. The percent regional surface area covered by tdT⁺ cells in normal-appearing regions was estimated by examining the tdT expression across entire lobe sections for each replicate.

3D culture and analysis of AT2-derived organoids

Gprc5a^−/−;Sftpc^creER/+;Rosa^Sun1GFP/+ were treated with NNK or saline and tamoxifen as described above, and they were euthanized at EOE (4 saline-treated and 5 NNK-treated mice) or at 3 months after exposure (10 saline-treated and 13 NNK-treated mice). Lungs were collected, dissociated into single cells (see mouse single-cell derivation in the Methods section ‘Single-cell isolation from tissue samples’), and live (Sytox Blue-negative) GFP⁺ single cells were collected by flow cytometry using a FACS Aria I instrument as previously described⁶⁶. GFP⁺ AT2 cells from NNK-treated or saline-treated groups were immediately washed and resuspended at a concentration of 5,000 cells per 50 µl of 3D medium (F12 medium supplemented with insulin, transferrin and selenium, 10% FBS, penicillin–streptomycin and l-glutamine). GFP⁺ cells were mixed at a 1:1 ratio (by volume) with 50,000 mouse endothelial cells (collected from mouse lungs by CD31 selection and expanded in vitro as previously described⁶⁷) and resuspended in 50 µl of Geltrex reduced growth factor basement membrane matrix (A1413301, Gibco). Next, 100 µl of 1:1 GFP⁺:endothelial cell mixture was plated on Transwell inserts with 0.4 µm pores and allowed to solidify for 30 min in a humidified CO₂ incubator (EOE: n = 3 wells per condition; 3 months after exposure: n = 4 wells for saline-derived organoids and n = 12 wells for NNK-derived organoids). Each well was then supplemented with 3D medium containing ROCK inhibitor (Y-27632, Millipore) and recombinant mouse FGF-10 (6224-FG, R&D Systems), and plates were incubated at 37 °C in a humidified CO₂ incubator. Wells were replenished with 3D medium every other day. For GFP⁺ organoids derived from mice exposed to NNK, 200 nM KRAS(G12D)-specific inhibitor MRTX1133 or DMSO vehicle was added to the medium and replenished 3 times a week (n = 6 wells per condition). Organoids were monitored and analysed twice a week using an EVOS M7000 imaging system (Thermo Fisher Scientific), whereby the numbers and sizes of organoids greater than 100 µm in diameter were recorded. At end point, 3D organoids were collected from the basement membrane matrix using Gentle Cell Dissociation reagent (100-0485, StemCell Technologies), fixed with 4% paraformaldehyde, permeabilized, blocked and stained overnight at 4 °C with a mixture of IF primary antibodies raised against LAMP3, GFP, KRT8 and cavin 3. The next day, organoids were washed and stained with fluorophore-conjugated secondary antibodies overnight at 4 °C while being protected from light. Organoids were washed and stained with DAPI nuclear stain for 30 min, after which they were collected in Aqua-Poly/Mount (18606-20, Polysciences) and transferred to slides. Images of organoids were captured using an Andor Revolution XDi WD spinning disk confocal microscope and analysed using Imaris software (Oxford Instruments).

2D viability assays

Mouse mycoplasma-free LUAD cell lines LKR13 (mutant Kras^G12D-driven³¹) and MDA-F471 (Gprc5a^−/− and Kras^G12D mutant²⁷) were plated on 96-well plates (10³ cells per well) and grown in DMEM (Gibco) supplemented with 10% FBS, 1% antibiotic antimycotic solution (A5955, Sigma-Aldrich) and 1% l-glutamine (G7513, Sigma-Aldrich). The next day, cells were cultured for up to 4 days with medium containing 0.5% FBS, 0.5% FBS with 50 ng ml^–1 epidermal growth factor (EGF) (E5160, Sigma-Aldrich), or 0.5% FBS with EGF and varying concentrations of MRTX1133 (Mirati Therapeutics). alamarBlue Cell Viability reagent (25 µl; DAL1025, ThermoFisher) was added to each well. At 4 days after treatment, viability was assessed by fluorescence spectrophotometry at 570 nm (and 600 nm as a reference). For the wells showing net positive absorbances relative to blank wells (at least 3 wells per cell line and condition), the percent differences in reduction between treated and control wells were calculated.

Western blot analysis

LKR13 and MDA-F471 cells were plated in 6-well plates (10⁶ cells per well) and grown under different conditions as described above. Protein lysates were extracted at 3 h after treatment and analysed by western blotting following overnight incubation with antibodies to the following primary proteins: vinculin (E1E9V, rabbit, Cell Signaling Technology, 13901; 1:1,000); phosphorylated p44/42 MAPK (ERK1/2, rabbit, Cell Signaling Technology, 9101; 1:2,000); phosphorylated S6 ribosomal protein (Ser 235/236, rabbit, Cell Signaling Technology, 4858; 1:2,000); p44/42 MAPK (ERK1/2, rabbit, Cell Signaling Technology, 9102; 1:2,000); or S6 (E.573.4, rabbit, Invitrogen, MA5-15164; 1:1,000). This was followed by 1 h of incubation with diluted secondary antibody (1706515 goat anti-rabbit IgG-HRP conjugate, Bio-Rad). Protein lysates from each cell line were analysed on multiple gels (four per cell line) with Precision Plus Protein Dual Color Standard (1610394, Bio-Rad) as the ladder and blotted to membranes to separately probe for phosphorylated and total forms of the same proteins, which have highly similar molecular weights (using phospho-specific antibodies or antibodies targeting total version of same protein). Vinculin protein levels were evaluated as loading control on each of the blots. Four blots (phospho-ERK, total ERK, phospho-S6 and total S6) for each of LKR13 and MDA-F471 are shown in Supplementary Fig. 9, each with its own analysis of equal protein loading (vinculin blot) and whereby only the ones indicated with green rectangles are presented in Extended Data Fig. 12c. Membranes were cut horizontally using molecular weight marker as a guide, and cut membranes were incubated with the specified antibodies (see Supplementary Fig. 9 for site of cutting and for overlay of colorimetric and chemiluminescent images of the same blot to display ladder and the analysed protein, respectively). Blots were imaged using the ChemiDoc Touch Imaging System (Bio-Rad) with Chemiluminescence and Colorimetric (for protein ladder) applications and auto expose or manual settings.

Chemicals and reagents

Tobacco-specific carcinogen (NNK) with a purity of 99.96% by HPLC was purchased from TargetMol. Tamoxifen and H&E staining reagents were purchased from Sigma Aldrich. The KRAS(G12D) inhibitor MRTX1133 was provided by J. Christensen (Mirati Therapeutics).

Statistical analyses

In addition to the algorithms and statistical analyses described above, all other basic statistical analyses were performed in the R statistical environment (v.4.0.0). The Kruskal–Wallis H-test was used to compare variables of interests across three or more groups. Wilcoxon rank-sum test was used for paired comparisons among matched samples from the same patients. Wilcoxon rank-sum test was used to compare other continuous variables such as gene expression levels and signature scores between groups. Spearman’s correlation coefficient was calculated to assess associations between two continuous variables (for example, cellular proportions and gene signature scores). Fisher’s exact test was used to identify differences in frequencies of groups based on two categorical variables. Ordinal logistic regression was performed using the polr function in the built-in R package MASS (v.7.3). Benjamin–Hochberg method was used to control for multiple hypothesis testing. All statistical tests performed in this study were two-sided. Results were considered significant at P values or FDR q values < 0.05. When a P value reported by R was smaller than 2.2e-16, it was reported as P < 2.2 × 10⁻¹⁶.

Ethics declarations

All human LUAD and normal lung tissues were obtained from patients who provided informed consent and under institutional review board-approved protocols at The University of Texas MD Anderson Cancer Center. All human data in this manuscript are deidentified to ensure patient privacy. All animal studies were conducted under IACUC-approved protocols at the University of Texas MD Anderson Cancer Center.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

[ad_2]

Source link

February 28, 2024

Tag: Cancer genomics

Geographic variation of mutagenic exposures in kidney cancer genomes

Recruitment of cases and informed consent

Bio-samples, data collection and expert pathology review

DNA extraction

Whole-genome sequencing

Somatic variant calling

Validation of sequencing for Japanese cases

Generation of mutational matrices

Mutational signature analysis

Attribution of activities of mutational signatures

Driver mutations

Evolutionary analysis

Regressions

Polygenic risk score analysis of lifestyle risk factors

Untargeted metabolomics association with signatures

Targeted metabolomics analyses

Geospatial analyses

Statistics and reproducibility

Reporting summary

3D genomic mapping reveals multifocality of human pancreatic precancers

Evolutionary trajectories of small cell lung cancer under therapy

Human lung tumour specimens

Clinical characteristics

DNA and RNA extraction

Next-generation sequencing

Whole-exome sequencing

WGS

RNA-seq

Dideoxynucleotide sequencing for validation of somatic alterations

Data processing of transcriptome sequencing data

Data processing of genome sequencing data

Analysis of clonal architecture from multiregional and longitudinal tumour samples

Analysis of tumour phylogenies

Analysis of cancer cell fractions for structural rearrangements

Analysis of mutational signatures

Analysis of significant mutations, copy number alterations and genome ploidy

Clinical correlations with chemotherapy relapse-free survival

Immunoblot analysis

Reporting summary

An atlas of epithelial cell states and plasticity in lung adenocarcinoma

Multi-regional sampling of human surgically resected LUADs and NL tissues

Single-cell isolation from tissue samples

Sorting and enrichment of viable lung epithelial singlets

Preparation of single-cell 5′ gene expression libraries

scRNA-seq data processing and quality control

Doublet detection and removal, and batch effect evaluation and correction

Unsupervised clustering and subclustering analysis

Identification of malignant cells and mapping KRAS codon 12 mutations

Mapping KRAS codon 12 mutations

PCA analysis of malignant cells and quantification of transcriptome similarity

Determination of non-malignant cell types and states

Scoring of curated gene signatures

Analysis of alveolar cell differentiation states and trajectories

ST data generation and analysis

Bulk DNA extraction and WES

Survival analysis

Analysis of public datasets

Digital spatial profiling of human tissues

Animal housing and tobacco carcinogen exposure experiments

Histopathological and IF analysis of mouse lung tissues

3D culture and analysis of AT2-derived organoids

2D viability assays

Western blot analysis

Chemicals and reagents

Statistical analyses

Ethics declarations

Reporting summary

Deep whole-genome analysis of 494 hepatocellular carcinomas

Naturally occurring T cell mutations enhance engineered T cell therapies