Tag: Databases

  • ‘All of Us’ genetics chart stirs unease over controversial depiction of race

    ‘All of Us’ genetics chart stirs unease over controversial depiction of race

    [ad_1]

    A conceptual photograph of a scientist extracting a DNA sample from a vial ready for automated analysis in front of a screen displaying a DNA profile.

    Geneticists have become concerned about how the research they publish could be interpreted incorrectly by the public.Credit: Tek Image/Science Photo Library

    Some geneticists have expressed their unease about a figure in a high-profile Nature paper that was published earlier this week1, noting that it could be misinterpreted as reinforcing racist beliefs. The figure has reignited a long-standing debate among geneticists about how best to discuss and depict race, ethnicity and genomic ancestry, given how these terms can be misinterpreted and weaponized by extremists.

    “The problem is, a lot of people will see figures like this as supporting a viewpoint” that race and ethnicity are closely aligned with genetics, says Ewan Birney, deputy director-general of the European Molecular Biology Laboratory in Cambridgeshire, UK. “And then they build castles in the air from all this.”

    Alexander Bick, a physician and geneticist at Vanderbilt University Medical Center in Nashville, Tennessee, who co-authored the paper in which the figure appears, acknowledged in an e-mail to Nature’s news team that “it’s clear that the figure fell short of our intended goal for this paper”. (Nature’s news team is editorially independent of its journal team.) But Bick defended the analysis, noting that it is a “faithful representation of the patterns that exist in the data that is consistent with representations in other similar studies” and that he is not planning to submit a correction to remove the plot.

    Stirring debate

    The paper is part of a larger package of articles published on 19 February that detail the progress and initial analyses of the All of Us programme, run by the US National Institutes of Health in Bethesda, Maryland. The programme aims to tackle inequities in genetics research by assembling detailed genomic and health profiles for a diverse group of one million people in the United States by the end of 2026.

    Within hours of the papers’ publication, population geneticist Jonathan Pritchard at Stanford University in California posted on the social-media platform X (formerly Twitter) to share his concerns about the figure, which is intended to showcase the diversity of the first 250,000 genomes included in the All of Us database. The chart uses an algorithm called UMAP to visualize genetic relationships and participants’ self-described race and ethnicity. But a problem with using UMAP, Pritchard wrote, is that it can exaggerate the distinctiveness of populations and fail to represent their intermixing properly.

    In reality, “genetic variation is a continuum, and thus genetic ancestry cannot be objectively carved out into discrete groups”, says Roshni Patel, a statistical geneticist who works with Pritchard at Stanford University.

    Scientific figure showing a chart with axes labelled 'UMAP 1' and 'UMAP 2' and coloured areas which correspond to race.

    An excerpt from a figure in a Nature paper that some geneticists say could be misinterpreted to reinforce racist beliefs. Source: Ref 1.

    To a layperson, the chart shows several distinct colourful blobs that could be misinterpreted as supporting genetic essentialism — the pseudoscientific belief that racial or ethnic groups are distinct genetic categories, and that individuals of the same group are genetically similar, Birney says.

    That is the opposite of what the data show, Bick says. “Our analysis reaffirms that race and ethnicity are social constructs that do not have a basis in genetics”.

    A few researchers have publicly called on Nature’s journal team to retract the paper because of the potentially misleading nature of the figure. Most geneticists contacted by Nature’s news team did not think retraction was necessary, but said they hoped that the authors would acknowledge the concerns. A spokesperson for the journal says that Nature’s editors are “aware of the discussions that are taking place and are in contact with the authors”.

    Public interpretation

    Geneticists have become particularly sensitive to how analyses they publish are used or interpreted by the public. In 2022, an 18-year-old gunman in Buffalo, New York, attempted to justify killing 10 Black people at a grocery shop with a 180-page manifesto that included several citations and figures from genetics papers.

    This ignited a debate among geneticists about their responsibility to ensure that their research does not spread in pseudoscientific circles and is not used to justify violence. “The language of some of this work showing up in other people’s violent manifestos is a very sobering reality that geneticists are having to reconcile with,” says Nicole Iturriaga, a political sociologist at the University of California, Irvine, who has studied how genetics research is co-opted by far-right groups.

    But Birney says he isn’t convinced there is a “neat and pretty way of representing relationships between humans”. Human genetic relationships are best described by family trees, and any effort to squash the hundreds of thousands of people included in a data set into a single chart will leave out important context, he says. It doesn’t help that every country uses the terms ‘race’ and ‘ethnicity’ in its own way, which makes the concepts harder for the public to understand, he adds.

    That doesn’t mean that geneticists should stop using charts in their manuscripts; it means that extra care is necessary to ensure that data are communicated responsibly, says Anna Lewis, a specialist in the ethical implications of genomics at Brigham and Women’s Hospital in Boston, Massachusetts. Patel adds that the All of Us researchers could have more closely followed the recommendations of a 2023 report issued by the US National Academies of Sciences, Engineering, and Medicine (NASEM) about best practices for using population descriptors in genetics research. Bick responds that the manuscript was written in 2022, before the NASEM report was published.

    Despite the charged and public nature of the criticism, Iturriaga says that this is a healthy debate for the field to have. And it’s important that geneticists think about the potential for misinterpretation before it’s too late, she adds.



    [ad_2]

    Source link

  • Genomic data in the All of Us Research Program

    [ad_1]

    The All of Us cohort

    All of Us aims to engage a longitudinal cohort of one million or more US participants, with a focus on including populations that have historically been under-represented in biomedical research. Details of the All of Us cohort have been described previously5. Briefly, the primary objective is to build a robust research resource that can facilitate the exploration of biological, clinical, social and environmental determinants of health and disease. The programme will collect and curate health-related data and biospecimens, and these data and biospecimens will be made broadly available for research uses. Health data are obtained through the electronic medical record and through participant surveys. Survey templates can be found on our public website: https://www.researchallofus.org/data-tools/survey-explorer/. Adults 18 years and older who have the capacity to consent and reside in the USA or a US territory at present are eligible. Informed consent for all participants is conducted in person or through an eConsent platform that includes primary consent, HIPAA Authorization for Research use of EHRs and other external health data, and Consent for Return of Genomic Results. The protocol was reviewed by the Institutional Review Board (IRB) of the All of Us Research Program. The All of Us IRB follows the regulations and guidance of the NIH Office for Human Research Protections for all studies, ensuring that the rights and welfare of research participants are overseen and protected uniformly.

    Data accessibility through a ‘data passport’

    Authorization for access to participant-level data in All of Us is based on a ‘data passport’ model, through which authorized researchers do not need IRB review for each research project. The data passport is required for gaining data access to the Researcher Workbench and for creating workspaces to carry out research projects using All of Us data. At present, data passports are authorized through a six-step process that includes affiliation with an institution that has signed a Data Use and Registration Agreement, account creation, identity verification, completion of ethics training, and attestation to a data user code of conduct. Results reported follow the All of Us Data and Statistics Dissemination Policy disallowing disclosure of group counts under 20 to protect participant privacy without seeking prior approval40.

    EHR data

    At present, All of Us gathers EHR data from about 50 health care organizations that are funded to recruit and enrol participants as well as transfer EHR data for those participants who have consented to provide them. Data stewards at each provider organization harmonize their local data to the Observational Medical Outcomes Partnership (OMOP) Common Data Model, and then submit it to the All of Us Data and Research Center (DRC) so that it can be linked with other participant data and further curated for research use. OMOP is a common data model standardizing health information from disparate EHRs to common vocabularies and organized into tables according to data domains. EHR data are updated from the recruitment sites and sent to the DRC quarterly. Updated data releases to the research community occur approximately once a year. Supplementary Table 6 outlines the OMOP concepts collected by the DRC quarterly from the recruitment sites.

    Biospecimen collection and processing

    Participants who consented to participate in All of Us donated fresh whole blood (4 ml EDTA and 10 ml EDTA) as a primary source of DNA. The All of Us Biobank managed by the Mayo Clinic extracted DNA from 4 ml EDTA whole blood, and DNA was stored at −80 °C at an average concentration of 150 ng µl−1. The buffy coat isolated from 10 ml EDTA whole blood has been used for extracting DNA in the case of initial extraction failure or absence of 4 ml EDTA whole blood. The Biobank plated 2.4 µg DNA with a concentration of 60 ng µl−1 in duplicate for array and WGS samples. The samples are distributed to All of Us Genome Centers weekly, and a negative (empty well) control and National Institute of Standards and Technology controls are incorporated every two months for QC purposes.

    Genome sequencing

    Genome Center sample receipt, accession and QC

    On receipt of DNA sample shipments, the All of Us Genome Centers carry out an inspection of the packaging and sample containers to ensure that sample integrity has not been compromised during transport and to verify that the sample containers correspond to the shipping manifest. QC of the submitted samples also includes DNA quantification, using routine procedures to confirm volume and concentration (Supplementary Table 7). Any issues or discrepancies are recorded, and affected samples are put on hold until resolved. Samples that meet quality thresholds are accessioned in the Laboratory Information Management System, and sample aliquots are prepared for library construction processing (for example, normalized with respect to concentration and volume).

    WGS library construction, sequencing and primary data QC

    The DNA sample is first sheared using a Covaris sonicator and is then size-selected using AMPure XP beads to restrict the range of library insert sizes. Using the PCR Free Kapa HyperPrep library construction kit, enzymatic steps are completed to repair the jagged ends of DNA fragments, add proper A-base segments, and ligate indexed adapter barcode sequences onto samples. Excess adaptors are removed using AMPure XP beads for a final clean-up. Libraries are quantified using quantitative PCR with the Illumina Kapa DNA Quantification Kit and then normalized and pooled for sequencing (Supplementary Table 7).

    Pooled libraries are loaded on the Illumina NovaSeq 6000 instrument. The data from the initial sequencing run are used to QC individual libraries and to remove non-conforming samples from the pipeline. The data are also used to calibrate the pooling volume of each individual library and re-pool the libraries for additional NovaSeq sequencing to reach an average coverage of 30×.

    After demultiplexing, WGS analysis occurs on the Illumina DRAGEN platform. The DRAGEN pipeline consists of highly optimized algorithms for mapping, aligning, sorting, duplicate marking and haplotype variant calling and makes use of platform features such as compression and BCL conversion. Alignment uses the GRCh38dh reference genome. QC data are collected at every stage of the analysis protocol, providing high-resolution metrics required to ensure data consistency for large-scale multiplexing. The DRAGEN pipeline produces a large number of metrics that cover lane, library, flow cell, barcode and sample-level metrics for all runs as well as assessing contamination and mapping quality. The All of Us Genome Centers use these metrics to determine pass or fail for each sample before submitting the CRAM files to the All of Us DRC. For mapping and variant calling, all Genome Centers have harmonized on a set of DRAGEN parameters, which ensures consistency in processing (Supplementary Table 2).

    Every step through the WGS procedure is rigorously controlled by predefined QC measures. Various control mechanisms and acceptance criteria were established during WGS assay validation. Specific metrics for reviewing and releasing genome data are: mean coverage (threshold of ≥30×), genome coverage (threshold of ≥90% at 20×), coverage of hereditary disease risk genes (threshold of ≥95% at 20×), aligned Q30 bases (threshold of ≥8 × 1010), contamination (threshold of ≤1%) and concordance to independently processed array data.

    Array genotyping

    Samples are processed for genotyping at three All of Us Genome Centers (Broad, Johns Hopkins University and University of Washington). DNA samples are received from the Biobank and the process is facilitated by the All of Us genomics workflow described above. All three centres used an identical array product, scanners, resource files and genotype calling software for array processing to reduce batch effects. Each centre has its own Laboratory Information Management System that manages workflow control, sample and reagent tracking, and centre-specific liquid handling robotics.

    Samples are processed using the Illumina Global Diversity Array (GDA) with Illumina Infinium LCG chemistry using the automated protocol and scanned on Illumina iSCANs with Automated Array Loaders. Illumina IAAP software converts raw data (IDAT files; 2 per sample) into a single GTC file per sample using the BPM file (defines strand, probe sequences and illumicode address) and the EGT file (defines the relationship between intensities and genotype calls). Files used for this data release are: GDA-8v1-0_A5.bpm, GDA-8v1-0_A1_ClusterFile.egt, gentrain v3, reference hg19 and gencall cutoff 0.15. The GDA array assays a total of 1,914,935 variant positions including 1,790,654 single-nucleotide variants, 44,172 indels, 9,935 intensity-only probes for CNV calling, and 70,174 duplicates (same position, different probes). Picard GtcToVcf is used to convert the GTC files to VCF format. Resulting VCF and IDAT files are submitted to the DRC for ingestion and further processing. The VCF file contains assay name, chromosome, position, genotype calls, quality score, raw and normalized intensities, B allele frequency and log R ratio values. Each genome centre is running the GDA array under Clinical Laboratory Improvement Amendments-compliant protocols. The GTC files are parsed and metrics are uploaded to in-house Laboratory Information Management System systems for QC review.

    At batch level (each set of 96-well plates run together in the laboratory at one time), each genome centre includes positive control samples that are required to have >98% call rate and >99% concordance to existing data to approve release of the batch of data. At the sample level, the call rate and sex are the key QC determinants41. Contamination is also measured using BAFRegress42 and reported out as metadata. Any sample with a call rate below 98% is repeated one time in the laboratory. Genotyped sex is determined by plotting normalized x versus normalized y intensity values for a batch of samples. Any sample discordant with ‘sex at birth’ reported by the All of Us participant is flagged for further detailed review and repeated one time in the laboratory. If several sex-discordant samples are clustered on an array or on a 96-well plate, the entire array or plate will have data production repeated. Samples identified with sex chromosome aneuploidies are also reported back as metadata (XXX, XXY, XYY and so on). A final processing status of ‘pass’, ‘fail’ or ‘abandon’ is determined before release of data to the All of Us DRC. An array sample will pass if the call rate is >98% and the genotyped sex and sex at birth are concordant (or the sex at birth is not applicable). An array sample will fail if the genotyped sex and the sex at birth are discordant. An array sample will have the status of abandon if the call rate is <98% after at least two attempts at the genome centre.

    Data from the arrays are used for participant return of genetic ancestry and non-health-related traits for those who consent, and they are also used to facilitate additional QC of the matched WGS data. Contamination is assessed in the array data to determine whether DNA re-extraction is required before WGS. Re-extraction is prompted by level of contamination combined with consent status for return of results. The arrays are also used to confirm sample identity between the WGS data and the matched array data by assessing concordance at 100 unique sites. To establish concordance, a fingerprint file of these 100 sites is provided to the Genome Centers to assess concordance with the same sites in the WGS data before CRAM submission.

    Genomic data curation

    As seen in Extended Data Fig. 2, we generate a joint call set for all WGS samples and make these data available in their entirety and by sample subsets to researchers. A breakdown of the frequencies, stratified by computed ancestries for which we had more than 10,000 participants can be found in Extended Data Fig. 3. The joint call set process allows us to leverage information across samples to improve QC and increase accuracy.

    Single-sample QC

    If a sample fails single-sample QC, it is excluded from the release and is not reported in this document. These tests detect sample swaps, cross-individual contamination and sample preparation errors. In some cases, we carry out these tests twice (at both the Genome Center and the DRC), for two reasons: to confirm internal consistency between sites; and to mark samples as passing (or failing) QC on the basis of the research pipeline criteria. The single-sample QC process accepts a higher contamination rate than the clinical pipeline (0.03 for the research pipeline versus 0.01 for the clinical pipeline), but otherwise uses identical thresholds. The list of specific QC processes, passing criteria, error modes addressed and an overview of the results can be found in Supplementary Table 3.

    Joint call set QC

    During joint calling, we carry out additional QC steps using information that is available across samples including hard thresholds, population outliers, allele-specific filters, and sensitivity and precision evaluation. Supplementary Table 4 summarizes both the steps that we took and the results obtained for the WGS data. More detailed information about the methods and specific parameters can be found in the All of Us Genomic Research Data Quality Report36.

    Batch effect analysis

    We analysed cross-sequencing centre batch effects in the joint call set. To quantify the batch effect, we calculated Cohen’s d (ref. 43) for four metrics (insertion/deletion ratio, single-nucleotide polymorphism count, indel count and single-nucleotide polymorphism transition/transversion ratio) across the three genome sequencing centres (Baylor College of Medicine, Broad Institute and University of Washington), stratified by computed ancestry and seven regions of the genome (whole genome, high-confidence calling, repetitive, GC content of >0.85, GC content of <0.15, low mappability, the ACMG59 genes and regions of large duplications (>1 kb)). Using random batches as a control set, all comparisons had a Cohen’s d of <0.35. Here we report any Cohen’s d results >0.5, which we chose before this analysis and is conventionally the threshold of a medium effect size44.

    We found that there was an effect size in indel counts (Cohen’s d of 0.53) in the entire genome, between Broad Institute and University of Washington, but this was being driven by repetitive and low-mappability regions. We found no batch effects with Cohen’s d of >0.5 in the ratio metrics or in any metrics in the high-confidence calling, low or high GC content, or ACMG59 regions. A complete list of the batch effects with Cohen’s d of >0.5 are found in Supplementary Table 8.

    Sensitivity and precision evaluation

    To determine sensitivity and precision, we included four well-characterized control samples (four National Institute of Standards and Technology Genome in a Bottle samples (HG-001, HG-003, HG-004 and HG-005). The samples were sequenced with the same protocol as All of Us. Of note, these samples were not included in data released to researchers. We used the corresponding published set of variant calls for each sample as the ground truth in our sensitivity and precision calculations. We use the high-confidence calling region, defined by Genome in a Bottle v4.2.1, as the source of ground truth. To be called a true positive, a variant must match the chromosome, position, reference allele, alternate allele and zygosity. In cases of sites with multiple alternative alleles, each alternative allele is considered separately. Sensitivity and precision results are reported in Supplementary Table 5.

    Genetic ancestry inference

    We computed categorical ancestry for all WGS samples in All of Us and made these available to researchers. These predictions are also the basis for population allele frequency calculations in the Genomic Variants section of the public Data Browser. We used the high-quality set of sites to determine an ancestry label for each sample. The ancestry categories are based on the same labels used in gnomAD18, the Human Genome Diversity Project (HGDP)45 and 1000 Genomes1: African (AFR); Latino/admixed American (AMR); East Asian (EAS); Middle Eastern (MID); European (EUR), composed of Finnish (FIN) and Non-Finnish European (NFE); Other (OTH), not belonging to one of the other ancestries or is an admixture; South Asian (SAS).

    We trained a random forest classifier46 on a training set of the HGDP and 1000 Genomes samples variants on the autosome, obtained from gnomAD11. We generated the first 16 principal components (PCs) of the training sample genotypes (using the hwe_normalized_pca in Hail) at the high-quality variant sites for use as the feature vector for each training sample. We used the truth labels from the sample metadata, which can be found alongside the VCFs. Note that we do not train the classifier on the samples labelled as Other. We use the label probabilities (‘confidence’) of the classifier on the other ancestries to determine ancestry of Other.

    To determine the ancestry of All of Us samples, we project the All of Us samples into the PCA space of the training data and apply the classifier. As a proxy for the accuracy of our All of Us predictions, we look at the concordance between the survey results and the predicted ancestry. The concordance between self-reported ethnicity and the ancestry predictions was 87.7%.

    PC data from All of Us samples and the HGDP and 1000 Genomes samples were used to compute individual participant genetic ancestry fractions for All of Us samples using the Rye program. Rye uses PC data to carry out rapid and accurate genetic ancestry inference on biobank-scale datasets47. HGDP and 1000 Genomes reference samples were used to define a set of six distinct and coherent ancestry groups—African, East Asian, European, Middle Eastern, Latino/admixed American and South Asian—corresponding to participant self-identified race and ethnicity groups. Rye was run on the first 16 PCs, using the defined reference ancestry groups to assign ancestry group fractions to individual All of Us participant samples.

    Relatedness

    We calculated the kinship score using the Hail pc_relate function and reported any pairs with a kinship score above 0.1. The kinship score is half of the fraction of the genetic material shared (ranges from 0.0 to 0.5). We determined the maximal independent set41 for related samples. We identified a maximally unrelated set of 231,442 samples (94%) for kinship scored greater than 0.1.

    LDL-C common variant GWAS

    The phenotypic data were extracted from the Curated Data Repository (CDR, Control Tier Dataset v7) in the All of Us Researcher Workbench. The All of Us Cohort Builder and Dataset Builder were used to extract all LDL cholesterol measurements from the Lab and Measurements criteria in EHR data for all participants who have WGS data. The most recent measurements were selected as the phenotype and adjusted for statin use19, age and sex. A rank-based inverse normal transformation was applied for this continuous trait to increase power and deflate type I error. Analysis was carried out on the Hail MatrixTable representation of the All of Us WGS joint-called data including removing monomorphic variants, variants with a call rate of <95% and variants with extreme Hardy–Weinberg equilibrium values (P < 10−15). A linear regression was carried out with REGENIE48 on variants with a minor allele frequency >5%, further adjusting for relatedness to the first five ancestry PCs. The final analysis included 34,924 participants and 8,589,520 variants.

    Genotype-by-phenotype replication

    We tested replication rates of known phenotype–genotype associations in three of the four largest populations: EUR, AFR and EAS. The AMR population was not included because they have no registered GWAS. This method is a conceptual extension of the original GWAS × phenome-wide association study, which replicated 66% of powered associations in a single EHR-linked biobank49. The PGRM is an expansion of this work by Bastarache et al., based on associations in the GWAS catalogue50 in June 2020 (ref. 51). After directly matching the Experimental Factor Ontology terms to phecodes, the authors identified 8,085 unique loci and 170 unique phecodes that compose the PGRM. They showed replication rates in several EHR-linked biobanks ranging from 76% to 85%. For this analysis, we used the EUR-, and AFR-based maps, considering only catalogue associations that were P < 5 × 10−8 significant.

    The main tools used were the Python package Hail for data extraction, plink for genomic associations, and the R packages PheWAS and pgrm for further analysis and visualization. The phenotypes, participant-reported sex at birth, and year of birth were extracted from the All of Us CDR (Controlled Tier Dataset v7). These phenotypes were then loaded into a plink-compatible format using the PheWAS package, and related samples were removed by sub-setting to the maximally unrelated dataset (n = 231,442). Only samples with EHR data were kept, filtered by selected loci, annotated with demographic and phenotypic information extracted from the CDR and ancestry prediction information provided by All of Us, ultimately resulting in 181,345 participants for downstream analysis. The variants in the PGRM were filtered by a minimum population-specific allele frequency of >1% or population-specific allele count of >100, leaving 4,986 variants. Results for which there were at least 20 cases in the ancestry group were included. Then, a series of Firth logistic regression tests with phecodes as the outcome and variants as the predictor were carried out, adjusting for age, sex (for non-sex-specific phenotypes) and the first three genomic PC features as covariates. The PGRM was annotated with power calculations based on the case counts and reported allele frequencies. Power of 80% or greater was considered powered for this analysis.

    Reporting summary

    Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

    [ad_2]

    Source link

  • your results are only as good as your data

    your results are only as good as your data

    [ad_1]

    A portrait of Hunter Moseley

    Hunter Moseley says that good reproducibility practices are essential to fully harness the potential of big data.Credit: Hunter N.B. Moseley

    We are in the middle of a data-driven science boom. Huge, complex data sets, often with large numbers of individually measured and annotated ‘features’, are fodder for voracious artificial intelligence (AI) and machine-learning systems, with details of new applications being published almost daily.

    But publication in itself is not synonymous with factuality. Just because a paper, method or data set is published does not mean that it is correct and free from mistakes. Without checking for accuracy and validity before using these resources, scientists will surely encounter errors. In fact, they already have.

    In the past few months, members of our bioinformatics and systems-biology laboratory have reviewed state-of-the-art machine-learning methods for predicting the metabolic pathways that metabolites belong to, on the basis of the molecules’ chemical structures1. We wanted to find, implement and potentially improve the best methods for identifying how metabolic pathways are perturbed under different conditions: for instance, in diseased versus normal tissues.

    We found several papers, published between 2011 and 2022, that demonstrated the application of different machine-learning methods to a gold-standard metabolite data set derived from the Kyoto Encyclopedia of Genes and Genomes (KEGG), which is maintained at Kyoto University in Japan. We expected the algorithms to improve over time, and saw just that: newer methods performed better than older ones did. But were those improvements real?

    Data leaks

    Scientific reproducibility enables careful vetting of data and results by peer reviewers as well as by other research groups, especially when the data set is used in new applications. Fortunately, in keeping with best practices for computational reproducibility, two of the papers2,3 in our analysis included everything that is needed to put their observations to the test: the data set they used, the computer code they wrote to implement their methods and the results generated from that code. Three of the papers24 used the same data set, which allowed us to make direct comparisons. When we did so, we found something unexpected.

    It is common practice in machine learning to split a data set in two and to use one subset to train a model and another to evaluate its performance. If there is no overlap between the training and testing subsets, performance in the testing phase will reflect how well the model learns and performs. But in the papers we analysed, we identified a catastrophic ‘data leakage’ problem: the two subsets were cross-contaminated, muddying the ideal separation. More than 1,700 of 6,648 entries from the KEGG COMPOUND database — about one-quarter of the total data set — were represented more than once, corrupting the cross-validation steps.

    When we removed the duplicates in the data set and applied the published methods again, the observed performance was less impressive than it had first seemed. There was a substantial drop in the F1 score — a machine-learning evaluation metric that is similar to accuracy but is calculated in terms of precision and recall — from 0.94 to 0.82. A score of 0.94 is reasonably high and indicates that the algorithm is usable in many scientific applications. A score of 0.82, however, suggests that it can be useful, but only for certain applications — and only if handled appropriately.

    It is, of course, unfortunate that these studies were published with flawed results stemming from the corrupted data set; our work calls their findings into question. But because the authors of two of the studies followed best practices in computational scientific reproducibility and made their data, code and results fully available, the scientific method worked as intended, and the flawed results were detected and (to the best of our knowledge) are being corrected.

    The third team, as far as we can tell, included neither their data set nor their code, making it impossible for us to properly evaluate their results. If all of the groups had neglected to make their data and code available, this data-leakage problem would have been almost impossible to catch. That would be a problem not just for the studies that were already published, but also for every other scientist who might want to use that data set for their own work.

    More insidiously, the erroneously high performance reported in these papers could dissuade others from attempting to improve on the published methods, because they would incorrectly find their own algorithms lacking by comparison. Equally troubling, it could also complicate journal publication, because demonstrating improvement is often a requirement for successful review — potentially holding back research for years.

    Encouraging reproducibility

    So, what should we do with these erroneous studies? Some would argue that they should be retracted. We would caution against such a knee-jerk reaction — at least as a blanket policy. Because two of the three papers in our analysis included the data, code and full results, we could evaluate their findings and flag the problematic data set. On one hand, that behaviour should be encouraged — for instance, by allowing the authors to publish corrections. On the other, retracting studies with both highly flawed results and little or no support for reproducible research would send the message that scientific reproducibility is not optional. Furthermore, demonstrating support for full scientific reproducibility provides a clear litmus test for journals to use when deciding between correction and retraction.

    Now, scientific data are growing more complex every day. Data sets used in complex analyses, especially those involving AI, are part of the scientific record. They should be made available — along with the code with which to analyse them — either as supplemental material or through open data repositories, such as Figshare (Figshare has partnered with Springer Nature, which publishes Nature, to facilitate data sharing in published manuscripts) and Zenodo, that can ensure data persistence and provenance. But those steps will help only if researchers also learn to treat published data with some scepticism, if only to avoid repeating others’ mistakes.

    [ad_2]

    Source link