Novel genotyping algorithms for rare variants significantly improve the accuracy of Applied Biosystems ™ Axiom ™ array genotyping calls

O Mizrahi Man,MH Woehrmann, TA Webster, J Gollub, A Bivol,SM Keeble,KH Aull,A Mittal, AH Roter, BA Wong,JP Schmidt

biorxiv(2021)

引用 0|浏览0
暂无评分
摘要
Objective To significantly improve the positive predictive value (PPV) and sensitivity of Applied Biosystems™ Axiom™ array variant calling, by means of novel improvement to genotyping algorithms and careful quality control of array probesets. The improvement makes array genotyping more suitable for very rare variants. Design Retrospective evaluation of UK Biobank array data re-genotyped with improved algorithms for rare variants. Participants 488,359 people recruited to the UK Biobank with Axiom array genotyping data including 200,630 with exome sequencing data. Main Outcome Measures A comparison of genotyping calls from array data to genotyping calls on a subset of variants with exome sequencing data. Results Axiom genotyping [[18][1]] performed well, based on comparison to sequencing data, for over 100,000 common variants directly genotyped on the Axiom UK Biobank array and also exome sequenced by the UK Biobank Exome Sequencing Consortium. However, in a comparison to the initial exome sequencing results of the first 50K individuals, Weedon et al. [[1][2]] observed that when grouping these variants by the minor allele frequency (MAF) observed in UK Biobank, the concordance with sequencing and resulting positive predictive value (PPV) decreased with the number of heterozygous (Het) array calls per variant. An improved genotyping algorithm, Rare Heterozygous Adjustment (RHA) [[16][3]], released mid-2020 for genotyping on Axiom arrays, significantly improves PPV in all MAF ranges for the 50K data as well as when compared to the exome sequencing of 200K individuals, released after Weedon et al. [[1][2]] performed their comparison. The RHA algorithm improved PPVs in the 200K data in the lowest three frequency groups [0, 0.001%), [0.001%, 0.005%) and [0.005%, 0.01%) to 83%, 82% and 88%; respectively. PPV was above 95% for higher MAF ranges without algorithm improvement. PPVs are somewhat higher in the 200K dataset, due to a different “truth set” from exome sequencing and because monomorphic exome loci are not included in the joint genotyping calls for the 200K data set, as explained in the methods section. Sensitivity was higher in the 200K data set than in the original 50K data as well, especially for low MAF ranges. This increase is in part due to the larger data set over which sensitivity could be computed and in part due to the different WES algorithms used for the 200K data [[7][4]]. Filtering of a relatively small number of non-performing probesets (determined without reference to the exome sequencing data) significantly improved sensitivities for all MAF ranges, resulting in 70%, 88% and 94% respectively in the three lowest MAF ranges and greater than 98% and 99.9% for the two higher MAF ranges ([0.01%, 1%), [1%, 50%]). Conclusions Improved algorithms for genotyping along with enhanced quality control of array probesets, significantly improve the positive predictive value and the sensitivity of array data, making it suitable for the detection of very rare variants. The probeset filtering methods developed have resulted in better probe designs for arrays and the new genotyping algorithm is part of the standard algorithm for all Axiom arrays since early 2020. ### Competing Interest Statement The authors are employed by Thermo Fisher Scientific, which participated in the design of and manufactures the UK Biobank Axiom Array. * Allele : each of two or more alternative forms of DNA that are found at the same location on a chromosome Allele A and allele B : For a SNP the two alternatives that can be observed and measured in a given sample are designated as “allele A” and “allele B” Array : DNA microarray that is used to genotype known genetic variants (SNPs and indels) in the population Clustering space : The X and Y dimensions defined by Signal Contrast and Signal Size Exome : ~1-2% of the human genome that codes for proteins Genotyping : method for determining the base (A, G, T, or C) present at a specific location in a person’s DNA Het call : short for heterozygous call Hom call : short for homozygous call Heterozygous : two different alleles at a given locus in an individual Homozygous : two identical alleles at a given locus in an individual Indel : type of variant where one or more bases are inserted or deleted as compared to the reference genome Major Homozygous : two identical alleles at a given locus that represent the most common allele at a given locus for the population of interest Minor Homozygous : two identical alleles at a given locus that represent the less common allele at a given locus for the population of interest Mean positive predictive value over a set of variants and individuals : the average over the positive predictive values for each variant nAB : the number of samples called heterozygous by AxiomGT1 in the clustering space Negative predictive value : proportion of the normal alleles that are confirmed by the reference standard (true negative/(true negative + false negative)) Positive predictive value : proportion of variant alleles that are confirmed by the reference standard (true positive/(true positive + false positive)) Overall positive predictive value : Positive predictive value across all variants and samples Probeset : A specific set of DNA sequences on the microarray that detect the presence of two or more alleles at a given locus Sensitivity : proportion of variant alleles detected by the reference standard that are also found by the index test (true positive/(true positive + false negative)) Signal Contrast and Signal Size : AxiomGT1 genotype clustering is carried out in two dimensions. The X dimension is called “contrast” and the Y dimension is called “size”. They are log-linear combinations of the two allele signal intensities. For alleles A and B, contrast is log2(A/B) and size is (log2(A) +log2(B))/2 Single nucleotide polymorphism (SNP) : type of single nucleotide variant; a position in the genome where an individual differs from the reference human genome by a single base change (i.e., a substitution of a single letter of DNA). A SNP may be rare or common in the population Specificity : proportion of normal alleles detected by the reference standard that are also found to be normal by the index test (true negative/(true negative + false positive)) Variant : Locus in the genome where different alleles have been observed in different people [1]: #ref-18 [2]: #ref-1 [3]: #ref-16 [4]: #ref-7
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要