Annotating Metabolite Mass Spectra with Domain-Inspired Chemical Formula Transformers
Nat Mac Intell(2023)
Abstract
Tandem mass spectroscopy is a useful tool to identify metabolites but is limited by the capability of computational methods to annotate peaks with chemical structures when spectra are dissimilar to previously observed spectra. Goldman and colleagues use a transformer-based method to annotate chemical structure fragments, thereby incorporating domain insights into its architecture, and to simultaneously predict the structure of the metabolite and its fragments from the spectrum. Metabolomics studies have identified small molecules that mediate cell signaling, competition and disease pathology, in part due to large-scale community efforts to measure tandem mass spectra for thousands of metabolite standards. Nevertheless, the majority of spectra observed in clinical samples cannot be unambiguously matched to known structures. Deep learning approaches to small-molecule structure elucidation have surprisingly failed to rival classical statistical methods, which we hypothesize is due to the lack of in-domain knowledge incorporated into current neural network architectures. Here we introduce a neural network-driven workflow for untargeted metabolomics, Metabolite Inference with Spectrum Transformers (MIST), to annotate tandem mass spectra peaks with chemical structures. Unlike existing approaches, MIST incorporates domain insights into its architecture by encoding peaks with their chemical formula representations, implicitly featurizing pairwise neutral losses and training the network to additionally predict substructure fragments. MIST performs favorably compared with both standard neural architectures and the state-of-the-art kernel method on the task of fingerprint prediction for over 70% of metabolite standards and retrieves 66% of metabolites with equal or improved accuracy, with 29% strictly better. We further demonstrate the utility of MIST by suggesting potential dipeptide and alkaloid structures for differentially abundant spectra found in an inflammatory bowel disease patient cohort.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined