alternative ORFs; alternative proteins; database; multicoding; precision medicine; proteogenomics; variants; Proteins; Peptides; Humans; Databases, Protein; Open Reading Frames; Peptides/genetics; Peptides/analysis; Proteomics/methods; Proteins/genetics; Chemistry (all); Biochemistry; General Chemistry
Abstract :
[en] Proteomic diversity in biological samples can be characterized by mass spectrometry (MS)-based proteomics using customized protein databases generated from sets of transcripts previously detected by RNA-seq. This diversity has only been increased by the recent discovery that many translated alternative open reading frames rest unannotated at unsuspected locations of mRNAs and ncRNAs. These novel protein products, termed alternative proteins, have been left out of all previous custom database generation tools. Consequently, genetic variations that impact alternative open reading frames and variant peptides from their translated proteins are not detectable with current computational workflows. To fill this gap, we present OpenCustomDB, a bioinformatics tool that uses sample-specific RNaseq data to identify genomic variants in canonical and alternative open reading frames, allowing for more than one coding region per transcript. In a test reanalysis of a cohort of 16 patients with acute myeloid leukemia, 5666 peptides from alternative proteins were detected, including 201 variant peptides. We also observed that a significant fraction of peptide-spectrum matches previously assigned to peptides from canonical proteins got better scores when reassigned to peptides from alternative proteins. Custom protein libraries that include sample-specific sequence variations of all possible open reading frames are promising contributions to the development of proteomics and precision medicine. The raw and processed proteomics data presented in this study can be found in PRIDE repository with accession number PXD029240.
Disciplines :
Biochemistry, biophysics & molecular biology
Author, co-author :
Guilloy, Noé; Department of Biochemistry and Functional Genomics, Université de Sherbrooke, Sherbrooke, Québec J1E 4K8, Canada ; PROTEO, Quebec Network for Research on Protein Function, Structure, and Engineering, Montreal, Québec H2X 3Y7, Canada
Brunet, Marie A; Department of Pediatrics, Université de Sherbrooke, Sherbrooke, Québec J1E 4K8, Canada
Leblanc, Sébastien; Department of Biochemistry and Functional Genomics, Université de Sherbrooke, Sherbrooke, Québec J1E 4K8, Canada ; PROTEO, Quebec Network for Research on Protein Function, Structure, and Engineering, Montreal, Québec H2X 3Y7, Canada
Jacques, Jean-François; Department of Biochemistry and Functional Genomics, Université de Sherbrooke, Sherbrooke, Québec J1E 4K8, Canada ; PROTEO, Quebec Network for Research on Protein Function, Structure, and Engineering, Montreal, Québec H2X 3Y7, Canada
Hardy, Marie-Pierre; Institute for Research in Immunology and Cancer, Université de Montréal, Montreal, Québec H3C 3J7, Canada
Ehx, Grégory ; Université de Liège - ULiège > Département des sciences cliniques
Lanoix, Joël; Institute for Research in Immunology and Cancer, Université de Montréal, Montreal, Québec H3C 3J7, Canada
Thibault, Pierre ; Institute for Research in Immunology and Cancer, Université de Montréal, Montreal, Québec H3C 3J7, Canada
Perreault, Claude ; Institute for Research in Immunology and Cancer, Université de Montréal, Montreal, Québec H3C 3J7, Canada ; Department of Medicine, Université de Montréal, Montreal, Québec H3C 3J7, Canada
Roucou, Xavier ; Department of Biochemistry and Functional Genomics, Université de Sherbrooke, Sherbrooke, Québec J1E 4K8, Canada ; PROTEO, Quebec Network for Research on Protein Function, Structure, and Engineering, Montreal, Québec H2X 3Y7, Canada
Language :
English
Title :
OpenCustomDB: Integration of Unannotated Open Reading Frames and Genetic Variants to Generate More Comprehensive Customized Protein Databases.
CIHR - Canadian Institutes of Health Research FRQS - Fonds de Recherche du Québec - Santé UdeM - Université de Montréal LLSC - Leukemia and Lymphoma Society of Canada Canadian Cancer Society
Funding text :
Computing resources from Digital Research Alliance of Canada are gratefully acknowledged. We thank Raphaelle Lambert and Jennifer Huber at the genomics facility for RNA-seq, Patrick Gendron, Eric Audemard, and Genevieve Boucher at the bioinformatic platform for assistance with RNA-seq analysis, the Institute for Research in Immunology and Cancer of the Université de Montréal. We acknowledge the work of Claude Rondeau and all members of the Quebec Leukemia Cell Bank. This work was supported by a grant from the Canadian Institutes of Health Research (CIHR, PJT-175322) to X.R. M.A.B., a Canada research chair to X.R., a grant from the Canadian Cancer Society (705604) to C.P. and P.T., a grant from the Leukemia and Lymphoma Society of Canada to C.P. and a grant from The Oncopole to C.P., P.T., and X.R. M.A.B. is a Junior 1 research fellow from the Fonds de Recherche du Québec–Santé. G.E. is supported by postdoctoral fellowships from the Institute for Research in Immunology and Cancer of the Université de Montréal, the Fonds de Recherche du Québec–Santé, and the Cole Foundation. Operation of the mp2 supercomputer is funded by the Canada Foundation of Innovation (CFI), le ministère de l’Économie, de la science et de l’innovation du Québec (MESI), and les Fonds de Recherche du Québec.
Brunet, M. A.; Leblanc, S.; Roucou, X. Reconsidering proteomic diversity with functional investigation of small ORFs and alternative ORFs. Exp. Cell Res. 2020, 393, 112057, 10.1016/j.yexcr.2020.112057
Cesnik, A. J.; Miller, R. M.; Ibrahim, K.; Lu, L.; Millikin, R. J.; Shortreed, M. R.; Frey, B. L.; Smith, L. M. Spritz: A Proteogenomic Database Engine. J. Proteome Res. 2021, 20, 1826- 1834, 10.1021/acs.jproteome.0c00407
Sheynkman, G. M.; Johnson, J. E.; Jagtap, P. D.; Shortreed, M. R.; Onsongo, G.; Frey, B. L.; Griffin, T. J.; Smith, L. M. Using Galaxy-P to leverage RNA-Seq for the discovery of novel protein variations. BMC Genomics 2014, 15, 703, 10.1186/1471-2164-15-703
Ruggles, K. V.; Tang, Z.; Wang, X.; Grover, H.; Askenazi, M.; Teubl, J.; Cao, Z.; McLellan, M. D.; Clauser, K. R.; Tabb, P. M. An analysis of the sensitivity of proteogenomic mapping of somatic mutations and novel splicing events in cancer. Mol. Cell Proteomics 2016, 15, 1060- 1071, 10.1074/mcp.M115.056226
Wang, X.; Slebos, R. J.; Wang, D.; Halvey, P. J.; Tabb, D. L.; Liebler, D. C.; Zhang, B. Protein identification using customized protein sequence databases derived from RNA-seq data. J. Proteome Res. 2012, 11, 1009- 1017, 10.1021/pr200766z
Wang, X.; Zhang, B. CustomProDB: An R package to generate customized protein databases from RNA-Seq data for proteomics search. Bioinformatics 2013, 29, 3235- 3237, 10.1093/bioinformatics/btt543
Zhu, Y.; Orre, L. M.; Johansson, H. J.; Huss, M.; Boekel, J.; Vesterlund, M.; Fernandez-Woodbridge, A.; Branca, R. M. M.; Lehtiö, J. Discovery of coding regions in the human genome by integrated proteogenomics analysis workflow. Nat. Commun. 2018, 9, 903, 10.1038/s41467-018-03311-y
Nagaraj, S. H.; Waddell, N.; Madugundu, A. K.; Wood, S.; Jones, A.; Mandyam, R. A.; Nones, K.; Pearson, J. V.; Grimmond, S. M. PGTools: A software suite for proteogenomic data analysis and visualization. J. Proteome Res. 2015, 14, 2255- 2266, 10.1021/acs.jproteome.5b00029
Nesvizhskii, A. I. Proteogenomics: Concepts, applications and computational strategies. Nat. Methods 2014, 11, 1114- 1125, 10.1038/nmeth.3144
Li, H.; Joh, Y. S.; Kim, H.; Paek, E.; Lee, S.-W.; Hwang, K.-B. Evaluating the effect of database inflation in proteogenomic search on sensitive and reliable peptide identification. BMC Genomics 2016, 17 ( Suppl 13), 1031, 10.1186/s12864-016-3327-5
Li, Y.; Wang, X.; Cho, J.-H.; Shaw, T. I.; Wu, Z.; Bai, B.; Wang, H.; Zhou, S.; Beach, T. G.; Wu, G. JUMPg: An integrative proteogenomics pipeline identifying unannotated proteins in human brain and cancer cells. J. Proteome Res. 2016, 15, 2309- 2320, 10.1021/acs.jproteome.6b00344
Brunet, M. A.; Lucier, J.-F.; Levesque, M.; Leblanc, S.; Jacques, J. F.; Al-Saedi, H. R. H.; Guilloy, N.; Grenier, F.; Avino, M.; Fournier, I. OpenProt 2021: Deeper functional annotation of the coding potential of eukaryotic genomes. Nucleic Acids Res. 2021, 49 ( D1), D380- D388, 10.1093/nar/gkaa1036
Chen, J.; Brunner, A. D.; Cogan, J. Z.; Nunez, J. K.; Fields, A. P.; Adamson, B.; Itzhak, D. N.; Li, J. Y.; Mann, M.; Leonetti, M. D. Pervasive functional translation of noncanonical human open reading frames. Science 2020, 367, 140- 146, 10.1126/science.aay0262
Brunet, M. A.; Levesque, S. A.; Hunting, D. J.; Cohen, A. A.; Roucou, X. Recognition of the polycistronic nature of human genes is critical to understanding the genotype-phenotype relationship. Genome Res. 2018, 28, 609- 624, 10.1101/gr.230938.117
Dobin, A.; Davis, C. A.; Schlesinger, F.; Drenkow, J.; Zaleski, C.; Jha, S.; Batut, P.; Chaisson, M.; Gingeras, T. R. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 2013, 29, 15- 21, 10.1093/bioinformatics/bts635
Bray, N. L.; Pimentel, H.; Melsted, P.; Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 2016, 34, 525- 527, 10.1038/nbt.3519
Brunet, M. A.; Leblanc, S.; Roucou, X. Openvar: functional annotation of variants in non-canonical open reading frames. Cell Biosci. 2022, 12 ( 1), 130, 10.1186/s13578-022-00871-x
Cox, J.; Mann, M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 2008, 26, 1367- 1372, 10.1038/nbt.1511
Silva, A. S. C.; Bouwmeester, R.; Martens, L.; Degroeve, S. Accurate peptide fragmentation predictions allow data driven approaches to replace and improve upon proteomics search engine scoring functions. Bioinformatics 2019, 35, 5243- 5248, 10.1093/bioinformatics/btz383
Bouwmeester, R.; Gabriels, R.; Hulstaert, N.; Martens, L.; Degroeve, S. DeepLC can predict retention times for peptides that carry as yet unseen modifications. Nat. Methods 2021, 18, 1363- 1369, 10.1038/s41592-021-01301-5
The, M.; MacCross, M. J.; Noble, W. S.; Käel, L. Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0. J. Am. Soc. Mass Spectrom. 2016, 27, 1719- 1727, 10.1007/s13361-016-1460-7
Verbruggen, S.; Gessulat, S.; Gabriels, R.; Matsaroki, A.; Van de Voorde, H.; Kuster, B.; Degroeve, S.; Martens, L.; Van Criekinge, W.; Wilhem, M. Spectral prediction features as a solution for the search space size problem in proteogenomics. Mol. Cell Proteomics 2021, 20, 100076, 10.1016/j.mcpro.2021.100076
Kall, L.; Canterbury, J. D; Weston, J.; Noble, W. S.; MacCoss, M. J Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat. Methods 2007, 4, 923- 925, 10.1038/nmeth1113
Chauvin, A.; Boisvert, F. M. Proteomics analysis of colorectal cancer cells. Methods Mol. Biol. 2018, 1765, 155- 166, 10.1007/978-1-4939-7765-9_9
Landry, C. R.; Zhong, X.; Nielly-Thibault, L.; Roucou, X. Found in translation: Functions and evolution of a recently discovered alternative proteome. Curr. Opin Struct Biol. 2015, 32, 74- 80, 10.1016/j.sbi.2015.02.017
Ruiz-Orera, J.; Messeguer, X.; Subirana, J. A.; Alba, M. N. Long non-coding RNAs as a source of new peptides. ELife 2014, 3, e.03523 10.7554/eLife.03523
Samandi, S.; Roy, A. V.; Delcourt, V.; Lucier, J.-F.; Gagnon, J.; Beaudoin, M. C.; Vanderperre, B.; Breton, M.-A.; motard, J.; Jacques, J.-F. Deep transcriptome annotation enables the discovery and functional characterization of cryptic small proteins. Elife 2017, 6, e27860 10.7554/eLife.27860
Blakeley, P.; Overton, I. M.; Hubbard, S. J. Addressing statistical biases in nucleotide-derived protein databases for proteogenomic search strategies. J. Proteome Res. 2012, 11, 5221- 5234, 10.1021/pr300411q
Granholm, V.; Käll, L. Quality assessments of peptide-spectrum matches in sotgun proteomics. Proteomics 2011, 11, 1086- 1093, 10.1002/pmic.201000432
Degroeve, S.; Gabriels, R.; Velghe, K.; Bouwmeester, R.; Tichshenko, N.; Martens, L. ionbot: a novel, innovative and sensitive machine learning approach to LC-MS/MS peptide identification. bioRxiv 2022, 10.1101/2021.07.02.450686