Arabidopsis thaliana; Cis-regulatory elements; Flavonoid target genes; Genomics; Plant organs; Predictive modeling; R package; Transcription Factors; Chromatin; Protein Binding/genetics; Plants/metabolism; Chromatin/genetics; Binding Sites/genetics; Transcription Factors/genetics; Transcription Factors/metabolism; Arabidopsis/genetics; Arabidopsis/metabolism; Arabidopsis; Binding Sites; Plants; Protein Binding; Physiology; Plant Science; Cell Biology; General Medicine
Abstract :
[en] The identification of transcription factor (TF) target genes is central in biology. A popular approach is based on the location by pattern matching of potential cis-regulatory elements (CREs). During the last few years, tools integrating next-generation sequencing data have been developed to improve the performance of pattern matching. However, such tools have not yet been comprehensively evaluated in plants. Hence, we developed a new streamlined method aiming at predicting CREs and target genes of plant TFs in specific organs or conditions. Our approach implements a supervised machine learning strategy, which allows decision rule models to be learnt using TF ChIP-chip/seq experimental data. Different layers of genomic features were integrated in predictive models: the position on the gene, the DNA sequence conservation, the chromatin state and various CRE footprints. Among the tested features, the chromatin features were crucial for improving the accuracy of the method. Furthermore, we evaluated the transferability of predictive models across TFs, organs and species. Finally, we validated our method by correctly inferring the target genes of key TFs controlling metabolite biosynthesis at the organ level in Arabidopsis. We developed a tool-Wimtrap-to reproduce our approach in plant species and conditions/organs for which ChIP-chip/seq data are available. Wimtrap is a user-friendly R package that supports an R Shiny web interface and is provided with pre-built models that can be used to quickly get predictions of CREs and TF gene targets in different organs or conditions in Arabidopsis thaliana, Solanum lycopersicum, Oryza sativa and Zea mays.
Disciplines :
Biotechnology
Author, co-author :
Rivière, Quentin; Brussels Bioengineering School, Laboratory of Plant Physiology and molecular Genetics, Université Libre de Bruxelles, Brussels 1050, Belgium
Corso, Massimiliano; Brussels Bioengineering School, Laboratory of Plant Physiology and molecular Genetics, Université Libre de Bruxelles, Brussels 1050, Belgium ; INRAE, AgroParisTech, Institut Jean-Pierre Bourgin (IJPB), Université Paris-Saclay, Versailles 78000, France
Ciortan, Madalina; Interuniversity Institute of Bioinformatics in Brussels, Machine Learning Group, Université Libre de Bruxelles, Brussels 1050, Belgium
Noël, Grégoire ; Université de Liège - ULiège > Département GxABT > Gestion durable des bio-agresseurs
Verbruggen, Nathalie ; Brussels Bioengineering School, Laboratory of Plant Physiology and molecular Genetics, Université Libre de Bruxelles, Brussels 1050, Belgium
Defrance, Matthieu ; Interuniversity Institute of Bioinformatics in Brussels, Machine Learning Group, Université Libre de Bruxelles, Brussels 1050, Belgium
Language :
English
Title :
Exploiting Genomic Features to Improve the Prediction of Transcription Factor-Binding Sites in Plants.
Aerts, S. (2012) Computational strategies for the genome-wide identification of cis-regulatory elements and transcriptional targets. Curr. Top. Dev. Biol. 98: 121-145.
Aho, A.V., Kernighan, B.W. and Weinberger, P.J. (1988) The AWK Programming Language. Addison-Wesley Publishing Company, Boston.
Alberghini, B., Zanetti, F., Corso, M., Boutet, S., Lepiniec, L., Vecchi, A., et al. (2022) Camelina [Camelina sativa (L.) Crantz] seeds as a multi-purpose feedstock for bio-based applications. Ind. Crops Prod. 182: 114944.
Baudry, A., Heim, M.A., Dubreucq, B., Caboche, M., Weisshaar, B., Lepiniec, L., et al. (2004) TT2, TT8, and TTG1 synergistically specify the expression of BANYULS and proanthocyanidin biosynthesis in Arabidopsis thaliana. Plant J. 39: 366-380.
Baxter, L., Jironkin, A., Hickman, R., Moore, J., Barrington, C., Krusche, P., et al. (2012) Conserved noncoding sequences highlight shared components of regulatory networks in dicotyledonous plants. Plant Cell 24: 3949-3965.
Behjati Ardakani, F., Schmidt, F. and Schulz, M.H. (2019) Predicting transcription factor binding using ensemble random forest models. F1000Research 7: 1603.
Bonev, B. and Cavalli, G. (2016) Organization and function of the 3D genome. Nat. Rev. Genet. 17: 661-678.
Boutet, S., Barreda, L., Perreau, F., Totozafy, J.-C., Mauve, C., Gakière, B., et al. (2022) Untargeted metabolomic analyses reveal the diversity and plasticity of the specialized metabolome in seeds of different Camelina sativa genotypes. Plant J. 110: 147-165.
Brandt, R., Salla-Martret, M., Bou-Torrent, J., Musielak, T., Stahl, M., Lanz, C., et al. (2012) Genome-wide binding-site analysis of REVOLUTA reveals a link between leaf patterning and light-mediated growth responses: REVOLUTA ChIP-Seq Analysis. Plant J. 72: 31-42.
Butel, N., Le Masson, I., Bouteiller, N., Vaucheret, H. and Elmayan, T. (2017) sgs1: A neomorphic nac52 allele impairing posttranscriptional gene silencing through SGS3 downregulation. Plant J. 90: 505-519.
Castro-Mondragon, J.A., Riudavets-Puig, R., Rauluseviciute, I., Berhanu Lemma, R., Turchi, L., Blanc-Mathieu, R., et al. (2022) JASPAR 2022: The 9th release of the open-Access database of transcription factor binding profiles. Nucleic Acids Res. 50: D165-D173.
Chen, T., Tong, H., Michael, B., Vadim, K., Yuan, T., Hyunsu, C., et al. (2021) Xgboost: Extreme Gradient Boosting. https://CRAN.Rproject. org/package=xgboost (September 19, 2019, date last accessed).
Chen, T. and Guestrin, C. (2016) XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining-KDD-16.The 22nd ACM SIGKDD International Conference. ACM Press, San Francisco, CA, pp. 785-794.
Chen, X., Yu, B., Carriero, N., Silva, C. and Bonneau, R. (2017) Mocap: large-scale inference of transcription factor binding sites from chromatin accessibility. Nucleic Acids Res. 45: 4315-4329.
Chen, T., Tong, H., Michael, B., Vadim, K., Yuan, T., Hyunsu, C., et al. (2017) LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Advances in Neural Information Processing Systems 30 (NIP 2017). https://www.microsoft.com/en-us/research/publication/lightgbm-Ahighly-efficient-gradient-boosting-decision-Tree/ (September 19, 2019, date last accessed).
Clough, E. and Barrett, T. (2016) The gene expression omnibus database. In Statistical Genomics. Edited by Mathé, E. and Davis, S. pp. 93-110. Springer New York, New York.
Collings, C.K., Waddell, P.J. and Anderson, J.N. (2013) Effects of DNA methylation on nucleosome stability. Nucleic Acids Res. 41: 2918-2931.
Corso, M., Perreau, F., Mouille, G. and Lepiniec, L. (2021) Specialized metabolites in seeds. Adv. Bot. Res. 98: 35-70.
Corso, M., Perreau, F., Mouille, G. and Lepiniec, L. (2020) Specialized phenolic compounds in seeds: structures, functions, and regulations. Plant Sci. 296: 110471.
Dorogush, A.V., Ershov, V. and Gulin, A. (2018) CatBoost: Gradient Boosting with Categorical Features Support. CoRR, abs/1810.11363. http:// arxiv.org/abs/1810.11363 (November 13, 2021, date last accessed).
Franco-Zorrilla, J.M., Lopez-Vidriero, I., Carrasco, J.L., Godoy, M., Vera, P., Solano, R., et al. (2014) DNA-binding specificities of plant transcription factors and their potential to define target genes. Proc. Natl. Acad. Sci. USA 111: 2367-2372.
Fuda, N.J., Ardehali, M.B. and Lis, J.T. (2009) Defining mechanisms that regulate RNA polymerase II transcription in vivo. Nature 461: 186-192.
Fujisawa, M., Shima, Y., Nakagawa, H., Kitagawa, M., Kimbara, J., Nakano, T., et al. (2014) Transcriptional regulation of fruit ripening by tomato FRUITFULL homologs and associated MADS box proteins. Plant Cell 26: 89-101.
Gaillochet, C., Stiehl, T., Wenzl, C., Ripoll, J.-J., Bailey-Steinitz, L.J., Li, L., et al. (2017) Control of plant cell fate transitions by transcriptional and hormonal signals. eLife 6: e30135.
Gomez-Porras, J.L., Riaño-Pachon, D.M., Dreyer, I., Mayer, J.E. and Mueller-Roeber, B. (2007) Genome-wide analysis of ABA-responsive elements ABRE and CE3 reveals divergent patterns in Arabidopsis and rice. BMC Genomics 8: 260.
Grant, C.E., Bailey, T.L. and Noble, W.S. (2011) FIMO: scanning for occurrences of a given motif. Bioinformatics 27: 1017-1018.
Gusmao, E.G. Allhoff, M., Zenke, M. and Costa, I.G. (2016) Analysis of computational footprinting methods for DNase sequencing experiments. Nat. Methods 13: 303-309.
Gusmao, E.G. Dieterich, C., Zenke, M. and Costa, I.G. (2014) Detection of active transcription factor binding sites with the combination of DNase hypersensitivity and histone modifications. Bioinformatics 30: 3143-3151.
Hardison, R.C. and Taylor, J. (2012) Genomic approaches towards finding cis-regulatory modules in animals. Nat. Rev. Genet. 13: 469-483.
Haudry, A., Platts, A.E., Vello, E., Hoen, D.R., Leclercq, M., Williamson, R.J., et al. (2013) An Atlas of over 90,000 conserved noncoding sequences provides insight into crucifer regulatory regions. Nat. Genet. 45: 891-898.
Heyndrickx, K.S., de Velde, J.V., Wang, C., Weigel, D. and Vandepoele, K. (2014) A functional and evolutionary perspective on transcription factor binding in Arabidopsis thaliana. Plant Cell 26: 3894-3910.
Jacob, P., Brisou, G., Dalmais, M., Thévenin, J., van der Wal, F., Latrasse, D., et al. (2021) The seed development factors TT2 and MYB5 regulate heat stress response in Arabidopsis. Genes 12: 746.
Jankowski, A., Tiuryn, J. and Prabhakar, S. (2016) Romulus: robust multistate identification of transcription factor binding sites from DNase-Seq data. Bioinformatics 32: 2419-2426.
Jayaram, N., Usvyat, D. and Martin, A.C.R. (2016) Evaluating tools for transcription factor binding site prediction. BMC Bioinformatics 17: 547.
Jin, J., Tian, F., Yang, D.-C., Meng, Y.-Q., Kong, L., Luo, J., et al. (2017) Plant-TFDB 4.0: Toward a central hub for transcription factors and regulatory interactions in plants. Nucleic Acids Res. 45: D1040-D1045.
Jones, P.A. (2012) Functions of DNA methylation: islands, start sites, gene bodies and beyond. Nat. Rev. Genet. 13: 484-492.
Karabacak Calviello, A., Hirsekorn, A., Wurmus, R., Yusuf, D. and Ohler, U. (2019) Reproducible inference of transcription factor footprints in ATAC-Seq and DNase-Seq datasets using protocol-specific bias modeling. Genome Biol. 20: 42.
Keilwagen, J., Posch, S. and Grau, J. (2019) Accurate prediction of cell typespecific transcription factor binding. Genome Biol. 20: 9.
Kent, W.J., Sugnet, C.W., Furey, T.S., Roskin, K.M., Pringle, T.H., Zahler, A.M., et al. (2002) The human genome browser at UCSC. Genome Res. 12: 996-1006.
Kinsella, R.J., Kahari, A., Haider, S., Zamora, J., Proctor, G., Spudich, G., et al. (2011) Ensembl BioMarts: A hub for data retrieval across taxonomic space. Database 2011: bar030.
Kotsiantis, S., Dimitris, K. and Pintelas, P. (2006) Handling imbalanced datasets: A review. GESTS Int. Trans. Comput. Sci. Eng. 30: 1-13.
Kuhn, M. (2020) Caret: Classification and Regression Training. https:// CRAN.R-project.org/package=caret (November 18, 2021, date last accessed).
Kumar, S. and Bucher, P. (2016) Predicting transcription factor site occupancy using DNA sequence intrinsic and cell-Type specific chromatin features. BMC Bioinform. 17: S4.
Lai, X., Stigliani, A., Vachon, G., Carles, C., Smaczniak, C., Zubieta, C., et al. (2019) Building transcription factor binding site models to understand gene regulation in plants. Mol. Plant 12: 743-763.
Lawrence, M., Daujat, S. and Schneider, R. (2016) Lateral thinking: how histone modifications regulate gene expression. Trends Genet. 32: 42-56.
Lawrence, M., Gentleman, R. and Carey, V. (2009) Rtracklayer: An R package for interfacing with genome browsers. Bioinformatics 25: 1841-1842.
Lawrence, M., Huber, W., Pagès, H., Aboyoun, P., Carlson, M., Gentleman, R., et al. (2013) Software for computing and annotating genomic ranges. PLoS Comput. Biol. 9: e1003118.
Lepiniec, L., Debeaujon, L., Routaboul, J. M., Baudry, A., Pourcel, L., Nesi, N., et al. (2006) Genetics and biochemistry of seed flovonoids. Annu. Rev. Plant Biol. 57(1): 405-430.
Lee, D.J., Minchin, S.D. and Busby, S.J.W. (2012) Activating transcription in bacteria. Annu. Rev. Microbiol. 66: 125-152.
Lenhard, B., Sandelin, A. and Carninci, P. (2012) Metazoan promoters: emerging characteristics and insights into transcriptional regulation. Nat. Rev. Genet. 13: 233-245.
Li, H. and Guan, Y. (2019) Leopard: fast decoding cell type-specific transcription factor binding landscape at single-nucleotide resolution. Bioinformatics 31: 721-731.
Li, H., Quang, D. and Guan, Y. (2019) Anchor: Trans-cell type prediction of transcription factor binding sites. Genome Res. 29: 281-292.
Liu, S., Zibetti, C., Wan, J., Wang, G., Blackshaw, S., Qian, J., et al. (2017) Assessing the model transferability for prediction of transcription factor binding sites based on chromatin accessibility. BMC Bioinform. 18: 355.
Meireles-Filho, A.C. and Stark, A. (2009) Comparative genomics of gene regulation-conservation and divergence of cis-regulatory information. Curr. Opin. Genet. Dev. 19: 565-570.
Meyer, C.A. and Liu, X.S. (2014) Identifying and mitigating bias in nextgeneration sequencing methods for chromatin biology. Nat. Rev. Genet. 15: 709-721.
Muiño, J.M., Kaufmann, K., van Ham, R.C., Angenent, G.C. and Krajewski, P. (2011) ChIP-Seq analysis in R (CSAR): An R package for the statistical detection of protein-bound genomic regions. Plant Methods 7: 11.
Mundade, R., Ozer, H.G., Wei, H., Prabhu, L. and Lu, T. (2014) Role of ChIPSeq in the discovery of transcription factor binding sites, differential gene regulation mechanism, epigenetic marks and beyond. Cell Cycle 13: 2847-2852.
Neph, S., Vierstra, J., Stergachis, A.B., Reynolds, A.P., Haugen, E., Vernot, B., et al. (2012) An expansive human regulatory lexicon encoded in transcription factor footprints. Nature 489: 83-90.
Nuruzzaman, M., Sharoni, A.M. and Kikuchi, S. (2013) Roles of NAC transcription factors in the regulation of biotic and abiotic stress responses in plants. Front. Microbiol. 4: 248.
Pages, H., Aboyoun, P., Gentleman, R. and DebRoy, S. (2019) Biostrings: efficient manipulation of biological strings. R Package Version 2.54.0.
Pott, S. and Lieb, J.D. (2015) What are super-enhancers- Nat. Genet. 47: 8-12.
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A.V. and Gulin, A. (2019) CatBoost: unbiased boosting with categorical features.
Qin, Q., Feng, J. and Ioshikhes, I. (2017) Imputation for transcription factor binding predictions based on deep learning. PLoS Comput. Biol. 13: e1005403.
Quang, D. and Xie, X. (2017) FactorNet: A deep learning framework for predicting cell type specific transcription factor binding from nucleotideresolution sequential data. Genomics 166: 40-47.
Quattrocchio, F., Baudry, A., Lepiniec, L., Grotewold, E. (2006) The Regulation of Flavonoid Biosynthesis. In Edited by Grotewold, E. pp. 97-122. Springer New York, New.
Rister, J. and Desplan, C. (2010) Deciphering the genome-s regulatory code: The many languages of DNA. BioEssays 32: 381-384.
Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C., et al. (2011) PROC: An open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 12: 77.
Schmidt, F., Gasparoni, N., Gasparoni, G., Gianmoena, K., Cadenas, C., Polansky, J.K., et al. (2017) Combining transcription factor binding affinities with open-chromatin data for accurate gene expression prediction. Nucleic Acids Res. 45: 54-66.
Schmidt, F., Kern, F., Ebert, P., Baumgarten, N. and Schulz, M.H. (2019) TEPIC 2-an extended framework for transcription factor binding prediction and integrative epigenomic analysis. Bioinformatics 35:nscription factor binding pr 1608-1609.
Sequeira-Mendes, J., Aragöez, I., Peiro, R., Mendez-Giraldez, R., Zhang, X., Jacobsen, S.E., et al. (2014) The functional topography of the Arabidopsis genome is organized in a reduced number of linear motifs of chromatin states. Plant Cell 26: 2351-2366.
Shi, Y., Ke, G., Soukhavong, D., Lamb, J., Meng, Q., Finley, T., et al. (2021) Lightgbm: Light Gradient Boosting Machine. https:// CRAN.R-project.org/package=lightgbm (November 12, 2021, date last accessed).
Siepel, A. and Haussler, D. (2005) Phylogenetic hidden markov models. In Statistical Methods in Molecular Evolution. Statistics for Biology and Health. pp. 325-351. Springer-Verlag, New York.
Song, Q., Lee, J., Akter, S., Rogers, M., Grene, R., Li, S., et al. (2020) Prediction of condition-specific regulatory genes using machine learning. Nucleic Acids Res. 48: e62.
Spitz, F. and Furlong, E.E.M. (2012) Transcription factors: from enhancer binding to developmental control. Nat. Rev. Genet. 13: 613-626.
Thomas, B.C., Rapaka, L., Lyons, E., Pedersen, B. and Freeling, M. (2007) Arabidopsis intragenomic conserved noncoding sequence. Proc. Natl. Acad. Sci. USA 104: 3348-3353.
Thomas-Chollier, M., Herrmann, C., Defrance, M., Sand, O., Thieffry, D., van Helden, J., et al. (2012) RSAT peak-motifs: motif analysis in full-size ChIP-Seq datasets. Nucleic Acids Res. 40: e31.
Tian, F., Yang, D.-C., Meng, Y.-Q., Jin, J. and Gao, G. (2020) PlantRegMap: charting functional regulatory maps in plants. Nucleic Acids Res. 48: D1104-D1113.
van Rooijen, R., Schulze, S., Petzsch, P. and Westhoff, P. (2020) Targeted misexpression of NAC052, acting in H3K4 demethylation, alters leaf morphological and anatomical traits in Arabidopsis thaliana. J. Exp. Bot. 71: 1434-1448.
Vaquerizas, J.M., Kummerfeld, S.K., Teichmann, S.A. and Luscombe, N.M. (2009) A census of human transcription factors: function, expression and evolution. Nat. Rev. Genet. 10: 252-263.
Veljkovic, J. and Hansen, U. (2004) Lineage-specific and ubiquitous biological roles of the mammalian transcription factor LSF. Gene 343: 23-40.
Vuong, P. and Misr, R. (2011) Guide to genome-wide bacterial transcription factor binding site prediction using OmpR as model. In Selected Works in Bioinformatics. Edited by Xia, X. InTech: 41-56.
Wang, C., Liu, C., Roqueiro, D., Grimm, D., Schwab, R., Becker, C., et al. (2015) Genome-wide analysis of local chromatin packing in Arabidopsis thaliana. Genome Res. 25: 246-256.
Welch, R., Chung, D., Grass, J., Landick, R. and Keles, S. (2017) Data exploration, quality control and statistical analysis of ChIP-Exo/Nexus experiments. Nucleic Acids Res. 45: e145.
Wittkopp, P.J. and Kalay, G. (2012) Cis-regulatory elements: molecular mechanisms and evolutionary processes underlying divergence. Nat. Rev. Genet. 13: 59-69.
Xu, W., Dubos, C. and Lepiniec, L. (2015) Transcriptional control of flavonoid biosynthesis by MYB-BHLH-WDR complexes. Trends Plant Sci. 20: 176-185.
Ye, H., Liu, S., Tang, B., Chen, J., Xie, Z., Nolan, T.M., et al. (2017) RD26 mediates crosstalk between drought and brassinosteroid signalling pathways. Nat. Commun. 8: 14573.
Zhang, S., Zhou, B., Kang, Y., Cui, X., Liu, A., Deleris, A., et al. (2015) Cterminal domains of histone demethylase JMJ14 interact with a pair of NAC transcription factors to mediate specific chromatin association. Cell Discov. 1: 15003.
Zhang, T., Zhang, W. and Jiang, J. (2015) Genome-wide nucleosome occupancy and positioning and their impact on gene expression and evolution in plants. Plant Physiol. 168: 1406-1416.
Zhang, T., Marand, A.P. and Jiang, J. (2016) PlantDHS: A database for DNase I hypersensitive sites in plants. Nucleic Acids Res. 44: D1148-D1153.
Zhiponova, M.K., Morohashi, K., Vanhoutte, I., Machemer-Noonan, K., Revalska, M., Van Montagu, M., et al. (2014) Helix-loop-helix/basic helix-loop-helix transcription factor network represses cell elongation in arabidopsis through an apparent incoherent feed-forward loop. Proc. Natl. Acad. Sci. USA 111: 2824-2829.
Zhu, B., Zhang, W., Zhang, T., Liu, B. and Jiang, J. (2015) Genome-wide prediction and validation of intergenic enhancers in arabidopsis using open chromatin signatures. Plant Cell 27: 2415-2426