[en] With the purpose to organize methodologies found in (recent) papers focusing on the development of genomic breed/population assignment tools, this review proposes to highlight good practice for the development of such tools. After an appropriate quality control of markers and the building of a representative reference population, three main steps can be followed to develop a genomic breed/population assignment tool: 1) The selection of discriminant markers, 2) The development of a model that allows accurate assignment of animals to their breed/population of origin, the so-called classification step, and, 3) The validation of the developed model on new animals to evaluate its performances in real conditions. The first step can be avoided when a mid- or low-density chip is used, depending on the methodology used for assignment. In the case selection of SNPs is necessary, we advise the use of one stage methodologies and to define a threshold for this selection. Then, machine learning can be used to develop the model per se, based on the selected or available markers. To tune the model, we recommend the use of cross-validation. Finally, new animals, not used in the first two steps, should be used to evaluate the performances of the model (e.g., with balanced accuracy and probabilities), also in terms of computation time.
Disciplines :
Agriculture & agronomy
Author, co-author :
Wilmot, Hélène ; Université de Liège - ULiège > Département GxABT > Animal Sciences (AS)
Gengler, Nicolas ; Université de Liège - ULiège > TERRA Research Centre > Animal Sciences (AS)
Language :
English
Title :
Good practice for assignment of breeds and populations—a review
F.R.S.-FNRS - Fonds de la Recherche Scientifique SPW DG03-DGARNE - Service Public de Wallonie. Direction Générale Opérationnelle Agriculture, Ressources naturelles et Environnement EU - European Union
Funding text :
The author(s) declare that financial support was received for the
research, authorship, and/or publication of this article. H. Wilmot, as
a former Research Fellow andN. Gengler, as a former Senior Research
Associate, acknowledge the support of the Fonds de la Recherche
Scientifique – FNRS (Brussels, Belgium). The Walloon Government
(Service Public de Wallonie – Direction Générale Opérationnelle
Agriculture, Ressources Naturelles et Environnement, SPWDGARNE;
Namur, Belgium) is acknowledged for its financial
support. The authors gratefully acknowledge the support of the
project 'Rotbunt DN' funded under the European Innovation
Partnership (EIP Agri) Schleswig-Holstein, Germany, through the
European Agricultural Fund for Rural Development (EAFRD).
Alexander D. H. Novembre J. Lange K. (2009). Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664. doi: 10.1101/gr.094052.109
Alter O. Brown P. O. Botstein D. (2000). Singular value decomposition for genome-Wide expression data processing and modeling. Proc. Natl. Acad. Sci. U. S. A. 97, 10101–10106. doi: 10.1073/pnas.97.18.10101
Baumung R. Cubric-Curik V. Schwend K. Achmann R. Sölkner J. (2006). Genetic characterisation and breed assignment in Austrian sheep breeds using microsatellite marker information. J. Anim. Breed. Genet. 123, 265–271. doi: 10.1111/j.1439-0388.2006.00583.x
Bertolini F. Galimberti G. Calò D. G. Schiavo G. Matassino D. Fontanesi L. (2015). Combined use of principal component analysis and random forests identify population-informative single nucleotide polymorphisms: Application in cattle breeds. J. Anim. Breed. Genet. 132, 346–356. doi: 10.1111/jbg.12155
Bertolini F. Galimberti G. Schiavo G. Mastrangelo S. Di Gerlando R. Strillacci M. G. et al. (2018). Preselection statistics and Random Forest classification identify population informative single nucleotide polymorphisms in cosmopolitan and autochthonous cattle breeds. Animal 12, 12–19. doi: 10.1017/S1751731117001355
Bjørnstad G. Røed K. H. (2002). Evaluation of factors affecting individual assignment precision using microsatellite data from horse breeds and simulated breed crosses. Anim. Genet. 33, 264–270. doi: 10.1046/j.1365-2052.2002.00868.x
Bouckaert R. R. Frank E. (2004). “Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms,” in Advances in Knowledge Discovery and Data Mining. 8th Pacific-Asia Conference, PAKDD 2004, Vol.3056 pp. 3–12 (Sydney: Springer).
Brodersen K. H. Ong C. S. Stephan K. E. Buhmann J. M. (2010). “The balanced accuracy and its posterior distribution,” in 2010 20th International Conference on Pattern Recognition. (New York City: IEE)3121–3124. doi: 10.1109/ICPR.2010.764
Browning S. R. Browning B. L. (2007). Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 81, 1084–1097. doi: 10.1086/521987
Brownlee J. (2019). Statistical significance tests for comparing machine learning algorithms. Mach. Learn. Mastery. Available at: https://machinelearningmastery.com/statistical-significance-tests-for-comparing-machine-learning-algorithms/ (Accessed June 14, 2023).
Calus M. P. L. Henshall J. M. Hawken R. Vandenplas J. (2022). Estimation of dam line composition of 3-way crossbred animals using genomic information. Genet. Sel. Evol. 54, 1–11. doi: 10.1186/s12711-022-00728-4
Campbell D. Duchesne P. Bernatchez L. (2003). AFLP utility for population assignment studies: Analytical investigation and empirical comparison with microsatellites. Mol. Ecol. 12, 1979–1991. doi: 10.1046/j.1365-294X.2003.01856.x
Castric V. Bernatchez L. (2004). Individual assignment test reveals differential restriction to dispersal between two salmonids despite no increase of genetic differences with distance. Mol. Ecol. 13, 1299–1312. doi: 10.1111/j.1365-294X.2004.02129.x
Connolly S. Fortes M. Piper E. Seddon J. Kelly M. (2014). “10th World Congress of Genetics Applied to Livestock Production,” in Determining the number of animals required to accurately determine breed composition using genomic data (American Society of Animal Science, Vancouver (Canada).
Cornuet J. M. Piry S. Luikart G. Estoup A. Solignac M. (1999). New methods employing multilocus genotypes to select or exclude populations as origins of individuals. Genetics 153, 1989–2000. doi: 10.1093/genetics/153.4.1989
Dallimer M. Blackburn C. Jones P. J. Pemberton J. M. (2002). Genetic evidence for male biased dispersal in the red-billed quelea Quelea quelea. Mol. Ecol. 11, 529–533. doi: 10.1046/j.0962-1083.2001.01454.x
Dalvit C. De Marchi M. Dal Zotto R. Gervaso M. Meuwissen T. Cassandro M. (2008a). Breed assignment test in four Italian beef cattle breeds. Meat Sci. 80, 389–395. doi: 10.1016/j.meatsci.2008.01.001
Dalvit C. Marchi M. D. Targhetta C. Gervaso M. Cassandro M. (2008b). Genetic traceability of meat using microsatellite markers. Food Res. Int. 41, 301–307. doi: 10.1016/j.foodres.2007.12.010
Demšar J. (2006). Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30.
Dietterich T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10, 1895–1923. doi: 10.1162/089976698300017197
Dimauro C. Cellesi M. Steri R. Gaspa G. Sorbolini S. Stella A. et al. (2013). Use of the canonical discriminant analysis to select SNP markers for bovine breed assignment and traceability purposes. Anim. Genet. 44, 377–382. doi: 10.1111/age.12021
Frkonja A. Gredler B. Schnyder U. Curik I. Sölkner J. (2012). Prediction of breed composition in an admixed cattle population. Anim. Genet. 43, 696–703. doi: 10.1111/j.1365-2052.2012.02345.x
Funkhouser S. A. Bates R. O. Ernst C. W. Newcom D. Steibel J. P. (2017). Estimation of genome-wide and locus-specific breed composition in pigs. Transl. Anim. Sci. 1, 36–44. doi: 10.2527/tas2016.0003
Gaspar H. A. Breen G. (2019). Probabilistic ancestry maps: A method to assess and visualize population substructures in genetics. BMC Bioinf. 20, 1–11. doi: 10.1186/s12859-019-2680-1
Gaspar D. Usié A. Leão C. Guimarães S. Pires A. E. Matos C. et al. (2023). Genome-wide assessment of the population structure and genetic diversity of four Portuguese native sheep breeds. Front. Genet. 14. doi: 10.3389/fgene.2023.1109490
Gebrehiwot N. Z. Strucken E. M. Marshall K. Aliloo H. Gibson J. P. (2021). SNP panels for the estimation of dairy breed proportion and parentage assignment in African crossbred dairy cattle. Genet. Sel. Evol. 53, 1–18. doi: 10.1186/s12711-021-00615-4
Gobena M. Elzo M. A. Mateescu R. G. (2018). Population structure and genomic breed composition in an Angus-Brahman crossbred cattle population. Front. Genet. 9. doi: 10.3389/fgene.2018.00090
Hayah I. Talbi C. Chafai N. Houaga I. Botti S. Badaoui B. (2023). Genetic diversity and breed-informative SNPs identification in domestic pig populations using coding SNPs. Front. Genet. 14. doi: 10.3389/fgene.2023.1229741
He J. Guo Y. Xu J. Li H. Fuller A. Tait R. G. et al. (2018). Comparing SNP panels and statistical methods for estimating genomic breed composition of individual animals in ten cattle breeds. BMC Genet. 19, 56. doi: 10.1186/s12863-018-0654-3
Hendry A. P. Day T. (2005). Population structure attributable to reproductive time: Isolation by time and adaptation by time. Mol. Ecol. 14, 901–916. doi: 10.1111/j.1365-294X.2005.02480.x
Huang Y. Bates R. O. Ernst C. W. Fix J. S. Steibel J. P. (2014). Estimation of U.S. yorkshire breed composition using genomic data. J. Anim. Sci. 92, 1395–1404. doi: 10.2527/jas.2013-6907
Hulsegge B. Calus M. P. L. Windig J. J. Hoving-Bolink A. H. Maurice-van Eijndhoven M. H. T. Hiemstra S. J. (2013). Selection of SNP from 50K and 777K arrays to predict breed of origin in cattle. J. Anim. Sci. 91, 5128–5134. doi: 10.2527/jas.2013-6678
Hulsegge I. Schoon M. Windig J. Neuteboom M. Hiemstra S. J. Schurink A. (2019). Development of a genetic tool for determining breed purity of cattle. Livest. Sci. 223, 60–67. doi: 10.1016/j.livsci.2019.03.002
Iquebal M. A. Ansari M. S. Sarika S. Dixit S. P. Verma N. K. Aggarwal R. A. K. et al. (2014). Locus minimization in breed prediction using artificial neural network approach. Anim. Genet. 45, 898–902. doi: 10.1111/age.12208
Jasielczuk I. Gurgul A. Szmatoła T. Radko A. Majewska A. Sosin E. et al. (2024). The use of SNP markers for cattle breed identification. J. Appl. Genet. 65 (3), 575–589. doi: 10.1007/s13353-024-00857-0
Johannesson K. Panova M. Kemppainen P. André C. Rolan-Alvarez E. Butlin R. K. (2010). Repeated evolution of reproductive isolation in a marine snail: Unveiling mechanisms of speciation. Philos. Trans. R. Soc B Biol. Sci. 365, 1735–1747. doi: 10.1098/rstb.2009.0256
Judge M. M. Kelleher M. M. Kearney J. F. Sleator R. D. Berry D. P. (2017). Ultra-low-density genotype panels for breed assignment of Angus and Hereford cattle. Animal 11, 938–947. doi: 10.1017/S1751731116002457
Kuehn L. A. Keele J. W. Bennett G. L. McDaneld T. G. Smith T. P. L. Snelling W. M. et al. (2011). Predicting breed composition using breed frequencies of 50,000 markers from the US Meat Animal Research Center 2,000 bull project. J. Anim. Sci. 89, 1742–1750. doi: 10.2527/jas.2010-3530
Kumar H. Panigrahi M. Chhotaray S. Pal D. Bhanuprakash V. Saravan K. A. et al. (2019). Identification of breed-specific SNP panel in nine different cattle genomes. Biomed. Res. 30, 78–81, 145473. doi: 10.35841/biomedicalresearch.30-18-1195
Kumar H. Panigrahi M. Saravanan K. A. Parida S. Bhushan B. Gaur G. K. et al. (2021). SNPs with intermediate minor allele frequencies facilitate accurate breed assignment of Indian Tharparkar cattle. Gene 777. doi: 10.1016/j.gene.2021.145473
Kwak N. Choi C. H. (2002). Input feature selection for classification problems. IEEE Trans. Neural Networks 13, 143–159. doi: 10.1109/72.977291
Lewis J. Abas Z. Dadousis C. Lykidis D. Paschou P. Drineas P. (2011). Tracing cattle breeds with principal components analysis ancestry informative SNPs. PloS One 6, e18007. doi: 10.1371/journal.pone.0018007
Manzoori S. Farahani A. H. K. Moradi M. H. Kazemi-Bonchenari M. (2023). Detecting SNP markers discriminating horse breeds by deep learning. Sci. Rep. 13, 1–14. doi: 10.1038/s41598-023-38601-z
McNemar Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12, 153–157. doi: 10.1007/BF02295996
Meuwissen T. H. E. Hayes B. J. Goddard M. E. (2001). Prediction of total genetic value using genome-wide dense marker maps. Genetics 157, 1819–1829. doi: 10.1093/genetics/157.4.1819
Moradi M. H. Khaltabadi-Farahani A. H. Khodaei-Motlagh M. Kazemi-Bonchenari M. Mcewan J. (2021). Genome-wide selection of discriminant SNP markers for breed assignment in indigenous sheep breeds. Ann. Anim. Sci. 21, 807–831. doi: 10.2478/aoas-2020-0097
Nadeau C. Bengio Y. (2003). Inference for generalization error. Mach. Learn. 52, 239–281. doi: 10.1023/A:1024068626366
Neel J. V. (1973). Private genetic variants and the frequency of mutation among South American Indians. Proc. Natl. Acad. Sci. U. S. A. 70, 3311–3315. doi: 10.1073/pnas.70.12.3311
Negrini R. Nicoloso L. Crepaldi P. Milanesi E. Colli L. Chegdani F. et al. (2009). Assessing SNP markers for assigning individuals to cattle populations. Anim. Genet. 40, 18–26. doi: 10.1111/j.1365-2052.2008.01800.x
Negrini R. Nicoloso L. Crepaldi P. Milanesi E. Marino R. Perini D. et al. (2008). Traceability of four European Protected Geographic Indication (PGI) beef products using Single Nucleotide Polymorphisms (SNP) and Bayesian statistics. Meat Sci. 80, 1212–1217. doi: 10.1016/j.meatsci.2008.05.021
Nikolic N. Park Y.-S. Sancristobal M. Lek S. Chevalet C. (2009). What do artificial neural networks tell us about the genetic structure of populations? The example of European pig populations. Genet. Res. (Camb). 91, 121–132. doi: 10.1017/S0016672309000093
Padilla J.Á. Sansinforiano E. Parejo J. C. Rabasco A. Martínez-Trancón M. (2009). Inference of admixture in the endangered Blanca Cacereña bovine breed by microsatellite analyses. Livest. Sci. 122, 314–322. doi: 10.1016/j.livsci.2008.09.016
Paetkau D. Calvert W. Stirling I. Strobeck C. (1995). Microsatellite analysis of population structure in Canadian polar bears. Mol. Ecol. 4, 347–354. doi: 10.1111/j.1365-294X.1995.tb00227.x
Paschou P. Ziv E. Burchard E. G. Choudhry S. Rodriguez-Cintron W. Mahoney M. W. et al. (2007). PCA-correlated SNPs for structure identification in worldwide human populations. PloS Genet. 3, 1672–1686. doi: 10.1371/journal.pgen.0030160
Pasupa K. Rathasamuth W. Tongsima S. (2020). Discovery of significant porcine SNPs for swine breed identification by a hybrid of information gain, genetic algorithm, and frequency feature selection technique. BMC Bioinf. 21, 216. doi: 10.1186/s12859-020-3471-4
Perfilyeva A. Mussabayev R. Bespalova K. Kuzovleva Y. Sergey B. Begmanova M. et al. (2024). Advanced median-based genetic similarity analysis in Kazakh Tazy dogs: A novel approach for breed conformity assessment. bioRxiv. doi: 10.1101/2024.03.19.585659
Primmer C. R. Koskinen M. T. Piironen J. (2000). The one that did not get away: Individual assignment using microsatellite data detects a case of fishing competition fraud. Proc. R. Soc B Biol. Sci. 267, 1699–1704. doi: 10.1098/rspb.2000.1197
Pritchard J. K. Stephens M. Donnelly P. (2000). Inference of population structure using multilocus genotype data. Genetics 155, 945–959. doi: 10.1093/genetics/155.2.945
Putnová L. Štohl R. (2019). Comparing assignment-based approaches to breed identification within a large set of horses. J. Appl. Genet. 60, 187–198. doi: 10.1007/s13353-019-00495-x
Roques S. Duchesne P. Bernatchez L. (1999). Potential of microsatellites for individual assignment: The North Atlantic redfish (genus Sebastes) species complex as a case study. Mol. Ecol. 8, 1703–1717. doi: 10.1046/j.1365-294X.1999.00759.x
Rosenberg N. A. Burke T. Elo K. Feldman M. W. Freidlin P. J. Groenen M. A. M. et al. (2001). Empirical evaluation of genetic clustering methods using multilocus genotypes from 20 chicken breeds. Genetics 159, 699–713. doi: 10.1093/genetics/159.2.699
Safran R. J. Scordato E. S. C. Wilkins M. R. Hubbard J. K. Jenkins B. R. Albrecht T. et al. (2016). Genome-wide differentiation in closely related populations: the roles of selection and geographic isolation. Mol. Ecol. 25, 3865–3883. doi: 10.1111/mec.13740
Schiavo G. Bertolini F. Galimberti G. Bovo S. Dall’olio S. Nanni Costa L. et al. (2020). A machine learning approach for the identification of population-informative markers from high-throughput genotyping data: Application to several pig breeds. Animal 14 (2), 223–232. doi: 10.1017/S1751731119002167
Scordato E. S. C. Smith C. C. R. Semenov G. A. Liu Y. Wilkins M. R. Liang W. et al. (2020). Migratory divides coincide with reproductive barriers across replicated avian hybrid zones above the Tibetan Plateau. Ecol. Lett. 23, 231–241. doi: 10.1111/ele.13420
Sevillano C. A. Vandenplas J. Bastiaansen J. W. M. Bergsma R. Calus M. P. L. (2017). Genomic evaluation for a three-way crossbreeding system considering breed-of-origin of alleles. Genet. Sel. Evol. 49, 1–14. doi: 10.1186/s12711-017-0350-1
Smouse P. E. Spielman R. S. Park M. H. (1982). Multiple-locus allocation of individuals to groups as a function of the genetic variation within and differences among human populations. Am. Nat. 119, 445–463. doi: 10.1086/283925
Vandenplas J. Calus M. P. L. Sevillano C. A. Windig J. J. Bastiaansen J. W. M. (2016). Assigning breed origin to alleles in crossbred animals. Genet. Sel. Evol. 48, 1–22. doi: 10.1186/s12711-016-0240-y
VanRaden P. M. (2008). Efficient methods to compute genomic predictions. J. Dairy Sci. 91, 4414–4423. doi: 10.3168/jds.2007-0980
VanRaden P. M. Tooker M. E. Chud T. C. S. Norman H. D. Megonigal J. H. Haagen I. W. et al. (2020). Genomic predictions for crossbred dairy cattle. J. Dairy Sci. 103, 1620–1631. doi: 10.3168/jds.2019-16634
Varga L. Edviné E. M. Hudák P. Anton I. Pálinkás-Bodzsár N. Zsolnai A. (2022). Balancing at the borderline of a breed: A case study of the hungarian short-haired vizsla dog breed, definition of the breed profile using simple SNP-based methods. Genes (Basel). 13, 2022. doi: 10.3390/genes13112022
Waser P. M. Strobeck C. (1998). Genetic signatures of interpopulation dispersal. Trends Ecol. Evol. 13, 43–44. doi: 10.1016/s0169-5347(97)01255-x
Weldrufael B. Houaga I. Gaynor C. R. Gorjanc G. Hickey J. M. (2024). Accurate determination of breed origin of alleles in a simulated smallholder crossbred dairy cattle population. bioXriv. doi: 10.1101/2024.04.12.589204
Westell R. A. Quaas R. L. Van Vleck L. D. (1988). Genetic groups in an animal model. J. Dairy Sci. 71, 1310–1318. doi: 10.3168/jds.S0022-0302(88)79688-5
Wilkinson S. Wiener P. Archibald A. L. Law A. Schnabel R. D. McKay S. D. et al. (2011). Evaluation of approaches for identifying population informative markers from high density SNP Chips. BMC Genet. 12, 45. doi: 10.1186/1471-2156-12-45
Wilmot H. Bormann J. Soyeurt H. Hubin X. Glorieux G. Mayeres P. et al. (2022). Development of a genomic tool for breed assignment by comparison of different classification models - Application to three local cattle breeds. J. Anim. Breed. Genet. 139, 40–61. doi: 10.1111/jbg.12643
Wilmot H. Druet T. Hulsegge I. Gengler N. Calus M. P. L. (2023a). Estimation of inbreeding, between-breed genomic relatedness and definition of sub-populations in red-pied cattle breeds. Animal 17, 100793. doi: 10.1016/j.animal.2023.100793
Wilmot H. Niehoff T. Soyeurt H. Gengler N. Calus M. P. L. (2023b). The use of a genomic relationship matrix for breed assignment of cattle breeds: comparison and combination with a machine learning method. J. Anim. Sci. 101, 1–9. doi: 10.1093/jas/skad172
Zhao C. Wang D. Teng J. Yang C. Zhang X. Wei X. et al. (2023). Breed identification using breed-informative SNPs and machine learning based on whole genome sequence data and SNP chip data. J. Anim. Sci. Biotechnol. 14, 1–13. doi: 10.1186/s40104-023-00880-x