Clustering of a Number of Genes Affecting in Milk Production using Information Theory and Mutual Information

Dehghanzadeh, Houshang; Mirhoseini, Seyed Ziaeddin; Ghaderi-Zefrehei, Mostafa; Tavakoli, Hassan; Esmaeilkhaniyan, Saeid

doi:10.29252/rap.10.23.117

Volume 10, Issue 23 (5-2019) rap 2019, 10(23): 117-132 | Back to browse issues page

‎ 10.29252/rap.10.23.117

Mendeley

Zotero

RefWorks

Dehghanzadeh H, Mirhoseini S Z, Ghaderi-Zefrehei M, Tavakoli H, Esmaeilkhaniyan S. (2019). Clustering of a Number of Genes Affecting in Milk Production using Information Theory and Mutual Information . rap. 10(23), 117-132. doi:10.29252/rap.10.23.117
URL: http://rap.sanru.ac.ir/article-1-790-en.html

Clustering of a Number of Genes Affecting in Milk Production using Information Theory and Mutual Information

Houshang Dehghanzadeh

, Seyed Ziaeddin Mirhoseini

, Mostafa Ghaderi-Zefrehei

, Hassan Tavakoli

, Saeid Esmaeilkhaniyan

Guilan Agricultural and Natural Resources Research and Education Center

Abstract: (3339 Views)

Information theory is a branch of mathematics. Information theory is used in genetic and bioinformatics analyses and can be used for many analyses related to the biological structures and sequences. Bio-computational grouping of genes facilitates genetic analysis, sequencing and structural-based analyses. In this study, after retrieving gene and exon DNA sequences affecting milk yield in dairy cattle, the entropy in orders one to four for each gene and eta exons was calculated. In order to extract gene distances, mutual information method was calculated. The results of mutual information of DNA and exon sequences were entered as input into 7 general clustering algorithms. In order to aggregate the results of clustering, AdaBoost algorithm was used. Finally, the results of AdaBoost algorithm were investigated by GeneMANIA prediction server to explore the results from gene annotation point of view. Integrated result of each clustering algorithm due to AdaBoost algorithm, which implied as gene tree, indicated that proposed method biologically grouped set of genes as it was proved by their gene annotation using GeneMANI. We believe that the proposed method might be used with other DNA based clustering competitive methods and therefore, it can be used to group set of genes in other species.

Keywords: Dairy cattle, Entropy, Gene clustering, Information theory, Mutual information

Full-Text [PDF 3943 kb] (890 Downloads)

Type of Study: Research | Subject: ژنتیک و اصلاح نژاد طیور
Received: 2017/10/6 | Revised: 2019/05/25 | Accepted: 2018/09/22 | Published: 2019/05/22

References

1. Bindewald, E. and B.A. Shapiro. 2006. RNA secondary structure prediction from sequence alignments using a network of k-nearest neighbor classifiers, RNA (2006), 12: 342-352. Published by Cold Spring Harbor Laboratory Press. Copyright 2006 RNA Society. [DOI:10.1261/rna.2164906]

2. Blaisdell, B.E. 1986. Ameasure of the similarity of sets of sequences not requiring sequence alignment. Proceeding of National Academy of Sciences. 83(14): 5155-5159. [DOI:10.1073/pnas.83.14.5155]

3. Brunell, H., J.J. Gallardo-Chacon, A. Buil, M. Montserrat Vallverdu, J.M. Soria, P. Caminal and A. Perera. 2010. MISS: a non-linear methodology based on mutual information for genetic association studies in both population and sib-pairs analysis .BIOINFORMATICS 26(15): 1811-1818, DOI:10.1093/bioinformatics/btq273. [DOI:10.1093/bioinformatics/btq273]

4. Buitenhuis, A.J., U.K. Sundekilde, N. Poulsen, H.C. Bertram, L.B. Larsen and P. Sørensen. 2013. Estimation of genetic parameters and detection of qtl for metabolites in Danish Holstein milk. Journal of Dairy Science, 14(79): 1-10.

5. Buslje, C.M., E. Teppa, T.D. Dome'nico, J.M. Delfino and M. Nielsen .2010. Networks of high mutual information define the structural proximity of catalytic sites: Implications for Catalytic Residue Identification. PLoS Computational Biology. . Volume 6(11). [DOI:10.1371/journal.pcbi.1000978]

6. Changchuan, Y., Y. Chen and S.T. Yau. 2014. A measure of DNA sequence similarity by Fourier Transform with applications on hierarchical clustering. Journal of Theoretical Biology, 359: 18-28. [DOI:10.1016/j.jtbi.2014.05.043]

7. Clemente, J.C., K. Satou and G. Valiente. 2007. Phylogenetic reconstruction from non-genomic data. Bioinformatics, 23: 110-115. [DOI:10.1093/bioinformatics/btl307]

8. Comin, M. and D. Verzotto. 2012. Alignment-free phylogeny of whole genomes using underlying subwords. Algorithms for Molecular Biology, 7(1). [DOI:10.1186/1748-7188-7-34]

9. Dawy, Z., J. Hagenauer, P. Hanus and J.C. Mueller. 2005. Mutual Information Based Distance Measures for Classification and Content Recognition with Applications to Genetics. 0-7803-8938-7/05/$20.00 (C) 2005 IEEE.

10. Edgar, R.C. and S. Batzoglou. 2006. Multiple sequence alignment. Curr. Opin. Struct. Biol, 16(3): 368-373. [DOI:10.1016/j.sbi.2006.04.004]

11. Edwards, S.V., B. Fertil, A. Giron and P.J. Deschavanne. 2002. A genomic schism in birds revealed by phylogenetic analysis of DNA strings. Systematic. Biology, 51: 599-613. [DOI:10.1080/10635150290102285]

12. Erill, I. 2012. Information Theory and biological sequences: Insights from an evolutionary prespective. 2012 Nova Science Publishers, Inc.

13. Freund, Y. and R. Schapire. 1996. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55: 119. CiteSeerX 10.1.1.32.8918,  DOI: 10.1006/jcss.1997.1504. [DOI:10.1006/jcss.1997.1504]

14. Freund, Y. and R. Schapire. 1996. Experiments with a new boosting algoritm. Paper read at Proceeding of the Thirteenth Internatioanal Conference on Machine Learning.

15. Forst, C.V. and K. Schulten. 2001. Phylogenetic analysis of metabolic pathways. Journal Molecular Evolution, 52: 471-489. [DOI:10.1007/s002390010178]

16. Gray, R.M. 2013. Entropy and Information Theory. First Edition. Springer-Verlag New York publisher.

17. Habibi, M., H.Pezeshk, C. Eslahchi and M. Sadegi. 2007. Allocation of protein secondary structure using entropy. Iran's fifth largest biotechnology conference. Tehran, Iran. pp: 33-39 (In Persian).

18. Herzel, H., W. Ebelling and A.O. Schmitt. 1994. Entropies of biosequences: The role of repeats. Physical Review Letters, 50: 5061-5071. [DOI:10.1103/PhysRevE.50.5061]

19. Heymans, M. and A.K. Singh. 2003. Deriving phylogenetic trees from the similarity analysis of metabolic pathways. Bioinformatics, 19(1): 138-146. [DOI:10.1093/bioinformatics/btg1018]

20. Jiang, S., C. Tang, L. Zhang and A. Zhang. 2014. A Maximum entropy approach to classifying gene array data sets. Workshop on Data Mining for Genomics, First SIAM International Conference on Data Mining.

21. Jun, S.R., G.E. Sims, G.A. Wu and S.H. Kim. 2010. Whole-proteome phylogeny of prokaryotes by feature frequency profiles: analignment-free method with optimal featurere solution. Proceedings of the National Academy of Sciences, 107 (1): 133-138. [DOI:10.1073/pnas.0913033107]

22. Katoh, K., K. Misawa, K.I. Kuma and T. Miyata. 2002. Mafft: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research, 30(14): 3059-3066. [DOI:10.1093/nar/gkf436]

23. Kemena, C. and C. Notredame. 2009. up coming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics, 25(19): 2455-2465. [DOI:10.1093/bioinformatics/btp452]

24. Khatib, H., RL. Monson, V. Schutzkus, D.M. Kohl, G.J.M. Rosa and J.J.Rutledge. 2008. Mutations in the STAT5A gene are associated with embryonic survival and milk composition in cattle. Journal of Dairy Science, 91: 784-793. [DOI:10.3168/jds.2007-0669]

25. Kim, J., S. Kim, K. Lee and Y. Kwon .2009. Entropy analysis in yeast DNA. Chaos, Solitons and Fractals 39: 1565-1571. [DOI:10.1016/j.chaos.2007.06.036]

26. Larkin, M.A., G. Blackshields, N. Brown, R. Chenna, P.A. McGettigan, H. McWilliam, F. Valentin, I.M. Wallace, A. Wilm and R. Lopez. 2007. Clustal w and clustal x version 2.0. Bioinformatics, 23(21): 2947-2948. [DOI:10.1093/bioinformatics/btm404]

27. Lemay, D.G., D.J. Lynn, W.F. Martin, M.C. Neville, T.M. Casey, G. Rincon, E.V. Kriventseva, W.C. Barris, A.S. Hinrichs, A.J. Molenaar, K.S. Pollard, N.J. Maqbool, K. Singh, R. Murney, E.M. Zdobnov, R.L. Tellam, J.F. Medrano, J.B. German and M. Rijnkels. 2009. The bovine lactation genome: insights into the evolution of mammalian milk. Genome Biology, 10:R43 (DOI: 10.1186/gb-2009-10-4-r43). [DOI:10.1186/gb-2009-10-4-r43]

28. Liou, C.Y., S.H. Tseng, W.C. Cheng and H.Y. Tsai. 2013. Structural complexity of DNA sequence. Computational and mathematical methods in medicine, Volume 2013, Article ID 628036, 11 pages. [DOI:10.1155/2013/628036]

29. Liu, B. 2007. Uncertainty Theory, 2nd ed., Springer-Verlag, Berlin.

30. Machado, J.T. 2012. Shannon Entropy Analysis of the Genome Code. Hindawi Publishing Corporation Mathematical Problems in Engineering Volume 2012, Article ID 132625, 12 pages DOI: 10.1155/2012/132625. [DOI:10.1155/2012/132625]

31. Monge, R.E. and J.L. Crespo. 2014. Comparison of complexity measures for DNA sequence analysis. 2014 International Work Conference on Bio-inspired Intelligence (IWOBI). [DOI:10.1109/IWOBI.2014.6913941]

32. Neagoe, I.M., D. Popescu and V.I.R. Niculescu. 2014. Applications of entropic divergence measures for DNA segmentation into high variable regiones of cryposporidium spp. GP60 gene. Romanian Reports in Physics, 66(4): 1078-1087.

33. Ogorevc, J., T. Kunej, A. Razpet and P. Dovc. 2009. Database of cattle candidate genes and genetic markers for milk production and mastitis. Animal Genetics, 40: 832-851. [DOI:10.1111/j.1365-2052.2009.01921.x]

34. Penner, O., P. Grassberger and M. Paczuski. 2011. Sequence Alignment, Mutual Information, and Dissimilarity Measures for Constructing Phylogenies. PLOS ONE, 6(1): e14373. DOI: 10.1371/journal.pone.0014373. [DOI:10.1371/journal.pone.0014373]

35. Pham, T.D., D.I. Crane, D. Tannock and D. Beck. 2004, Kullback-Leibler dissimilarity of markov models for phylogenetic tree reconstruction. Proceeding of 2004 international Symposium on Inteligent Multimedia, Video and Speech Processing. October 20-22, 2004 HongKong.

36. Porto-DIaz, L., V. BolOn-Canedo, A. Alonso-Betanzos and O. Fontenla-Rome. 2011. A study of performance on microarray data sets for a classifier based on information theoretic learning. Neural Networks 24: 888-896. [DOI:10.1016/j.neunet.2011.05.010]

37. Qi, J., B.Wang and B. Hao. 2004. Whole proteome prokaryote phylogeny without sequence alignment: a K-string composition approach. Journal Molecular and Evolution, 58: 1-11. [DOI:10.1007/s00239-003-2493-7]

38. Reddy, Y.V. and A. Sebastian. 2009. Parameters for estimation of entropy to study price manipulation in stock markets", Research publication university of Dehli.

39. 39.Ruiz-Marin, M., M. Matilla-Garcia, J.A.G. Cordoba, J.L. Susillo-Gonzalez, A. Romo-Astorga, A. Gonzalez-Pérez, A. Ruiz and J. Gayan. 2010. An entrpyetest for single-locus genetic association analysis. BMC Genetics, 11:19. [DOI:10.1186/1471-2156-11-19]

40. Sims, G.E., S.R. Jun, G.A. Wu and S.H. Kim. 2009. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proceedings of the National Academy of Sciences. 106(8): 2677-2682. [DOI:10.1073/pnas.0813249106]

41. Shannon, C. 1948. A mathematical theory of communication. Bell System Technical Journal, vol. 27: 379-423 and 623-656. [DOI:10.1002/j.1538-7305.1948.tb00917.x]

42. Sherwin, B.W. 2010. Entropy and information approaches to genetic diversity and its expression: genomic geography. Entropy, 12: 1765-1798; DOI: 10.3390/e12071765. [DOI:10.3390/e12071765]

43. Stuart G.W, K. Moffet and S. Baker. 2002. Integrated gene species phylogenies from unaligned whole genome protein sequences. Bioinformatics, 18: 100-108. [DOI:10.1093/bioinformatics/18.1.100]

44. Stuart, G.W., K. Moffet and J.J. Leader. 2002. A comprehensive vertebrate phylogeny using vector representations of protein sequences from whole genomes. Molecular Biology and Evolution., 19: 554-562. [DOI:10.1093/oxfordjournals.molbev.a004111]

45. Sundekilde, U.K., L.B. Larsen and H.C. Bertram. 2013. NMR-Based Milk Metabolomics. Metabolites, 3:204-222. [DOI:10.3390/metabo3020204]

46. Tautz, D. and M. Trick, G.A. Dover. 1986. Cryptic simplicity in DNA is a major source of genetic variation. Nature, 322: 652-656. [DOI:10.1038/322652a0]

47. Tomovic, A. and E.J Oakeley. 2007. Position dependencies in transcription factor binding sites. Bioinformatics, 23(8): 933-941 DOI: 10.1093/bioinformatics/btm055. [DOI:10.1093/bioinformatics/btm055]

48. Vinga, S. and J. Almeida. 2003. Alignment-free sequence comparison: review. Bioinformatics, 19(4): 513-523. [DOI:10.1093/bioinformatics/btg005]

49. Vinga, S. 2013. Information theory applications for biological sequence analysis. Briefings in Bioinformatics, 15(3): 376-389, DOI: 10.1093/bib/bbt068. [DOI:10.1093/bib/bbt068]

50. Warde-Farley, D., S.L. Donaldson, O. Comes, K. Zuberi, R. Badrawi, P. Chao, M. Franz, C. Grouios, F. Kazi, C.T. Lopes, A. Maitland, S. Mostafavi, J. Montojo, Q. Shao, G. Wright, G.D. Bader and Q. Morris. 2010. The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Research, 2010, Vol. 38, Web Server issueDOI:10.1093/nar/gkq537. [DOI:10.1093/nar/gkq537]

51. Warnow, T.2013. Large-scale multiple sequence alignment and phylogeny estimation. In: Models and Algorithms for Genome Evolution. Springer, 85-146pp. [DOI:10.1007/978-1-4471-5298-9_6]

52. Xie, X., Y. Yu, G. Liu, Z. Yuan and J. Song. 2010. Complexity and Entropy Analysis of DNA Methyltransferase. J Data Mining in Genom Proteomics. 1(2): 1000105. [DOI:10.4172/2153-0602.1000105]

53. Yu, Z.G., V. Anh and K.S. Lau. 2003. Multifractal and correlation analysis of protein sequences from complete genome, Physical Review E, 68: 021913. [DOI:10.1103/PhysRevE.68.021913]

54. Yu, Z.G, V.V. Anh and L.Q. Zhou. 2005. Fractal and dynamical language methods to construct phylogenetic tree based on protein sequences from complete genomes, in L.Wang, K. Chen and Y.S. Ong (Eds): ICNC 2005, Lecture Notes in Computer Science, 3612: 337-347, Springer-Verlag Berlin Heidelberg. [DOI:10.1007/11539902_40]

55. Yu, Z.G., L.Q. Zhou, V. Anh and K.H. Chu. 2007. Phylogeny of prokaryotes and chloroplasts revealed by a simple composition approach on all protein sequences from whole genome without sequence alignment, Journal of Molecular Evolution, 60: 538-545. [DOI:10.1007/s00239-004-0255-9]

56. Zhang, JL., L.S. Zan, P. Fang, F. Zhang, G.L. Shen and W.Q. Tian. 2008. Genetic variation of PRLR gene and association with milk performance traits in dairy cattle. Canadian Journal of Animal Science, 88: 33-39. [DOI:10.4141/CJAS07052]

57. Zhou, L.Q., Z.G. Yu, V. Anh, P.R. Nie, F.F. Liao and Y.J. Chen. 2007. Log-correlation distance and Fourier transformation with Kullback-Leibler divergence distance for construction of vertebrate phylogeny using complete mitochondrial genomes. In Proceedings of the 3nd International Conference on Natural Computation (ICNC2007), Haikou, China, August 2007: 304-308. [DOI:10.1109/ICNC.2007.462]

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Designed & Developed by : Yektaweb

how do you evaluate this site?
	Excellent
	Good
	Average
	weak

Research On Animal Production

Related Websites

Site Keywords

Vote