Abstract
We can discover the effective similarity among pairs of finite objects and denoise a finite object using the Kolmogorov complexity of these objects. The drawback is that the Kolmogorov complexity is not computable. If we approximate it, using a good real-world compressor, then it turns out that on natural data the processes give adequate results in practice. The methodology is parameter-free, alignment-free and works on individual data. We illustrate both methods with examples.
Footnotes
References
- 1
Bennett CH, Gács P, Li M, Vitányi PMB& Zurek W . 1998 Information distance. IEEE Trans. Inf. Theory 44, 1407–1423.doi:10.1109/18.681318 (doi:10.1109/18.681318). Crossref, Web of Science, Google Scholar - 2
Li M, Chen X, Li X, Ma B& Vitányi PMB . 2004 The similarity metric. IEEE Trans. Inf. Theory 50, 3250–3264.doi:10.1109/TIT.2004.838101 (doi:10.1109/TIT.2004.838101). Crossref, Web of Science, Google Scholar - 3
Vereshchagin NK& Vitányi PMB . 2004 Kolmogorov's structure functions and model selection. IEEE Trans. Inf. Theory 50, 3265–3290.doi:10.1109/TIT.2004.838346 (doi:10.1109/TIT.2004.838346). Crossref, Web of Science, Google Scholar - 4
Cilibrasi RL& Vitányi PMB . 2005 Clustering by compression. IEEE Trans. Inf. Theory 51, 1523–1545.doi:10.1109/TIT.2005.844059 (doi:10.1109/TIT.2005.844059). Crossref, Web of Science, Google Scholar - 5
de Rooij S& Vitányi PMB . 2012 Approximating rate–distortion graphs of individual data: experiments in lossy compression and denoising. IEEE Trans. Comput. 61, 395–407.doi:10.1109/TC.2011.25 (doi:10.1109/TC.2011.25). Crossref, Web of Science, Google Scholar - 6
Kolmogorov AN . 1965 Three approaches to the quantitative definition of information. Problems Inf. Transm. 1, 1–7. Google Scholar - 7
Cebrian M, Alfonseca M& Ortega A . 2007 The normalized compression distance is resistant to noise. IEEE Trans. Inf. Theory 53, 1895–1900.doi:10.1109/TIT.2007.894669 (doi:10.1109/TIT.2007.894669). Crossref, Web of Science, Google Scholar - 8
Long C, Zhu X, Li M& Ma B . 2008 Information shared by many objects. Proc. 17th ACM Conf. on Information and Knowledge Management, Napa Valley, CA, 26–30 October, pp 1213–1220. New York, NY: ACM. doi:10.1145/1458082.1458242 (doi:10.1145/1458082.1458242). Google Scholar - 9
Vitányi PMB . 2011 Information distance in multiples. IEEE Trans. Inf. Theory 57, 2451–2456.doi:10.1109/TIT.2011.2110130 (doi:10.1109/TIT.2011.2110130). Crossref, Web of Science, Google Scholar - 10
Cilibrasi RL& Vitányi PMB . 2011 A fast quartet tree heuristic for hierarchical clustering. Pattern Recognit. 44, 662–677.doi:10.1016/j.patcog.2010.08.033 (doi:10.1016/j.patcog.2010.08.033). Crossref, Web of Science, Google Scholar - 11
- 12
Tzanetakis G& Cook P . 2002 Music genre classification of audio signals. IEEE Trans. Speech Audio Process. 10, 293–302.doi:10.1109/TSA.2002.800560 (doi:10.1109/TSA.2002.800560). Crossref, Google Scholar - 13
Grimaldi M, Kokaram A& Cunningham P . 2002 Classifying music by genre using the wavelet packet transform and a round-robin ensemble. Technical Report no. TCD-CS-2002-64 Trinity College Dublin. Google Scholar - 14
Dannenberg R, Thom B& Watson D . 1997 A machine learning approach to musical style recognition. Proc. 1997 Int. Comput. Music Conf., pp. 344–347. Ann Arbor, MI: MPublishing, University of Michigan Library. (http://quod.lib.umich.edu/i/icmc/bbp2372.1997.090/–machine-learning-approach-to-musical-style-recognition). Google Scholar - 15
Chai W& Vercoe B . 2001 Folk music classification using hidden Markov models. Proc. Int. Conf. Artif. Intell, Las Vegas, NV . (http://en.scientificcommons.org/43297168). Google Scholar - 16
Scott P . 2001 Music classification using neural networks. Manuscript, EE 373B Project, Stanford University, CA. (http://cuip.net/∼dloquinte/researchfiles/musicclassification.pdf). Google Scholar - 17
Vitányi PMB, Balbach FJ, Cilibrasi RL& Li M . 2009 Normalized information distance. Information theory and statistical learning (eds, Emmert-Streib F& Dehmer M ), pp. 45–82. New York, NY: Springer. Crossref, Google Scholar - 18
Terwijn SA, Torenvliet L& Vitányi PMB . 2011 Nonapproximability of the normalized information distance. J. Comput. Syst. Sci. 77, 738–742.doi:10.1016/j.jcss.2010.06.018 (doi:10.1016/j.jcss.2010.06.018). Crossref, Web of Science, Google Scholar - 19
Cilibrasi RL, de Wolf R& Vitányi PMB . 2004 Algorithmic clustering of music based on string compression. Comput. Music J. 28, 49–67.doi:10.1162/0148926042728449 (doi:10.1162/0148926042728449). Crossref, Web of Science, Google Scholar - 20
Wehner S . 2007 Analyzing worms and network traffic using compression. J. Comput. Security 15, 303–320 (http://dl.acm.org/citation.cfm?id=1370630). Google Scholar - 21
Zhang X, Hao Y, Zhu X-Y& Li M . 2008 New information distance measure and its application in question answering system. J. Comput. Sci. Technol. 23, 557–572.doi:10.1007/s11390-008-9152-9 (doi:10.1007/s11390-008-9152-9). Crossref, Web of Science, Google Scholar - 22
Cleary JG& Witten IH . 1984 Data compression using adaptive coding and partial string matching. IEEE Trans. Commun. 32, 396–402.doi:10.1109/TCOM.1984.1096090 (doi:10.1109/TCOM.1984.1096090). Crossref, Web of Science, Google Scholar - 23
Cebrian M, Alfonseca M& Ortega A . 2005 Common pitfalls using the normalized compression distance: what to watch out for in a compressor. Commun. Inf. Syst. 5, 367–384 (http://projecteuclid.org/euclid.cis/1175791028). Google Scholar - 24
Rannalal B& Yang Z . 2008 Phylogenetic inference using whole genomes. Annu. Rev. Genomics Hum. Genet. 9, 217–231.doi:10.1146/annurev.genom.9.081307.164407 (doi:10.1146/annurev.genom.9.081307.164407). Crossref, PubMed, Web of Science, Google Scholar - 25
Cao Y, Janke A, Waddell PJ, Westerman M, Takenaka O, Murata S, Okada N, Pbo S& Hasegawa M . 1998 Conflict among individual mitochondrial proteins in resolving the phylogeny of Eutherian orders. J. Mol. Evol. 47, 307–322.doi:10.1007/PL00006389 (doi:10.1007/PL00006389). Crossref, PubMed, Web of Science, Google Scholar - 26
Keogh E, Lonardi S& Ratanamahatana CA . 2004 Towards parameter-free data mining. Proc. 10th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Seattle, WA, 22–24 August, pp. 206–215 New York, NY: ACM. doi:10.1145/1014052.1014077 (doi:10.1145/1014052.1014077). Google Scholar - 27
Ksiazek TG, 2003 A novel coronavirus associated with severe acute respiratory syndrome. New Engl. J. Med. 348, 1953–1956.doi:10.1056/NEJMoa030781 (doi:10.1056/NEJMoa030781). Crossref, PubMed, Web of Science, Google Scholar - 28
Costa Santos C, Bernardes J, Vitányi PMB& Antunes L . 2006 Clustering fetal heart rate tracings by compression. Proc. 19th IEEE Int. Symp. on Computer-Based Medical Systems, Salt Lake City, UT, 22–23 June, pp. 685–690. Los Alamitos, CA: IEEE Computer Society Press. doi:10.1109/CBMS.2006.68 (doi:10.1109/CBMS.2006.68). Google Scholar - 29
Belloni T, Klein-Wolt M, Méndez M, van der Klis M& van Paradijs J . 2000 A model-independent analysis of the variability of GRS 1915+105. Astron. Astrophys. 355, 271–290 (http://adsabs.harvard.edu/full/2000A%26A…355..271B). Google Scholar - 30
Cilibrasi RL& Vitányi PMB . 2007 The Google similarity distance. IEEE Trans. Knowledge Data Eng. 19, 370–383.doi:10.1109/TKDE.2007.48 (doi:10.1109/TKDE.2007.48). Crossref, Web of Science, Google Scholar - 31
Chang C-C& Lin C-J . 2011 LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 3.doi:10.1145/1961189.1961199 (doi:10.1145/1961189.1961199). Crossref, Web of Science, Google Scholar - 32
Oliveira LS, Sabourin R, Bortolozzi F& Suen CY . 2002 Automatic recognition of handwritten numerical strings: a recognition and verification strategy. IEEE Trans. Pattern Anal. Mach. Intell. 24, 1438–1454.doi:10.1109/TPAMI.2002.1046154 (doi:10.1109/TPAMI.2002.1046154). Crossref, Web of Science, Google Scholar - 33
Shannon CE . 1948 The mathematical theory of communication. Bell Syst. Tech. J. 27, 623–656. Crossref, Google Scholar - 34
Shannon CE . 1959 Coding theorems for a discrete source with a fidelity criterion. IRE National Convention Record, Part 4 (Coding Theorems), pp. 142–163 Institute of Radio Engineers. Google Scholar - 35
Vereshchagin NK& Vitányi PMB . 2010 Rate distortion and denoising of individual data using Kolmogorov complexity. IEEE Trans. Inf. Theory 56, 3438–3454.doi:10.1109/TIT.2010.2048491 (doi:10.1109/TIT.2010.2048491). Crossref, Web of Science, Google Scholar - 36
Levenshtein VI . 1966 Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10, 707–710. Google Scholar - 37
Turing AM . 1936 On computable numbers, with an application to the Entscheidungsproblem. Proc. Lond. Math. Soc. 42, 230–265 ; ‘Correction’ 43 (1937), 544–546. Google Scholar - 38
Li M& Vitányi PMB . 2008 An introduction to Kolmogorov complexity and its applications 3rd edn. New York, NY: Springer. Crossref, Google Scholar - 39
Li M, Badger JH, Chen X, Kwong S, Kearney P& Zhang H . 2001 An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17, 149–154.doi:10.1093/bioinformatics/17.2.149 (doi:10.1093/bioinformatics/17.2.149). Crossref, PubMed, Web of Science, Google Scholar