Introduction

Gonzalo Navarro

doi:10.1017/CBO9781316588284.002

1 - Introduction

Published online by Cambridge University Press: 05 September 2016

Gonzalo Navarro

Show author details

Gonzalo Navarro: Affiliation:
Universidad de Chile

Book contents

Get access

Summary

Why Compact Data Structures?

Google's stated mission, “to organize the world's information and make it universally accessible and useful,” could not better capture the immense ambition of modern society for gathering all kinds of data and putting them to use to improve our lives. We are collecting not only huge amounts of data from the physical world (astronomical, climatological, geographical, biological), but also human-generated data (voice, pictures, music, video, books, news, Web contents, emails, blogs, tweets) and society-based behavioral data (markets, shopping, traffic, clicks, Web navigation, likes, friendship networks).

Our hunger for more and more information is flooding our lives with data. Technology is improving and our ability to store data is growing fast, but the data we are collecting also grow fast – in many cases faster than our storage capacities. While our ability to store the data in secondary or perhaps tertiary storage does not yet seem to be compromised, performing the desired processing of these data in the main memory of computers is becoming more and more difficult. Since accessing a datum in main memory is about 105 times faster than on disk, operating in main memory is crucial for carrying out many data-processing applications.

In many cases, the problem is not so much the size of the actual data, but that of the data structures that must be built on the data in order to efficiently carry out the desired processing or queries. In some cases the data structures are one or two orders of magnitude larger than the data! For example, the DNA of a human genome, of about 3.3 billion bases, requires slightly less than 800 megabytes if we use only 2 bits per base (A, C, G, T), which fits in the main memory of any desktop PC. However, the suffix tree, a powerful data structure used to efficiently perform sequence analysis on the genome, requires at least 10 bytes per base, that is, more than 30 gigabytes.

The main techniques to cope with the growing size of data over recent years can be classified into three families:

Efficient secondary-memory algorithms. While accessing a random datum from disk is comparatively very slow, subsequent data are read much faster, only 100 times slower than from main memory. Therefore, algorithms that minimize the random accesses to the data can perform reasonably well on disk.

Type: Chapter
Information: Compact Data Structures
A Practical Approach
, pp. 1 - 13

DOI: https://doi.org/10.1017/CBO9781316588284.002 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2016

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abramowitz, M. and Stegun, I. A. (1964). Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Dover, 9th edition.

Agarwal, R.,Khandelwal, A., and Stoica, I. (2015). Succinct: Enabling queries on compressed data. In Proc. 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI), pages 337–350.Google Scholar

Aho, A. V., Hopcroft, J. E., and Ullman, J. D. (1974). The Design and Analysis of Computer Algorithms. Addison-Wesley.

Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. (2009). Introduction to Algorithms. MIT Press, 3rd edition.

Cover, T. and Thomas, J. (2006). Elements of Information Theory. Wiley, 2nd edition.

Ferragina, P. and Manzini, G. (2005). Indexing compressed texts. Journal of the ACM, 52(4), 552–581.Google Scholar

Fischer, J. and Heun, V. (2011). Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM Journal on Computing, 40(2), 465–492.Google Scholar

Gál, A. and Miltersen, P. B. (2007). The cell probe complexity of succinct data structures. Theoretical Computer Science, 379(3), 405–417.Google Scholar

Gog, S. (2011). Compressed Suffix Trees: Design, Construction, and Applications. Ph.D. thesis, Ulm University, Germany.

Gog, S. and Petri, M. (2014). Optimized succinct data structures for massive data. Software Practice and Experience, 44(11), 1287–1314.Google Scholar

Graham, R. L., Knuth, D. E., and Patashnik, O. (1994). Concrete Mathematics – A Foundation for Computer Science. Addison-Wesley, 2nd edition.

Grossi, R. and Ottaviano, G. (2013). Design of practical succinct data structures for large data collections. In Proc. 12th International Symposium on Experimental Algorithms (SEA), LNCS 7933, pages 5–17.Google Scholar

Gusfield, D. (1997). Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press.

Hennessy, J. L. and Patterson, D. A. (2012). Computer Architecture: A Quantitative Approach. Morgan Kauffman, 5th edition.

Jacobson, G. (1988). Succinct Data Structures. Ph.D. thesis, Carnegie Mellon University.

Kao, M.-Y., editor (2016). Encyclopedia of Algorithms. Springer, 2nd edition.

Knuth, D. E. (1998). The Art of Computer Programming, volume 3: Sorting and Searching. Addison- Wesley, 2nd edition.

Lei, X., Senior, A., Gruenstein, A., and Sorensen, J. (2013). Accurate and compact large vocabulary speech recognition on mobile devices. In Proc. 14th Annual Conference of the International Speech Communication Association (INTERSPEECH), pages 662–665.Google Scholar

Li, H. and Durbin, R. (2010). Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26(5), 589–595.Google Scholar

Mäkinen, V., Belazzougui, D., Cunial, F., and Tomescu, A. I. (2015). Genome-Scale Algorithm Design. Cambridge University Press.

Mehlhorn, K. (1984). Data Structures and Algorithms 1: Sorting and Searching. EATCS Monographs on Theoretical Computer Science. Springer-Verlag.

Munro, J. I. (1996). Tables. In Proc. 16th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS), LNCS 1180, pages 37–42.Google Scholar

Muthukrishnan, S. (2005). Data Streams: Algorithms and Applications. Now Publishers.

Ohlebusch, E. (2013). Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch Verlag.

Raman, R. (2015). Encoding data structures. In Proc. 9th International Workshop on Algorithms and Computation (WALCOM), LNCS 8973, pages 1–7.Google Scholar

Rawlins, G. J. E. (1992). Compared to What? An Introduction to the Analysis of Algorithms. Computer Science Press.

Roosta, S. H. (1999). Parallel Processing and Parallel Algorithms: Theory and Computation. Springer.

Sedgewick, R. and Flajolet, P. (2013). An Introduction to the Analysis of Algorithms.Addison-Wesley- Longman, 2nd edition.

Sedgewick, R. and Wayne, K. (2011). Algorithms. Addison-Wesley, 4th edition.

Sorensen, J. and Allauzen, C. (2011). Unary data structures for language models. In Proc. 12th Annual Conference of the International Speech Communication Association (INTERSPEECH), pages 1425–1428.Google Scholar

Vitter, J. S. (2008). Algorithms and Data Structures for External Memory. Now Publishers.

Book contents

1 - Introduction

Summary

Access options

References

Save book to Kindle

Save book to Dropbox

Save book to Google Drive