Skip to main content Accessibility help
×
Hostname: page-component-848d4c4894-tn8tq Total loading time: 0 Render date: 2024-06-30T10:48:51.524Z Has data issue: false hasContentIssue false

4 - Data Management Architectures

Published online by Cambridge University Press:  05 December 2012

Terence Critchlow
Affiliation:
Pacific Northwest National Laboratory
Ghaleb Abdulla
Affiliation:
Lawrence Livermore National Laboratory
Jacek Becla
Affiliation:
Stanford University
Kerstin Kleese-Van Dam
Affiliation:
Pacific Northwest National Laboratory
Sam Lang
Affiliation:
Pacific Northwest National Laboratory
Deborah L. McGuinness
Affiliation:
Rensselaer Polytechnic Institute
Ian Gorton
Affiliation:
Pacific Northwest National Laboratory, Washington
Deborah K. Gracio
Affiliation:
Pacific Northwest National Laboratory, Washington
Get access

Summary

Data management is the organization of information to support efficient access and analysis. For data-intensive computing applications, the speed at which relevant data can be accessed is a limiting factor in terms of the size and complexity of computation that can be performed. Data access speed is impacted by the size of the relevant subset of the data, the complexity of the query used to define it, and the layout of the data relative to the query. As the underlying data sets become increasingly complex, the questions asked of it become more involved as well. For example, geospatial data associated with a city is no longer limited to the map data representing its streets, but now also includes layers identifying utility lines, key points, locations, and types of businesseswithin the city limits, tax information for each land parcel, satellite imagery, and possibly even street-level views. As a result, queries have gone from simple questions, such as, “How long is Main Street?,” to much more complex questions such as, “Taking all other factors into consideration, are the property values of houses near parks higher than those under power lines, and if so, by what percentage?” Answering these questions requires a coherent infrastructure, integrating the relevant data into a format optimized for the questions being asked.

Data management is critical to supporting analysis because, for large data sets, reading the entire collection is simply not feasible. Instead, the relevant subset of the data must be efficiently described, identified, and retrieved. As a result, the data management approach taken effectively defines the analysis that can be efficiently performed over the data.

Type
Chapter
Information
Data-Intensive Computing
Architectures, Algorithms, and Applications
, pp. 48 - 84
Publisher: Cambridge University Press
Print publication year: 2012

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

1. Altintas, I. “Lifecycle of Scientific Workflows and Their Provenance: A Usage Perspective.” IEEE Congress on Services – Part I. 474–75, Honolulu, HI, July 2008.CrossRefGoogle Scholar
2. Altintas, I., Bhagwanani, S., Buttler, D., Chandra, S., Cheng, Z., Coleman, M., Critchlow, T., Gupta, A., Han, W., Liu, L., Ludaescher, B., Pu, C., Moore, R., Shoshani, A., and Vouk, M. “A Modeling and Execution Environment for Distributed Scientific Workflows.” Proceedings of the 15th IEEE International Conference on Scientific and Statistical Database Management (SSDBM), Cambridge, MA, July 2003.Google Scholar
3. Adabi, D. J., Boncz, P. A., and Harizopoulos, S. “Column-Oriented Database Systems.” Proceedings of the VLDB Endowment 2, no. 2 (August 2009): 166465. Available: http://cs-www.cs.yale.edu/homes/dna/papers/columnstore-tutorial.pdf.Google Scholar
4. Ali, N., Carns, P., Iskra, K., Kimpe, D., Lang, S., Latham, R., Ross, R., Ward, L., and Sadayappan, P. “Scalable I/O Forwarding Framework for High-Performance Computing Systems.” IEEE International Conference on Cluster Computing (Cluster 2009), New Orleans, LA, September 2009.Google Scholar
5. Agrawal, R., Gupta, A., and Sarawagi, S. “ModelingMultidimensional Databases.” Proceedings of the 13th International Conference on Data Engineering, Birmingham, U.K., April 1997.Google Scholar
6. Becla, J., Lim, K. -T., Monkewitz, S., Nieto-Santisteban, M., and Thakar, A. “Organizing the Extremely Large LSST Database for Real-Time Astronomical Processing.” 17th Annual Astronomical Data Analysis Software and Systems Conference (ADASS 2007), London, England, September 2007.Google Scholar
7. Benedict, J. L., McGuinness, D. L., and Fox, P. “A Semantic Web-based Methodology for Building Conceptual Models of Scientific Information.” American Geophysical Union, Fall Meeting (AGU 2006), San Francisco, CA, December 2007.Google Scholar
8. Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. “Bigtable: A Distributed Storage System for Structured Data.” 7th USENIX Symposium on Operating Systems Design and Implementation, Boston, MA, May 2006.Google Scholar
9. Codd, E. F., Codd, S. B., and Salley, C. T.Providing OLAP (On-Line Analytical Processing) to User-Analysts: An IT Mandate. Report. Codd & Associates, 1993.Google Scholar
10. Carns, P., Harms, K., Allcock, W., Lang, S., Latham, R., and Ross, R. “Storage Access Characteristics of Computational Science Applications.” Proceedings of Supercomputing, New Orleans, LA, November 2010.Google Scholar
11. Cudre-Mauroux, P., Kimura, H., Lim, K. -T., Rogers, J., Simakov, R., Soroush, E., Velikhov, P., Wang, D. L., Balazinska, M., Becla, J., DeWitt, D., Heath, B., Maier, D., Madden, S., Patel, J., Stonebraker, M., and Zdonik, S.A Demonstration of SciDB: A Science-Oriented DBMS. VLDB '09 2, no. 1 (August 2009): 1534–37. Available: http://www.vldb.org/pvldb/2/vldb09-76.pdf.Google Scholar
12. Carns, P., Lang, S., Ross, R., Vilayannur, M., Kunkel, J., and Ludwig, T. “Small-File Access in Parallel File Systems.” Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium, Rome Italy, May 2009.Google Scholar
13. Codd, E. F.A Relational Model for Large Shared Data Banks.” Communications of the ACM 13, no. 6 (June 1970): 377–87.CrossRefGoogle Scholar
14. Culler, D., Singh, J., and Gupta, A.Parallel Computer Architecture: A Hardware/Software Approach. San Francisco, CA: Morgan Kaufmann, 1999.Google Scholar
15. Chaudhuri, S., and Dayal, U.An Overview of Data Warehousing and OLAP Technology.” ACM SIGMOD Record 26, no. 1 (1997): 65–74.Google Scholar
16. ,Department of Defense. JASON Defense Advisory Panel Report. Data Analysis Challenges, JSR-08–142, December 2008.Google Scholar
17. Dehne, F., Eavis, T., and Rau-Chaplin, A.The cgmCUBE Project: Optimizing Parallel Data Cube Generation for ROLAP.” Journal of Distributed and Parallel Databases 19, no. 1 (2006): 29–62.CrossRefGoogle Scholar
18. Freire, J., and Silva, C. “Towards Enabling Social Analysis of Scientific Data.” Proceedings of CHI Social Data Analysis Workshop, Florence, Italy, April 2008.Google Scholar
19. Goodman, A. A., and Wong, C. G. 2009. “Bringing the Night Sky Closer: Discoveries in the Data Deluge.” In The Fourth Paradigm: Data-Intensive Scientific Discovery, 39–44, edited by T., HeyS., TansleyK., Tolle. Microsoft Research. Redmond WA, 2006.Google Scholar
20. Gropp, W., Huss-Lederman, S.Lumsdaine, A.Lusk, E.Nitzberg, B.Saphir, W., and Snir, M.MPI – The Complete Reference, Volume 2, The MPI Extensions. Cambridge, MA: The MIT Press, 1998.Google Scholar
21. Gopalkrishnan, V., Li, Q., and Karlapalem, K. “Star/Snow-Flake Schema Driven Object-Relational Data Warehouse Design and Query Processing Strategies.” In Lecture Notes in Computer Science. Volume 1676/1999, 11–22. Berlin/Heidelberg: Springer, 1999.Google Scholar
22. Gray, J., Nieto-Santisteban, M. A., and Szalay, A. S. “The Zones Algorithm for Finding Points-Near-a-Point or Cross-Matching Spatial Datasets.” In The ACM Computing Research Repository (CoRR). Vol abs/cs/0701171. Microsoft: 2007.Google Scholar
23. Hirschman, J. E., Balakrishnan, R., Christie, K. R., Costanzo, M. C., Dwight, S. S., Engel, S. R., Fisk, D. G., Hong, E. L., Livstone, M. S., Nash, R., Park, J., Oughtred, R., Skrzypek, M., Starr, B., Theesfeld, C. L., Williams, J., Andrada, R., Binkley, G., Dong, Q., Lane, C., Miyasato, S., Sethuraman, A., Schroeder, M., Thanawala, M. K., Weng, S., Dolinski, K., Botstein, D., and Cherry, J. M.Genome Snapshot: A New Resource at the Saccharomyces Genome Database (SGD) Presenting an Overview of the Saccharomyces cerevisiae Genome.” Nucleic Acids Research 34, no. 1: D442–D445.
24. The Hierarchical Data Format, Version 5 (HDF5). Available: http://www.hdfgroup.org/HDF5/doc/.
25. Hadjieleftheriou, M., Hoel, E., and Tsotras, V. J.Sail: A Spatial Index Library for Efficient Application Integration.” GeoInformatica 9, no. 4 (2005): 367–89.CrossRefGoogle Scholar
26. Hey, T., Tansley, S., and Tolle, K.The Fourth Paradigm, Data-Intensive Scientific Discovery. Redmond, Washington: Microsoft Research, October 2009.Google Scholar
27. Joslyn, C., Burke, J., Critchlow, T., Hengartner, N., and Hogan, E. “View Discovery in OLAP Databases through Statistical Combinatorial Optimization.” Proceedings of the 21st International Conference on Scientific and Statistical Database Management. New Orleans, LA, June 2009.Google Scholar
28. Kogge, P., Bergman, K., Borkar, S., Campbell, D., Carlson, W., Dally, W., Denneau, M., Franzon, P., Harrod, W., Hill, K., Hiller, J., Karp, S., Keckler, S., Klein, D., Lucas, R., Richards, M., Scarpelli, A., Scott, S., Snavely, A., Sterling, T., Williams, R. S., and Yelick, K.Exascale Computing Study: Technology Challenges in Achieving Exascale Systems. Technical Report DARPA, 2008.Google Scholar
29. Kleese van Dam, K., James, M., and Walker, A. “Integrating Data Management and Collaborative Sharing with Computational Science Research Processes.” In Handbook of Research on Computational Science and Engineering: Theory and Practice, edited by J., Leng and W., Sharrock. 506–38, Hershey, PAIGI Global, September 2011.Google Scholar
30. Kumar, V. S., Kurc, T., Abdulla, G., Kohn, S. R., Saltz, J., and Matarazzo, C. “Architectural Implications for Spatial Object Association Algorithms.” Proceedings of the IEEE International Parallel and Distributed Processing Symposium, Rome, Italy, IPDPS, May 2009.Google Scholar
31. Ludaescher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M., Lee, E. A., Tao, J., and Zhao, Y.Scientific Workflow Management and the Kepler System. Concurrency and Computation: Practice & Experience 18, no 10: (August 2006) 1039–65.CrossRefGoogle Scholar
32. Lang, S., Carns, P., Latham, R., Ross, R., Harms, K., and Allcock, W. “I/O Performance Challenges at Leadership Scale.” SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, New York, NY, November 2009.Google Scholar
33. Li, J., Liao, W. -K., Choudhary, A., Ross, R., Thakur, R., Gropp, W., and Latham, R.Parallel netCDF: A Scientific High-Performance I/O Interface.” Technical Report ANL/MCS-P1048–0503, Mathematics and Computer Science Division, Argonne National Laboratory, May 2003.CrossRefGoogle Scholar
34. Ludaescher, B., and Goble, C. A., eds. “ACM SIGMOD Record.” Special Section on Scientific Workflows 34, no. 3: September 2005.
35. Lofstead, J. F., Klasky, S., Schwan, K., Podhorszki, N., and Jin, C. “Flexible IO and Integration for Scientific Codes Through the Adaptable IO System.” Proceedings of the International Workshop on Challenges of Large Applications in Distributed Environments. 15–24, Boston, MA, June 2008.Google Scholar
36. Lawrence, B. N., Lowry, R., Miller, P., Snaith, H., and Woolf, A.Information in Environmental Data Grids.” Phil Trans R Soc A 3671 (2009): 1003–14.CrossRefGoogle Scholar
37. http://www.lsst.org/lsst/science/concept data.
38. Mokbel, M. F., and Aref, W. G.PLACE: A Scalable Location-Aware Database Server for Spatio-Temporal Data Streams.” Data Engineering Bulletin 28, no. 3 (2005): 3–10.Google Scholar
39. Musick, R., and Critchlow, T.Practical Lessons in Supporting Large Scale Computational Science.” SIGMOD Record 28, no. 4 (December 1999): 49–57.CrossRefGoogle Scholar
40. McGuinness, D., Fox, P., Cinquini, L., West, P., Garcia, J., Benedict, J. L., and Middleton, D. “The Virtual Solar-Terrestrial Observatory: A Deployed Semantic Web Application Case Study for Scientific Research.” Proceedings of the Nineteenth Conference on Innovative Applications of Artificial Intelligence (IAAI-07). Vancouver, British Columbia, Canada, July 22-26, 2007. Available: http://www.ksl.stanford.edu/KSL Abstracts/KSL-07-01.html.Google Scholar
41. Mokbel, M. F., Ghanem, T. M., and Aref, W. G.Spatio-Temporal Access Methods. IEEE Data Engineering Bulletin 26, no. 2 (2003): 40–49.Google Scholar
42. Matthews, B., Sufi, S., Flannery, D., Lerusse, L., Griffin, T., Gleaves, M., and Kleese van Dam, K. “Using a Core Scientific Metadata Model in Large-Scale Facilities.” 5th International Digital Curation Conference (IDCC 2009), London, U.K., December 2009.Google Scholar
43. Maceachren, A. M., Robinson, A., Gardner, S., Murray, R., Gahegan, M., and Hetzler, E.Visualizing Geospatial Information Uncertainty: What We Know and What We Need to Know.” Cartography and Geographic Information Science 32, no. 3 (2005): 139–60.CrossRefGoogle Scholar
44. http://www.netezza.com/.
45. Nanni, M., Kuijpers, B., Korner, C., May, M., and Pedreschi, D.Spatiotemporal Data Mining. In Giannotti, F., and Pedreschi, D., eds. Mobility, Data Mining, and Privacy: Geographic Knowledge Discovery. Berlin, Germany: Springer-Verlag, 2008.Google Scholar
46. Oracle Times Ten. In-Memory Database Architectural Overview. Release 6.0. Available: http://www.oracle.com/us/products/database/timesten/overview/index.html.
47. IEEE/ANSI Standard. 1003.1 Portable Operating System Interface (POSIX) – Part 1: System Application Program Interface (API) [C Language], 1996.
48. Werstein, P. “A Performance Benchmark for Spatio-Temporal Databases.” Proceedings of the 10th Annual Colloquium of the Spatial Information Research Centre, The University of Otago, Dunedin, New Zealand. 365–73, December 1998.Google Scholar
49. Roddick, J. F., Hornsby, K. and Spiliopoulou, M. “An Updated Bibliography of Temporal, Spatial and Spatio-Temporal Data Mining Research.” In Post-Workshop Proceedings of the International Workshop on Temporal, Spatial and Spatio-Temporal Data Mining. Lecture Notes in Artificial Intelligence. Roddick, J. F. and Hornsby, K., eds. 147–63. Berlin: Springer, 2001.CrossRefGoogle Scholar
50. Sellis, T. “Research Issues in Spatio-Temporal Database Systems.” In Güting, R. H., Papadias, D., and Lochovsky, F., eds., SSD' 99, LNCS 1651. 5–11, Berlin, Heidelberg: Springer-Verlag, 1999.Google Scholar
51. Schwan, P. “Lustre: Building a File System for 1000-Node Clusters.” Proceedings of the 2003 Linux Symposium, Ottawa, Canada, July 2004.Google Scholar
52. Schneider, M. “Fuzzy Spatial Data Types for Spatial Uncertainty Management in Databases.” Handbook of Research on Fuzzy Information Processing in Databases. Edited by J., Galindo Ed. 490–515. Hershey, PA: IGI Global, 2008.Google Scholar
53. Siebenlist, F., Ananthakrishnan, R., Bernholdt, D. E., Cinquini, L., Foster, I. T., Middleton, D. E., Miller, N., and Williams, D. N. “Enhancing the Earth System Grid Security Infrastructure Through Single Sign-on and Autoprovisioning.” Proceedings of the 5th Grid Computing Environments Workshop, Portland, Oregon, November 14–20, 2009. GCE '09. ACM, New York, NY, 1–8. Available: http://doi.acm.org/10.1145/1658260.1658278.Google Scholar
54. Sanfilippo, A., Baddeley, B., Beagley, N., McDermott, J., Riensche, R., Taylor, R., and Gopalan, B.Using the Gene Ontology to Enrich Biological Pathways. International Journal of computational Biology and Drug design 2, no. 3 (2009): 221–35.CrossRefGoogle ScholarPubMed
55. Stonebraker, M., Becla, J., DeWitt, D., Lim, K-T., Maier, D., Ratzesberger, O., and Zdonik, S.Requirements for Science Data Bases and SciDB.” CIDR 2009 Conference, Asilomar, CA, January 2009. Available: http://www-db.cs.wisc.edu/cidr/cidr2009/Paper 26.pdf.Google Scholar
56. Shepler, S., Eisler, M., and Noveck, D. Network File System (NFS) Version 4 Minor Version 1 Protocol. January 2010. Available: http://datatracker.ietf.org/doc/rfc5661/.
57. Schmuck, F., and Haskin, R.GPFS: A Shared-Disk File System for Large Computing Clusters.” Proceedings of the FAST 2002 Conference on File and Storage Technologies, Monterey, CA, January 2002.Google Scholar
58. Silberschatz, A., Korth, H., and Sudarshan, S.Database Systems Concepts. New York: McGraw-Hill Publishing, January 2010.Google Scholar
59. Szalay, A. S., Gray, J., Thakar, A., Kunszt, P. Z., Malik, T., Raddick, J., Stoughton, C., and vandenBerg, J.The SDSS SkyServer: Public Access to the Sloan Digital Sky Server Data.” SIGMOD Conference, Madison, WI, June 2002: 570–81.Google Scholar
60. Shi, W., Wang, S., Li, D., and Wang, X.Uncertainty-Based Spatial Datamining.” ASIAGIS, Wuhan, China, October 2003.Google Scholar
61. Thakur, R., Gropp, W., and Lusk, E.Data Sieving and Collective I/O in ROMIO.” Proceedings of the Seventh Symposium on the Frontiers of Massively Parallel Computation, Los Alamitos CA, Feb.1999.CrossRefGoogle Scholar
62. Thakur, R., Gropp, W., and Lusk, E. “On Implementing MPI-IO Portably and with High Performance.” Proceedings of the 6th Workshop on I/O in Parallel and Distributed Systems. Atlanta, GA: ACM Press, May 1999.Google Scholar
63. Top500 List, November 2009. Available: http://www.top500.org/list/2009/11/100.
64. Twa, M., Parthasarathy, S., Rosche, T., and Bullmer, M. “Decision Tree Classification of Spatial Data Patterns.” From Videokeratography Using Zernike Polynomials. SIAM International Conference on Data Mining, San Francisco, CA, May 2003.Google Scholar
65. Walker, A. M., Bruin, R. P., Dove, M. T., White, T. O. H., Kleese van Dam, K., and Tyer, R. P.Integrating Computing, Data and Collaboration Grids: The RMCS Tool. Phil Trans R Soc A 367, no. 1890 (March 13, 2009): 1047–50; DOI: 10.1098/rsta.2008.0159.CrossRefGoogle Scholar
66. Woolf, A., Lawrence, B., Lowry, R., Kleese van Dam, K., Cramer, R., Gutierrez, M., Kondapalli, S., Latham, S., Lowe, D., O'Neill, K., and Stephens, A.Data Integration with the Climate Science Modeling Language. Adv Geosci 8 (2006): 83–90. Available: www.adv-geosci.net/8/83/2006/.CrossRefGoogle Scholar
67. Welch, B., Unangst, M., Abbasi, Z., Gibson, G., Mueller, B., Small, J., Zelenka, J., and Zhou, B. “Scalable Performance of the Panasas Parallel File System.” Proceedings of the 6th USENIX Conference on File and Storage Technologies, San Jose, CA, February 2008.Google Scholar
68. Xu, X., Han, J., and Lu, W. “RT-Tree: An Improved R-Tree Index Structure for Spatiotemporal Databases.” Proceedings of the 4th International Symposium on Spatial Data Handling (SDH), Zurich Switzerland, July 1990.Google Scholar

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×