Hostname: page-component-586b7cd67f-2plfb Total loading time: 0 Render date: 2024-11-21T11:56:22.589Z Has data issue: false hasContentIssue false

Computation semantics of the functional scientific workflow language Cuneiform*

Published online by Cambridge University Press:  24 October 2017

JÖRGEN BRANDT
Affiliation:
Humboldt-Universität zu Berlin, Unter den Linden 6, 10099 Berlin, Germany (e-mails: brandjoe@informatik.hu-berlin.de, reisig@informatik.hu-berlin.de, leser@informatik.hu-berlin.de)
WOLFGANG REISIG
Affiliation:
Humboldt-Universität zu Berlin, Unter den Linden 6, 10099 Berlin, Germany (e-mails: brandjoe@informatik.hu-berlin.de, reisig@informatik.hu-berlin.de, leser@informatik.hu-berlin.de)
ULF LESER
Affiliation:
Humboldt-Universität zu Berlin, Unter den Linden 6, 10099 Berlin, Germany (e-mails: brandjoe@informatik.hu-berlin.de, reisig@informatik.hu-berlin.de, leser@informatik.hu-berlin.de)
Rights & Permissions [Opens in a new window]

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

Cuneiform is a minimal functional programming language for large-scale scientific data analysis. Implementing a strict black-box view on external operators and data, it allows the direct embedding of code in a variety of external languages like Python or R, provides data-parallel higher order operators for processing large partitioned data sets, allows conditionals and general recursion, and has a naturally parallelizable evaluation strategy suitable for multi-core servers and distributed execution environments like Hadoop, HTCondor, or distributed Erlang. Cuneiform has been applied in several data-intensive research areas including remote sensing, machine learning, and bioinformatics, all of which critically depend on the flexible assembly of pre-existing tools and libraries written in different languages into complex pipelines. This paper introduces the computation semantics for Cuneiform. It presents Cuneiform's abstract syntax, a simple type system, and the semantics of evaluation. Providing an unambiguous specification of the behavior of Cuneiform eases the implementation of interpreters which we showcase by providing a concise reference implementation in Erlang. The similarity of Cuneiform's syntax to the simply typed lambda calculus puts Cuneiform in perspective and allows a straightforward discussion of its design in the context of functional programming. Moreover, the simple type system allows the deduction of the language's safety up to black-box operators. Last, the formulation of the semantics also permits the verification of compilers to and from other workflow languages.

Type
Research Article
Copyright
Copyright © Cambridge University Press 2017 

Footnotes

*

This work is funded by the EU FP7 project “Scalable, Secure Storage and Analysis of Biobank Data” under Grant Agreement no. 317871. We also acknowledge funding by the Humboldt Graduate School GRK 1651: SOAMED.

References

Armstrong, J., Virding, R., Wikström, C. & Williams, M. (1996) Concurrent Programming in ERLANG (2nd Ed.). Prentice Hall International (UK) Ltd., Hertfordshire, UK.Google Scholar
Arts, T., Hughes, J., Johansson, J. & Wiger, U. (2006) Testing telecoms software with quviq quickcheck. In Proceedings of the 2006 ACM SIGPLAN Workshop on Erlang, ERLANG '06. New York, NY, USA: ACM.Google Scholar
Bessani, A., Brandt, J., Bux, M., Cogo, V., Dimitrova, L., Dowling, J., Gholami, A., Hakimzadeh, K., Hummel, M., Ismail, M., Laure, E., Leser, U., Litton, J.-E., Martinez, R., Niazi, S., Reichel, J. & Zimmermann, K. (2015) Biobankcloud: A platform for the secure storage, sharing, and processing of large biomedical data sets. In Proceedings of 1st International Workshop on Data Management and Analytics for Medicine and Healthcare (DMAH 2015).Google Scholar
Bishop, C. M. (2006) Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA.Google Scholar
Brandt, J., Bux, M. & Leser, U. (2015 March) Cuneiform: A functional language for large scale scientific data analysis. In Proceedings of the Workshops of the EDBT/ICDT, vol. 1330, pp. 17–26.Google Scholar
Breitinger, S., Klusik, U. & Loogen, R. (1998) From (Sequential) Haskell to (Parallel) Eden: An Implementation Point of View. Berlin, Heidelberg: Springer, pp. 318334.Google Scholar
Budiu, M. & Goldstein, S. C. (2002) Pegasus: An Efficient Intermediate Representation. Technical Report. DTIC Document.Google Scholar
Bux, M., Brandt, J., Lipka, C., Hakimzadeh, K., Dowling, J. & Leser, U. (2015 September) Saasfee: Scalable scientific workflow execution engine. In Proceedings of the VLDB Endowment, vol. 8, pp. 1892–1895.CrossRefGoogle Scholar
Bux, M., Brandt, J., Witt, C., Dowling, J. & Leser, U. (2017) Hi-way: Execution of scientific workflows on hadoop yarn. In Proceedings of the 20th International Conference on Extending Database Technology (EDBT).Google Scholar
Church, A. & Rosser, J. B. (1936) Some properties of conversion. Trans. Am. Math. Soc. 39 (3), 472482.Google Scholar
Cohen-Boulakia, S. & Leser, U. (2011) Search, adapt, and reuse: The future of scientific workflows. Sigmod Rec. 40 (2), 616.Google Scholar
Dean, J. & Ghemawat, S. (2008) Mapreduce: Simplified data processing on large clusters. Commun. ACM 51 (1), 107113.CrossRefGoogle Scholar
Deelman, E., Livny, M., Mehta, G., Pavlo, A., Singh, G., Su, M.-H., Vahi, K. & Wenger, R. K. (2006) Pegasus and dagman from concept to execution: Mapping scientific workflows onto today's cyberinfrastructure. In High Performance Computing Workshop, pp. 56–74.Google Scholar
DeRemer, F. L. & Kron, H. H. (1976) Programming-in-the-Large versus Programming-in-the-Small. Berlin, Heidelberg: Springer, pp. 8089.Google Scholar
Di Tommaso, Paolo, Maria, Chatzou, Floden, Evan W., Prieto, Barja Pablo, Emilio, Palumbo & Cedric, Notredame (2017). Nextflow enables reproducible computational workflows. Nat Biotech, 35 (4), 316319.Google Scholar
Duda, R. O., Hart, P. E. & Stork, D. G. (2012) Pattern Classification. John Wiley & Sons.Google Scholar
Efron, B. & Tibshirani, R. J. (1994) An Introduction to the Bootstrap. CRC Press.Google Scholar
Goderis, A., Brooks, C., Altintas, I., Lee, E. A. & Goble, C. (2007) Composing Different Models of Computation in Kepler and Ptolemy ii. Berlin, Heidelberg: Springer, pp. 182190.Google Scholar
Goecks, J., Nekrutenko, A. & Taylor, J. (2010) Galaxy: A comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11 (8), 1.Google Scholar
Guan, Z., Hernandez, F., Bangalore, P., Gray, J., Skjellum, A., Velusamy, V. & Liu, Y. (2006) Grid-flow: A grid-enabled scientific workflow system with a petri-net-based interface. Concurr. Comput.: Pract. Exp. 18 (10), 11151140.Google Scholar
Harper, R. (2016) Practical Foundations for Programming Languages. Cambridge University Press.Google Scholar
Haykin, S. S., Haykin, S. S., Haykin, S. S. & Haykin, S. S. (2009) Neural Networks and Learning Machines, vol. 3. Upper Saddle River, NJ, USA: Pearson.Google Scholar
Hennessy, M. (1990) The Semantics of Programming Languages: An Elementary Introduction using Structural Operational Semantics. John Wiley & Sons.Google Scholar
Hey, T., et al. (2009) The Fourth Paradigm: Data-Intensive Scientific Discovery, vol. 1. Microsoft research Redmond, WA.Google Scholar
Hidders, J. & Sroka, J. (2008) Towards a Calculus for Collection-Oriented Scientific Workflows with Side Effects. Berlin, Heidelberg: Springer, pp. 374391.Google Scholar
Hughes, J. (2007) Quickcheck Testing for Fun and Profit. Berlin, Heidelberg: Springer, pp. 132.Google Scholar
Hull, D., Wolstencroft, K., Stevens, R., Goble, C., Pocock, M. R., Li, P. & Oinn, T. (2006) Taverna: A tool for building and running workflows of services. Nucleic Acids Res. 34 (suppl 2), W729W732.Google Scholar
Kahn, G. (1987) Natural Semantics. Berlin, Heidelberg: Springer, pp. 2239.Google Scholar
Kalayci, S., Dasgupta, G., Fong, L., Ezenwoye, O. & Sadjadi, S. M. (2010) Distributed and adaptive execution of condor dagman workflows. In SEKE, pp. 587–590.Google Scholar
Kelly, P. M. (2011) Applying functional programming theory to the design of workflow engines. PhD thesis, University of Adelaide.Google Scholar
Kelly, P. M., Coddington, P. D. & Wendelborn, A. L. (2009) Lambda calculus as a workflow model. Concurr. Comput.: Pract. Exp. 21 (16), 19992017.Google Scholar
Köster, J. & Rahmann, S. (2012) SnakemakeâǍŤa scalable bioinformatics workflow engine. Bioinformatics 28 (19), 25202522.Google Scholar
Liu, J., Pacitti, E., Valduriez, P. & Mattoso, M. (2015) A survey of data-intensive scientific workflow management. J. Grid Comput. 13 (4), 457493.Google Scholar
Loogen, R., Ortega-Mallén, Y. & Peña-Marí, R. (2005) Parallel functional programming in eden. J. Funct. Program. 15 (03), 431475.Google Scholar
Ludäscher, B. & Altintas, I. (2003) On providing declarative design and programming constructs for scientific workflows based on process networks. San Diego Supercomputer Center.Google Scholar
Manly, B. F. J. (2006) Randomization, Bootstrap and Monte Carlo Methods in Biology, vol. 70. CRC Press.Google Scholar
McPhillips, T., Bowers, S. & Ludäscher, B. (2006) Collection-Oriented Scientific Workflows for Integrating and Analyzing Biological Data. Berlin, Heidelberg: Springer, pp. 248263.Google Scholar
Michaelson, G. (2011) An Introduction to Functional Programming Through Lambda Calculus. Courier Corporation.Google Scholar
Moggi, E. (1991) Notions of computation and monads. Inform. Comput. 93 (1), 5592.Google Scholar
Myers, K. S., Yan, H., Ong, I. M., Chung, D., Liang, K., Tran, F, Keleş, S., Landick, R. & Kiley, P. J. (2013) Genome-scale analysis of escherichia coli fnr reveals complex features of transcription factor binding. Plos Genet 9 (6), e1003565.Google Scholar
Oinn, T., Greenwood, M., Addis, M., Alpdemir, M. N., Ferris, J., Glover, K., Goble, C., Goderis, A., Hull, D., Marvin, D., Li, P., Lord, P., Pocock, M. R., Senger, M., Stevens, R., Wipat, A. & Wroe, C. (2006) Taverna: lessons in creating a workflow environment for the life sciences. Concurrency and Computation: Practice and Experience, 18 (10), 10671100.Google Scholar
Olston, C., Reed, B., Srivastava, U., Kumar, R. & Tomkins, A. (2008) Pig latin: A not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD '08. New York, NY, USA: ACM, pp. 1099–1110.Google Scholar
Pierce, B. C. (2002) Types and Programming Languages. MIT press.Google Scholar
Plotkin, G. D. (1981) A structural approach to operational semantics. Computer Science Department, Aarhus University Aarhus, Denmark.Google Scholar
Pointon, R. F., Trinder, P. W. & Loidl, H.-W. (2001) The Design and Implementation of Glasgow Distributed Haskell. Berlin, Heidelberg: Springer, pp. 5370.Google Scholar
Sroka, J. & Hidders, J. (2009a) Towards a formal semantics for the process model of the taverna workbench. Part i. Fundam. Inform. 92 (3), 279299.Google Scholar
Sroka, J. & Hidders, J. (2009b) Towards a formal semantics for the process model of the taverna workbench. Part ii. Fundam. Inform. 92 (4), 373396.Google Scholar
Sroka, J., Hidders, J., Missier, P. & Goble, C. (2010) A formal semantics for the taverna 2 workflow model. J. Comput. Syst. Sci. 76 (6), 490508.Google Scholar
Tennent, R. D. (1976) The denotational semantics of programming languages. Commun. ACM 19 (8), 437453.Google Scholar
Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P. & Murthy, R. (2009) Hive: A warehousing solution over a map-reduce framework. Proc. Vldb Endowment 2 (2), 16261629.Google Scholar
Turi, D., Missier, P., Goble, C., De Roure, D. & Oinn, T. (2007) Taverna workflows: Syntax and semantics. In Proceedings of IEEE International Conference on e-Science and Grid Computing. IEEE, pp. 441–448.Google Scholar
White, T. (2012) Hadoop: The Definitive Guide. O'Reilly Media, Inc..Google Scholar
Winskel, G. (1993) The Formal Semantics of Programming Languages: An Introduction. MIT Press.Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S. & Stoica, I. (2010) Spark: Cluster computing with working sets. Hotcloud 10 (10–10), 95.Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., Mccauley, M., Franklin, M., Shenker, S. & Stoica, I. (2012) Fast and interactive analytics over hadoop data with spark. Usenix Login 37 (4), 4551.Google Scholar
Zinn, D., Bowers, S., McPhillips, T. & Ludäscher, B. (2009) Scientific workflow design with data assembly lines. In Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science, WORKS '09. New York, NY, USA: ACM, pp. 14:1–14:10.Google Scholar
Submit a response

Discussions

No Discussions have been published for this article.