Computation semantics of the functional scientific workflow language Cuneiform*

JÖRGEN BRANDT; WOLFGANG REISIG; ULF LESER

doi:10.1017/S0956796817000119

Computation semantics of the functional scientific workflow language Cuneiform*

Part of: Big Data Special Collection

Published online by Cambridge University Press: 24 October 2017

JÖRGEN BRANDT ,

WOLFGANG REISIG and

ULF LESER

Show author details

JÖRGEN BRANDT: Affiliation:
Humboldt-Universität zu Berlin, Unter den Linden 6, 10099 Berlin, Germany (e-mails: brandjoe@informatik.hu-berlin.de, reisig@informatik.hu-berlin.de, leser@informatik.hu-berlin.de)
WOLFGANG REISIG: Affiliation:
Humboldt-Universität zu Berlin, Unter den Linden 6, 10099 Berlin, Germany (e-mails: brandjoe@informatik.hu-berlin.de, reisig@informatik.hu-berlin.de, leser@informatik.hu-berlin.de)
ULF LESER: Affiliation:
Humboldt-Universität zu Berlin, Unter den Linden 6, 10099 Berlin, Germany (e-mails: brandjoe@informatik.hu-berlin.de, reisig@informatik.hu-berlin.de, leser@informatik.hu-berlin.de)

Article contents

Abstract
Footnotes
References

Rights & Permissions

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

Cuneiform is a minimal functional programming language for large-scale scientific data analysis. Implementing a strict black-box view on external operators and data, it allows the direct embedding of code in a variety of external languages like Python or R, provides data-parallel higher order operators for processing large partitioned data sets, allows conditionals and general recursion, and has a naturally parallelizable evaluation strategy suitable for multi-core servers and distributed execution environments like Hadoop, HTCondor, or distributed Erlang. Cuneiform has been applied in several data-intensive research areas including remote sensing, machine learning, and bioinformatics, all of which critically depend on the flexible assembly of pre-existing tools and libraries written in different languages into complex pipelines. This paper introduces the computation semantics for Cuneiform. It presents Cuneiform's abstract syntax, a simple type system, and the semantics of evaluation. Providing an unambiguous specification of the behavior of Cuneiform eases the implementation of interpreters which we showcase by providing a concise reference implementation in Erlang. The similarity of Cuneiform's syntax to the simply typed lambda calculus puts Cuneiform in perspective and allows a straightforward discussion of its design in the context of functional programming. Moreover, the simple type system allows the deduction of the language's safety up to black-box operators. Last, the formulation of the semantics also permits the verification of compilers to and from other workflow languages.

Information

Type: Research Article
Information: Journal of Functional Programming , Volume 27 , 2017 , e22

DOI: https://doi.org/10.1017/S0956796817000119 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2017

Footnotes

This work is funded by the EU FP7 project “Scalable, Secure Storage and Analysis of Biobank Data” under Grant Agreement no. 317871. We also acknowledge funding by the Humboldt Graduate School GRK 1651: SOAMED.

References

Armstrong, J., Virding, R., Wikström, C. & Williams, M. (1996) Concurrent Programming in ERLANG (2nd Ed.). Prentice Hall International (UK) Ltd., Hertfordshire, UK.Google Scholar

Arts, T., Hughes, J., Johansson, J. & Wiger, U. (2006) Testing telecoms software with quviq quickcheck. In Proceedings of the 2006 ACM SIGPLAN Workshop on Erlang, ERLANG '06. New York, NY, USA: ACM.Google Scholar

Bessani, A., Brandt, J., Bux, M., Cogo, V., Dimitrova, L., Dowling, J., Gholami, A., Hakimzadeh, K., Hummel, M., Ismail, M., Laure, E., Leser, U., Litton, J.-E., Martinez, R., Niazi, S., Reichel, J. & Zimmermann, K. (2015) Biobankcloud: A platform for the secure storage, sharing, and processing of large biomedical data sets. In Proceedings of 1st International Workshop on Data Management and Analytics for Medicine and Healthcare (DMAH 2015).Google Scholar

Bishop, C. M. (2006) Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA.Google Scholar

Brandt, J., Bux, M. & Leser, U. (2015 March) Cuneiform: A functional language for large scale scientific data analysis. In Proceedings of the Workshops of the EDBT/ICDT, vol. 1330, pp. 17–26.Google Scholar

Breitinger, S., Klusik, U. & Loogen, R. (1998) From (Sequential) Haskell to (Parallel) Eden: An Implementation Point of View. Berlin, Heidelberg: Springer, pp. 318–334.Google Scholar

Budiu, M. & Goldstein, S. C. (2002) Pegasus: An Efficient Intermediate Representation. Technical Report. DTIC Document.Google Scholar

Bux, M., Brandt, J., Lipka, C., Hakimzadeh, K., Dowling, J. & Leser, U. (2015 September) Saasfee: Scalable scientific workflow execution engine. In Proceedings of the VLDB Endowment, vol. 8, pp. 1892–1895.CrossRef Google Scholar

Bux, M., Brandt, J., Witt, C., Dowling, J. & Leser, U. (2017) Hi-way: Execution of scientific workflows on hadoop yarn. In Proceedings of the 20th International Conference on Extending Database Technology (EDBT).Google Scholar

Church, A. & Rosser, J. B. (1936) Some properties of conversion. Trans. Am. Math. Soc. 39 (3), 472–482.Google Scholar

Cohen-Boulakia, S. & Leser, U. (2011) Search, adapt, and reuse: The future of scientific workflows. Sigmod Rec. 40 (2), 6–16.Google Scholar

Dean, J. & Ghemawat, S. (2008) Mapreduce: Simplified data processing on large clusters. Commun. ACM 51 (1), 107–113.CrossRef Google Scholar

Deelman, E., Livny, M., Mehta, G., Pavlo, A., Singh, G., Su, M.-H., Vahi, K. & Wenger, R. K. (2006) Pegasus and dagman from concept to execution: Mapping scientific workflows onto today's cyberinfrastructure. In High Performance Computing Workshop, pp. 56–74.Google Scholar

DeRemer, F. L. & Kron, H. H. (1976) Programming-in-the-Large versus Programming-in-the-Small. Berlin, Heidelberg: Springer, pp. 80–89.Google Scholar

Di Tommaso, Paolo, Maria, Chatzou, Floden, Evan W., Prieto, Barja Pablo, Emilio, Palumbo & Cedric, Notredame (2017). Nextflow enables reproducible computational workflows. Nat Biotech, 35 (4), 316–319.Google Scholar

Duda, R. O., Hart, P. E. & Stork, D. G. (2012) Pattern Classification. John Wiley & Sons.Google Scholar

Efron, B. & Tibshirani, R. J. (1994) An Introduction to the Bootstrap. CRC Press.Google Scholar

Goderis, A., Brooks, C., Altintas, I., Lee, E. A. & Goble, C. (2007) Composing Different Models of Computation in Kepler and Ptolemy ii. Berlin, Heidelberg: Springer, pp. 182–190.Google Scholar

Goecks, J., Nekrutenko, A. & Taylor, J. (2010) Galaxy: A comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11 (8), 1.Google Scholar

Guan, Z., Hernandez, F., Bangalore, P., Gray, J., Skjellum, A., Velusamy, V. & Liu, Y. (2006) Grid-flow: A grid-enabled scientific workflow system with a petri-net-based interface. Concurr. Comput.: Pract. Exp. 18 (10), 1115–1140.Google Scholar

Harper, R. (2016) Practical Foundations for Programming Languages. Cambridge University Press.Google Scholar

Haykin, S. S., Haykin, S. S., Haykin, S. S. & Haykin, S. S. (2009) Neural Networks and Learning Machines, vol. 3. Upper Saddle River, NJ, USA: Pearson.Google Scholar

Hennessy, M. (1990) The Semantics of Programming Languages: An Elementary Introduction using Structural Operational Semantics. John Wiley & Sons.Google Scholar

Hey, T., et al. (2009) The Fourth Paradigm: Data-Intensive Scientific Discovery, vol. 1. Microsoft research Redmond, WA.Google Scholar

Hidders, J. & Sroka, J. (2008) Towards a Calculus for Collection-Oriented Scientific Workflows with Side Effects. Berlin, Heidelberg: Springer, pp. 374–391.Google Scholar

Hughes, J. (2007) Quickcheck Testing for Fun and Profit. Berlin, Heidelberg: Springer, pp. 1–32.Google Scholar

Hull, D., Wolstencroft, K., Stevens, R., Goble, C., Pocock, M. R., Li, P. & Oinn, T. (2006) Taverna: A tool for building and running workflows of services. Nucleic Acids Res. 34 (suppl 2), W729–W732.Google Scholar

Kahn, G. (1987) Natural Semantics. Berlin, Heidelberg: Springer, pp. 22–39.Google Scholar

Kalayci, S., Dasgupta, G., Fong, L., Ezenwoye, O. & Sadjadi, S. M. (2010) Distributed and adaptive execution of condor dagman workflows. In SEKE, pp. 587–590.Google Scholar

Kelly, P. M. (2011) Applying functional programming theory to the design of workflow engines. PhD thesis, University of Adelaide.Google Scholar

Kelly, P. M., Coddington, P. D. & Wendelborn, A. L. (2009) Lambda calculus as a workflow model. Concurr. Comput.: Pract. Exp. 21 (16), 1999–2017.Google Scholar

Köster, J. & Rahmann, S. (2012) SnakemakeâǍŤa scalable bioinformatics workflow engine. Bioinformatics 28 (19), 2520–2522.Google Scholar

Liu, J., Pacitti, E., Valduriez, P. & Mattoso, M. (2015) A survey of data-intensive scientific workflow management. J. Grid Comput. 13 (4), 457–493.Google Scholar

Loogen, R., Ortega-Mallén, Y. & Peña-Marí, R. (2005) Parallel functional programming in eden. J. Funct. Program. 15 (03), 431–475.Google Scholar

Ludäscher, B. & Altintas, I. (2003) On providing declarative design and programming constructs for scientific workflows based on process networks. San Diego Supercomputer Center.Google Scholar

Manly, B. F. J. (2006) Randomization, Bootstrap and Monte Carlo Methods in Biology, vol. 70. CRC Press.Google Scholar

McPhillips, T., Bowers, S. & Ludäscher, B. (2006) Collection-Oriented Scientific Workflows for Integrating and Analyzing Biological Data. Berlin, Heidelberg: Springer, pp. 248–263.Google Scholar

Michaelson, G. (2011) An Introduction to Functional Programming Through Lambda Calculus. Courier Corporation.Google Scholar

Moggi, E. (1991) Notions of computation and monads. Inform. Comput. 93 (1), 55–92.Google Scholar

Myers, K. S., Yan, H., Ong, I. M., Chung, D., Liang, K., Tran, F, Keleş, S., Landick, R. & Kiley, P. J. (2013) Genome-scale analysis of escherichia coli fnr reveals complex features of transcription factor binding. Plos Genet 9 (6), e1003565.Google Scholar

Oinn, T., Greenwood, M., Addis, M., Alpdemir, M. N., Ferris, J., Glover, K., Goble, C., Goderis, A., Hull, D., Marvin, D., Li, P., Lord, P., Pocock, M. R., Senger, M., Stevens, R., Wipat, A. & Wroe, C. (2006) Taverna: lessons in creating a workflow environment for the life sciences. Concurrency and Computation: Practice and Experience, 18 (10), 1067–1100.Google Scholar

Olston, C., Reed, B., Srivastava, U., Kumar, R. & Tomkins, A. (2008) Pig latin: A not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD '08. New York, NY, USA: ACM, pp. 1099–1110.Google Scholar

Pierce, B. C. (2002) Types and Programming Languages. MIT press.Google Scholar

Plotkin, G. D. (1981) A structural approach to operational semantics. Computer Science Department, Aarhus University Aarhus, Denmark.Google Scholar

Pointon, R. F., Trinder, P. W. & Loidl, H.-W. (2001) The Design and Implementation of Glasgow Distributed Haskell. Berlin, Heidelberg: Springer, pp. 53–70.Google Scholar

Sroka, J. & Hidders, J. (2009a) Towards a formal semantics for the process model of the taverna workbench. Part i. Fundam. Inform. 92 (3), 279–299.Google Scholar

Sroka, J. & Hidders, J. (2009b) Towards a formal semantics for the process model of the taverna workbench. Part ii. Fundam. Inform. 92 (4), 373–396.Google Scholar

Sroka, J., Hidders, J., Missier, P. & Goble, C. (2010) A formal semantics for the taverna 2 workflow model. J. Comput. Syst. Sci. 76 (6), 490–508.Google Scholar

Tennent, R. D. (1976) The denotational semantics of programming languages. Commun. ACM 19 (8), 437–453.Google Scholar

Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P. & Murthy, R. (2009) Hive: A warehousing solution over a map-reduce framework. Proc. Vldb Endowment 2 (2), 1626–1629.Google Scholar

Turi, D., Missier, P., Goble, C., De Roure, D. & Oinn, T. (2007) Taverna workflows: Syntax and semantics. In Proceedings of IEEE International Conference on e-Science and Grid Computing. IEEE, pp. 441–448.Google Scholar

White, T. (2012) Hadoop: The Definitive Guide. O'Reilly Media, Inc..Google Scholar

Winskel, G. (1993) The Formal Semantics of Programming Languages: An Introduction. MIT Press.Google Scholar

Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S. & Stoica, I. (2010) Spark: Cluster computing with working sets. Hotcloud 10 (10–10), 95.Google Scholar

Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., Mccauley, M., Franklin, M., Shenker, S. & Stoica, I. (2012) Fast and interactive analytics over hadoop data with spark. Usenix Login 37 (4), 45–51.Google Scholar

Zinn, D., Bowers, S., McPhillips, T. & Ludäscher, B. (2009) Scientific workflow design with data assembly lines. In Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science, WORKS '09. New York, NY, USA: ACM, pp. 14:1–14:10.Google Scholar

Submit a response

Discussions

No Discussions have been published for this article.

Article contents

Computation semantics of the functional scientific workflow language Cuneiform*

Abstract

Information

Footnotes

References

Discussions

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests