Data oriented workflow using semantic technologies
Hui Bu, Ran Liu
School of Software, North China University of Water Resources and Electric Power, China
Scientific workflows are a topic of great interest in the Grid community that sees in the workflow model an attractive paradigm for programming distributed wide area Grid infrastructures. Scientific workflows have recently emerged as a new paradigm for scientists to formalize and structure complex and distributed scientific processes to enable and accelerate many scientific discoveries. In contrast to business workflows, which are typically control flow oriented, scientific workflows tend to be dataflow oriented, introducing a new set of requirements for system. In this paper, we consider a general workflow setting in which input data sets are processed by a graph of transformations to produce output results. Our goal is to perform efficient selective refresh of elements in the output data, i.e., compute the latest values of specific out-put elements when the input data may have changed. The data provenance is investigated to be used to enable efficient refresh. The proposed approach is based on capturing one level data provenance at each transformation when the workflow is run initially. Then at refresh time provenance is used to determine (transitively) which input elements are responsible for given output elements, and the workflow is rerun only on that portion of the data needed for refresh. The reported preliminary experimental results are developed on the overhead of provenance capture, and on the crossover point be-tween selective refresh and full workflow computation development.