Provenance Challenge: University of Manchester, School of Computer Science
Team and Project Details
Project overview: The PC3 project is undertaken as part of the
myGrid project, and in particular of the
Taverna scientific workflow management system. Taverna 2.1, soon to be released, will offer a provenance API that allows third party developers to access the Taverna provenance DB. In particular the API supports queries to the provenance graph for a workflow run (one run at a time, that is), by specifying (a) the workflow ports for which provenance information is sought, and (b) the workflow tasks ("processors") where provenance information should be reported. This gives users the option to focus provenance information only on portions of the provenance graph that is of interest.
My immediate plan is to incorporate
OPM graph generation functionality into the existing provenance query algorithm. This will be followed by a corresponding import functionality into our internal provenance model (relational).
The former is done and the
OPM graphs below are an example. The latter is still in the works and hopefully will be ready in time for the PC3 f2f meeting.
The ability to fine-tune a provenance query means that Taverna can actually generated and export only the portion of the entire
OPM provenance graph that is relevant to answer the query. In practice, the query is answered by our system, and the
OPM graph represents the answer to the query. Examples of this are also given below. Note that the entire provenance graph is obtained simply by completely "unfocusing" the query, i.e., it corresponds to the degenerate query on all the output ports and all the workflow tasks.
Workflow representation
The PAN-STARRS workflow is defined as a sequence of steps with exit conditions. While this is naturally expressed using an imperative language, the Taverna workflow model is that of a dataflow, i.e., the entire workflow execution is data-driven and there are (essentially) no control structures. This makes the implementation of the PAN-STARRS workflow less than obvious, in that, for example, one cannot simply "halt" a workflow in response to a data error condition. Taverna's treatment of data errors involves propagating the error from the point it is generated, through the remaining dataflow graph. Errors are treated just like any other piece of data, but can be distinguished as representing an error condition. Thus, the workflow continues to execute, but the workflow controller prevents Taverna processors (i.e., workflow tasks) from executing when any of the inputs are errors, immediately returning the same error values instead.
Thus, the overall behaviour is that of a dataflow where errors are spotted, but they do not have the effect to halt executions.
In my rendering of the challenge workflow I provide one boolean output for each possible error condition. In addition, error values appear in the provenance trace as "error values". All intermediate Taverna values are identified using URIs, of the form "t2:ref//dataref.taverna.org?
". In particular, errors are values that contain a description of the error, and are identified by a URI of the form :"t2:error//dataref.taverna.org?". this makes it easy to trace and query errors.
A graphical depiction of the challenge workflow appears in the figure below:
- Taverna version of the challenge workflow (PNG):
The workflow is also available here through the myExperiment workflow repository, part of the myGrid project.
Open Provenance Model Output
The output is in RDF and is produced using the Tupelo provenance API, courtesy of Joe Futrelle at NCSA.
The OPM graphs below, in RDF/XML format, represent (1) the complete OPM graph (the result of the "fully unfocused query" alluded to above), and (2,3,4) for each of the first 3 provenance queries, a graph that contains enough information to answer the query, but does not include parts of the graph that are not relevant for the query. This is done to show how the queries can be answered by Taverna on its native provenance system, and the resulting subgraphs are then exported to OPM.
SPARQL queries
Below are a few simple utility SPARQL queries that may be useful to inspect the OPM RDF graph:
A note of Taverna provenance queries
A provenance query in Taverna consists of two parts:
- a set of ports (output processor variables) whose provenance we are interested in. Optionally, one can also specify an iteration step, i.e., when the value is a list whose elements are produced during iterations. We refer to these using the notation:
target = processor&/variable/iteration [,processor/variable/iteration]* | ALL
(in reality the grammar allows for a few more options, but this will suffice here)
- a set of target processors, where we want provenance to be reported. This is done to avoid unnecessary noise in the query answer, i.e., we "jump" over uninteresting processors. We refer to these using the notation:
selected = processor [, processor]* | ALL
For example:
target=TOP / LoadCSVFileIntoTableOutput? /2 denotes element 2 of the value on port LoadCSVFileIntoTableOutput? of the top-level Taverna workflow (note that workflows can be nested)
select=LoadCSVFileIntoTable,IsMatchTableColumnRanges
this query returns the lineage of the value specified in the targets, but only on the input and output ports of processors LoadCSVFileIntoTable? and IsMatchTableColumnRanges?.
Complete OPM graph
- OPMGraph-complete.rdf: Complete provenance graph (RDF/XML) for one successful run of the PAN-STARRS workflow
- OPMGraph-complete.xml: Complete provenance graph (XML) for one successful run of the PAN-STARRS workflow
- OPMGraph-complete.dot: Complete provenance graph (dot) for one successful run of the PAN-STARRS workflow
- Complete provenance graph (PNG) for one successful run of the PAN-STARRS workflow:
Query 1 For a given detection, which CSV files contributed to it?
note this will be updated to reflect Paul's latest qualification of the purpose of the query.
This maps to the Taverna provenance query:
target = TOP/LoadCSVFileIntoTableOutput/2, selected = LoadCSVFileIntoTable?
and generates the following OPM graph:
- OPMGraph-query1.xml: Provenance graph (XML) as an answer to provenance query 1
- OPMGraph-query1.dot: Provenance graph (dot) as an answer to provenance query 1
- OPMGraph-query1.rdf: Provenance graph (RDF) as an answer to provenance query 1
- Provenance graph (PNG) as an answer to provenance query 1:
using the SPARQL queries listed above to inspect the graph, one obtains the artifacts:
------------------------------------------------------------------------------------------------------------------------------------------------------
| artifact | value |
======================================================================================================================================================
| <t2:ref//dataref.taverna.org?test20> | "J062941_LoadDB" |
| <t2:ref//dataref.taverna.org?test21> | "true" |
| <t2:ref//dataref.taverna.org?test15> | "/Users/paolo/Documents/myGRID/OPM/PC3/SampleData/J062941/P2_J062941_B001_P2fits0_20081115_P2Detection.csv" |
------------------------------------------------------------------------------------------------------------------------------------------------------
where the third entry is file that contributes to the detection. One can inspect the Used relations:
--------------------------------------------------------------------------------------------------------------------------------------------------------
| process | usedArtifact | role | processIteration |
========================================================================================================================================================
| "http://taverna.opm.org/LoadCSVFileIntoTable?it=2" | "t2:ref//dataref.taverna.org?test15" | "LoadCSVFileIntoTable/FileEntry?it=2" | "[2]" |
--------------------------------------------------------------------------------------------------------------------------------------------------------
Query 2: "Was the range check (IsMatchTableColumnRanges?) performed for this table?"
this translates simply into a boolean query that tests whether IsMatchTableColumnRanges? is mentioned anywhere in the provenance graph, and is basically a reacheability query:
target = TOP / ALL / ALL --all output ports of the top level workflow
select = IsMatchTableColumnRanges?
- OPMGraph-query2.xml: Provenance graph (XML) as an answer to provenance query 2
- OPMGraph-query2.dot: Provenance graph (dot) as an answer to provenance query 2
- OPMGraph-query2.rdf: Provenance graph (RDF) as an answer to provenance query 2
- Provenance graph (PNG) as an answer to provenance query 2:
Query 3: Which operation executions were strictly necessary for the Image table to contain a particular (non-computed) value?
Taverna can only provide the simple answer to the query, because query answers are based purely on the data dependencies that are exposed to the workflow.
This translates in Taverna by considering that the Image table is populated from the 2nd CSV file. Since the provenance graph only mentions processors that contributed to the execution, a query that traces back from the target port any processor collects all required processors along the path:
target = TOP / LoadCSVFileIntoTableOutput? / 1
select = ALL
- OPMGraph-query3.xml: Provenance graph (XML) as an answer to provenance query 3
- OPMGraph-query3.dot: Provenance graph (dot) as an answer to provenance query 3
- OPMGraph-query3.rdf: Provenance graph (RDF) as an answer to provenance query 3
- Provenance graph (PNG) as an answer to provenance query 3:
Notes on the IDs used in the graph, and on the values of artifacts:
In Taverna, repeated invocation of a processor occurs when the processor expects an atomic value, i.e., a string, but instead its input port is bound to a list of strings (the story is actually a bit longer but this will suffice for this note). So for example, processor LoadCSVFileIntoTable? is defined to have one input port, called DBEntry, which expects a string (the file name). In this workflow it receives a list of 3 file names. This causes it to execute independently on each of them. A trace for each of these 3 independent executions appears in the OPM graph, and each occurrence is indexed as [0], [1], and [2].
(This path notation makes it possible to express more complex paths into nested lists).
Since there is no explicit provision in OPM to account for indexing of multiple occurrences of the same process, one new ID is created for each occurrence simply by appending the index to the process name. So for example, the following records appear in the list produced by
WhoUsedWhat.sparql:
--------------------------------------------------------------------------------------------------------------------------------------------------------
| process | usedArtifact | role | processIteration |
========================================================================================================================================================
| "http://taverna.opm.org/LoadCSVFileIntoTable?it=2" | "t2:ref//dataref.taverna.org?test16" | "LoadCSVFileIntoTable/FileEntry?it=2" | "[2]" |
| "http://taverna.opm.org/LoadCSVFileIntoTable?it=1" | "t2:ref//dataref.taverna.org?test14" | "LoadCSVFileIntoTable/FileEntry?it=1" | "[1]" |
| "http://taverna.opm.org/LoadCSVFileIntoTable?it=0" | "t2:ref//dataref.taverna.org?test12" | "LoadCSVFileIntoTable/FileEntry?it=0" | "[0]" |
--------------------------------------------------------------------------------------------------------------------------------------------------------
This is interpreted as "occurrence [i] of LoadCSVFileIntoTable used artifact t2:ref//dataref.taverna.org?test16 with role
LoadCSVFileIntoTable/FileEntry?it=i during iteration _[i]_"
The last item, processIteration, is added as an explicit RDF triple to the Used resource to make it possible to query the graph by iteration.
Note also that the role is used, at the moment, to describe the binding of a variable to a value (an artifact ID), e.g. LoadCSVFileIntoTable/FileEntry?it=0 means that t2:ref//dataref.taverna.org?test20 is the artifact (id) bound to variable FileEntry of processor LoadCSVFileIntoTable during iteration [0].
Regarding values, at the moment my implementation can optionally include a property, which is not part of standard OPM, to associate the dereferenced value to the artifact. This is useful to quickly inspect the graph when the values are simple.
In the case of the challenge workflow, values are actually Java beans, so it's not immediately clear how to represent their value. In my implementation of the workflow, I am XMLEncoding/XMLDecoding the beans, as Taverna really only likes to work with strings. Rather then associating XMLEncoded beans to the artifacts, I have chosen to extract field values from them. This is done through customizable "data value extractors" that are invoked by the OPM graph builder when a new artifact is produced. So my choice of extractors reflects the needs of the provenance queries, but should be viewed just as an example of a generic data extraction pattern. In particular, my plan is to map beans to JSON, which is a generally useful format when presenting OPM graphs using rich web pages. (I just haven't had the time to do it).
Query Results
This section describes how third party OPM graphs are imported into the Taverna provenance model. Once imported, the challenge queries can be answered using the Taverna provenance query engine, just like the native Taverna provenance traces.
Representing third party OPM graphs in the Taverna provenance model: the MPOD idea
One peculiarity of the Taverna provenance model is that it relies on the structure of the static workflow graph in order to answer provenance queries efficiently. This is not a problem when provenance is captured from a Taverna workflow execution, of course, but it may become problematic when third party OPM graphs are imported, because those do not carry the original workflow structure.
Our solution is to use the causal relations provided by the graph, to "induce" a Minimal Plausible Originating (Taverna) Dataflow (MPOD) that could have produced the graph. As a consequence, the import algorithm consists of two parts:
- generate a MPOD from the artifact-artifact and artifact-process relations, and store it in the provenance DB. This includes processors with inputs and output ports, connected through data dependencies. A small number of mapping rules are used for this purpose;
- generate the bindings of ports to artifacts that would have been observed upon execution of the generated workflow.
The result is a complete provenance DB. One twist to this approach is that, for multi-account graphs, the algorithm maps each account to a separate MPOD, and in addition it generates a comprehensive dataflow that includes all relations found across all accounts.
Ideally, if this mapping were completely lossless, then the OPM graph obtained by exporting the provenance DB that results from this import method, would be identical to the initial third party graph that was imported. While this is not always the case (in some cases the mapping is lossy), the interesting point is that one can pose (with a few renamings) the same queries described above to the imported provenance DB, and obtain structurally similar query answer OPM graphs.
The following OPM XML graphs have been successfully imported into Taverna:
Regarding posing queries on the DB that stores the content of the these graphs, the problem seems to be that, although all the necessary information is, at least apparently, in the graph, there is no simple way to map the provenance queries from the first part of the exercise, i.e., those posed on the native Taverna workflow provenance, to these. This is where the current effort is concentrating.
Example: query 3 graph from UC Davis:
Suggested Workflow Variants
Suggested Queries
Suggestions for Modification of the Open Provenance Model
Conclusions
-- PaoloMissier - 07 May 2009
to top