Technical Summary of the Second Provenance Challenge Workshop
This is a summary of the technical issues discussed at the second provenance challenge workshop. The issues have been broadly divided into three topics: interoperability of provenance systems (the main focus of the challenge), models of provenance and the form of the challenge itself. We also include a list of participants and the next steps to follow on from the second challenge work. Please see the
challenge specification,
teams' results and
workshop schedule for more.
Participants
- Natalia Currle-Linde, High Performance Computing Center Stuttgart
- Ewa Deelman, Information Sciences Institute
- Juliana Freire, Universtity of Utah
- Joe Futrelle, National Centre for Supercomputing Applications
- Dennis Gannon, Indiana University
- Tara Gibson, Pacific Northwest National Lab
- Paul Groth, University of Southampton
- Thomas Heinis, ETH Zurich
- David Koop, Universtity of Utah
- Ales Krenek, CESNET
- Jose Manuel Gomez-Perez, ISOCO
- Gaurang Mehta, Information Sciences Institute
- Simon Miles, King's College London
- Luc Moreau, University of Southampton
- Jim Myers, National Centre for Supercomputing Applications
- Kiran-Kumar Muniswamay-Reddy, Harvard
- Patrick Paulson, Pacific Northwest National Lab
- Norbert Podhorszki, University of California, Davis
- Yogesh Simmhan, Indiana University
- Peter Slaughter, University of California
- Robert Stevens, University of Manchester
- Karan Vahi, Information Sciences Institute
Interoperability of Provenance Systems
The aim of the second provenance challenge was to explore issues of interoperability between provenance systems. In many cases, the provenance of a data item will not be fully encompassed by the details of a process using one provenance system. Instead, the inputs that feed into the data item have potentially diverse sources, each of which is relevant to the item's provenance. Therefore, being able to combine provenance data from multiple sources is an important issue.
Some differences between systems that were expected to be problematic happily turned out not to be. For instance, the differences between data from systems that regard provenance primarily as documenting the connections between steps in a process ('process provenance') and those that document dependencies between data ('data provenance') were easily bridged. This is because, while superficially the model differed, details of both data and process dependencies were evident in each, e.g. process steps' inputs and outputs were documented or the process by which data was produced was documented. Moreover, the teams were easily able take and interpret each others' data with minimal human contact needed. This was partly because of shared knowledge about the challenge, but also suggests some commonality in the understanding of provenance between teams.
One issue that recurred through the presentations and discussion was that of
naming. Each system independently named the data items, process steps etc. that they were documenting, and so integrating two or three systems' provenance data meant interpreting where an identifier produced by one system referred to the same entity as another identifier produced by a different system. Discussion covered techniques for trying to semi-automatically match names, mechanisms by which the problem could be avoided in the first place and the possibility that the problem was created only by the artificial nature of the challenge (see Challenge Comments section below). Techniques used or proposed for trying to determine matches between names included comparing MD5 checksums on the data referred to, classification of entities by context, and asking the user for guidance. It was suggested that a system may be able to indicate how likely it had matched names correctly, giving a measure of the likely quality of the provenance. To prevent the naming issue even occurring, it was proposed that provenance systems defer naming to the (software) creator of any data, and communicate these names through their interactions with each other. We also discussed the minimality of information needed to preserve naming: some entities may be intensionally defined rather than named explicitly, as long as sufficient information is passed between parts of the system recording or querying for provenance.
Another issue discussed was that of
extraneous information. A provenance query is performed using a particular provenance system. Where the data from which the provenance of a data item is found contains information other than the querying system expects, because the system that recorded the data had different ideas of what was relevant, then what should be done with the extraneous information? Should we always expect the querying system to return it, because it was considered relevant? Or should it exclude it as it does not fit the model it is using and so might not be correctly interpreted?
The issue of scalability was also discussed. If we have to determine provenance from data produced by multiple systems, there is a performance cost in matching up the data (such as in translating one model to another). It was suggested that benchmarks might help to understand how costly this really is (see Next Steps below).
Models of Provenance
One result of exploring interoperability of provenance systems, and of testing how data from them could be combined, was a better understanding of each others' approaches. Much discussion was had on ways of modelling provenance.
In the first provenance challenge workshop, the discussion led to the claim that provenance must, at least, contain a
causality graph, i.e. the process that occurred, the derivation of data etc. This graph may be explicit or implicit in any one system's provenance data. In this second workshop, this idea was extended and discussed. First, it was claimed that it must be an
annotated causality graph, in order to capture the details and not just the structure of the provenance. Annotations were discussed more widely. Some teams argued that arbitrary annotations should not be included in the concept of provenance, as this led to potentially all data being provenance and so an unhelpful lack of focus. Others argued that, as annotations can themselves have annotations, a model to capture provenance may be, at base, arbitrary data graphs, such as RDF encodes. It was noted that, in considering annotations of a causality graph, a provenance model may allow not only vertices (the things caused/derived/triggered etc.) to be annotated, but also the edges (the causal connections), because causation takes different forms in different instances. This followed into discussion about whether the causal graph should really be a hyper-graph, where one edge connects potentially many nodes. For example, where one data item is the division of two others, those latter two items play the roles of divisor and dividend, and these two roles only have meaning for the particular causal connection. Finally, we discussed using annotations to infer causality, particularly annotations stating the time steps of a process occur.
An issue that appeared several times in discussion was that of
levels of abstraction or different
views of the same provenance. It was argued that different users may want the provenance of the same data for different purposes and so would view it in different ways, and a model of provenance should support this. For example, someone debugging the execution of a workflow will want the details about how it ran and using what resources (including library versions etc.), while a scientist trying to understand the broad procedure of an experiment will want a more abstract view with the appropriate level of description for scientific explanation. It was suggested that when looking at the causality graph for a workflow at one level of abstraction, a vertex in that graph may represent a sub-workflow which can be expanded into another causality graph.
The
evolution of workflows was also discussed. If the workflow used to produce a result is important for understanding that result, then the provenance of the workflow (script) also becomes important: there is a causal connection between the workflow being edited and its results being as they are. Some workflow authoring systems provided versioning, and the issue of when a workflow is edited to become a different workflow rather than a different version of the same workflow was discussed.
An attempt was made to decide on some common terminology/concepts between teams, which could be used as the basis for future collaboration. The first concept discussed was
annotated causality graph, with suggested, additionally,
annotated causality graph node and
annotated causality graph edge. The atomicity of a graph node was discussed. It was at least agreed that, where two nodes are causally connected, it should not be ambiguous as to what 'part' of the entity denoted by each node was caused/effected by the other node, i.e. the nodes should be atomic with respect to that particular causal connection. It was argued that this may be extended to being able to intensionally define nodes as one of a set of atomic things, such as events, atomic data items or state transitions. From the discussion on annotations, other terms were proposed, including
node annotation,
edge annotation,
annotation name and
annotation value, but these were determined not to capture the potential for annotations to annotations and were not specifically about provenance, so the terms were dropped. Other terms discussed were
depends on and
caused by for describing the edges of the causal graph, and
process invocation box for documentation of a process that may act as a graph node, possibly expandable to a sub-graph describing that process.
Challenge Comments
Feedback was given on the challenge exercise itself, which can hopefully help inform future challenges.
On the positive side, it was noted that interoperability between systems had been achieved fairly straightforwardly without any or minimal interaction between teams. The main times that interactions did need to occur were where there were bugs in a team's provenance data.
However, questions were also put about how the structure of the challenge may have artificially aided or inhibited interoperation. First, it is possible that the nature of the challenge amplified the naming issues described in the Interoperability section above: if one part of a workflow was enacted in one system and a second part in another system, then the intermediate data must have passed between the systems and naming issues may have been partly resolved in this unspecified exchange.
As mentioned in the Model section above, some teams argued that arbitrary annotations, as were relied on for queries 8 and 9 of both first and second challenges, should not be in scope of the exercise. Questions were also raised, though not further discussed, about how specific the teams' efforts were to the challenge: whether the translation of models was challenge-specific or if generic rules could be applied, and whether the teams' implementations of provenance queries used in the first challenge had to be altered to run over other teams' data (so suggesting a lack of generality).
Next Steps
Overall, teams had put a lot of effort into the second provenance challenge, and this meant that the workshop contained a good deal of interesting and wide-ranging discussion. At the end of the workshop, the next steps were discussed.
While the challenge teams present expressed enthusiasm for continuing to discuss provenance-related issues and work together on interoperability between their systems, there will not be a third provenance challenge immediately following this one. Instead, it was decided to encourage teams to use the particular results, issues and topics they had encountered in the second provenance challenge in papers submitted to the International Provenance and Annotation Workshop (IPAW), to be held in 2008. At this workshop, we will set aside time to discuss future steps for the provenance challenge.
As a prelude to a possible third challenge, we will elicit from teams workflows, provenance data or other ideas by which to 'benchmark' provenance systems. A mid-term date will be proposed by the challenge organisers to ensure we stay on track for the results of this to be discussed at IPAW. A subset of the challenge teams have started to build a repository for the benchmarks, and will present it to the rest when ready. Similarly, discussions on some shared vocabulary is continuing between some teams, and this again should be disseminated for wider discussion once ready.
An official report of the provenance challenge workshop will be written.
Feedback and Clarification
To all provenance challenge teams: please feel free to give feedback, add comments or clarify your own position in this section.
--
SimonMiles - 23 Jul 2007
to top