Open Provenance Model Contents
- Introduction
- Basics
- Overlapping and Hierarchichal Descriptions
- Provenance Graph Definition
- Timeless Formal Model
- Inferences
- Formal Model and Time Annotations
- Time Constraints and Inferences
- Support for Collections
- Example of Representation
- Conclusion
- Best Practice on the Use of Agensts
- References
6 Inferences
The Open Provenance Model has defined the notion of
OPM graph based on a set of syntactic rules and the notion of
Provenance Graph adding a set of topological constraints. Provenance graphs are aimed at representing causality graphs explaining how processes and artifacts came out to be. It is expected that a variety of reasoning algorithms will exploit this data model, in order to provide novel and powerful functionality to users. It is beyond the scope of this document to include an extensive coverage of relevant reasoning algorithms. However, provenance graphs, by means of edges, capture causal dependencies, which can be summarised by means of transitive closure that we describe in this section.
6.1 One Step Inferences
In Section 2, we have introduced the two causal dependencies
triggeredBy and
wasDerivedFrom acting as abbreviation for causal dependencies
used and wasGeneratedBy. Figure 9 shows their exact meaning.
Figure 9: One Step Inference in the Provenance Model
Figures 10 and 11 formalize Figure 9 by introducing rules for each inference that can be performed in the Open Provenance Model. A rule consists of two expressions separated by a horizontal line. The expression above the line is a hypothesis, whereas the expression below the line is a conclusion that can be inferred from the hypothesis.
In Equation (1), a
wasTriggeredBy edge is inferred from the existence of a
used and wasGeneratedBy edges, as per described in Figure 9. We note that the inferred \triggeredBy edge relies on both accounts
acc2 and
acc3, hence, it is given _acc2 ∪
acc3 as account.
Figure 10: One Step Inference Rules (1)
Equation (2) is the reverse of Equation (1): it allows us to establish that the edge
<
p2,
p1,
acc > ∈
WasTriggeredBy is hiding the existence
of some artifact
a2, used by
p2 and generated by
p1. The
inferred edges
used and
wasGeneratedBy are asserted in the context of
some account
acc2 and
acc3, whose union is the original account
acc. We note that Equation (2) allows us to
establish the existence of some artifact
a2 (and
r1,
r2,
acc1,
acc2) but it does not tell us what their ids and values are.
This is the consequence of using
wasTriggeredBy , which is a lossy
summary of the composition of
used and
wasGeneratedBy.
The kind of inferences that can be made about wasDerivedFrom is of a different
nature. Indeed, without any internal knowledge of P1 in Figure 9, it is impossible to ascertain there is an actual data dependency between A1 and A2.
Remark.
Concretely, a rule such as the following would lead to incorrect
inferences since it allows arbitrary outputs to a process to be
inferred to be dependent on arbitrary inputs to the same process.
⟨a2,r2,p1,acc2⟩∈WasGeneratedBy ∧ ⟨p1,r1,a1,acc1⟩∈Used
|
|
⟨a2,a1,acc1⋃ acc2⟩∈WasDerivedFrom
|
While it is unreasonable to infer an exact dependency by means of
wasDerivedFrom , it is useful to be able to infer that a dependency
may exist. To this end, we introduce the edge
mayHaveBeenDerivedFrom that marks such a potential dependency. Hence, if <
a1,
a2 > ∈
WasDerivedFrom, then <
a1,
a2 > ∈
MayHaveBeenDerivedFrom, but not vice-versa.
Hence, Equation (
3)
states that a
mayHaveBeenDerivedFrom edge can be derived from the existence of a
succession of
wasGeneratedBy and
used edges.
Equation (4) is to (2) what
wasDerivedFrom is to
wasTriggeredBy.
Figure 11: One Step Inference Rules (2)
In rules 1 and 3, the inferred edges have accounts
acc2 ∪ acc3 and
acc1 ∪ acc2, respectively. Hence, the artifacts and processes connected by these edges will have an effective account membership modified accordingly. We note that rules 1 and 3 effectively creates relationships in the union of multiple account views.
6.2 Transitive Closure
Users want to find out the causes of an artifact, not due to one process, but potentially, due to an unknown number of them.
Hence, for the purpose of expressing queries or expressing inferences about provenance graphs, we introduce
four new relationships, which are transitive versions of existing relationships, namely
Used*,
WasGeneratedBy*,
WasDerivedFrom* and
TriggeredBy*. Their definitions are displayed in Figure 12. We note that Figure 12 contains definitions (as opposed to inference rules of Figures 10 and 11, which specify which edges can be inferred from which edges). For convenience, we have also introduced a generic causal dependency
wasDependentOn∗ (see equations (9) to (12)). Note that similar inference rules can be defined for
MayHaveBeenDerivedFrom.
Figure 12: Transitive Closures
Equations 7 and 8 are one of the multiple possible ways of defining edges
used* and
wasGeneratedBy*. Other definitions could be expressed and proved equivalent (such as
used* can be derived from a single
used and
_wasDerivedFrom*).
Comments
In my mind, 'wasDerivedFrom' has the same meaning as 'mayHaveBeenDerivedFrom', since if the process model is looked into far enough, almost any causality might be found to be false. I don't think we should have two different types of linked; we might name the remaining one 'mayHaveBeenDerivedFrom' if it helps people agree on the meaning.
--
PatrickPaulson - 18 Aug 2008
to top