Data Provenance, Distributed Systems and Regulatory Pressure

Reliably determining the provenance of data across widely distributed systems is a complex problem. Virtually every distributed system displays poor data provenance tracking capability due to the fragmented, heterogeneous foundations that typically underpin these systems.

This is a problem that needs to be addressed. In May 2018, EU Regulation 2016/679 also known as the General Data Protection Regulation (GDPR) will replace the 1995 EU Data Protection Directive[1]. This new regulation extends the 1995 directive, and encompasses any organisation[2] accumulating EU data.

GDPR makes it a legal requirement to show an input if relates to a computed value[3] that results in a decision that impacts an EU citizen. In other words, both data provenance[4] and lineage[5] must be identified and displayed. This includes all data relating to an individual including: email address(es), bank details, medical records, even social network postings. The regulation extends to cover past and present data states, including being able to display why an expected result or output does not occur[6].

Achieving this over widely distributed, fragmented, heterogeneous systems which process very large message volumes is a significant challenge – one that eliminates attempting conventional system mapping and data provenance approaches. Lineage adds further complexity as GDPR requires that the execution path followed by a data item must be identified as being the “correct” path to determine the resulting data-state, or at least the path that data followed, is in compliance with the required process model.

One approach is using tuples to relate specific output results to specific inputs[7]. This approach has been explored by tracing data errors back to their source[8]. An extension to this approach is to apply state information – effectively re-applying conventional database tuple concepts to the network to extract “network provenance”[9].

This is not a flawed approach but breaks down when applied across large-scale distributed systems. This is in part due to the frequency of network changes. Unlike using tuples in a conventional database, network changes can be quite frequent. By way of example, an examination of why Path “P” was used to communicate with application “A”[10] may reveal that Path P no longer exists due to an implementation change.

Another problem is not why a specific output has been generated[11] but rather why there has been the non-occurrence of an expected result[12]. This has particular relevance when considering GDPR as it will be a requirement to identify why a specific data-state has resulted due to non-compliance with a required process.

Further complications with the tuple approach are experienced when dealing with network (mis)-configurations. Configuration errors, although common and typically widespread, may only manifest themselves in specific circumstances or conditions, and may only impact a subset of messaging events.

This combination of infrequent occurrence and impact on message subset makes the isolation and correction of network misconfigurations across large systems extremely challenging. Isolation and correction cannot be achieved by adding temporal information or randomly sampling sets of executing events to check for configuration correctness. The only method that ensures all network configuration errors are identified, isolated and addressed is to continuously monitor all system exchanges, across all co-operating nodes, at all times. Stated another way, provenance and lineage across large scale distributed systems needs to be captured in a centralised database and presented as a topology to enable each step in a process chain, that results in a data change, to be identified.

Such an approach captures “network provenance”[13]. This may be more correctly described as “network state” or the “Network As Used State”. As the system is being continuously monitored and the resulting description continuously updated, this may also be described as a converting an existing implementation into a “State Machine”.

Once the end-to-end As-Used system state has been captured, it becomes possible to retrace execution sequences to any point in the process where a data change has occurred. The ability to “roll back” to any point in the execution sequence also facilitates the ability to “roll forward” from any point to identify all locations to where the data has been distributed.

This enables any input into any operation that results in a computed output value to be identified. This output may then be “traced forward” to all points of distribution. From this may be derived information relating to decisions that may or may not impact an EU citizen. In other words bringing together data provenance, data lineage and network state addresses the central tenets of the GDPR.

The captured As-Used state may also be used to test for correct data outputs, and to isolate unexpected results. This can be achieved by comparing the observed behaviour of the system with the expected behaviour of the system model(s). Since temporal information is captured as the As-Used state is being mapped, it becomes possible to examine both past and present system states, from which can be determined the cause(s) of why an expected output did, or did not, occur

Further benefits may also be derived from the capture of the As-Used state. One such benefit is continuous system optimisation. Once the end-to-end system description has been determined, test data may be recursively deployed to ensure the correctness of all network configurations. Such testing may be fully automated thus eliminating many of the expensive, and time consuming test sequences currently being used.

The HELIXsystem Process Assembler (HSPA) is the only capability available in the market today that enables the As-Used state of fragmented, widely distributed, heterogeneous systems to be machine captured and unambiguously described. HSPA deploys across any implementation regardless of age or complexity without introducing any change in the observed systems.


[1] 95/46/EC

[2] GDPR extends to cover any non-EU organisation accumulating EU resident data

[3] Or output

[4] identifying the source or origin of data

[5] what happens to data entries as this data migrates over a system over time

[6] Penalties for non-compliance are severe €20m or 4% of Global turnover whichever is the greater making non-compliance a non-option

[7] For example {I,P,o} where I is an input P is an oPeration and o is an output.

[8] Tracing Data Errors with View-Conditioned Causality Meliou et al 2011

[9] Efficient Querying and Maintenance of Network Provenance at Internet Scale W Zhou et al 2010

[10] If you like “What is the Network Provenance of P”

[11] That is what Inputs “I” to oPeration “P” produced output(s) “o”

[12] Diagnosing Missing Events in Distributed Systems Y Wu et al 2014

[13] As described elsewhere

← Return to News