When a manual handoff occurs, when sensors glitch, or when legacy systems fail to capture every attribute, the log ends up with holes – blank timestamps, unknown resources, missing activity names. Traditional repair tools either need an existing process model (hardly available in fast‑moving startups) or rely on generic machine‑learning models that only guess a single field at a time. The result? Incomplete insights and costly re‑engineering.
Enter SANAGRAPH, the brainchild of researchers from the University of Trento, unveiled in their paper Graph‑based Event Log Repair. This isn’t just another auto‑encoder; it’s a heterogeneous graph neural network (HGNN) that treats every trace as a living graph where each event attribute becomes its own node type. By doing so, SANAGRAPH can simultaneously reconstruct all missing attributes – activities, timestamps, resources, costs, you name it.
SANAGRAPH builds this graph for every trace, encodes categorical values with one‑hot vectors and normalizes numeric fields, then feeds the structure into a stack of SAGEConv layers – a lightweight but powerful message‑passing engine. Information flows from known nodes to the empty ones, enabling the model to infer missing values based on the surrounding context.
Key findings: * Activity reconstruction – SANAGRAPH outperformed the auto‑encoder by an average of 43 percentage points on structured masking scenarios (odd/even/window), and was only slightly behind on random masks. * Timestamp accuracy – Results were neck‑and‑neck, with SANAGRAPH achieving lower mean absolute error (MAE) on three datasets and marginally higher MAE on the other three. Full‑attribute repair – When tasked with restoring all* attributes, SANAGRAPH’s performance stayed within 0.05 accuracy points of its activity‑only version, proving that adding complexity does not cripple the model. * Scalability – A modest 2‑layer network already delivered strong results; increasing to four layers boosted accuracy further without exploding training time (thanks to GPU acceleration).
- Model‑Free Flexibility – No need to craft a process model beforehand. SANAGRAPH learns directly from the raw logs, making it perfect for agile environments where processes evolve daily.
- Holistic Data Quality – By repairing every attribute at once, downstream analytics (conformance checking, bottleneck detection, predictive monitoring) receive a clean, richer dataset.
- Speed of Insight – The graph‑based approach runs in minutes on a standard workstation equipped with an RTX 4070 GPU, turning what used to be a manual data‑cleansing marathon into an automated sprint.
- Future‑Ready Architecture – Because it operates on graphs, SANAGRAPH can easily ingest multimodal data (IoT sensor streams, ERP records, unstructured logs) as new node types, paving the way for truly omniscient process mining.
The researchers hint at next steps: incorporating global log‑level features into the graph, adding explainability layers so users can see why a particular value was chosen, and exploring dynamic graph depths that automatically adapt to the complexity of each trace. All this points toward a future where process intelligence is self‑sustaining, continuously learning, correcting, and optimizing without human intervention.
References (selected): Van der Aalst et al., Process Mining Handbook; Wu et al., Comprehensive Survey on Graph Neural Networks; Dissegna et al., “Multiperspective next event prediction via heterogeneous GNNs”; Nguyen et al., “Autoencoders for improving quality of process event logs”.
The SANAGRAPH code and full experimental results are openly available on GitHub, inviting the community to extend this breakthrough into new domains – from healthcare records to autonomous vehicle fleets.
