Understanding and analyzing the behavior of robotic systems is essential to ensure their reliability, efficiency, and continuous improvement, especially as robots are increasingly deployed in complex, dynamic environments. Process mining offers a powerful approach to uncover and analyze the execution of robotic operations. However, applying process mining to robotic systems requires bridging the gap between fine-grained multimodal data and high-level activity representations. Recent advances in foundation models provide a promising solution to this challenge, as the knowledge acquired during their extensive pretraining enables them to interpret multimodal data without the need for task-specific training. In this work, we propose a novel multimodal process mining pipeline that leverages the zero-shot capabilities of foundation models to perform activity recognition from visual and auditory inputs. By transforming fine-grained multimodal data into event logs, the pipeline enables the application of process mining techniques to robotic systems. We applied our approach to the Baxter UR5 95 Objects dataset, which offers synchronized video and audio recordings of a Baxter robot manipulating objects. The fusion of activity recognition results from these complementary modalities yields an event log that more accurately represents the robot’s operations, mitigating imprecision associated with using a single modality. Our results demonstrate that foundation models effectively enable the application of process mining to robotic systems, facilitating monitoring and analysis of their behavior.

Multimodal Zero-Shot Activity Recognition for Process Mining of Robotic Systems

Corradini F.;Pettinari S.;Re B.;Rossi L.;Sampaolo M.
2025-01-01

Abstract

Understanding and analyzing the behavior of robotic systems is essential to ensure their reliability, efficiency, and continuous improvement, especially as robots are increasingly deployed in complex, dynamic environments. Process mining offers a powerful approach to uncover and analyze the execution of robotic operations. However, applying process mining to robotic systems requires bridging the gap between fine-grained multimodal data and high-level activity representations. Recent advances in foundation models provide a promising solution to this challenge, as the knowledge acquired during their extensive pretraining enables them to interpret multimodal data without the need for task-specific training. In this work, we propose a novel multimodal process mining pipeline that leverages the zero-shot capabilities of foundation models to perform activity recognition from visual and auditory inputs. By transforming fine-grained multimodal data into event logs, the pipeline enables the application of process mining techniques to robotic systems. We applied our approach to the Baxter UR5 95 Objects dataset, which offers synchronized video and audio recordings of a Baxter robot manipulating objects. The fusion of activity recognition results from these complementary modalities yields an event log that more accurately represents the robot’s operations, mitigating imprecision associated with using a single modality. Our results demonstrate that foundation models effectively enable the application of process mining to robotic systems, facilitating monitoring and analysis of their behavior.
2025
9783032029355
9783032029362
Activity Recognition
Foundation Models
Process Mining
Robotic Systems
273
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11581/494865
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact