Based on research by:
Nataliia Klievtsova (Technical University of Munich),
Sherri Hadian (SAP Signavio, @Sherbika),
Timotheus Kampik (SAP Signavio, @TimotheusKampik),
Jürgen Mangler (Technical University of Munich),
Frederick Engelns-Bauer (SAP Signavio),
Stefanie Rinderle-Ma (Technical University of Munich).
Generative AI (especially large language models, LLMs) is now being used to automatically create conceptual models like business process diagrams from natural language descriptions. One big challenge is evaluating these AI-generated models. Traditional evaluation requires ground truth data – for example, a human-crafted process model for a given description – but such datasets are rare and may eventually get contaminated as LLMs train on them.
How can we reliably measure an AI model’s performance without always needing ground truth? This blog introduces Round-Trip Correctness (RTC) as a solution. RTC evaluates generative AI by checking how well an AI can convert a model to text and back to a model (or vice versa) and seeing how much information survives the round-trip. If the final output closely matches the initial input, the generation is considered “round-trip correct.” In this post, we’ll explain the RTC approach, how it works under the hood and what we learned by applying it to real process model datasets. We’ll also explore cross-domain results and discuss why this metric is useful for practitioners in SAP process modeling and beyond. The code, results, and data are available in the accompanying GitHub repositories: Code and Results and Dataset.
Round-Trip Correctness is a way to evaluate AI-generated models by round-tripping between text and model representations. It has its origins in machine-translation (Somers, 2005) and was recently proposed for LLM evaluation by Allamanis et al. (2024). In a round-trip, you start with one format (say a process model in json format) -> convert it to the other format (a textual description) using an AI, -> then convert it back to the original format using AI again. You end up with an original and a reconstructed artifact, both in the same format, which you can compare for similarity. The idea is that a high-quality generative model should produce outputs that preserve the content of the input when cycled through this process. For example, if an AI turns a process model in json format into text and then back into a process model in json format, the final process model should be almost the same as the original if nothing important was “lost in translation.” This fidelity is the RTC score.
The motivation for RTC is sustainability in evaluation. Instead of relying solely on limited or potentially biased ground truth pairs, we can use the LLM as an evaluator by testing its consistency. This approach is especially handy when ground truth models aren’t available – which is often the case for new process descriptions or proprietary scenarios. In their study, Allamanis et al. (2024) demonstrated that RTC scores, which require no human reference, strongly correlate with traditional evaluation metrics. For instance, when applied to code synthesis benchmarks like HumanEval (Chen et al., 2021), and ARCADE (Yin et al., 2022), RTC scores showed Pearson correlations of 0.96 with standard pass@1 measures (Chen et al., 2021), confirming RTC as a reliable proxy for assessing quality.
In detail, what round-trip correctness (RTC) implies is that for a "good" forward and backward model, we expect x̂ = M−1(M(x)) and x to be semantically equivalent. In our case, this would mean the semantic similarity between the generated text and the original text is maximal. sim(x̂, x) can be a measure of the round-trip capacity of the LLM. Because an exact calculation of RTC is not possible, we calculate an average value by drawing a small number of forward Nf and backward Nb samples and compute an approximate RTC as:
Limitation: The RTC score evaluates the combined M and M−1 performances. If, for example, the M model generates poor BPMN models from text, we cannot expect to measure the ability of the M−1 model in generating text.
To apply RTC in practice, we define two pipelines: one starting from a model and one starting from text. Both pipelines involve a forward generation step and a backward generation step with an LLM:
In both pipelines, we run multiple samples to get a robust score. In our experiments, we took three forward generations (each followed by its backward generation, i.e Nf = 3 and Nb = 1) and averaged the similarity scores of the three round-trips. To allow some variability in the forward direction, the LLM is used with a higher temperature (we used temperature 1.0) when generating the first output. This produces diverse phrasings or model structures. The backward direction uses a lower temperature (in fact, 0, to be deterministic) so that given the intermediate output, the model reconstruction is as consistent as possible. We also carefully engineered prompts for each direction (one-shot examples) to ensure the LLM knows how to format BPMN JSON versus a well-structured textual description. The overall RTC pipeline provides two scores: one from the M2M route (how close models match) and one from the T2T route (how close texts match).
Once we have an original and round-tripped pair (either two texts or two models), we need to measure their similarity. Designing a good similarity metric is key to quantifying RTC.
Text-to-Text Similarity: We defined two methods to compare the original description with the round-tripped description:
Model-to-Model Similarity: Comparing two BPMN models is tricky, so we implemented two complementary metrics:
For all these similarity calculations, the outputs are normalized to a 0 – 1 scale (with 1 meaning an identical match). In our implementation, we used a lightweight embedding model (Alibaba-NLP (2024)) to get sentence embeddings efficiently, but any good semantic vector model could serve.
We evaluated RTC on two datasets where ground truth mappings exist, to see if RTC scores reflect actual quality:
We ran both the M2M and T2T pipelines on each dataset using two different LLMs (OpenAI’s GPT-4o and Google’s Gemini 1.5 Pro) and computed the RTC scores. We also measured the traditional accuracy: for M2M, we compared the AI-generated description to the human reference description (text similarity with ground truth); for T2T, we compared the AI-generated model to the human reference model (model similarity with ground truth). Then we checked how well the RTC scores correlate with the ground-truth-based scores across the sample set.
The results are encouraging. We observed a positive correlation between the RTC scores (which require no ground truth) and the actual ground-truth evaluation scores on both datasets. In fact, for the GPT-4 model on the PET dataset, the Spearman correlation coefficient was up to 0.68 (with p-value < 0.001) when using the hybrid text similarity in the M2M pipeline showed higher correlation with ground truth evaluations than the text-to-text (T2T) pipeline. In other words, using an original model as the starting point for round-trip gave a more reliable signal of quality than starting from text. This could be because generating text from a model (which tends to be a deterministic, structured task) and then regenerating the model yields less ambiguity than the other way around.
Another clear finding was that the hybrid text similarity metric outperformed pure semantic similarity. The hybrid metric had stronger correlation with the model accuracy measures, indicating that preserving the order of information in the descriptions was important for evaluating model fidelity. Pure semantic similarity sometimes failed to penalize jumbled or rephrased outputs that still “talked about” the right things but in a different sequence.
It is also worth noting that both GPT-4 and Gemini models showed very similar trends and performance on these evaluations. Neither had a clear advantage in the round-trip scores or correlations, and both achieved high RTC correlations in the M2M pipeline.
One concern with generative models is whether they perform equally well on processes from different domains or industries. RTC lets us not just check consistency, but actually rank “domain readiness”: which fields an out-of-the-box LLM can handle well and which require extra tuning. We applied the RTC M2M pipeline on two additional datasets to probe this cross-domain performance:
We ran the model→text→model round-trip on all these models (using both GPT-4o and Gemini 1.5 Pro) and measured the model-to-model similarity scores (using Method 2, which focuses on flow correctness). This lets us see how well each LLM preserves the model across different domains, and whether certain domains are harder.
Consistency across business categories (MaD150): While average RTC scores sat in the mid-80 percent range (≈86 % for GPT-4, ≈85 % for Gemini), it was the evaluation of RTC across categories that revealed where each model excelled or struggled. Manufacturing and finance processes frequently achieved > 90 % fidelity, whereas Customer Support for Ticket Management dipped below 70 % despite its relative simplicity. Even complex workflows (e.g., churn-prevention sequences with up to 19 tasks) could exceed 85 % when terminology was straightforward. GPT-4 produced slightly tighter score distributions than Gemini, but both exhibited outliers on edge-case structures—precisely the cases that merit focused prompt tuning or domain-specific examples.
Relative Performance Across Distinct Domains: In the Domains dataset—covering GDPR compliance, healthcare, logistics, manufacturing, and tourism—we again saw relative strengths and weaknesses rather than flat consistency. Both LLMs averaged around 80 percent similarity, but performance peaked in manufacturing (likely due to its structured, repetitive nature) and troughed in domains like tourism or GDPR, where narrative descriptions and specialized jargon posed greater challenges. The variance between best and worst domains was on the order of 10–15 percent, highlighting where domain adaptation or additional examples could pay off most.
In summary, these cross-domain experiments demonstrate that round-trip evaluation effectively highlights not only absolute fidelity but also relative domain readiness - showing practitioners which industries an LLM handles out-of-the-box and which domains demand more fine-tuning. Rather than expecting a single “consistency” number, you can use RTC to rank domains—prioritizing prompt-engineering or domain-specific training where fidelity lags and confidently deploying on domains where RTC scores are already high.
Introducing Round-Trip Correctness as an evaluation metric opens up several practical possibilities for SAP modelers and developers:
No Ground Truth? With RTC, you can evaluate an LLM-generated model without needing a pre-existing correct model for comparison. This enables continuous evaluation in environments where you are generating models for new process descriptions that haven’t been modeled before. As long as you can, for example, round-trip back to text, you have a self-check mechanism. This is crucial for maintaining evaluation rigor as ground truth datasets are scarce or become part of the LLM’s training data. However, while RTC correlates well with actual ground truth data, it should not be seen as a direct substitute but rather as a good option in case no ground truth data is available. Here the choice of the evaluation metric does matter as discussed above.
In conclusion, Round-Trip Correctness is a promising metric for generative AI in process modeling. It shifts the evaluation burden partly onto the AI itself by testing consistency, and our research indicates it correlates well with traditional evaluation when such data exists. For practitioners, RTC can be a handy tool in the toolbox, enabling more frequent and flexible testing of AI model generators, especially where ground truth is unavailable. By incorporating RTC into development and deployment, we can achieve more reliable AI-generated models and accelerate the adoption of AI in business process management with greater confidence in cases where ground truth is sparse. Whether you’re an SAP workflow expert, a BPMN enthusiast, or an AI researcher, we encourage you to consider round-tripping your next generative model and see what it reveals!
Alibaba-NLP. (2024). gte-large-en-v1.5: A Lightweight Sentence Encoder [Software]. Hugging Face. Retrieved from https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5
Allamanis, M., Panthaplackel, S., & Yin, P. (2024). Unsupervised Evaluation of Code LLMs with Round-Trip Correctness. In International Conference on Machine Learning (ICML). https://doi.org/10.48550/arXiv.2402.08699
Bellan P, van der Aa H, Dragoni M, Ghidini C, Ponzetto SP (2022a) Pet: an annotated dataset for process extraction from natural language text tasks. In: International conference on business process management, Springer, pp 315–321
Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., … & Zaremba, W (2021). Evaluating Large Language Models Trained on Code (HumanEval). arXiv preprint arXiv:2107.03374. https://doi.org/10.48550/arXiv.2107.03374
Somers, H. (2005). Round-trip translation: What is it good for? In: Proceedings of the australasian language technology workshop 2005, pp 127–133
Sola D, Warmuth C, Schäfer B, Badakhshan P, Rehse J, Kampik T (2022) SAP signavio academic models: A large process model dataset. 10.1007/978-3-031-27815-0_33, https://doi.org/10.1007/978-3-031-27815-0_33
Voelter, M., Hadian, R., Kampik, T., Breitmayer, M., & Reichert, M. (2024). Leveraging Generative AI for Extracting Process Models from Multimodal Documents. arXiv preprint arXiv:2406.04959. https://doi.org/10.48550/arXiv.2406.04959
Yin, P., Neubig, G., Allamanis, M., & Brockschmidt, M. (2022). ARCADE: A Benchmark Dataset for Code Generation. arXiv preprint arXiv:2212.09248. https://doi.org/10.48550/arXiv.2212.09248
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
58 | |
20 | |
11 | |
11 | |
10 | |
7 | |
6 | |
6 | |
6 | |
4 |