Artificial Intelligence and Machine Learning Blogs
Explore AI and ML blogs. Discover use cases, advancements, and the transformative potential of AI for businesses. Stay informed of trends and applications.
cancel
Showing results for 
Search instead for 
Did you mean: 
L_Skorwider
Active Participant
2,343

Introduction

When ChatGPT 3.5 was introduced to the world in 2022, I was convinced it was a breakthrough that would change the world - not necessarily version 3.5, but one of the subsequent iterations. And it will. I am fully convinced of this. Now, as we near the end of 2024, two years have passed, and I’ve just finished the proof of concept for my autonomous agent project working in SAP GUI. For a long time, I had wanted to create something like this, and it turned out to be simpler than I expected.

Technology

To be honest, I find the technological aspect a bit dull. It's not the most important part of this project. Nevertheless, it's worth mentioning. The project is built on the popular LangGraph library, an excellent tool for creating AI agents. Previously, I had primarily worked with LangChain, so this was something new for me.

The agent’s behavior is driven by tools I developed. It operates autonomously, taking sequential steps and analyzing the results. It can perform several actions in a row, such as navigating to a transaction, filling out a form, clicking a button, or switching tabs. What’s more, it can handle multiple transactions within the same task. Once it obtains a final result and can respond to the user, it concludes its operations.

In this project, I used GPT-4o as the large language model. This is a multimodal model, allowing it to analyze images as well. However, I tried to limit image analysis as much as possible, as it generates costs and can be relatively slow. That said, it remains incredibly versatile. For some tasks, the agent can complete actions entirely without screenshots.

In my opinion, the agent is lightning fast, especially for steps that do not require image analysis. You can judge this for yourself in the attached video.

Project Development

What I truly wanted to discuss is the progress and outcomes of this project. Initially, I assumed that to make the agent functional, I would need to build a substantial knowledge base. I planned to prepare documents outlining step-by-step instructions for various operations - for instance, how to check runtime errors in the system or what buttons to press to create a user. This seemed like a rational and necessary approach at the time.

The first stage of the project involved creating tools for the SAP GUI AI Agent. I assumed that the ability to fill text fields, press buttons, and switch tabs would be necessary. This was enough to begin as a proof of concept. While this doesn’t enable full operation of the SAP GUI interface, it suffices for basic tasks. I also added the ability to take screenshots since the agent needs to "look around" and analyze results.

Next, I planned to work on a sophisticated prompt containing detailed instructions on how to interact with the interface and execute specific tasks.

A Surprise

I launched the agent with a very basic prompt and asked it to check processes in the system. Imagine my surprise when I received a response. It felt a bit like setting up a complex interface, placing someone in front of it, and saying, “Work.” And that person, without hesitation, starts using the interface as if they’re already familiar with it. Of course, I understand that large language models possess extensive knowledge and thus are familiar with most SAP transactions. But sometimes, the ability of these models to connect the dots is truly astonishing.

Where Is This Headed?

I am fully aware that this is a very early stage of the project. Let’s not fool ourselves; the tasks I assigned to my agent are not senior-level. However, consider the pace at which artificial intelligence is advancing. What wasn’t possible a few months ago is now entirely natural. Smaller and smaller models are becoming increasingly intelligent. The progress is lightning-fast. Could my agent, a year from now, be working in a Service Desk and performing as effectively as a human? And over time, could it also replace SAP Consultants?

Of course, SAP GUI is not the ideal interface for a machine. At the moment, we’re teaching artificial intelligence to use the interfaces we use ourselves. We’re adapting it to fit into our world. This is evident in the case of humanoid robots. They are designed to resemble humans so they can operate in the same environments. But is this truly optimal? Is this really the best form for a robot? Perhaps it’s just a transitional phase. The same applies to agents using interfaces we use. In the future, it’s likely that agents will communicate through APIs or, in the case of SAP, via RFC. And we’ll understand less and less of it.

Do you already feel the breath of an Autonomous SAP Consultant on your neck? Share your thoughts in the comments, and don’t forget to check out the video!

13 Comments
L_Skorwider
Active Participant

There is a new version of the SAP GUI AI Agent. You can see a presentation here:

BharathReddyGoli
Explorer

It is an amazing attempt to showcase what is possible

L_Skorwider
Active Participant

Thanks. I'll deliver more in spare time. 🙂

DJ_ISU
Explorer
0 Kudos

It's amazing! Suddenly I'm feeling bad, because I only managed to train an LLM in assisting for functional and technical questions. It's amazing, what you have achieved!! 

LeandroRibeiro
Participant
0 Kudos

What a very nice video!

Could you comment (technically) on the tools used by the Agent?

 

L_Skorwider
Active Participant

Hi @LeandroRibeiro 

SAP GUI is, on one hand, a very specific interface, but on the other hand, it's quite typical. If you think about it more closely, you'll find that it shares many common elements with web pages. You have a transaction name that serves a similar role to a URL for web pages. You have elements like buttons and text fields. Additionally, there are, of course, menus and various other elements like dropdown menus within the page content. So it's natural that the tools are somewhat similar to those used for managing websites with artificial intelligence.

The entire challenge of creating an agent lies in preparing a type of "web scraping" that works with the SAP GUI. In the case of the improved version, I was able to extract more information using just text. However, in the first video, you can see that screenshots played a bigger role. For now, screenshots are used where I don't yet support native elements.

As I mentioned in the first post, the whole thing is based on a LangGraph and uses SAP scripting both for handling the interface and for reading the content of the page. The rest is done by the LLM.

BR

MarioAndreschak
Discoverer
0 Kudos

Great stuff! Is this available on Github? I am most interested in how you interact with the SAP GUI through GUI scripting. Are there any plans to enable models running locally (e.g. ollama)?

L_Skorwider
Active Participant

Hi,

No, I still haven't published this project anywhere yet. I have a few ideas for its development and for changing the approach, but here, SAP somewhat disrupted my plans due to the lack of support for PyRFC.

However, I am very eager to test the operation of Agenda with other models, and I plan to use DeepSeek, but for now, there is also a lack of time for that, as I am currently developing another project. However, I will definitely want to come back to this. Especially since I also promised a summary of the costs and a deeper look into how it works from the inside.

BR

thomas_mller13
Participant
0 Kudos

Interesting. - Did I get this correctly: You didn't use recorded scripts in order to retrain the LLM model? Would it be possible the retrain the model with the help of recorded scripts in order to succeed in more specific situations, e.g if one wants to perform a series of postings automatically or in order to perform automated tests?

Thanks and regards,

Thomas

 

L_Skorwider
Active Participant
0 Kudos

Hi,

No, the model was not additionally trained in any way, nor was it given any ready-made scripts to execute. All it has is the ability to control individual actions in SAP-GUI, such as pressing buttons, filling in fields, or selecting options from a menu.

I think with your approach it would be possible to substitute ready-made scripts as tools. Then, automatic tests would probably be the most efficient.

On the other hand, a lot can also be achieved by simply giving the model detailed instructions on what to do. Note that in my videos, these instructions are quite general, yet the language model manages to use tools on its own. I usually don't even suggest which transaction to use, and it chooses on its own. However, ready instructions could be provided on how to behave in specific cases.

 

L_Skorwider
Active Participant
0 Kudos

Hi All,

Here's a question for everyone. I have to admit, I've been a bit distracted with other activities and projects, so I haven't focused on this one as much as I should have. But I did promise to make another video. What are you most interested in? Would you like to see the program code, learn about the approach, or maybe get a deeper understanding of how the agent uses tools? For instance, we could explore individual prompts in LangSmith. Which of these do you find most interesting?

L

DJ_ISU
Explorer
0 Kudos

Hi,

first of all thanks for these interesting Videos you already provided.

I would be interested in both topics. But bringing in a priority it would be:

  1.  deeper understanding of how the agent uses tools
  2. Like to see the program code 
945_500_4055
Explorer
0 Kudos

Hi,

Honestly, I would like to everything about it. 😀

  1. deeper understanding of how the agent uses tools
  2. Like to see the program code
  3. If you have tried different versions of SAP GUI
Labels in this area