Introduction
When ChatGPT 3.5 was introduced to the world in 2022, I was convinced it was a breakthrough that would change the world - not necessarily version 3.5, but one of the subsequent iterations. And it will. I am fully convinced of this. Now, as we near the end of 2024, two years have passed, and I’ve just finished the proof of concept for my autonomous agent project working in SAP GUI. For a long time, I had wanted to create something like this, and it turned out to be simpler than I expected.
Technology
To be honest, I find the technological aspect a bit dull. It's not the most important part of this project. Nevertheless, it's worth mentioning. The project is built on the popular LangGraph library, an excellent tool for creating AI agents. Previously, I had primarily worked with LangChain, so this was something new for me.
The agent’s behavior is driven by tools I developed. It operates autonomously, taking sequential steps and analyzing the results. It can perform several actions in a row, such as navigating to a transaction, filling out a form, clicking a button, or switching tabs. What’s more, it can handle multiple transactions within the same task. Once it obtains a final result and can respond to the user, it concludes its operations.
In this project, I used GPT-4o as the large language model. This is a multimodal model, allowing it to analyze images as well. However, I tried to limit image analysis as much as possible, as it generates costs and can be relatively slow. That said, it remains incredibly versatile. For some tasks, the agent can complete actions entirely without screenshots.
In my opinion, the agent is lightning fast, especially for steps that do not require image analysis. You can judge this for yourself in the attached video.
Project Development
What I truly wanted to discuss is the progress and outcomes of this project. Initially, I assumed that to make the agent functional, I would need to build a substantial knowledge base. I planned to prepare documents outlining step-by-step instructions for various operations - for instance, how to check runtime errors in the system or what buttons to press to create a user. This seemed like a rational and necessary approach at the time.
The first stage of the project involved creating tools for the SAP GUI AI Agent. I assumed that the ability to fill text fields, press buttons, and switch tabs would be necessary. This was enough to begin as a proof of concept. While this doesn’t enable full operation of the SAP GUI interface, it suffices for basic tasks. I also added the ability to take screenshots since the agent needs to "look around" and analyze results.
Next, I planned to work on a sophisticated prompt containing detailed instructions on how to interact with the interface and execute specific tasks.
A Surprise
I launched the agent with a very basic prompt and asked it to check processes in the system. Imagine my surprise when I received a response. It felt a bit like setting up a complex interface, placing someone in front of it, and saying, “Work.” And that person, without hesitation, starts using the interface as if they’re already familiar with it. Of course, I understand that large language models possess extensive knowledge and thus are familiar with most SAP transactions. But sometimes, the ability of these models to connect the dots is truly astonishing.
Where Is This Headed?
I am fully aware that this is a very early stage of the project. Let’s not fool ourselves; the tasks I assigned to my agent are not senior-level. However, consider the pace at which artificial intelligence is advancing. What wasn’t possible a few months ago is now entirely natural. Smaller and smaller models are becoming increasingly intelligent. The progress is lightning-fast. Could my agent, a year from now, be working in a Service Desk and performing as effectively as a human? And over time, could it also replace SAP Consultants?
Of course, SAP GUI is not the ideal interface for a machine. At the moment, we’re teaching artificial intelligence to use the interfaces we use ourselves. We’re adapting it to fit into our world. This is evident in the case of humanoid robots. They are designed to resemble humans so they can operate in the same environments. But is this truly optimal? Is this really the best form for a robot? Perhaps it’s just a transitional phase. The same applies to agents using interfaces we use. In the future, it’s likely that agents will communicate through APIs or, in the case of SAP, via RFC. And we’ll understand less and less of it.
Do you already feel the breath of an Autonomous SAP Consultant on your neck? Share your thoughts in the comments, and don’t forget to check out the video!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
7 | |
2 | |
1 | |
1 | |
1 | |
1 |