Artificial Intelligence and Machine Learning Blogs
Explore AI and ML blogs. Discover use cases, advancements, and the transformative potential of AI for businesses. Stay informed of trends and applications.
cancel
Showing results for 
Search instead for 
Did you mean: 
L_Skorwider
Participant
1,403

Introduction

When ChatGPT 3.5 was introduced to the world in 2022, I was convinced it was a breakthrough that would change the world - not necessarily version 3.5, but one of the subsequent iterations. And it will. I am fully convinced of this. Now, as we near the end of 2024, two years have passed, and I’ve just finished the proof of concept for my autonomous agent project working in SAP GUI. For a long time, I had wanted to create something like this, and it turned out to be simpler than I expected.

Technology

To be honest, I find the technological aspect a bit dull. It's not the most important part of this project. Nevertheless, it's worth mentioning. The project is built on the popular LangGraph library, an excellent tool for creating AI agents. Previously, I had primarily worked with LangChain, so this was something new for me.

The agent’s behavior is driven by tools I developed. It operates autonomously, taking sequential steps and analyzing the results. It can perform several actions in a row, such as navigating to a transaction, filling out a form, clicking a button, or switching tabs. What’s more, it can handle multiple transactions within the same task. Once it obtains a final result and can respond to the user, it concludes its operations.

In this project, I used GPT-4o as the large language model. This is a multimodal model, allowing it to analyze images as well. However, I tried to limit image analysis as much as possible, as it generates costs and can be relatively slow. That said, it remains incredibly versatile. For some tasks, the agent can complete actions entirely without screenshots.

In my opinion, the agent is lightning fast, especially for steps that do not require image analysis. You can judge this for yourself in the attached video.

Project Development

What I truly wanted to discuss is the progress and outcomes of this project. Initially, I assumed that to make the agent functional, I would need to build a substantial knowledge base. I planned to prepare documents outlining step-by-step instructions for various operations - for instance, how to check runtime errors in the system or what buttons to press to create a user. This seemed like a rational and necessary approach at the time.

The first stage of the project involved creating tools for the SAP GUI AI Agent. I assumed that the ability to fill text fields, press buttons, and switch tabs would be necessary. This was enough to begin as a proof of concept. While this doesn’t enable full operation of the SAP GUI interface, it suffices for basic tasks. I also added the ability to take screenshots since the agent needs to "look around" and analyze results.

Next, I planned to work on a sophisticated prompt containing detailed instructions on how to interact with the interface and execute specific tasks.

A Surprise

I launched the agent with a very basic prompt and asked it to check processes in the system. Imagine my surprise when I received a response. It felt a bit like setting up a complex interface, placing someone in front of it, and saying, “Work.” And that person, without hesitation, starts using the interface as if they’re already familiar with it. Of course, I understand that large language models possess extensive knowledge and thus are familiar with most SAP transactions. But sometimes, the ability of these models to connect the dots is truly astonishing.

Where Is This Headed?

I am fully aware that this is a very early stage of the project. Let’s not fool ourselves; the tasks I assigned to my agent are not senior-level. However, consider the pace at which artificial intelligence is advancing. What wasn’t possible a few months ago is now entirely natural. Smaller and smaller models are becoming increasingly intelligent. The progress is lightning-fast. Could my agent, a year from now, be working in a Service Desk and performing as effectively as a human? And over time, could it also replace SAP Consultants?

Of course, SAP GUI is not the ideal interface for a machine. At the moment, we’re teaching artificial intelligence to use the interfaces we use ourselves. We’re adapting it to fit into our world. This is evident in the case of humanoid robots. They are designed to resemble humans so they can operate in the same environments. But is this truly optimal? Is this really the best form for a robot? Perhaps it’s just a transitional phase. The same applies to agents using interfaces we use. In the future, it’s likely that agents will communicate through APIs or, in the case of SAP, via RFC. And we’ll understand less and less of it.

Do you already feel the breath of an Autonomous SAP Consultant on your neck? Share your thoughts in the comments, and don’t forget to check out the video!

6 Comments
L_Skorwider
Participant

There is a new version of the SAP GUI AI Agent. You can see a presentation here:

BharathReddyGoli
Explorer

It is an amazing attempt to showcase what is possible

L_Skorwider
Participant

Thanks. I'll deliver more in spare time. 🙂

DJ_ISU
Explorer
0 Kudos

It's amazing! Suddenly I'm feeling bad, because I only managed to train an LLM in assisting for functional and technical questions. It's amazing, what you have achieved!! 

LeandroRibeiro
Participant
0 Kudos

What a very nice video!

Could you comment (technically) on the tools used by the Agent?

 

L_Skorwider
Participant
0 Kudos

Hi @LeandroRibeiro 

SAP GUI is, on one hand, a very specific interface, but on the other hand, it's quite typical. If you think about it more closely, you'll find that it shares many common elements with web pages. You have a transaction name that serves a similar role to a URL for web pages. You have elements like buttons and text fields. Additionally, there are, of course, menus and various other elements like dropdown menus within the page content. So it's natural that the tools are somewhat similar to those used for managing websites with artificial intelligence.

The entire challenge of creating an agent lies in preparing a type of "web scraping" that works with the SAP GUI. In the case of the improved version, I was able to extract more information using just text. However, in the first video, you can see that screenshots played a bigger role. For now, screenshots are used where I don't yet support native elements.

As I mentioned in the first post, the whole thing is based on a LangGraph and uses SAP scripting both for handling the interface and for reading the content of the page. The rest is done by the LLM.

BR

Labels in this area