Artificial Intelligence and Machine Learning Blogs
Explore AI and ML blogs. Discover use cases, advancements, and the transformative potential of AI for businesses. Stay informed of trends and applications.
cancel
Showing results for 
Search instead for 
Did you mean: 
L_Skorwider
Active Participant
3,212

AI Agents

Artificial Intelligence Agents are autonomous programs designed to perform tasks intelligently without human intervention. Leveraging advancements in large language models, these agents can understand and generate human-like text, enabling them to tackle complex challenges across various domains. I think it's worth saying that most agents are also able to use the tools provided to them by the developer. As a result, agents often have specializations that frequently depend on the tools they're able to use.

For example, there are agents that can do a research for you, manage your web browser, set up appointments, or even trade cryptocurrencies for you. The possibilities are essentially limitless, and everything depends on the intelligence of the large language model and the tools with which the agent has been equipped.

To clarify, let's compare agents to popular chat systems. In a typical chat, when a user asks a question, the AI responds, and then it's back to the user for the next input. This back-and-forth creates a conversation. An agent, on the other hand, usually differs as it can execute multiple tasks on its own before returning to the user. This means it can accomplish more and utilize various tools before getting back to the user.

This is just one way to understand the broad concept of an "AI agent." However, for this post, we'll go with this definition.

SAP GUI AI Agent

SAP GUI AI Agent is a program designed to interact with the graphical user interface of the SAP system. It can handle a wide variety of tasks, even those it hasn't been explicitly programmed to perform in advance. This is made possible by the vast knowledge that large language models have.

The agent was created in November 2024, and its functionality was showcased on the YouTube channel Subdalf. The inspiration for its creation came from the occasional solutions capable of using a web browser. I thought that if it's possible to use artificial intelligence to navigate the interface for browsing web pages, then it should also be possible to operate in SAP GUI.

The agent was developed in November 2024 and its capabilities were showcased on the Sapdalf's YouTube channel. The idea came from innovative projects that use a web browser. I figured if AI can navigate web interfaces, then it should be able to work with SAP GUI too.

In the original version, the agent was given a rather limited number of tools, yet it operated surprisingly efficiently. This aroused the interest of many people in the SAP community and encouraged me to further develop the program. I was genuinely surprised by how efficiently the agent behaved once it had the right tools. I hadn't anticipated such performance, as I mentioned in my previous blog post.

In December 2024, I released a video showcasing an upgraded version of the agent. I added new tools that expanded its capabilities, allowing it to create programs in ABAP. This version was also more cost-effective because I worked on reducing the need for costly screenshot analyses in favor of text analysis. I also promised to dive back into the technical details of this project. So here we are with today's post.

 

Agent Architecture

AI

To truly appreciate the capabilities of the SAP AI agent, it's essential to understand the key components and frameworks that shape its architecture. At its core, the agent is built using Python, a widely recognized and versatile programming language in the AI domain. Python's extensive ecosystem and supportive libraries make it a natural choice for developing AI tools like this one.

If Python is the undisputed leader in AI solutions, then LangChain is the go-to library for building LLM-based projects. LangChain simplifies the process of integrating AI models, particularly for conversational and language-heavy tasks. Complementing this is LangGraph, an extension tailored for creating agents that offers developers a balance between simplicity and precise control. There are currently many frameworks, such as Smolagents, that offer greater simplicity but less control. Here, control over operations is very important, which is why it is worth using a well-established solution.

What are libraries without a large language model, though? There's a wide array of choices out there, but I tend to stick with GPT-4o, as it's reasonably priced and offers impressive capabilities. I'm also thinking about checking out other models like o3 Mini. With the LangChain library, switching models, no matter the provider, should be a breeze.

SAP

When discussing architecture, it's crucial to explain how the agent connects to SAP. Here, a mechanism called SAP Scripting comes in handy, allowing the use of SAP graphical user interface elements from external programs. 

The Component Object Model (COM) is a technology developed by Microsoft that enables different applications to communicate and share functionalities. In the context of integrating Python with SAP GUI, COM allows for the automation of interactions with the SAP user interface. Thanks to the Python libraries  pywin32 and pythoncom, it is possible to leverage the COM mechanism to control SAP GUI, enabling our AI agent to use various tools. This allows the agent to navigate between transactions, enter data, and interact with different user interface elements.

Clear division

To maintain greater transparency, I decided to separate all operations related to communication with the SAP GUI and encapsulate them in a separate class called SAPAutomation. This class, apart from the basic methods that enable logging into the system and session handling, allows for performing basic operations on the interface, such as pressing a buttons, selecting options from the menu, or filling out text fields.

In the main part, however, I focus on building the agent's functionality based on the LangGaph framework. Tools are the bridge connecting these two worlds. That is why the main module includes functions that invoke specific actions on the interface, which are in fact wrappers for the methods of the SAPAutomation class. These functions, along with a clear description, are passed to the LLM as an array of tools.

At first glance, the interface seems straightforward, and dealing with elements like buttons, text fields, or menus isn't too challenging. However, over time, you'll realize that there’s actually a rich variety of interface elements within SAP, contrary to what you might initially think. To keep everything running smoothly and make the agent work efficiently, it's crucial to implement as many interface elements as you can. Naturally, thanks to its intelligence, AI can still accomplish a lot even with limited resources, as we saw in the first video.

LangGraph

LangGraph lets you create complex agent structures, but it's often more effective to keep things simple. Therefore, I opted for a very simple construction, centered around two nodes - reasoner and tools. The loop between the reasoner and tools is executed multiple times during the agent's operation. When the agent determines that no further operations need to be performed using the tools, it's a clear sign that the final answer can be generated, returned to the user, and the program can be terminated.

graph.png

 

 

If you are interested in a brief code review of the program, I encourage you to watch the video attached to the post. There, I show and discuss the most important elements used in the program.

 

Tracing agent activity

The LangChain ecosystem provides many useful tools. One of them is LangSmith. It enables a detailed analysis of the query process to a large language model through the LangChain framework methods. This allows for a very precise tracking of agent's queries to the large language model.

Screenshot from 2025-03-03 14-45-43.png

 

In the video attached to this post, you can find a very detailed analysis of a single agent's run. The agent's job is to make changes to an ABAP report. It involves several steps, each of which is thoroughly discussed along with the selection of appropriate tools selected by artificial intelligence.

 

 

Costs

With the rise of more intelligent yet pricey large language models, it's hard to ignore the operating costs of the agent. That's why it's worth monitoring. This is also the reason why the GPT-4o model, which has significant capabilities at a relatively attractive price, was chosen to work with the agent.

A single task performed by an agent usually costs less than one dollar, especially when there's no need for analyzing screenshots. This is why it's worthwhile to work on implementing tools that help avoid taking screenshots, which are a last resort when an agent is unable to analyze the current situation and outcome based solely on the state of the provided interface elements.

The task analyzed in the video cost exactly 55 cents because 209,000 tokens were used. As can be easily calculated, the main burden was the input tokens. This is quite natural for longer conversations because with each iteration, the entire history of previous actions is sent.

There's definitely room for optimization here, especially since a significant number of tokens are used to communicate the exact state of the GUI elements. There may be potential to simplify this process and even limit the history. However, so far, due to the relatively low costs, this hasn't been the main focus of the agent's development. However, this may change when the need arises to use more expensive language models.

 

Summary and further development

As you can see, building an agent is not particularly complicated. Initially, I planned to build a wide knowledge base for each type of task and provide it as context to the agent using the RAG technique. However, over time it turned out to be unnecessary, at least for basic operations. Of course, it's definitely worth considering, especially if we need to teach the agent the rules prevailing in a given organization and provide, for example, information about naming conventions in use. However, when we talk only about the general use of the user interface, the agent handles it well without any additional knowledge base.

What are the further plans and possible directions for development? 

First and foremost, further work on the implementation of various interface elements. Although many of them are used to a lesser extent in SAP, several frequently used ones have still not been implemented.

It is also natural for the agent to develop alongside the development of large language models. Thanks to the use of the LangChain framework, there is no problem with replacing a large language model, even if we decide to change its creator. This is quite important in light of recent dynamic changes in this field. It may turn out that alternatives to OpenAI offer a better quality-to-price ratio, and perhaps even effectiveness.

Working with the right prompts for the agent shows a lot of promise. It's well-known that proper prompting techniques can make a big difference. In fact, recent reports suggest that simple techniques can deliver excellent results and are even more efficient than using Chain-of-Thoughts. Nevertheless, I will definitely want to try to connect the so-called Reasoning Models, for example, o3 Mini.

There is no doubt that the space for development is still very large.

 

Video

I hope the above entry shed some light on how the agent is built. If you're curious about even more details, like what the agent's code looks like, I recommend checking out the attached video.

8 Comments
Duc
Explorer

Very interesting! Please keep sharing more inspiring works like this.

lijc
Explorer
0 Kudos

Hi  I want to know what list_all_object should return.

I understand that this is used to avoid screenshot operations, so it returns the IDs, types, and text values ​​of all elements of the interface. However, I find it easy to exceed the max token in this way, because the messages attached to each call LLM are accumulated.

L_Skorwider
Active Participant

Hi,

Not just to avoid screenshots. It works more like this – the more readable the object is, the less screenshots are needed. However, the list of objects is primarily needed for the agent to be able to interact with interface elements. For example, to press a button, its identifier must be listed.

As for the growing history, it could potentially become a problem. I haven't encountered it yet, but if it does arise, I have a Plan B – removing old lists from history.

BR

lijc
Explorer
0 Kudos

Hi

Thank you for your answer. I am trying to build this project. Most of it has been completed, but there is a problem: I want to read the code content in SE38, but after I get GuiShell through SAP script, I cannot read the specific code content. Do you have any good method?

lijc_0-1742370613505.png

 

L_Skorwider
Active Participant
0 Kudos

Hello,

Unfortunately, using some GUI elements is a bit tricky. In this case, you have to iterate through all the lines of code, retrieving the content using the GetLineText() method.

Lukasz

lijc
Explorer
0 Kudos

Thanks!  I will try this method 

ulf_bethke
Explorer
0 Kudos

Could this be done via SAP webgui as well ?

L_Skorwider
Active Participant

Hi,

This particular agent works with the SAP GUI, not the Web GUI. However, the implementation of the Web GUI itself seems to be much simpler. The challenge here was using GUI Scripting and implementing all those elements. You can first try regular browser use to see how it handles the Web GUI...

BR

Labels in this area