on 2017 Jan 24 4:22 PM
Hi.
Our largest client have a SA16 database, about 25 GB in size. Primary use is a client/server application, but we also have som webservices and some use CLR (.Net code).
As of lately, we are experiencing some issues. Yesterday, one of the webservices used over 30 seconds to connect. This usually takes milliseconds, due to the use of connection pool cache. At that same time, one of the CLR-routines reported "All threads are blocked" in the Windows event log.
The database server is a new HP Proliant server, with lots of RAM and a gnarly CPU.
I had a look at the service, and it is set up like this:
-n <servername> -x tcpip(serverport=2638) -gnl 200 -gnh 400 -ca 0 -c 32G <databasefile.db>
Also, the option max_query_tasks option is set to 1 (a tech at SAP recommended it).
Is there any point in adjusting the gnl and gnh switches? Or could we remove them all together?
It would be great to hear some of your experiences if you have any. Perhaps someone from SAP has anything to contribute?
Regards,
Bjarne Anker Maritech Systems AS Norway
If anyone wants to know, we found the (one?) cause for the troubles we had.
We were triggering events in paralell, but the event's code used a mutex to serialize access. Usually the code was running quickly so there was no problem. But under increased load the events did need more time, so there was a tipping point when the number of events waiting for the mutex was increasing constantly.
Obviously each of the events did use one thread while waiting, and each event opened its own connection to the database. So in Performance Monitor we saw a steady increase of the "Connection Count" counter, as well as of the "CurrentMultiprogrammingLevel" database property.
If the load was not reduced significantly, this usually lead to a database crash when "MaxMultiprogrammingLevel" was reached, after a few "All threads are blocked" exceptions. If load was reduced in time, the Connection Count dropped quite quickly (CurrentMultiprogrammingLevel did so only after a quite long time, ~1 day) and no problems occured.
Reviewing the event's code, and changing it so that no mutex is needed anymore removed the problem and speeded up work considerably, at least at the load levels we are currently able to reach. CurrentMultiprogrammingLevel now remains on a somewhat sensible level (below 100).
So, it is the classical story in multi threaded programming. Maybe the example helps some people to find their own problem.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
We were triggering events in paralell, but the event's code used a mutex to serialize access.
Just out of curiosity: If this relates to one event being triggered in parallel, do you really need it to run in parallel, or could you restrict the event to have only one instance running at any given time via the "old fashion" of checking event_parameter('NumActive') > 1?
(If you would need allow n instances in parallel at max, you could use semaphores with a start count of n instead of mutexes.)
You could use sa_server_option with MaxMultiProgrammingLevel to increase the gnh value online, to check if that helps.
You can use the profiler to check if you really have so many simultaneous requests that 400 tasks are not enough to handle them.
You can check the blocked connections tab to check if more or less all your requests are accessing the same data and therefore have to wait for each other
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
In addition to all the other suggestions, you may find Foxhound to be helpful when dealing with runaway connections; see the white paper here.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
The exercise would be to identify the reason all your worker threads are busy. {as others have correctly identified} Intra-query parallelism is just one possible factor.
The list of possibilities include:
It is worth noting one request from 1 client connection can consume many workers when it combines any number of the above features and parallelism can magnify that a couple of different ways.
It is also worth noting that Tasks can also become blocked and hold their worker threads while locks need to be acquired. If you have a large impact due to contention, you will want to monitor your contention levels as well.
Just some thoughts that may help you in your investigations ...
User | Count |
---|---|
59 | |
10 | |
7 | |
7 | |
6 | |
6 | |
5 | |
5 | |
5 | |
5 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.