cancel
Showing results for 
Search instead for 
Did you mean: 

All threads are blocked

bjanker77
Participant
4,956

Hi.

Our largest client have a SA16 database, about 25 GB in size. Primary use is a client/server application, but we also have som webservices and some use CLR (.Net code).

As of lately, we are experiencing some issues. Yesterday, one of the webservices used over 30 seconds to connect. This usually takes milliseconds, due to the use of connection pool cache. At that same time, one of the CLR-routines reported "All threads are blocked" in the Windows event log.

The database server is a new HP Proliant server, with lots of RAM and a gnarly CPU.

I had a look at the service, and it is set up like this:

-n <servername> -x tcpip(serverport=2638) -gnl 200 -gnh 400 -ca 0 -c 32G <databasefile.db>

Also, the option max_query_tasks option is set to 1 (a tech at SAP recommended it).

Is there any point in adjusting the gnl and gnh switches? Or could we remove them all together?

It would be great to hear some of your experiences if you have any. Perhaps someone from SAP has anything to contribute?

Regards,

Bjarne Anker Maritech Systems AS Norway

VolkerBarth
Contributor

As option -gnh restricts the maximum number of tasks and thereby worker threads - is there a need for you to specify that? (So my implicit suggestion is to try to omit that limitation...) - As your -gnl option is set to a really high minimum MPL value, I guess that seems to have a reason, however, it would be a further test to omit that, as well, and check what default MPL settings the database server will calculate (listed in the console when starting the process...).

bjanker77
Participant
0 Kudos

I'm not sure why this option is specified actually. I think this is an old setup, before I was involved and before we added several more services and CLR. I guess it's no big problem in omitting set option all together?

Thanks.

VolkerBarth
Contributor
0 Kudos

FWIW, thread deadlock may also result from problems with external environments or suboptimal locking schemes (in which case increasing the maximum MPL value may just delay the problem to a later time, when the new maximum value is reached)...

There are quite a few FAQs here dealing with such problems, cf:

http://sqlanywhere-forum.sap.com/search/?q=All+threads+are+blocked&Submit=search&t=question

bjanker77
Participant
0 Kudos

We are using CLR External Environment, and that is rather new. I think the first "all threads are blocked" came around not long after we started the CLR routines. I think there is a connection here. I will look into some of the other similar problems and see if I can find anything useful.

Thanks.

Accepted Solutions (0)

Answers (3)

Answers (3)

If anyone wants to know, we found the (one?) cause for the troubles we had.

We were triggering events in paralell, but the event's code used a mutex to serialize access. Usually the code was running quickly so there was no problem. But under increased load the events did need more time, so there was a tipping point when the number of events waiting for the mutex was increasing constantly.

Obviously each of the events did use one thread while waiting, and each event opened its own connection to the database. So in Performance Monitor we saw a steady increase of the "Connection Count" counter, as well as of the "CurrentMultiprogrammingLevel" database property.

If the load was not reduced significantly, this usually lead to a database crash when "MaxMultiprogrammingLevel" was reached, after a few "All threads are blocked" exceptions. If load was reduced in time, the Connection Count dropped quite quickly (CurrentMultiprogrammingLevel did so only after a quite long time, ~1 day) and no problems occured.

Reviewing the event's code, and changing it so that no mutex is needed anymore removed the problem and speeded up work considerably, at least at the load levels we are currently able to reach. CurrentMultiprogrammingLevel now remains on a somewhat sensible level (below 100).

So, it is the classical story in multi threaded programming. Maybe the example helps some people to find their own problem.

VolkerBarth
Contributor
0 Kudos

We were triggering events in paralell, but the event's code used a mutex to serialize access.

Just out of curiosity: If this relates to one event being triggered in parallel, do you really need it to run in parallel, or could you restrict the event to have only one instance running at any given time via the "old fashion" of checking event_parameter('NumActive') > 1?

(If you would need allow n instances in parallel at max, you could use semaphores with a start count of n instead of mutexes.)

0 Kudos

In fact we have changed the code so the events can (and do) run in parallel, which resulted in considerable better performance and user experience. 😉

Nevertheless that hint about semaphores may be useful in the future.

MCMartin
Participant

You could use sa_server_option with MaxMultiProgrammingLevel to increase the gnh value online, to check if that helps.

You can use the profiler to check if you really have so many simultaneous requests that 400 tasks are not enough to handle them.

You can check the blocked connections tab to check if more or less all your requests are accessing the same data and therefore have to wait for each other

0 Kudos

Though we have not yet really understood the cause of the issue, one of our applications using SA 17 experienced a significant improvement after increasing MaxMultiProgrammingLevel from the default of 80 to 200 or even 500.

Nevertheless the error still occurs occasionally...

Breck_Carter
Participant

In addition to all the other suggestions, you may find Foxhound to be helpful when dealing with runaway connections; see the white paper here.

Former Member

The exercise would be to identify the reason all your worker threads are busy. {as others have correctly identified} Intra-query parallelism is just one possible factor.

The list of possibilities include:

  • Execution of Parallel access plans,
  • Remote accesses (via proxy tables for example),
  • " " adds extra workers normally but can become new additional requests/tasks and will consume more workers when the remote dbs are on the same server,
  • External Environment uses (your CLR routines are an example),
  • Web requests (both those into and those from the database server),
  • Active sending of emails,
  • The execution of Events (including maintenance plans),
  • Backup tasks,
  • {I believe TDS result sets can also hold worker threads longer than other connections)
  • in addition to the normal Active (scheduled) requests,
  • ... and I may have missed a couple more.

It is worth noting one request from 1 client connection can consume many workers when it combines any number of the above features and parallelism can magnify that a couple of different ways.

It is also worth noting that Tasks can also become blocked and hold their worker threads while locks need to be acquired. If you have a large impact due to contention, you will want to monitor your contention levels as well.

Just some thoughts that may help you in your investigations ...

bjanker77
Participant
0 Kudos

Ok, it looks like we have a lot to learn regarding this. But stupid question: is it an idea to omit both -gnl and -gnh switches? I suppose that the server would dynamically handle this itself when the limitations are gone?

Regards,

Bjarne

VolkerBarth
Contributor
0 Kudos

I would think so, but note that your -gnl value is probably quite high, so I'll check what the defaults are... (see my first comment).