cancel
Showing results for 
Search instead for 
Did you mean: 

Primary server/database hang when network latency slow (async mode HA)

Former Member
4,029

Please help,

Our HA configuration : Server 1 (DC) : Primary database Arbiter

Server 2 (DRC) : Mirror Database

Async Mode

We have problem when network latency from DC to DRC slow/bad then primary database hang/pending transaction. All transaction from client apps to primary database pending/hang until arbiter get status database mirror disconnected or connected.

I assume that primary database hang/pending transaction because arbiter need to state primary or mirror status.

Need your help how to configure HA SQL Anywhere without primary database hang/pending transaction in condition slow network latency.

Thanks For your Help

Accepted Solutions (0)

Answers (2)

Answers (2)

Former Member

Debugging slowdown between connected primary/mirror

There are two possible synchronous calls that are made between a connected primary and aynchronous mirror that could cause transactions on the primary to slow down:

1.1) As Mark mentioned, every ~100 sends, the primary will wait for the mirror to reply back to say that buffers have been written to disk. This is to ensure the mirrors buffers aren't overloaded. If this is the case, there would be evidence in the mirror log files. I would suggest enabling mirror logging on the mirror server (specified with logfile=filename.mlog in CREATE MIRROR SERVER statement - see http://dcx.sap.com/index.html#sa160/en/dbreference/alter-mirror-server-statement.html*d5e32538 ), and then look for a line that looks like:

MM/DD HH:MM:SS.SSS << LOG_PAGES,16,partner,first_page=1,need_ack=1

On an asynchronous mirror, most of these requests should have need_ack=0, but if you see need_ack=1, the primary will wait for a reply. You should look in the log for the next line that looks like this:

MM/DD HH:MM:SS.SSS >> SUCCESS,partner

This line means the reply was sent to the primary. If there is a large time gap between the receive of the request and the reply, then this will be delaying your primary. This slowdown could be caused by a slow hardware. We recommend that hardware on the mirror is just as good (or better) than the hardware being used for the primary.

Alternately, if the time gap isn't very large and you're seeing longer delays, you can also enable mirror logging on the primary to see how long it's taking for these requests to get from one server to the other. This travel time could account for the delays as well. If this is the case, we recommend that you try to improve the network between the machines.

1.2) The second possibility is that the mirror is running behind the primary in applying transactions. In this case, you should be able to see the following message in the console log (output from -o server.conslog on the server start command line):

Database "demo" mirroring: primary blocked for x seconds waiting for the mirror to catch up"

The fix for this is to ensure that the hardware on the mirror is just as good (or better) than the hardware being used for the primary. It's also possible that the network could be causing this problem. (As in #1 you could check for this by looking for DU_MIRROR_CATCHUP in the primary and mirror .mlog files to see how long each request takes)

Configuration

From this posting, though, it sounds like perhaps your problem is when the primary loses the connection to the mirror and needs to check with the arbiter whether to stay on as primary. I don't think the arbiter should be checking with the mirror as you've suggested, so I'm not actually sure what slowdown you could be seeing here. However, here are some ideas about improving your configuration

2.1) Unless you have set the mirroring option auto_failover to ON, you don't actually have high availability with an asynchronous mirror. You could potentially change the mirror to be a read-only scale-out node (ie copy node). This would prevent the primary from checking for quorum when the node disconnects; however, you may encounter some of the network delays mentioned in 1.1.

2.2) In general, we recommend that the primary, mirror, and arbiter are all located on separate machines for optimal high availability. It sounds like in your situation, the primary is experiencing high enough network latency to cause dropped connections. In this case, I wonder whether it would be better for the mirror to takeover as primary, and the primary can rejoin when the network issues resolved. For this to work properly, you will need to make the mirror synchronous, and move the arbiter to another machine. If you want to have a preferred primary server, you could set the "preferred" option, as Volker has mentioned.

2.3) Perhaps one of your problems is that you're experiencing network drops when in reality there are just extended delays. In this case, you could try increasing the liveness timeout by adding "lto=timeout_value" to the connection strings provided in the CREATE MIRROR SERVER statements. See http://dcx.sap.com/index.html#sa160/en/dbadmin/livenesstimeout.html

2.4) As an alternative to using SQLA database mirroring, you could use live backups. http://dcx.sap.com/index.html#sa160/en/dbadmin/da-backup-dbs-4977640.html



I hope this gives you a number of ideas to try. There are many different possible configurations, and I think it ultimately comes down to whether you're trying to achieve true high availability, or simply have an (almost) up to date backup of the database, and what kind of slow down you're willing to trade for data availability.

VolkerBarth
Contributor
0 Kudos

Wow, what a wealth of details, suggestions and explanations - HA (highly appreciated)!

Breck_Carter
Participant
0 Kudos

What are "the primary and mirror .mlog files"?

MarkCulp
Participant

See Mary's suggestion in her second paragraph: the .mlog files are the output files from including logfile=filename.mlog in the CREATE MIRROR statement for the primary and secondary.

Breck_Carter
Participant
0 Kudos

...doh! 🙂

Or... I was playing Jeopardy!

MarkCulp
Participant

(1) If the primary server cannot verify that it has quorum (i.e. knows that it is still the primary and to do this it needs to be able to to communicate to either the arbiter or the mirror or both) then it will stop all COMMITs until such time that it can get quorum again - this is working as it is intended to work. If the primary were to allow COMMITs to complete without quorum then it would be allowing the possibility of lost transactions in the future.

(2) If quorum is not the problem then the other thing that happens under the cover is that the primary will slow down (and stop/wait) if the primary gets "too far" ahead of the mirror. I.e. The primary needs to send the transaction information to the mirror and the primary needs to ensure that the mirror does not get overloaded so what it does is after every 100 (I think that is the number?) packets the primary will send a "did you get this?" request and then wait until it gets a response - If the mirror is behind and has a backlog of packets that it has not yet received then the response can be delayed and you may see connections that are COMMITing on the primary "hang" until the response is received.

The only suggestion that I would make is to attempt to improve your network connection between DC and DRC.

Former Member
0 Kudos

Hi Mark,

Thanks for your suggestion. Please explain : (1) Our primary server and arbiter place in same server , i think arbiter needs answer/ack from mirror server (mirror connected/disconnected) to decide quorum (mirror server in different server and city). Primary server will hang/pending trx if arbiter didn't get status from mirror server. It happen intermitten and the symptom is bad network latency not network bandwidth. Can we setting parameter in arbiter when arbiter didn't get status from mirror about 60s arbiter decide status mirror disconnected ?

MarkCulp
Participant
0 Kudos

Since your primary and arbiter are co-located my #1 will not apply to you (I missed that fact when I typed up my answer) - the primary will be able to maintain quorum since it can easily talk to the arbiter. So in your situation my best guess is that your issue will most likely be #2: the primary will be waiting for an response from the mirror in response to the "did you get this?" request.

VolkerBarth
Contributor
0 Kudos

Aside: The above is true as long as "primary" and "mirror" run as expected, i.e. there has not been a role-switch inbetween, otherwise the arbiter may not run on the same machine as the "primary"... - do you use the "preferred" option?