on 2017 May 11 9:22 AM
Hey everyone,
We're struggling with our Mobilink server performance and are looking for information about which combination of which parameter(s) would be the best for our use case. Any advice would be warmly welcomed.
Infrastructure : We have two Mobilink servers (running with version 16) with 12 cores, 80 GB RAM. Both are communicating with reverse proxies and load balancers. The whole is supported with cookie-based sessions. This is working fine.
Requirements :
Actual key configuration items :
We've looked at documentation and have already tuned the server a bit in the last years but we're still not happy with the robustness of a synchronization. Most of the time, we get unexpected crashes, synchronisation errors, problems to read bytes. Those problems come randomly and are pure-luck based.
We're looking for experiences, KPI, or any idea that could help us find the perfect set of parameters to get our servers pumping data.
Note : simulation in a "controlled" environment is nearly impossible.
Request clarification before answering.
First off, you said you're getting a lot of crashes. The software shouldn't ever crash. If you get in touch with tech support, they can help you help us fix that.
Memory
I. 2017-05-15 08:58:45. <main> PERIODIC: TRACKED_MEMORY: 12 258 748 765 I. 2017-05-15 08:58:45. <main> PERIODIC: MEMORY_USED: 23 155 286 016 I. 2017-05-15 08:58:45. <main> PERIODIC: PAGES_SWAPPED_IN: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: PAGES_SWAPPED_OUT: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: PAGES_LOCKED_MAX: 2 874 283
This looks ok given your large hosts. Pages swapped in and out are 0, so you're staying in memory, which should always be the goal. TRACKED___MEMORY is the amount of memory the server thinks it has allocated and MEMORY_USED is the amount of memory the OS has given the process. It's surprising that they're so far apart. Do you have a Java or .NET VM configured to use about 10GB of RAM? Most of what the server has allocated is storing row data, which is perfectly normal.
Volume
I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_ROWS_UPLOADED: 5468845 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_ROWS_DOWNLOADED: 3605732328 I. 2017-05-15 08:58:45. <main> PERIODIC: TCP_BYTES_WRITTEN: 142799040880 I. 2017-05-15 08:58:45. <main> PERIODIC: TCP_BYTES_READ: 4105727260
~230K bytes down per sync
~5.8K rows down per sync
There's nothing to be especially worried about here. You have essentially no upload and your download is not particularly large.
Syncs
I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_UPLOAD_CONNS_IN_USE: 1 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_NON_BLOCKING_ACK: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_CONNECT_FOR_ACK: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_GET_DB_WORKER_FOR_ACK: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_WAIT_FOR_DNLD_ACK: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_SEND_DNLD: 20 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_END_SYNC: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_FETCH_DNLD: 87 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_PREP_FOR_DNLD: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_APPLY_UPLOAD: 1 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_BEGIN_SYNC: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_AUTH_USER: 5 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_CONNECT: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_RECVING_UPLOAD: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_IN_SYNC_REQUEST: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_WAITING_CONS: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: LONGEST_SYNC: 2878580 I. 2017-05-15 08:58:45. <main> PERIODIC: LONGEST_DB_WAIT: 0 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_CONNECTED_SYNCS: 113 I. 2017-05-15 08:58:45. <main> PERIODIC: ML_NUM_CONNECTED_CLIENTS: 258
Most of your syncs are either fetching the download from the consolidated, or sending it. Given the relative size of your downloads this isn't surprising. What's worrying here is the LONGEST_SYNC; this is milliseconds, so your oldest sync has taken about 45 minutes and is still running. Given just the ppv output we can't know where that sync is actually stuck. You'll have to connect a ML Profiler to see what the individual syncs are doing.
Errors
I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_WARNINGS: 862621 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_ERRORS: 15902 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_FAILED_SYNCS: 232 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_SUCCESS_SYNCS: 616417 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_ROLLBACKS: 221 I. 2017-05-15 08:58:45. <main> PERIODIC: NUM_COMMITS: 3076355 I. 2017-05-15 08:58:45. <main> PERIODIC: INTERRUPTED_TCP_CONNECTION: 2283016
You have 232 failed syncs and 221 rollbacks. I'm pretty sure every rollback will cause a failed sync, which means your failures are mostly happening while the server is doing database work. There are dozens of errors for each failed sync, so I would guess you have a Java or .NET VM printing big stack traces when these syncs are failing.
INTERRUPTED_TCP_CONNECTION is really high, which means our self-recovering HTTP gear is kicking in. It's happening a few times per sync which seems like a lot. This might indicate a flaky network, or it might be a side effect of how liveness works with HTTP. You could try increasing the liveness timeout on the client, or setting the buffer_size
parameter in the server's -x switch to something smaller the default 64000, like 32000. This should be a lower priority; it's probably not causing any real problems.
Summary
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Thank you Bill. Some of the metrics were somehow cryptic and your explanations have enlighted some things.
The rollbacks I'm talking about are done by the ML server against the consolidated database. The server runs the scripts in a sequence of transactions; if everything succeeds then it commits, but if anything goes wrong it will rollback the transaction and fail the sync. Given the ppv output we can't see what specific part of the sync is failing.
The various STAGE_LEN values list the size of the event queues for some thread pools in the server. They're basically debug values for me and are probably not going to be useful for customers. The only time they might be useful is if STREAM_STAGE_LEN was much larger than the number of syncs (i.e. 20-30 times larger); that would indicate the pool controlled with -wn is overworked and bumiping up -wn would be good.
User | Count |
---|---|
46 | |
6 | |
6 | |
5 | |
4 | |
4 | |
3 | |
3 | |
3 | |
3 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.