on 2010 Feb 26 5:59 PM
Title isn't that great, so feel free to change.
Basically, I will describe an issue we had this week. I want to get responses from everyone on what you would have done in the scenario. I won't post what we did at first, but will later on.
Scenario:
A 35GB SQL Anywhere 9 database with 1 remote site that receives all data, and around 190 remote sites that receive limited data. Two assertion failures in one week. One, which was an index issue. The second, a little more severe - an invalid page.
If anyone would like more details, let me know.
Additional Information:
Request clarification before answering.
Parts of what I would have done (Should say, what I have done during one SQL Remote consolidated crash some years ago...) include:
Try to catch all sent messages that have not been read by remotes (i.e. prevent remotes from reading further messages, maybe be removing access to the file link for remotes). That way you may reduce the possible number of remotes that might need to get re-extracted. And that may be the most important goal...
Test if the dabase file is startable without log (all tests done on copies, of course!), i.e. does dbsrvX mydb.db -f work?
Otherwise, you may be able to restore the database to its current state by using a backuped db file and apply all logs (including the current one).
If the current log is unusable, try to find out by using DBTRAN if there are any transactions from remotes that may be not part of the last valid database you can restore, and apply them manually to reduce the amount of possibly lost data. (Or try to read from a consildated SQL Remote -v log output if that is available).
As stated, we had a similar crash some years ago (as was realized later by tech support, based on a bug in the server code), resulting in a database file thas was usable and a log that was valid by itseld (DBTRAN with no problems) but both didn't fit together. With help of Tech Support, we could keep using the database file and "declared the current log" as a online-backup log and started a new log. Adapting the corrent log offsets was the key here. So we had no need to rebuild the database and had only to re-extract a few remotes, and no data loss at all. (And we would always been able to restore from a nightly backup, so it was no very painful experience.) But the procedure and tests took quite a while.
Our resume was an optimized backup-/restore plan (I guess as everybody does after such a crash...) with the following points:
So in case of a server crash (which happened for the very same bug again some weeks later...), we are able to do a very quick restore by going back to the full backup and the latest valid log. Only the remotes woh might have replicated in the last half an hour might need to get re-extraced.
IMHO, bringing up a system fast is usually as important as to find out what went wrong:)
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
That last point, bringing up a system fast sometimes means breaking replication and re-extracting. Finding out what went wrong is often a very long process, and there is no guarantee you will EVER find out what really went wrong... and if you do, it often turns out to be a unique one-time likely-to-never-be-repeated affair... so the whole search for what went wrong was pointless. Personally, my track record for Finding Out Why Replicaton Broke (a slightly different issue that being discussed here) is effectively zero.
...which is one of the main reasons I like MobiLink, when the consolidated breaks you fix it and carry on synchronizing, no steenking transaction log involvement. No re-extracting 1,000 remotes. No having to explain to management or a client why you have absolutely no idea why it broke, and never will, because you're too busy re-extracting.
@Breck: With a good backup/restore plan one ought to be able to bring up a system fast without having to re-extract much remotes (of course this assumes your remotes do not replicate every few minutes). I think "breaking replication" should only occur to those remotes that have replicated after the point (i.e. the (consolidated's log offset) to which one can recover, and I would always prefer to re-extract some remotes over rebuilding the whole system.
@Breck: And I absolutely agree w.r.t. the questionable value of "finding out what broke the system". It's worth the attempt, but that should be done after the system is up and running again. IMHO, one should not have to decide between "Do I try to recover or do I try to find the error condition?" - I guess when a system is down, typically that's not the situation where most folks are in the best condition for a complex replication analysis:)
And another addition: In our particular case, the fastest solution would have been: Start the broken 8.0.3.x database with dbsrv9 and stop it immediately. That would have "healed" the problem as the server bug was fixed in 9.x, and the database/log would have run with 8.0.3 just fine again. - I just found out that afterwards:( - So another tool in such situation might always be "Try with the latest EBF / the latest version that can load that database without upgrading"...
Assumption: The consolidated dbsrv9.exe is down, having just crashed on the assertion.
First thing: Shut off dbremote.exe at the consolidated end, and stop it from starting again for the time being.
Second thing: Make a full file copy of the crashed consolidated *.db and *.log. That way, no matter how much WORSE things get during attempts to get things going again, you can get the consolidated database back to the point it was right after the crash.
Third thing: Follow the steps in this V9 Help topic (starting at step 3, since 1 and 2 have been covered by the Assumption and Second Thing above):
SQL Anywhere Server - SQL Usage » Remote Data and Bulk Operations » Importing and exporting data » Rebuilding databases » Rebuild databases involved in synchronization or replication
Assumption: Everything is OK now <g>
( I know, I lose points for all the Assumptions 🙂
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
It's not clear to me that you couldn't have recovered from your last valid backup by applying the backed up transaction logs. If the log files that cover the time from your last valid backup to the point of failure don't exist, then your approach sounds fairly reasonable. Since you don't seem to have had to re-extract your users, I expect you had all of the necessary log files to recover from.
There are a few things that could have been done "before the fact" to both better protect your database and to ensure that you wouldn't risk breaking replication. The most important option that I would always recommend using at the conslidated site is the dbremote -u switch to only replicated backed up transactions. Using this switch prevents the consolidated sending out any messages for operations that occured after the last backup. This ensures 2 things:
If you combine the dbremote -u switch with scheduled incremental transaction log backups on the consolidated as Volker discussed, then you will still keep your replication latency low while astronomically increasing the recoverability of your replicated system.
The other configuration choices to consider include transaction log mirroring and high availabilty configuration on the consolidated. Granted that high availability wasn't available in version 9, you would still have the option of using a transaction log mirror.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Rob, while I see the beneft in using "-u", there are situations where this doesn't work well. For example, in our setup the consolidated does some kind of complex calculation/reporting based on newly sent-in data and sends the results back to the remotes. So we have to run dbremote on the cons twice in a short cycle (say, 1-2 minutes after an incoming message which triggers the calculation). In that situation, it's simple not useful to do a backup before each dbremote run. Instead we do log backups every 30 minutes. - So I think "always using -u" is a too general recommendation:)
@Volker - MobiLink would be a preferred architecture for your scenario since it provides for that backend processing to happen within the single synchronization session. I expect your system pre-dates MobiLink.
By choosing not to run with the -u you are creating a failure point that can force you to re-extract remotes. If you have a few hundred or a few thousand remotes, that risk likely isn't acceptable.
In other words, the recommendation to always run -u on the consolidated is still the right technical recommendation. Folks may choose not to use it for other business reasons.
@Rob: I agree with your general recommendation. In our particular case, MobiLink might really fit better - when the system was set up, MobiLink was already available (and we have used it to synchronize the ASA cons with a MS SQL database) but we were (and are) more experienced with SQL Remote. In our case, the omission of -u bears the actual risk that a few remotes (usually 0-10) might have to be rebuild which is tolerable for us, particularly as we can always rebuild their changes as we log all incoming statements. But as you state, this is certainly a non-standard situation.
Heres what I did...
As soon as the database went down, make sure the replication service was stopped and wouldnt start until I wanted it to.
Find the cause of the problem. I was able to get a backup by restarting the database and triggering the event. Once this was finished, I worked on the backup to test, which my first step was running dbvalid against it. I found it was a heavy populated table (in our world, at least). Naturally, as soon as it tried to validate that table, the backup db crashed.
Now, I know what table it is, but range of data is it? I followed the Sybase technical document to figure that out (took forever).
Once I knew exactly where the problem was, I unloaded every bit of data I could. What I couldn't get from this database, I was able to get from the one site that replicates all data.
So. Now I know where the problem is, and I have the data to rebuild.
What next?
Remove said table from all publications.
Drop all constraints/triggers/etc. from table
Rename table
Create a new table
Repopulate with data extracted from above.
Add all constraints/triggers/etc.
Create new publication with just this table, subscribe all the sites, and simply 'start' them (Not Synchronize).
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
61 | |
8 | |
7 | |
6 | |
6 | |
4 | |
4 | |
4 | |
4 | |
4 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.