on 2016 Feb 18 12:14 PM
We have a customer trying to upgrade from v10.0.1.4051 to v16.0.0.2221
Their plan is to initially run the v10 database under the v16 engine, and then after a suitable period of satisfactory live running, re-build as a v16 database. The idea being that if significant issues are found under v16 they can easily revert. Their testing of this process has gone fine for some time until now.
The v10 database was run briefly under the v16 engine. The v16 engine was then stopped cleanly and the v10 engine re-started. The database would not start and gave this message:
02/17 23:13:39. *** ERROR *** Assertion failed: 201822 (10.0.1.4051)
Checkpoint log: attempt to allocate before recovery is complete
However v16 could open the problem database without problem. Despite this the database still would not open under v10 and all attempts at recovery failed and the database had to be recovered from backup to run under v10 again.
The only reference to this assertion I can find related to a bug fixed in 2008 in v9.0.2 (CR496087):
In rare instances, assertion failure 201822 "Checkpoint log: attempt to allocate before recovery is complete" could have been reported during database recovery. The problem has been fixed. If this problem is encountered, recovery with a server containing this fix should complete normally.
The v10 release notes make a few references to checkpoint log issues but all were fixed before build 4051.
Does anyone have any thoughts as to what the issue might be? Thanks!
Using -f should not succeed as the checkpoint log is actually stored within the database file. The problem seems to be that the server is going through recovery despite the claim that the database has been shutdown cleanly. I wonder if perhaps there are multiple dbspaces in this database and that not all dbspaces are being copied into both environments. I'm trying to figure out why a database that apparently was shutdown cleanly would need to go through recovery. Swapping copies of dbspaces might do that. The assertion failure is basically telling me that the server is going through recovery yet we're attempting to allocate new pages to the checkpoint log which is not allowed. I would need to see the database or a full dump/core to have a chance to figure out what is going on.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
[WAG] Since the database was shut down cleanly, there should be no reason not to try dbsrv10 -f to force a restart without the transaction log, then run dbsrv10 normally to create a new transaction log. [/WAG]
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Sorry - I should have included that - it was one of the things we tried under the heading of "all attempts at recovery failed". When starting with the -f it gave the same assertion.
You would have thought that if the db had been recovered by v16 then v10 would be happy too, once that had happened.
[WAG] Try starting the database with dbsrv12/dbsrv11, then dbsrv10. [/WAG]
Does the database start under V16 in read-only mode (-r)? (Just another very wild guess...) (Just to verify the clean shutdown, apparently.)
Tried this now:
v11 (last ebf) behaves like v10 - ie asserts.
v12 behaves like v16 - runs no problem.
in either case the db still can't be run under v10 afterwards. It looks like there is some sort of problem that is fatal for v10/11 but ignored / irrelevant for v12 onwards.
Has this been a "one-time" problem, i.e. had the customer tried once to run a v10 db under v16, stopped that and tried to run it under v10, again (and then noticing that this did not work as expected and did not continue to use v16) - or has it been tried again and ran into the same issues again?
(I'm just asking whether this issue may be related to a particular combination of data entry/checkpoint timing/shut down situation/whatever that may not repeat itself necessarily... - We had had an issue once with MobiLink years ago where a sequence of steps we had done for years in identical fashion suddenly made the database unstartable with v8 (whereas v9 could still start it), so I guess if something had been a bit different, that show-stopper might have not shown up... - wild post-mortem guesses, apparently...:))
Of course it would be understandable if your careful customer would not like a second attempt when the first did not run as expected.
You are right Volker - so far this has only happened once in testing. What the customer is worried about is how long it will take to recover / how much data might be lost if they found they needed to revert and this happened in a live situation.
It's a lot of "ifs", but they are rightly being cautious. This is a large database with hundreds of users in a 24x7 retail situation (with lots of other systems connected) - saying "opps sorry, we can't do anything until we've sorted the computer" isn't going to be good enough. That's why I'm trying to get to the bottom of what happened so we can come up with a workaround or make sure that it won't happen for real.
User | Count |
---|---|
68 | |
8 | |
8 | |
6 | |
6 | |
6 | |
6 | |
6 | |
6 | |
5 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.