cancel
Showing results for 
Search instead for 
Did you mean: 

Database recovery long and quiet

2,442

We are using SQLAnywhere 10.0.1.4295, hosted on Windows 2003R2/64. The database file is approximately 42GB in size. Early yesterday morning, the DB server slowed to a crawl and became unusable. I could not connect with iSQL; 15 minutes after entering login credentials, I still had no screen. At the server console, I tried to shut the database down. I immediately got the message "shutdown initiated", but 90 minutes later the database as still not shut down. I became impatient, and terminated the dbsrv10 task.

In restart, the server announced it was doing a database recovery (no surprise). It did that for 8 hours, and during that time the size of the db file grew to 70GB. At that point, I stopped the task, restored a backup of the database, and restarted.

My questions are:

  • Where can I look to see what might have caused the original problem? The console window provided no insight at all
  • Can I make dbsrv10 more talkative, either while running, or while in recovery mode so I have an idea of what's going on?

Any insight appreciated

VolkerBarth
Contributor
0 Kudos

Can I make dbsrv10 more talkative...?

I guess you are already using dbsrv10 -o console.log to write the console message to a file, aren't you?

0 Kudos

Actually yes are are. There's just not much there.

0 Kudos

What kind of information are you hoping to see? v12 and up print out the recovery progress in percent during automatic recovery. There is some additional debugging information that can be enabled with -z but it is mostly network related.

0 Kudos

Bingo! At the very least then, we'd have something other than manually watching file sizes to see that something is happening.

VolkerBarth
Contributor
0 Kudos

Even it this might be a post-mortem situation, you could try to do the recovery with a current v12 engine - it should be able to cope with a v10 database... (Of course that ought to be tested on a copy of the original database...)

Trying to use the newest SA/EBF engine that a problematic database can be run on is one of the lessons I have learnt after (rare) database crashes...

Accepted Solutions (1)

Accepted Solutions (1)

The excessive login time could be caused by all the worker threads being tied up, make sure you have an appropriate number of worker threads for your environment (-gn). In 12+ we can automatically increase the number of workers as needed. How many logical processors is the database running on? There are several problems related to parallel queries that are fixed in newer versions but aren't available for the 10 branch. After the server is running for sometime, check sa_conn_info and see if there are a large number of exchange tasks that do not disappear.

If you would like more diagnostic information regarding what the database to doing during slowdowns like this, you could try implementing minimal request level logging or database tracing to record the statements being executed. This may allow you to pinpoint the query that is making the server do so much work.

What stage of recovery was the database stuck in? If it was the rollback stage, then a transaction (probably the one that was causing the slow down) had a large number of uncommitted transactions that needed to be rolled back. This would explain the large size of the database file and the very long recovery time. If it was in the applying transactions phase you should check the translation log with dbtran to see what it was trying to apply during recovery.

Mikel -- thanks for the reply. I had thought about the worker threads angle, but haven't investigated yet. The server is virtualized, with 1 vCPU and 4 vCores. So far, this is somewhat of an anomaly; we've been using Sybase for 10 years, and this is the first time anything like this has happened.

The recovery had gone through the checkpoint log and was doing the transaction log. Six hours after it started the trasn log, it was not done, and the only other messages were about expanding the cache size. No matter what, growth of the .db file from 43GB to 72GB seems a bit excessive to me.

I'd recommend taking a look at the transaction log (translate it using dbtran) and see if you can find anything unusual. Usually this kind of problem is caused by a runaway query that is repeatedly executing a DML statement. When you look at the console log, when was the last successful checkpoint compared to the time you shutdown the server? Are all the start's and end's paired very close together?

The growth of the DB file is probably due to the undo log. As we go through and apply the transaction log, we have to write to statement to reverse it (e.g. the insert corresponding to a delete statement). This rollback/undo log is stored in the db file. If there are millions of operations and a commit never occurred, we have to apply them all, then go back to the rollback log and roll them all back. Also, we have to keep the unmodified pages from the previous checkpoint around (checkpoint log) in case you happen to crash again during the recovery.

The growth of the database file sounds normal given the circumstance, it be interesting to know how it got to this state. Hopefully something stands out in the dbtran (or the record of checkpoint times)

0 Kudos

I did not have the presence of mind to save the transaction log prior to restoring the backup, so I can't investigate that. Getting the factory working after 8 hours clouded my thinking. However, I did find out that the night before, a purge process tried to delete appx 10 million records; I suspect that was the root of the problem.

0 Kudos

That could be it, the 29GB of file growth during the recovery would be the ~10 million insert statements (with the full row data) being added to the rollback log and the unmodified pages in the checkpoint log. To avoid this in the future, you could turn such a statement into a truncate table if possible or occasionally commit during the purge (checking the UncommitOp property maybe)

MarkCulp
Participant

Also make sure that all of your tables - and particularly the table from which the 10 million records were being deleted - has a primary key (or unique index). If a table does not have primary key then when a delete is replayed from the transaction log a complete table scan needs to be done for each and every row that is being deleted - the initial delete may have been relatively quick but the replay from the transaction log will be really slow.

Breck_Carter
Participant
0 Kudos

> The server is virtualized, with 1 vCPU and 4 vCores.

That may be part of the problem... those numbers sound low.

How many real CPUs are there on the real computer? What other processes are sharing the real computer? Is the database server lightly loaded and low priority?

Answers (0)