on 2014 May 28 10:51 AM
We are using SQLAnywhere 10.0.1.4295, hosted on Windows 2003R2/64. The database file is approximately 42GB in size. Early yesterday morning, the DB server slowed to a crawl and became unusable. I could not connect with iSQL; 15 minutes after entering login credentials, I still had no screen. At the server console, I tried to shut the database down. I immediately got the message "shutdown initiated", but 90 minutes later the database as still not shut down. I became impatient, and terminated the dbsrv10 task.
In restart, the server announced it was doing a database recovery (no surprise). It did that for 8 hours, and during that time the size of the db file grew to 70GB. At that point, I stopped the task, restored a backup of the database, and restarted.
My questions are:
Any insight appreciated
The excessive login time could be caused by all the worker threads being tied up, make sure you have an appropriate number of worker threads for your environment (-gn). In 12+ we can automatically increase the number of workers as needed. How many logical processors is the database running on? There are several problems related to parallel queries that are fixed in newer versions but aren't available for the 10 branch. After the server is running for sometime, check sa_conn_info
and see if there are a large number of exchange tasks that do not disappear.
If you would like more diagnostic information regarding what the database to doing during slowdowns like this, you could try implementing minimal request level logging or database tracing to record the statements being executed. This may allow you to pinpoint the query that is making the server do so much work.
What stage of recovery was the database stuck in? If it was the rollback stage, then a transaction (probably the one that was causing the slow down) had a large number of uncommitted transactions that needed to be rolled back. This would explain the large size of the database file and the very long recovery time. If it was in the applying transactions phase you should check the translation log with dbtran to see what it was trying to apply during recovery.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Mikel -- thanks for the reply. I had thought about the worker threads angle, but haven't investigated yet. The server is virtualized, with 1 vCPU and 4 vCores. So far, this is somewhat of an anomaly; we've been using Sybase for 10 years, and this is the first time anything like this has happened.
The recovery had gone through the checkpoint log and was doing the transaction log. Six hours after it started the trasn log, it was not done, and the only other messages were about expanding the cache size. No matter what, growth of the .db file from 43GB to 72GB seems a bit excessive to me.
I'd recommend taking a look at the transaction log (translate it using dbtran) and see if you can find anything unusual. Usually this kind of problem is caused by a runaway query that is repeatedly executing a DML statement. When you look at the console log, when was the last successful checkpoint compared to the time you shutdown the server? Are all the start's and end's paired very close together?
The growth of the DB file is probably due to the undo log. As we go through and apply the transaction log, we have to write to statement to reverse it (e.g. the insert corresponding to a delete statement). This rollback/undo log is stored in the db file. If there are millions of operations and a commit never occurred, we have to apply them all, then go back to the rollback log and roll them all back. Also, we have to keep the unmodified pages from the previous checkpoint around (checkpoint log) in case you happen to crash again during the recovery.
The growth of the database file sounds normal given the circumstance, it be interesting to know how it got to this state. Hopefully something stands out in the dbtran (or the record of checkpoint times)
I did not have the presence of mind to save the transaction log prior to restoring the backup, so I can't investigate that. Getting the factory working after 8 hours clouded my thinking. However, I did find out that the night before, a purge process tried to delete appx 10 million records; I suspect that was the root of the problem.
That could be it, the 29GB of file growth during the recovery would be the ~10 million insert statements (with the full row data) being added to the rollback log and the unmodified pages in the checkpoint log. To avoid this in the future, you could turn such a statement into a truncate table if possible or occasionally commit during the purge (checking the UncommitOp property maybe)
Also make sure that all of your tables - and particularly the table from which the 10 million records were being deleted - has a primary key (or unique index). If a table does not have primary key then when a delete is replayed from the transaction log a complete table scan needs to be done for each and every row that is being deleted - the initial delete may have been relatively quick but the replay from the transaction log will be really slow.
User | Count |
---|---|
67 | |
8 | |
8 | |
6 | |
6 | |
6 | |
6 | |
6 | |
5 | |
5 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.