cancel
Showing results for 
Search instead for 
Did you mean: 

Corrupted client-side backup

justin_willey
Participant
2,763

We have a utility for automating backups & DR servers for our customers. As well as doing things like making sure that live backup keeps running and regularly applying log files to DR databases, it can invoke client side backups (using dbbackup.exe). This usually works fine even with 200-300GB databases.

However, we have been getting quite a few cases where, although the backup completes and reports success, any attempt to use the database (eg apply logs, verify etc) results in a variety of assertion errors. On checking, the original database is fine, as is the disk system of the DR machine.

We have submitted a few of these databases to SAP support who have said that basically the database file is corrupt. It would seem likely therefore that some of the database pages are getting corrupted by network errors during the back-up process.

Is there any form of in-built error checking, checksum validation etc that goes on during a database backup? Obviously network errors shouldn't be happening, but I'm surprised that the issue isn't being detected during the backup process. Or, is something quite different going on? (As mentioned above, a whole variety of assertion errors can come up - but all seem to relate to a corrupt database file - the log file always seems to be OK when checked)


Our log of events (with a few sensitive things replaced with XXXX:

20:46:58 Backing up Primary Database to E:\\XXXXX\\IQX.db, Replacing Standby

20:46:58 dbbackup.exe -k nocopy -b 240 -y -o "E:\\XXXXX\\Messages\\20170705200027 Full Backup.txt" -c "ENG=BRSSIQX;DBN=BRSSIQX;LINKS=TCPIP(Host=192.168.101.180;Port=2639);UID=pears;PWD=XXXXX;CON=FullBackupBRSS" "E:\\BRSSIQXHubBackup" SQL Anywhere Backup Utility Version 16.0.0.2076 (70302259 of estimated 70302259 pages, 100% complete) Database backup completed


22:18:58 Applying E:\\XXXXX\\IQX.log to Standby Database E:\\XXXXX\\IQX.db

22:18:58 dbsrv16.exe -n LogApplyBRSS -o "E:\\XXXXX\\Messages\\20170705200027 Full Backup.txt" -qs -c 1G -ca 0 -x none -xd "E:\\XXXXX\\IQX.db" -a "E:\\XXXXX\\IQX.log" I. 07/05 22:18:58. SQL Anywhere Network Server Version 16.0.0.2076

I. 07/05 22:18:58. Workgroup Edition I. 07/05 22:18:58.

I. 07/05 22:18:58. Copyright © 2014 SAP SE or an SAP affiliate company. I. 07/05 22:18:58. All rights reserved. I. 07/05 22:18:58. Use of this software is governed by the Sybase License Agreement. I. 07/05 22:18:58. Refer to http://www.sybase.com/softwarelicenses. I. 07/05 22:18:58.

I. 07/05 22:18:58. Connection limit (licensed seats): 155 I. 07/05 22:18:58. Processors detected: 1 (containing 2 logical processors) I. 07/05 22:18:58. Processor limit (Workgroup Edition): 2 I. 07/05 22:18:58. Processor limit (licensed processors): 2 I. 07/05 22:18:58. Maximum number of processors the server will use: 1 physical processor(s), 2 core(s) I. 07/05 22:18:58. This server is licensed to: I. 07/05 22:18:58. XXXX I. 07/05 22:18:58. XXXX I. 07/05 22:18:58. Running Windows 2012R2 Build 9600 on X86_64 I. 07/05 22:18:58. Server built for X86_64 processor architecture I. 07/05 22:19:00. 1050108K of memory used for caching I. 07/05 22:19:00. Minimum cache size: 1050108K, maximum cache size: 11320464K I. 07/05 22:19:00. Using a maximum page size of 4096 bytes I. 07/05 22:19:00. Multiprogramming level: minimum:2, current:20, maximum:80 I. 07/05 22:19:00. Automatic tuning of multiprogramming level is enabled I. 07/05 22:19:00. Starting database "IQX" (E:\\XXXXX\\IQX.db) at Wed Jul 05 2017 22:19 I. 07/05 22:19:00. Performance warning: Database file "E:\\XXXXX\\IQX.db" consists of 138 disk fragments I. 07/05 22:19:00. Database recovery in progress I. 07/05 22:19:00. Last checkpoint at Wed Jul 05 2017 20:46 I. 07/05 22:19:00. Transaction log: E:\\XXXXX\\IQX.log... E. 07/05 22:19:00. ERROR Assertion failed: 201502 (16.0.0.2076) Inconsistent page modification counter value E. 07/05 22:19:00. Error: Internal database error ERROR Assertion failed: 201502 (16.0.0.2076) Inconsistent page modification counter value -- transaction rolled back

15:50:30 ERROR - Failed with exit code 1

VolkerBarth
Contributor
0 Kudos

I have no clue, but do these backups validate successfully?

Breck_Carter
Participant
0 Kudos

That would be a moot point, since dbsrv16.exe is failing to start the backup (or copy thereof) and that would have to run successfully before starting dbvalid.exe.

Vlad
Product and Topic Expert
Product and Topic Expert

Justin, I apologize in advance that I cannot help you with the precise answer. But here are my thoughts:

  1. By default SA has checksum for all global pages enabled (see dbinit, option -s)
    According to the documentation, the checksum is calculated just before the page is written to disk.

  2. If you can reproduce the issue while doing the backup over the network, I'd suggest you to backup the data using the same host, where the DB is running. I know this is what not recommended, but this will help us to exclude the network as a "broken" part.

  3. Again, if the network quality is somehow guilty, it means that every archive (7z/zip/whatever) will be corrupted if you simply copy it from your DB server to the destination machine.

  4. From old Breck's replies (I agree with him), there is a chance that something is wrong with your RAM. It is worth testing RAM using one of the utilities from this page. I am sure you have already done this twice on both machines.

  5. I'd like to add that your SA server is quite old (2.5 years old). The latest version is SQL Anywhere 16.0 SP70 Build 2471.

A short summary - verify the network (do whatever tests you want, e.g. transfer large files over the network, calculate CRC before and after transferring; do the backup on the DB server side), verify RAM (both servers, thoroughly), verify HDD's SMART and surface (just in case), upgrade SA server.

Breck_Carter
Participant
0 Kudos

16.0.0.2076 is rather [cough] immature. Even Foxhound 4 requires 2127 to run its own database, and that build was carefully chosen as "the earliest build that can tie its own laces" 🙂

Breck_Carter
Participant
0 Kudos

@Justin: Do not read this rant until after your problem is solved... it will just annoy you :)... read Vlad's comment first.

@EveryoneElse: There is no point in taking backups if you don't validate them afterwards. To put it another way, if you're not willing to spend the time and expense to validate your backups, don't bother creating the backups in the first place... history shows they will not work.

For large databases the "time and expense" of validating backups is very real, since in most cases you have to create a second (throwaway) copy of the backup for the validation process.

The phrase "in most cases" means "you want to preserve the ability to apply incremental log backups after restoring the backup". In order to validate the backup, dbsrv16.exe must first be run normally and that process makes it impossible to apply a future incremental log backup or multiple differential log backups... hence the "second (throwaway) copy of the backup".

You might ask, "Why can't I just start the backup in read-only mode, to run the validation without making a second copy?

Answer: dbsrv16.exe -r prevents the automatic recovery process from running, and that is needed in order run the validation process... assuming you want to prove the backup can actually be used in a restore.

VolkerBarth
Contributor
0 Kudos

Hm, I wasn't sure whether that error was displayed while already applying transaction logs (as -a is used), so it could be different when the database file would be started alone or with the original log file.

VolkerBarth
Contributor

There are some fixes for assertions and backups between 16.0.0.2076 and 2471, e.g.

Engineering Case #781340)================

    Under rare circumstances, an incremental backup executed on the client could 
    have incorrectly reported success while having copied an incomplete transaction 
    log. This has been fixed.

    A way to check if the problem has occurred is to validate the backup. File 
    sizes between the database transaction log and the backup log can also be 
    compared.

Not that I would anyhow claim that would apply here.

Breck_Carter
Participant
0 Kudos

FWIW £119.99 buys a Seagate Expansion 4 TB USB 3.0 Portable 2.5 inch External Hard Drive which is perfect for those "throwaway copies" 🙂

VolkerBarth
Contributor
0 Kudos

FWIW, v17 has introduced "dbvalid -im v" to allow in-memory validation without modifiying the database and log file.

@Justin: Not helpful for you currently, I know:(

justin_willey
Participant
0 Kudos

No, that's what we would try to do next in the process, but the log has to be applied first to bring the database to a consistent state. We stop at this point if there is an error as any subsequent operation would of course fail.

Breck_Carter
Participant
0 Kudos

To be more specific, dbvalid -im v is not for validating the active database, it is specifically for validating a backup copy that is auto-started by the connection from dbvalid.exe.

So yes, this might make the "throwaway copy" unnecessary if (a) the backup is a matching full backup of both database and log file or (b) the backup is an old database backup file and fresh incremental log backup file... not sure about (c) the backup being validated is an old backup file plus a dozen or so differential log backups.

I have come to believe that (a) and (b) are good, and (c) pretty much sucks, it is 2017 and disk space is cheap, so yes, dbvalid -im v is a good thing!

=====

-im mode

Defines the in-memory mode to use if the database server is auto-started (by setting the -im database server option on the StartLine parameter of the connection string).

v or validation (default) In-memory validation. Use this mode to validate a backup copy of the database without modifying it or its transaction log files.

Breck_Carter
Participant
0 Kudos

> what we would try to do next in the process

OK, so definitely do not read my rant below 🙂

Breck_Carter
Participant
0 Kudos

Indeed... your point was not moot.

justin_willey
Participant

Thanks for the "trigger warning" but not required. That's what this process is all about!

Our normal process (abbreviated) is:

0 Validate existing DR (disaster recovery) database in read only mode - if OK move to safe copies folder

1 Backup database & log to DR machine as new DR database (no truncation)

2 Apply log (from backup) to DR db (what failed in this case)

3 Validate DR DB in read only mode

4 Apply copies of "Live" log backup on regular basis (eg every 20 mins) until we do step 0 again!

This usually works fine and means we have regular off-line validation and a constantly available DR server, with no downtime on the live server for the process. I've missed out the bit where we truncate the log on the live server so it doesn't get too big.

If the live server goes down, we just apply the latest version of the "live backup" log and we are up-to-date.

All the steps can send out email, SMS, SMNP etc notifications.

Breck_Carter
Participant
0 Kudos

Your integration of live backup (dbbackup -l) into the regular backup and validation process is intriguing. I assume you run it locally, rather than use live backup to facilitate an offsite recovery server. I'm guessing this is possible with live backup but not HA, the first time I've heard of live backup having significant technical advantage over HA.

...or am I out to lunch? 🙂

justin_willey
Participant
0 Kudos

It runs on the DR server. Our process, that runs as a service on the DR machine(s), monitors it and ensures it restarts etc if connection is lost, if the live or DR servers are rebooted etc.

Our clients prefer the approach to HA because of the control and visibility. The DR copy can also be duplicated for reporting, warehousing etc.

VolkerBarth
Contributor
0 Kudos

Hm, so the same DR server making the sometimes corrupt client-side backup does also do the live backup?

(I'm just wondering whether possible network problems might corrupt the latter, too...)

Breck_Carter
Participant
0 Kudos

FWIW live backup and HA can coexist, and with HA you can run reports against the secondary, or a third read-only scale-out... something to keep in mind, since HA in V16+ isn't Your Father's HA any more 🙂

justin_willey
Participant
0 Kudos

The HA would deal with the DR side, we would still need to back it up to validate it.

justin_willey
Participant
0 Kudos

The log files have always been fine. It appears that the backup db file is corrupted before we try to start it.

justin_willey
Participant
0 Kudos

Thanks Vlad for your pointers. I really appreciate your input.

5 I certainly agree about using the latest version though I haven't seen any changes that specifically address this sort of issue. We are in the process of pushing out a newer build. All the problems we have seen so far have been on 2076.

1 We have checksums on, but one of the things I'm trying to find out is whether the backup process (as opposed to the engine) uses them.

2 We have done that in every case and the live database has always been OK. It is only the backed-up database that has the issue.

3 Occasionally we have been able to get an error when copying a really large file (similar in size to the databases eg 200G - 300G) When this happens we always know about it immediately - Windows reports a failure in copying - but there may also be an element of hidden re-trying going on and we don't know if dbbackup.exe does the same. The odd thing is that in all these cases the dbbackup.exe -l has been going on for months (on dozens of different servers and networks) copying thousands of gigabytes of data without a single error. Does it have an error detection / correction method that the full backup doesn't, I wonder?

4 We have checked in a couple of cases, without anything showing. However it seems unlikely that a lot of different customers would all have servers with faulty RAM that only manifested this particular problem.

Accepted Solutions (1)

Accepted Solutions (1)

justin_willey
Participant

It seems we were being affected by one or more issues fixed before build 2471 (Thanks for your email Mark Culp!). Certainly we are now not getting repeats.

We did then get hit by the apparently new problem:

================(Build #2524  - Engineering Case #809863)================

Under exceptional rare circumstances, the server may loop infinitely during 
an index backward scan. This has been fixed.

This manifested itself for us by a statement like

select max(column) from table where anothercolumn = 'something'

just never returning, while

select min(column) from table where anothercolumn = 'something'

was fine.

However that was fixed in build 2546 before we got a chance to finish reporting it - so all seems good now.

MarkCulp
Participant
0 Kudos

Glad to hear that your issues were fixed!

Answers (0)