on 2014 Sep 02 5:19 PM
The objective is to have a backup of a relatively large (30gb) database that can be restored rapidly, with a very narrow window of potential data loss, as this database is in use 24x7x365. This should all be documented in such a manner that we can assume the entire IT team was beamed up by aliens and the janitor is responsible for bringing the database back up.
There is much helpful information on this site. I've read enough now that I've started to second guess the plan. What I've outlined below seems to work just fine, but I'm sure someone can advise a better way. Database mirroring is coming, but not until the first part of next year.
We have 'ServerA' which is running the database, and 'ServerB', which will be the backup target. ServerB is maintained at the same patch level, etc as ServerA. Both servers are 64bit Windows.
On ServerB there is a batch file that uses DBBACKUP to create a full backup. For now it runs once a week. It's a pretty straight forward commnand:
dbbackup -c "DSN=MyDSN;" -x -y c:\\backup\\full
Another batch file runs every 30 minutes to grab the transaction log
dbbackup -c "DSN=MyDSN;" -t -y c:\\backup\\incr
Should we need to use the backup -- which implies that all three of my VMWare HA hosts are down and the redundant iSCSI sans have failed:
- Make new "working folder" for the recovery server and switch to it
- copy files from c:\\backup\\full to it
- apply the transaction log from the full backup
DBSRV16 mydatabase.db -a mydatabase.log- Exercise patience until the server shuts down
- Copy mydatabase.log from c:\\backup\\incr
- Repeat steps 3 & 4
- miller time
> a very narrow window of potential data loss, as this database is in use 24x7x365
> entire IT team was beamed up by aliens and the janitor is responsible for bringing the database back up.
Those two requirements are reasonable for a high availability setup, not backup and recovery... not even live backup. Experience has shown that backup is easy but recovery is hard because you can never predict exactly what the problem's going to be.
You always need a backup and recovery plan, and it should be tested, but don't expect it to provide high availability... for that you need... wait for it :)... high availability.
FWIW a dbbackup -l live backup command can be wrapped in a GOTO start loop that repeatedly runs the command if it fails. The usual cause for failure is loss of contact with the master database server... if it is intermittent then the GOTO loop will let dbbackup reconnect and continue on... if the loss of contact is permanent then the GOTO loop will continue until you kill it, which you will probably do because it's time to start the secondary server and apply the live log.
All that live backup stuff is possible, but for V16 it's the Amish way of doing high availability... no offense to the Amish, or to live backup, or SQL Remote, or any of The Old Ways... supported by all but embraced by few.
Remember that when the live backup is used for failover you have to (a) start the second server and (b) apply the log... and if it's been a while since a full backup has been taken, applying the live log may be time consuming.
Plus, it's harder and more time consuming to practice a live log failover than an HA failover. Janitor-ready documentation is a real challenge, far beyond my abilities (I tried, with V9, failed).
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
I may have drawn an inaccurate picture. The primary server is running in VMWare with HA, connected to redundant iSCSI SANs, with geographic separation of redundant devices. So the primary mission for this exercise is protection from 1) catastrophic hardware failure and 2) severe database corruption.
The 'janitor can do it' factor comes from ISO auditor who believe that everyone can do any job in the factory given good enough instructions. So mostly it's an exercise to make sure a box can be checked off.
Please post a new question... a backup and recovery plan for a geographically separate HA setup is an entirely different proposition. In particular, you do not want any of your regular backup activities to interfere with regular HA operation or even HA failover.
Is it possible to check the box without having an actual janitor demonstrate recovery from catastrophic failure? Without testing, recovery is guaranteed to fail... not 99.9%, but 100%.
It's ISO, so "checking the box" mostly means that we show that we have the instructions available, and have confirmed that if followed, they will work -- janitor not required. Even as we speak, I am doing testing and documentation of the recovery process.
In this instance, I'm using 'HA' in the VMWare context, which will fail a VM over from one host to another. This we have tested (both intentionally and not so much), and it works well -- the most users noticed was a 10 second lag on a couple screens on the ERP system. 'Geographically separated', for now, means the hardware is at the other end of the building, literally a 1/3 of a mile away, with gigabit fiber between.
In reality, the process I am documenting is for a "the building burned down" scenario. Since we are a plastics manufacturer, in that case I'll have more important issues than if I have an exact up to the minute inventory.
User | Count |
---|---|
76 | |
11 | |
10 | |
8 | |
7 | |
7 | |
6 | |
6 | |
6 | |
4 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.