Data Assurance (DA) : tips & tricks

Mark_A_Parsons · ‎05-26-2014

I've been working with Sybase's Data Assurance (DA) product for about 6 weeks and during that time I've tested several of DA's options/functions/switches/bells&whistles (and there are quite a few that I haven't touched).

Caveats

DA has the ability to compare data sets as well as schemas. To date my work has focused solely on data set comparisons so this is what I'll be concentrating on in this blog.

DA has a plugin module for the Sybase Control Center (SCC) product but a) I'm a command line type of guy and b) I've been burned too many times by the older Sybase Central product (ie, I don't trust Sybase's GUI clients to do what they say they'll do). Net result is that all of my work, and the following ramblings, are based on using the command line tool (via isql) to interact with DA.

NOTE: DA uses the Apache Derby database as its DASD; unfortunately DA does not come with any means of connecting to the DASD (nor any sort of ER diagram or inventory of the DASD itself). If you've configured your DASD for direct access and you've got a Derby-capable client (eg, ij, SQirreL, Aqua Data Studio) you can probably reverse engineer your DDL from the DASD tables, ymmv.

DA has the ability to auto-create some components (eg, comparesets, jobs). This consists of loading the auto-generated components into the DA's system database (aka DASD). There is currently no way to reverse engineer these components so you'll have no DDL of your own. DA is still quite new and I've run into a handful of bugs (including the 'import job' auto-generator) so I prefer to have complete control over my DDL, ie, I do not use any of the auto-creation commands that come with DA.

DA has the ability to compare data between different RDBMS's. In my case I'm only working with ASE dataservers as the source and target.

My current replication environment consists of a mix of table- and database-level (aka MSA) replication.

DA version & documentation

DA is still in its infancy (ie, still going through growing pains => errors unwanted features abound). You'll want to make sure you're running the latest version of DA to insure you pick up not only bug fixes buy also improvements/changes to commands.

As of this writing I'm running SP200 (plus some one-off patches provided by DA engineering).

You'll also want to make sure you've got the latest copy of the DA Users Guide. [NOTE: DA documentation is still woefully lacking in usable detail but it's all we've got at this point.]

Local vs standalone agent

When you install the DA server you automatically get an internal/local copy of the agent installed for you.

You also have the option of installing a standalone agent(s) pretty much anywhere within your networking domain (the DA server and agents are capable of seeing each other on the network).

In the DA documentation you'll see suggestions of installing a standalone agent on each of the hosts where you have a source/target data set (eg, ASE installation). While there's generally nothing wrong with this suggestion I'd highly recommend you refrain from co-locating a DA agent with your ASE, at least not until you've had a chance to work with DA and understand how it really works.

In my environment I've installed the DA server (plus it's default internal/local client) as well as a standalone client on a non-production host that usually has a lot of free cpu cycles. All of my comparesets reference the standalone agent (no particular reason other than wanting to see how the DA server interacted with a standalone agent).

Because of the parallel processing capability of the DA server and agent I quite often see 100% cpu utilization on my host (3x core AIX lpar) when running just 3 jobs/comparesets at a time. If you plan on running DA against a production environment the last thing you want to find yourself doing is explaining to your boss (and angry users) why you've got a handful of DA agent threads eating up all of the cpu resources on your production ASE host.

The key take away: do not co-locate your DA agents with your (production) dataservers unless you've got a sure fire way of limiting the cpu resources used by said DA agent ... otherwise you may want to insure your resume/CV is up to date.

Basics

Before you can compare any data sets you must tell DA how to connect to your databases, ie, you must create a connection profile (create connection) for each database that contains source or target data. You can (theoretically) have as many connections as you wish.

For a given set of data you must define a comparison set (aka compareset; create compareset) which designates a source connection, owner and table (or view) plus an optional where clause. You must also designate the same components for one or more targets. You can (theoretically) have as many comparesets as you wish.

Before you can run your comparesets you must assign said comparesets to a job (create job). Within a job each compareset is assigned a logical comparison name. A job must reference at least one compareset; a job can reference (theoretically) as many comparesets as you wish. You can (theoretically) have as many jobs as you wish.

Once you've got a job defined you can run it (ie, perform your actual data comparisons; run job), monitor the job (monitor job) and generate a report of the results of the job (show history).

Descriptors (aka names)

Most descriptors/names (eg, connection name, compareset name, job name) are limited to 255 characters.

Descriptor names must be unique within their component type (eg, all connection names must be unique).

In my current environment I've got ~300 connection profiles, 6000 comparesets and 6000 jobs. [NOTE: For reasons I'll outline later I've designed my jobs to reference a single compareset hence my number of comparesets = my number of jobs = 6000.]

You'll want to give some thought up front as to your naming standards so that you do not find yourself running into errors with duplicate descriptors/names.

In my environment I've chosen the following descriptor naming standard:

connection : conn_<env>_<DS>_<DB>

compareset : cs_<env>_<PDS>_<PDB>_<owner>_<table>__{tab|MSA}__<repdef>

comparison : same as compareset

job : j_<env>_<PDS>_<PDB>_<owner>_<table>__{tab|MSA}__<repdef>

where <env> is one of { dev | test | uat | prd }

NOTE: My naming standard for compareset and job names is the same except for the initial prefix (cs_ vs j_); this is by design (more details below).

When logged into the DA server you can run variations of the 'show' command to list out your various components. You can use the '*' wildcard character with the 'show' command. Having a common naming standard, along with the use of the '*' wildcard character, allows you to display a list of only those components you're interested in (eg; show connection *_prd_* to show a list of production connection profiles).

If you're like me you'll be storing your DDL for future reference/use. In my case I work in a UNIX environment so I needed to come up with a directory/file naming standard for managing my repository of DDL scripts. My current structure looks like:

DA/<env>/

comparesets/<repdef>/<compareset_name>.sql

jobs/<repdef>/<job_name>.sql

jobs and comparesets

A job may consist of multiple comparesets. The job can be configured to run X number of comparesets in parallel. This makes it quite handy to, say, compare all of the tables (aka comparesets) for a MSA repdef/sub by grouping all of the comparesets under a single job ... and this feature appears to work rather well.

Where you run into a problem (IMO) with having a single job manage multiple comparesets is this: how do you re-run a subset of the comparesets within a job (eg, you need to modify a connection or compareset; you applied a reconciliation script to a target and want to verify you're now in sync)? The only way I've found to do this is to update a job to disable all of the comparesets you don't want to run, re-run the job, then update the job again to re-enable all of the comparesets.

While disabling/re-enabling comparesets is certainly doable, I've opted for a simplistic approach of 1 job = 1 compareset, and a small UNIX ksh wrapper script to manage running X number of jobs in parallel at a time. Then if I need to run a specific compareset again I just re-run the associated job. By using the same name for the compareset and job (except for the appropriate prefix of cs_ or j_) it's quite easy to find the compareset given a job name or vice versa.

DA server/agent sessions

The vast majority of your interaction with DA will be through the DA server. The only interaction I have with the DA agent is to a) monitor connection usage stats and b) shutdown the agent.

For both components you'll quickly find that when you go to run a command that generates a lot of output (eg, 'show compareset') you'll be presented with just the first 100 lines of output.

For some reason the DA developers opted for a default max output of 100 lines sooooo, the first thing you'll want to do when you log into the DA server (or agent) is to increase the number of lines displayed to your monitor: session max_rows <some_big_number>

In my case I typically issue session max_rows 10000.

NOTE: If there's a configuration setting to change the default max_rows=100 I haven't found it, yet.

Running a job (non-blocking vs blocking)

The 'run job' command is a non-blocking call; this means that the job is kicked off in the background and the command line prompt is presented to you for entering another command.

On the one hand this is a good feature because it allows you to use the same login session to monitor the job (monitor job) while it's running.

On the other hand this is a not-so-good feature if you've got a process (eg, UNIX shell script wrapper) that's responsible for running a large number of jobs. Let's say you have 2000 jobs to run and your shell script has a loop that issues 'run job <job_name>'; guess what ... you'll kick off 2000 jobs in the background. While DA could theoretically manage all of those jobs, in reality one of two things will happen: 1) jobs will start aborting all over the place because the DA server and/or agent(s) run out of memory or 2) the jobs will run painfully slow because they are contending with each other for (relatively) limited cpu resources. [NOTE: A good example of why you wouldn't want a DA agent running on the same machine as your ASE instance, eh!]

Thankfully the SP200 version of DA added the optional 'wait' feature to the 'run job' command which has the effect of turning the command into a blocking command, ie, you will not see the command line prompt until the job has completed running.

If you're running a job manually from the command line then by all means continue to use the non-blocking version of 'run job', but if you've got a shell script that'll be managing a large number of jobs then you should definitely consider using the 'run job <job_name> wait' command.

Assorted configurations

java memory settings

The DA server and agent(s) are java processes. Each server/agent is initially created with its own RUNserver file. Within this RUNserver file is the normal java scripting environment stuff, including the java command used to startup the server/agent.

If you find your jobs dying because of memory starvation you can look at configuring your DA server and/or agent(s) with a) more starting memory (-Xms) and/or b) a higher max memory (-Xmx) usage.

disk space

If the DA agent(s) is configured to perform external sorting, keep in mind that the DA agent can only keep so many rows in memory, so intermediate sort sets get written to disk (and are deleted once the sort is completed). Also, the DA server maintains a history of job runs on disk in the form of report *txt/*xml files as well as SQL reconciliation scripts (if configured to generate recon scripts); for large volumes of differences these report/recon files can also get quite large.

The key issue here is to make sure your DA server and agent(s) have enough space in their data directories (see my previous blog post for details on configuring your jobs to use external sorting) for any situation you may come across.

sorting: internal (in ASE) vs external (by agent)

To compare data DA must have the data sorted in PK order. To limit the amount of cpu and tempdb space used in my ASE dataservers, as well as limit the amount of data cache thrashing. I've opted for external sorting (ie, let the DA agent sort my data). See my previous blog post for details on configuring your jobs to use external sorting.

concurrent connection pool size

By default the DA agent can manage 5 concurrent instances of a given connection (aka connection profile). If you plan to have more than 5 concurrent comparisons running which touch the same database (either as source or target), then you'll need to increase this setting. This configuration is made at the DA server level via the config command: config db_connection_pool_size 30

If the DA agent runs up against the max number of connection instances then your associated job(s) will appear to hang. The only way to 'fix' this appears to be to bounce the agent and server and restart your jobs where you left off.

NOTE: I've run into a bug feature with DA agent whereby it's not always re-using its connections with the net result being that even though I may never have more than 3 active connections into a given database (source or target), I still have to bump up db_connection_pool_size to a large number (eg, 30, 50, 70) to make sure my series of jobs complete successfully. Because of this bug feature I (currently) have to keep an eye on the number of active instances for a given connection (see the agent 'show connection' command); if the number of instances of a connection start to come close to my db_connection_pool_size setting I either have to a) increase db_connection_pool_size or b) interrupt my processing by bouncing the DA server and agent and then restart my jobs where I left off ... *ugh*

blob comparisons

By default DA will compare the first 1024 bytes of a blob (ASE text/image datatype). This value is controlled by the DA server level lob_fetch_size configuration parameter.

I've got a few text columns containing upwards of 2M bytes so I've reconfigured my DA server accordingly, eg, config lob_fetch_size 2500000.

parallel configurations

DA has the ability to perform various operations in parallel either through configuration settings (eg, job:MAX_CONCURRENT_COMPARISONS; job:NUM_PARTITIONS; config:sort_max_thread; etc) or design (eg, using multiple agents running on multiple machines; compareset: 1x source, Nx targets; using compareset boundaries; etc). It's quite easy to flood your host(s) with DA related processing, which in turn can easily max out the cpu resources on the hosts where you have DA server and agent(s) running. You'll need to run some tests in your environment to see what makes operational sense.

In my environment a) my comparesets have 1 or 2 targets (only because that mimics my replication environment; otherwise I would probably have no problems using 3+ targets if that mimicked my replication environment), b) I don't use compareset boundaries (I've found they're uneven and buggy) and c) my jobs only reference 1 compareset so job:MAX_CONCURRENT_COMPARISONS doesn't apply (and I make sure I control, via UNIX ksh script, the max number of jobs I run at any one time).

Obviously your ability to support a given level of parallel operations will ultimately come down to the available disk/cpu resources you can provide to the DA server and agent(s).

cutoff for comparison differences

By default DA will abort a comparison if it finds more than 1000 differences. You can change this cutoff with the job/comparison configuration setting ABORT_DIFF_MAX.

In my environment my job generation code automatically sets ABORT_DIFF_MAX to 100000 for all job/comparisons.

SQL reconciliation scripts (auto-apply vs manual)

You'll most likely want to insure all comparisons generate a reconciliation script (job/comparison:CREATE_COL_LOG=true; job/comparison:CREATE_RECON_SCRIPT=true).

While you can choose to have DA automatically reconcile differences for you, I'd recommend against this and instead configure your job/comparisons to *NOT* auto reconcile for you (job/comparison:AUTO_RECONCILE=false).

NOTE: I choose manual reconciliation for a few reasons: 1) I want control over when/how to run the reconciliation scripts, 2) if there are any differences I probably (as the Repserver DBA) need to investigate why there are differences - easier to do if the problem hasn't been masked over by auto-reconciliation) and 3) I don't want to be surprised by a large volume of differences (eg, 100's of thousands of differences due to a) wholesale replication failure or b) faulty compareset definition) trying to be applied against the target at an inappropriate time.

When a job has completed running you can run the 'show history <job_name> latest' command to see where in the DA's data directory structure your SQL reconciliation scripts are located.

reconciliation script batch size

If configured to generate SQL reconciliation scripts, each reconciliation script will consist of a single transaction (begin/commit tran) wrapper around all of the DML commands in said reconciliation script.

You can override this default setting with the DA server configuration setting recon_tran_max_stmts.

In my environment I've opted for a 5K transaction size: config recon_tran_max_stmts 5000. If for example I've got 10,005 differences then my reconciliation script will look like: begin tran, 5000 DML operations, commit tran, begin tran, 5000 DML operations, commit tran, begin tran, 5 DML operations, commit tran

Conclusion

As you'll find out once you start working with DA, there are a lot of configuration and design options.

As a (relatively) new product some of these options may not work as advertised, while other options may have some unexpected side effects (eg, DA can do fast comparisons but you may (initially) be surprised by the high cpu foot print).

As with its predecessor (rs_subcmp), DA can require a large repository of DDL so you'll want to give some thought to how you'll manage said DDL repository.

While I've found DA to be a major improvement over rs_subcmp, there are still plenty of ways to ... ummm ... errr ... get yourself in trouble, so make sure you thoroughly test your DA installation before using it in your production environment.