cancel
Showing results for 
Search instead for 
Did you mean: 

Is it advantageous to create FOREIGN KEYs after data is loaded?

VolkerBarth
Contributor
4,420

When rebuilding a database with a RELOAD.SQL file, the rebuild process generally creates foreign keys after the data is loaded.

Is that a performance advantage when compared to including the foreign key definition right in the table creation step? – Or is that just in order to prevent FK violations during the data loading phase?

Or asked more generally:

Will the "quality" of the index automatically built for a FK be different when the FK definition is before or after the table is filled (with LOAD TABLE)?

If the order of steps is important, would it then be as useful to defer PRIMARY KEY definitions as well (what DBUNLOAD does not)?

Current system: 8.0.3.5574, also relating to 11.0.1 and 12 beta (I'm aware that PK/FK indices are handled differently in 10+ databases).

Accepted Solutions (1)

Accepted Solutions (1)

Former Member

There are three performance advantages that come from deferring index creation when loading bulk data.

First, index creation is more efficient when the table is already loaded. Consider a large database with two tables P and F with a foreign key from F to P. Assume that the buffer pool is not large enough to store a significant portion of P. As rows of F are loaded, the server checks for referential integrity (RI) violations. If the foreign key on F is already defined, the server needs to probe the primary key index on P to ensure there is a matching row. Since the buffer pool can't hold enough of P, many of these requests will require a disk seek (perhaps 10ms per row). This is effectively a nested loop join; for a table with millions of rows, it could take thousands of seconds to verify the RI. Instead of using a nested loop join, it would be more efficient in some cases to use a merge join or a hash join.

The server uses sorting during index creation to convert the nested loop join to a merge join. This reduces the random reads on the primary key index, and reduces the cost of verifying RI for foreign key indexes. Sorting is not just beneficial for foreign key indexes. Sorted rows allow the index to be built in physical order, avoiding random I/O within the index being built. This benefits all index types, not just foreign key indexes.

In version 8.0 and above, the server will consider sorting the keys of an index when building an index for a table that is already loaded. In versions 8.0 to 10.0, the server begins loading the table and monitors how many times it needs to block for a disk read (for example, to read the primary index or to read within the current index structure). Once a threshold is met, the server switches to a sorting plan where the rows are sorted in index order. This means each primary key value is probed at most once and the index is built in physical order.

In version 11.0 and later, the server always uses a sorted strategy. Further, the server considers parallel access plans that perform the sort on multiple threads and verifies referential integrity using joins (which may be parallelized). There is special JNLO(m) join type that is a memoizing join. The optimizer can construct plans that sort rows of F in foreign-key order then join to P with only one probe of P's primary key for each unique key. Other join types can also be selected by the query optimizer.

A second benefit to deferred index creation is that index pages are not interspersed with table pages. When an index is created before data is loaded, table pages and index pages are allocated as rows are loaded, and this can lead to index pages being interspersed with table pages. Normally this is not a very significant issue, but it can reduce the efficiency of sequential scans of the table. Sequential scans issue read hints to load data asynchronously. If index pages are interspersed, some additional number of seeks may be needed. If this does occur, reorganizing the table will correct the physical layout issue.

A third benefit to deferred index creation is that index leaf pages can benefit from sorting. Index leaf pages use prefix compression that can pack more entries into a page if they have a common prefix. When the data is loaded in sorted order, the compression does the best job that is possible. If the data is not loaded, there can be cases where compression could have saved space but the load order meant the opportunity is lost. Reorganizing an index also restores it to this "best possible" state.

So, there are some benefits to deferring creation of indexes when loading bulk data. The benefits improve the time needed to create the indexes and also can improve the physical layout of the table and indexes (although this is usually a minor factor). The physical structure can also be restored by reorganizing.

VolkerBarth
Contributor
0 Kudos

Ivan, that is a great answer. I guess it will need some time until I understand all the details but the general message is clear. Big thanks:)

VolkerBarth
Contributor

One question though: As the 2nd and 3rd advantage seem to apply to primary keys as well, would it make sense to defer the PRIMARY KEY definition as well? - I'm just asking as DBUNLOAD does not defer creation of primary and unique keys. - As with SA 10, one might use the ALTER INDEX PRIMARY KEY ... REBUILD clause for that.

Breck_Carter
Participant
0 Kudos

@Ivan: Does the second benefit imply that putting indexes in their own dbspace is a good idea? (which can be done with CREATE INDEX but not with primary and foreign key indexes AFAIK).

Former Member

@Volker: There would indeed be some benefits to building the primary key index after the data is loaded. I think that sorting can also be an issue depending on whether your primary key is the clustered index (default or explicit) for a table and whether the unload data files were created with ordering on. The benefits 2. and 3. diminish over time as changes are made to the table and I don't recall seeing cases where there really is a huge difference. It is something to consider if you create a database that will not be modified much.

Former Member

@Breck, if the dominant access pattern to a large table is sequential, it might be a good idea to put that table in its own dbspace. Some small gaps such as those introduced by index pages are likely just noise in a system handling a concurrent workload. I've noticed these issues when trying to benchmark specific operations with a single client doing big sequential scans. In many real-world situations, benefit 2. is probably not that noticeable.

Answers (1)

Answers (1)

Former Member
0 Kudos

My vote is to "prevent FK violations during the data loading phase". And I don't think that you will get some performance penalty to create FK after load data.

VolkerBarth
Contributor
0 Kudos

Well, the question is whether creating FK before loading the data gives a performance penalty (by means of a "bad balanced index" or the like)...