on 2014 Nov 25 1:56 PM
We're experimenting with developing our own 'data warehouse' with SQL Anywhere (v 12 right now) using sqla native sql for our ETL process. We looked into using the Infomaker pipeline, but feeling nervous about the future of the product so we wrote our own "pipeline". Our database is growing quite large, and I'm not sure if it's totally because we continue to add new data to the db as we develop or if part of it is because of the method we are using each night to refresh. Basically, this is our "pipeline". Loop through a table on the warehouse containing names and locations of tables to be refreshed:
We assumed that more often that not, empty space from the truncated table would be used but is it possible that it's looking for new space with each new run? We're up to 10gb but actually that is coming from 6 other databases and I'm not totally shocked at that size. The process runs quite fast so we're happy with the performance, but don't want to fill up the server unnecessarily. Of course we will continue to fine tune what data we're bringing over but if this is creating bad use of space, we may need to re-think our pipeline process. tia.
Becky Snyder Bradley University
Request clarification before answering.
So basically the question is whether a daily repetition of truncate table/insert statements will lead to undesired fragmentation? - If so, I would think that SQL Anywhere should reclaim most of the free space from deleted rows for new ones. Of course you can monitor that with builtin tools like the sa_table_page_usage system procedure and/or properties like "FreeSpace" and the like.
FWIW: You might also be able to store the remote data in a local not transactional temporary table and then use a MERGE statement to update the "real" warehouse table. That way you might have much less modifications on the real table (preventing fragmentation in the system dbspace) at the cost of a possibly more intense usage of the temporary dbspace...
IMHO, it's a common approach to store remote data in a temporary table before it is applied/joined with local data simply because joins etc. on remote data are often very slow and inefficient. If that is the main reason you have chosen to always truncate and re-insert remote data, I'd guess a "temporary local copy" would be more efficient.
In case the remote data is coming from different database systems, you might also consider the "data import from MS SQL Server test series" Breck has published on his great blog a while ago:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Yes, I should have specified that some data is coming from different database systems. Thanks for your comments and suggestions. Off to find Breck's blog entry!
FWIW, in case comparing "old" and "new" data from remote systems is part of your ETL steps, the following questions might be helpful:
User | Count |
---|---|
47 | |
6 | |
6 | |
5 | |
5 | |
4 | |
4 | |
3 | |
3 | |
3 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.