Overcoming major performance problems loading a large data warehouse

A major Australasian bank has a large data warehouse where they are building a customer-centric view of their corporate data. In the past banks have been limited by the product-based nature of their operational systems. Most banks are now moving towards having complete views of their customers' relationships with them as this greatly enhances marketing activities. Of course this involves extracting large data volumes as well as running very big data transformation jobs.

Initially the data warehouse load jobs work well and load times are quite acceptable, in the two or three hour range. The load jobs take simple comma-delimited data files extracted from the mainframe-based operational systems and stored on disk servers shared by both mainframe and the warehouse data server. The load jobs are custom-built SQL-based stored procedures that do both cleaning and transformation tasks as well as loading.    

Problem: Over time the data loading jobs expanded to cope with more types of data, right down to account transaction details, loading times have become longer and longer to the point they have become unacceptable.

Solution:  A tuning exercise was carried out and it was discovered that the load processes were only using a single processor at a time. By changing certain database parameters and multi-threading slow load steps, the loads were turned into multi-processor jobs and parallelised resulting in run times being reduced back to around three hours. This bought breathing space but would be unlikely to cope with the ultimate projected data volumes.

Moral: While re-configuring the load processes was the most cost-effective solution, it was also realized that a database is not the ideal environment in which to clean and transform data. Moreover, the language of databases, SQL, is not very effective or efficient for data cleaning and transformation, primarily because of its set-based logic. Tools that can manipulate data record-by-record outside the database are usually far more efficient, and have better functionality.