Implementation of a Real-Time Data Warehouse Hampered by Immature Toolset

The opening of the New Zealand electricity industry to competition with power being bought and sold just like shares on a stock market brings its own IT challenges to a large state-owned enterprise. To be able to participate in the power "share market" this company has to build an ultra-reliable data warehouse that collects real-time power pricing data 24 by 7. The data is collected as it is published by the trading exchange by data feeds that check for updates every few minutes. These occur across the Internet with two completely independent encrypted data feeds operating in parallel loading two separate data warehouses. The energy traders make their buying and selling decisions based on the data as soon as it is received.

Millions of dollars of trades are done daily, so both system reliability and having the trading data available as soon as possible are critical to doing deals of greatest benefit to the company.

Problem:  The ETL (extraction, transforming, and loading) tool chosen by the customer initially causes endless problems. The extraction jobs would hang for no obvious reason and on a seemingly random basis. Re-starting the jobs required manual intervention. As my company was responsible for the after-hours support, the result was endless frustration and many sleepless nights trying to keep both parallel systems in operation! At the height of the problems getting called out five or six times a night was not uncommon. It was fortunate that we were able to dial in remotely to do support.

Solution: We painstakingly and exhaustively went through all the possibilities. Technical support only went so far and after that it was simply a matter of trial and error. As we discovered, the tool was incredibly sensitive to not being installed quite right, not being configured quite right, not having internet connections with plenty of spare bandwidth, and most importantly plenty of processor and memory headroom. One by one we sorted these issues and gradually the system became more and more stable. We also built special checking jobs that would attempt to reset and restart the ETL jobs if it thought they had stopped. By the time we finished three or four after-hours callouts per month were typical. The result was one happy customer and uninterrupted sleep for the support team.

Moral:  Immature, unreliable software tools are not normally showstoppers. With perseverance, systemically working through all potential issues, working around idiosyncrasies and limitations, a satisfactory solution can usually be found. It can take a lot of resources, lots of IT experience and extra money to find that solution. At the end of the day it is usually cheaper to stick with the tool with known limitations than to throw it out for another that may be more reliable, but will undoubtedly have its own issues.