SSIS 2005 - Tales from the Trenches
Should you use SSIS 2005 for your precious ETL project?
Should you let your child keep an aligator as a pet?
SSIS 2005 is to ETL tools what Lotus Notes is to email clients - an unfriendly tool that has all the hallmarks of having been written by an outsourced take-away. Here follows a general approach for dealing with this pernicious piece of punk-ware based on my hard experience.
Achtung! Warnings about Microsoft's SSIS 2005
Microsoft's SSIS 2005 is not a mature product (at the time of writing
OCT2006 SEP2007 OCT2008FEB2009) and 'does not do what it says on the tin' consistently. Your beautiful ETL system may work fine today, but run it tomorrow and nothing may work for no reason at all. It just does this, you see.
Do not believe that SSIS will save you development time or help you create less buggy programs. The belief that in order to use it, 'all one does is drag and drop these here clever boxes around with your mouse' is dangerously misplaced when you factor in the time required for debugging and dealing with inexplicable behaviour and crashes in your development and production environments.
The sensible Approach
If you absolutely have to use SSIS to either impress your peers or the person who sponsored your project or if you want to add brownie points to your CV, but deep down still feel that you are a true IT professional who is interested in delivering a working ETL solution, then restrict the use of SSIS to an absolute minimum.
Minimise the use of SSIS
Do as much as possible of the work using stored procedures, SQL, Perl and DOS command-shell scripts and put these into ExecuteProcedure tasks in SSIS. There, you have now used SSIS and kept your project sponsor happy, your vanity remains intact and chances are that the damn thing will work.
Do not trust SSIS for important overnight processing
You have been warned. If you need lights-out reliabililty, then use other ETL tools. There are plenty about, some free (Perl) and some very expensive (Informatica). However, they are reliable.
.dtsx files are the XML files that contain the ETL packages that SSIS creates. Even the smallest change to a package causes the XML to be completely rejigged. The XML may still be valid but you cannot see what your incremental changes are using any sensible diff tool. Nor can you merge your changes with those of fellow developers. From a configuration control perspective, you have to follow the 'lock-modify-unlock' model instead of the collaboration-friendly 'copy-modify-merge' model. In other words, only one person can work on a package at a time.
Lamentation and Remedy: The work-around is to keep packages very small, consisting of only one data flow object. As a result, you cannot pipeline much data since data can only be handed over from package to package via physical database tables, which may slow your ETL process down. Consider using flat files or raw files for intermediate storage - there may be performance benefits. With smaller packages, silent errors and silent crashes can better be pinpointed and corrupted data can easily be manually restored.
Keep your Data Pipes Short
"Its eyes are bigger than its stomach"
Data pipes in SSIS 2005 take more than they can eat at the onset of package execution. Curiously, the server can run out of memory half-way down the data pipe, yet it continues to feed the greedy package with input data. The type of crashes that result are particulalrly nasty since things quietly slow down to silent halt. The remedy to such a crash has all too frequently been to reboot the SSIS server.
Lamentation and Remedy: Again, use staging tables all over the place. Never mind the loss the in performance - your top priority is reliability!
So, what was the state of the data in the midst of your ETL process before SSIS crashed?
Lamentation and Remedy: You will never know this unless you persist some form of physical staging at all critical junctures in your ETL process. Similarly, add reject tables for every data task that has an error output and state the reason for rejection with timestamp and package name. This is particularly important in lookup operations. Did I mention logging? Do loads of it, especially when you are running a lights-out process.
Recovery from Screw-Up using Data Checkpoints
Frequently (sic) when a package fails, the only recourse is to run the entire offending package again, which can be very wasteful or downright impossible when you have lost your system off-line window. This is another reason to keep packages small. It is possible to resume the execution of a package from where it broke if you prepare for it by making your own data checkpoints: Decide on where your data checkpoints will be and multicast to Raw File destinations. On screw-up, temporarily modify your package to read the data back in through a Raw File data source.
Litany of Problems and further Lamentations
Service Packs ad nauseum: In my last
two three six eight major SSIS-based projects, I have contemplated how much easier each project might have been had I simply used first-principle methods in the ETL process instead of SSIS. I was having vivid dreams of using unambiguous Perl scripts without the inexplicable behind-the-scenes magic of SSIS. The first few times I secretly hoped that the problems would disappear with the installation of a new Microsoft service pack or hot fix, but this never really fixed much although it felt better at least.
Performace: I have run SSIS 2005 on an HP DL360, HP DL380's and an HP BL40 (complete with SAN's), all hosted on Windows 2003 Server. These are generally quite capable servers. The biggest data loads were 17 millions records per batch run - this is a trivial amount of data by today's standards. Disappointingly, simple ETL process took well over 14 hours in the latter case.
No matter which environment I was dealing with, the same problems kept on appearing with SSIS 2005:
Spontaneous Package invalidation: Without having touched the SSIS package or the tables that the package references, a package can one day perform correctly and on the next day refuse to run because it has spontaneoulsy invalidated itself.
Metadata invalidation: The smallest DDL change to a table can invalidate packages that reference the table. Why some packages and not others?
Performance problems: Simple table extracts can sometimes hog the server for no reason at all
Bad copying of SSIS objects: In order to save development work, we duplicate an SSIS object (Package, ETL Task, etc.) in Visual Studio (copy and then paste). Visual Studio, alas, forgets to assign a different GUID to the object. Different object with the same GUID can really confuse SSIS 2005! What is clearly missing in Visual Studio is a 'clone' function. (The remedy is simple but tidious: Manually assign a new GUID after having duplicated the object)
Components hiding in the Graphic editor: Can't determine why your SSIS packages is doing someting unexplicable? You may have a stray component hiding behind another. Yes, seriously, the editor allows you to do such silly things.
Hanging Visual Studio: It simply stops. Dead in its tracks. As someone who used Visual Studion since version 1.0 (it was not much of an IDE back then but hung less frequently), I philosophically muse whether mankind in general is actually making any progress at all.
What, no UNDO? That's right, Folks! Any changes you make in Visual Studio on an SSIS project cannot be undone.
It is runs in DEBUG mode, is should surely run in RELEASE mode. Ha! Not always so. No, really...
So you really think you should use SSIS 2005 for your ETL project?