Quantcast
Channel: SQL Server
Viewing all articles
Browse latest Browse all 3819

Blog Post: SSIS: Variable length fixed width flat files design pattern with no scripting

$
0
0
Recently while doing a POC for a client on SSIS they had some *interesting* data in flat files that they were comparing. All of the files were fixed width, which should have been relatively straight-forward, but the challenge presented itself in that none of the files had a delimiter (true fixed width old school stuff). What really made it interesting is that each row in the data file had a row type identifier, but after that the data was all different, with different columns, different data points, and different destinations. There were 2 files, each with two different scenarios: one in which all of the rows were the same length, and one in which each row was a variable length. Additionally, I had a very limited amount of development time to build a package that processed and loaded both of these files. In this blog post I’d like to walk through the solution that I devised, in the hopes that it may help some others who may experience such a problem. I’m going to break each file out into a separate part, to give some segmentation:   AUTHOR NOTE: I did some digging around on Bing/google about this as well, but could only find some script components that did it. I haven’t tested against scripts, but scripts are line by line. My personal opinion is that this would probably perform a little better, as I’m keeping SSIS buffers and sets of data intact, but I haven’t tested it. I’m very interested to get some others thoughts around this, so please let me know what you think!   Problem 1 (Row lengths are the same but different columns and data points) Below is a simple sample file that I’ve created on my machine to recreate my issue:   At first glance, the header and trailer look just like a normal flat file, right? The issue that I had with this was that the client didn’t want to trim it off altogether, they wanted to write the header and footer to different rows in the database to archive the information. With such a limited amount of time, immediately I had to rule out a script task as it would have taken me too long (I had less than a day), plus the production size of the file was unknown and the production data sets could have been anywhere from 10 to 10 million records. What’s my scalability limit? Script task: out.   Then, staring at the flat file connection manager, it hit me. Since I did have one constant amongst every row, I could use that and a conditional split to derive the destination! I created an initial flat file source that had 2 columns: one for the “Row Type Identifier”, and one for the rest of all of the data (I called it Column 0, and make sure to change the output width super wide to hold the file):     Now that the source is configured, we want to ensure that we trim the new RowIdentifier column to get rid of the empty characters. Add a derived column, and put in a trim expression, replacing the existing Row Identifier column:   Click Ok, and then add a Conditional Split. Configure your conditional split to split the Row Identifier out into separate outputs for each of the file types that are in your file. Using the sample above, we’d have an output for the HEADER, TRAILER, DATHEADER, and DATAROW:     Now each piece of the file that falls under each separate identifier can be sent out in sets! Yes!!!!!!!! Let’s add a derived column component for the each set, since now we’re going to need to take care of [Column 0]:     To do this, we’re going to make heavy use of the substring function. Starting with my header file, my derived column component would look like: Notice that I added a type cast to convert the stamp of the date sent to an integer. Also, notice that even since we already took care of the row identifier in the original flat file source, the beginning of Column 2 is technically string position 1 in Column 0. Moving over to another output, the store sales detail output would look like:     After that, you can go about your normal transformations and destinations as usual. Or, if you’d prefer, you could also write your split files into new staging files somewhere, and create new data flows to move the data. The destinations are in red only because I didn’t configure them:     Problem 2 (Row lengths are different, different columns and data points, same file) So that takes care of problem 1, and since I avoided writing code, I’m feeling pretty good about myself at this point. Maybe this second file won’t be so bad after all… And then I opened the file (Please note I highlighted the rows so you could see that they are all different lengths):   Whoa. My first reaction was to run to the closest window, jump, and hope for the best (fortunately I always pack a parachute. ) But then after I looked at it a little more I realized that just like in the first file, the very first value always identifies what kind of row that it is. The trick is that unlike in the first one, the first column is not a fixed amount, but I can identify that the very first character identifies what kind of row it is, which isn’t that dissimilar from Problem 1. In the original Flat File source, instead of knowing I want to pull in the first 10 characters to do our analysis, let’s change it to 1:     Since it’s only 1 character, don’t really need to trim it here. After changing the conditional split component to this files “splitters” (h,r,d, or t)though, the derived columns and the rest of the ETL would be exactly the same as problem 1 above. Rather than re-writing it all here I’ll just redirect you back to the top. Concluding this post, there are a couple of things that I really like about both of these approaches, including a) no code, b) still handles the data in sets, and I don’t have to sacrifice anything down to row by row, and c) it’s much faster to develop!

Viewing all articles
Browse latest Browse all 3819


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>