I got an assignment to develop an ETL Process which contains number of lookups to be made with different tables. So I created a dataflow and made the required lookups one by one and extracted the data as shown in the diagram.
Since those lookups are examining independent fields I thought why can’t we change the dataflow and do those lookups in a parallel way . So I changed my dataflow as shown in the figure. Here I add a Multicast to create multiple data streams and used Merge Join to concatenate those Parallel data streams. However Merge Join supports only for sorted data streams. In order to make the stream sorted I used pre-sort mechanism available at the Advanced editor of source file. Instead you can use Sort component prior to Merge Join. But best practice would be to avoid Sort within SSIS unless it is completely necessary and do presorting before it gets in to the pipeline; because SSIS allocates the memory space for the entire data stream in the sort transformation and creates a bottle neck in the overall flow.

Figure 2
Both works logically fine and gives the same output. But When I look at the processing time both has approximately similar value; actually more time is consumed in manually paralleled method. This wasn’t I expected . What would be the justification?
Is it because I created multiple copies of streams using Multicast?
Though Multicast visualizes that it creates copies of the data stream, at backstage what actually happens is that it enables those three Lookup to work on the same data buffer rather than creating copies which is good in Parallelism aspect.
The Problem is tools (Merge Join) which used to join those streams are less efficient and it requires sort transformations also. This adds more overhead compared to usual sequential Lookup operation and destroys the advantage of Parallel Lookup operations.
When we run the sequential Lookup method we see that the Lookups at the end of the flow are also performing with the completed data while upper Lookups are still on operation. This is because SSIS provides parallelism within the sequential flow.
So can’t we optimize the Lookups?
In order to optimize this Lookup operation what I suggest would be to analyze the entire operation and manage the Caching behavior of the flow.
No comments:
Post a Comment