Fileset Stages In Datastage
Joining of data is an activity that frequently happens in all data warehousing projects. At some point in your project you will definitely need to use such a functionality to get the result you need. Datastage has three different types of stages that can basically carry out the joining of two or more datasets. Choosing which stage to carry out your join depends entirely on the data that you will be processing. Typically if you have a huge amount of data coming in your reference and source links then its best to be using the join stage. The lookup stage just wont be able to handle the load and most likely wont work since it wont be able to hold that much data in the memory. If your data is suitable partitioned and sorted correctly then the join stage would run much faster than any of the other stages. If the data is sorted, the join stage is very fast since it never involves paging or other I/O operations. Ideally it is suggested that if the data you get in your links is more than a million then its best that you use the join stage rather than the lookup stage. The only advantage the other stages have over the join stage is that merge and lookup stages let you capture rejects on failed lookups. But when compared to performance I guess it’s not much of an added advantage.
The input data to the join stage has certain pre conditions. The data must be partitioned and sorted properly or else your join results may vary from what you expect it to be. The partition must be decided on the basis of the keys which you are using for the join. Like in all cases of partitions, the data must be divided into blocks in such a way that data with similar key values will be in the same partition. However if you have not partitioned or sorted the data, and if you have set the partition mode as ‘auto’, Datastage will insert the sort and partitions for you.
Similar to the join operations in any query, the data is identified as left and right blocks. You should specify your left and right links in the ‘Link Ordering’ section of the stage. This is a very step in configuring your join stage since it decides how your output looks like. The join stage allows you all 4 different type of join options. Full Outer, Inner, left outer and right outer joins.
The merge stage is similar to the join stage in certain aspects. Just like the join stage the input data has to be sorted and partitioned. However unlike the join and lookup stage, the merge stage gives the user the option of having multiple reject links. In merge the data is identified as Master and Update blocks. There should be a reject link for each update link. So if a match against a particular update link failed then the data would move to the reject link for that specific update link. Another pre-requisite for the merge stage is that the master data has to be free of duplicates. If in your job you are using more than one update set then you should ensure that even those update sets don’t have duplicates.
Gather proxy premium download pc. The lookup stage should only be used in low volume situations. The join stage, before it carries out its join, requires all the data to be loaded onto the memory first which is why it cant take up huge loads .The lookup stage allows you one reject link ( which is how it differs from the merge stage) and it does not require the data to be pre-sorted( which differentiates it from both merge and join). Data however does need to be partitioned correctly. The lookup stage also allows you to lookup against a lookup fileset (which is not possible with the other two stages). The lookup stage in the version 8 and upwards now give you the range lookup option ( which by the way I haven’t found much use off). In addition to this Datastage also gives you the option of in-memory lookups.
Datastage Copy Stage
An overview of parallel stages in DataStage: General, Debug and development, processing, file, database, real-time, restructure, data quality, sequences. ETL-Tools.Info Business Intelligence - Data warehousing - ETL.