Category Archives: Knowing SSIS

This category describes what exactly SSIS is and why it is used for.

Multicast versus Conditional Split

Multicast and conditional split transformations are quite often used while developing an SSIS package. I have compiled some of the differences between the two transformations that might help understanding these transformations in a better way:

Multicast Conditional Split
Replicate source data to multiple destinations so that different logic can be applied to same source data Distribute the source data into multiple destinations based on conditions
Directs row to every output Directs row to single output
Does not contain any error output It has one error output
Do not require expressions Expressions are used for specifying conditions
Do not have default output It provides default output  so that if a row matches no expression it is directed to default output

If you have any constructive opinion, please write to me or post comments

Enjoy!

Performance optimizations in SSIS

Developing an ETL is a part of challenge but the real challenge lies in developing an optimized one. In this very blog I will try to explain different performance optimizations technique that can be incorporated in ETL packages so as to run ETL in effective and efficient way. I have collected the material by reading MSBI books, googling out different blogs and from my own experience with ETL.

Here we goes..

  1. At connections level: By default, the network packet size has 0 value which means 4KB of data will be transferred at a time. Change this value to 32767 so as to enable 32 KB of data transfer to be transferred at one time

Untitled0

2. Using ‘Fast Parse’ property: This option can be used for very fast loading of data from flat files. To enable this option, perform following steps:

a. Right click the flat file source, and click ‘Show Advance Editor’:

Untitled1

b. Go to ‘Input and Output properties’ tab. In the Inputs and Outputs, expand the Output columns and select the column for which you want to enable ‘fast parse’.  After selecting the column, go to Custom Properties and set ‘FastParse’ property to ‘True’ as shown:

Untitled2

3. Avoid ‘Select *’: Instead of using ‘Select *’ while pulling data from SQL server table, use the column list and try to pull only required columns of table. SSIS uses buffer oriented architecture to transfer data wherein data from source first comes in buffers, required transformations are performed and then the data is dumped to destinations. However, size of the buffer depends on the estimated row size which is calculated as the maximum size of all the columns in a row. So if there are more number of columns in a row, they will occupy more space in buffer resulting in less number of rows in buffer. Therefore, always try to pull only required columns not all.

4. Avoid Asynchronous transformations: As far as possible, try to avoid them. Asynchronous transformations like sort needs to read and process all input data before creating any output records. As a result, data transfer process gets slower.

5. Considering data types: The data types should be as narrow as possible since they will be allocated less memory. Allocate the required number of bytes to the columns.

6. Indexes: If you are pulling large volume of data, then drop all the indexes if exists. After your load is complete, then re-create the indexes

7. OLEDB Destination settings:

Untitled3

  1. Keep Identity: By default, this property is unchecked, which means source identity values will not be copied to destination so destination table will create its own identity values if it has an identity column. If you check this property, then DFT engine will preserve the values of source identity and pass it to destination. If possible, keep it unchecked
  2. Check Constraints: By default, this property is checked. If you are sure that incoming data will not violate destination table constraints then uncheck this property. This setting specifies that the dataflow pipeline engine will validate the incoming data against the constraints of target table. If you un-check this option it will improve the performance of the data load.
  3. Rows per batch: By default, the value of this property is -1 or empty which means all the incoming rows will be treated as single batch. Change the value of this property to a positive value that specifies the number of rows in batch.
  4. Insert commit size: By default, the value is ‘2147483647’  as you can see in the above screenshot which means all the incoming data will be committed in one shot after successful completion of the DFT. We can specify a positive value for this setting to indicate that commit will be done for those number of records. Changing the default value for this setting will put overhead on the dataflow engine to commit several times but at the same time it will release the pressure on the transaction log and tempdb to grow specifically during high volume data transfers.

8. Using Delay Validation property: By default, this property is set to ‘False’. SSIS performs early validation that validates the package and its components before starting execution of package. If you are using some tasks that are dynamically passing some values to other tasks then set this property to ‘True’ for such tasks otherwise your package will fail.

9. Use Performance Counters to monitor SSIS performance

10. Using checkpoints: Please refer https://ssisblog.wordpress.com/?s=checkpoints

11. Parallel execution of tasks can also lead to performance benefits

Note: These are general guidelines to optimize your ETL package. If you have any constructive opinion or any doubts then do post here.

Enjoy! SSIS is fun..

Leveraging SSIS in HDInsights

SSIS can be leveraged in HDInsights projects as effectively as in other BI projects. Once the data resides on Hadoop clusters, you surely want to import data to Cube or DW for further analytics. There are multiple ways to achieve the desired purpose:

1. Using SQOOP connector:

SQOOP connector basically fires a command on Hadoop Command prompt and imports the data from HDFS/ Hive to your SQL Server tables. To automate the process, you can wrap the SQOOP command in a batch file and execute this batch file from SSIS.

2. Using Hive data source in SSIS:

For this, one has to create a Hive data source and then in SSIS, use this data source to dump data to destination.
Note: 2nd option is not available in preview versions of HDInsights.

SSIS Transformations – Blocking, Partially Blocking and Non blocking

The type of transformation that we choose to achieve particular task has a great impact on performance of SSIS. Hence we should have  clear understanding of different types of transformations available in SSIS and be careful enough while deciding among them.

Here, I am going to mention few points that help in better understanding of transformation:

Blocking

Partially\ Semi Blocking

Non-Blocking

Need to read all the inputs before they produce some output. Clearly, it consumes time Need not read all inputs Need not read all inputs
It creates a new buffer for the output It creates a new buffer for the output It uses the existing buffer. This introduces huge performance benefits
These transformations may create additional records i.e. the number of input records might be different from number of output records. Hence, they are asynchronous in nature. Number of output records may or may not match the number of input records. They are also asynchronous in nature Number of output records are same as number of input records. These transformations are synchronous in nature
New thread is  introduced in Data flow New thread might introduce in data flow No additional thread is created
Available resources are impacted heavily Less impact on available resources than Blocking transformations It has the minimum impact to available resources out of all the transformation types
Ex: Sort, Aggregate Ex: Merge, Merge Join Ex: Data Conversion, Derive Column, Look-up

 

Enjoy! Happy Coding..

Checkpoints in SSIS – Implementing restart-ability

Suppose the SSIS package contains around 30-40 tasks\ Containers and the execution time of ETL completion is approximately 18 hours. This means that the reports developed on the data mart needs to be refreshed after 18 hours. Now assume after 10th hour something happened and the ETL fails. Now what would you do? Start the package from beginning! Obviously not since Stake holders are waiting for the reports with refreshed data.

What you should do is ‘Implement Restart-ability’ in your package. You can implement restart-ability by either using SSIS features or by doing custom code. In this blog I will be discussing about SSIS feature to implement Restart-ability.

SSIS uses Checkpoints to restart the package from the point of failure instead of starting the package from the beginning. SSIS captures the restart point in the checkpoint file along with values of variables, details of successfully completed tasks and the unique identifier of the package. With the help of this unique identifier, SSIS picks up the correct checkpoint file and restarts the package. If the package identifier does not match with the identifier stored in Checkpoints file then restart will fail. After you implemented the fix in your code and restart the package, SSIS uses this checkpoint file to start execution from the point of failure, skipping all the tasks that successfully executed previously. Once the package is successful, the checkpoint file is removed from the system.

When package restarts from the point of failure, then:

  1. Committed transactions may run again
  2. Loop containers will always run again even if the task inside them successfully completed

Note: For the scenarios mentioned above, we have to go for Custom Checkpoints

Checkpoints and Data Integrity:

When the SSIS package use Checkpoints file, it tries to acquire lock on that file. Now if another instance of the same package is running and access the same checkpoint file, then it fails since the checkpoint file is locked by already running SSIS package. This ensures that no two instance of the same package can run simultaneously. Hence there will not be any data integrity issues.

Implementing Checkpoints:

To implement checkpoints, we need to configure only four properties:

  1. Checkpoint file name –  Specify the name of checkpoint
  2. Checkpoint usage – Choose any 1 of three values: Never, Always and IfExists
  3. Save checkpoints – Set this property to true to restart the package from point of failure
  4. Set FailPackageOnFailure to true for every task\ container used in the package

Checkpoint usage properties:

  1. Never – Do not use checkpoints
  2. Always – Always use checkpoints, if the file does not exists package will fail
  3. IfExists – Package use checkpoint file if exists otherwise it creates it

Transactions in SSIS

Transactions in SSIS can be implemented at any level like package level, task level, Container level etc.

SSIS transactions support 3 values:

Supported: An SSIS object can only join the transaction, cannot initiate the transaction

Not Supported: An SSIS object neither initiate the transaction nor joins the transaction

Required: An SSIS object can initiate the transaction

Now we know the meaning of transaction values, let’s go a step further  and understand transaction’s behavior in case of Parent – child relationship:

Scenario 1: Transaction property set to ‘Required’ at parent level

Suppose at package level, I set the transaction property to ‘Required’. This means that package initiate the transaction. Now all its child whether sequence container or task would join this transaction. If any task fails, everything would be roll back.

Scenario 2: Transaction property set to ‘Supported’ at parent level

If the transaction property is set to ‘Supported’, and if any task fails, then only failed task is rolled back, rest of the tasks would be committed

Scenario 3: Transaction property not configured at parent level but is set to ‘Required’ at child level

In this scenario, the child will initiate the transaction.

 

Playing with SSIS is fun

SSIS Package Design Concepts – Control Flow and Data Flow

Let’s explore the two fundamental concepts related with package design

  • Control Flow
  • Data Flow

Control flow:

As the name suggest, ‘Control Flow’ means controlling the flow of execution of SSIS package. While designing the ETL package, we need to think how one container can be connected to another, how related tasks can be connected. To connect different components of SSIS, we need to identify whether we need to make tasks sequential or parallel. This whole design depicts the flow of execution and is implemented in ‘Control Flow’ component.

Control flow can consist of multiple tasks such as DFTs, Sequence Container, Execute SQL task and so on. These tasks are linked with the help of ‘Precedence Constraints’ (explained later). Following snapshot shows the location of ‘Control Flow’ in SSIS designer

Control Flow in SSIS Designer

The below screenshot shows an example of how the flow looks like:

Flow of Execution

Note:

  1. The two sequence containers will run in parallel with each other
  2. Within the sequence container, the tasks mentioned will run in sequence. Ex: First sequence container contains 3 tasks – ‘Execute SQL Task’, ‘DFT task’ and ‘For Each Loop Container’. Firstly the ‘Execute SQL Task’ will execute then ‘DFT task’ and then ‘For Each task’
  3. ‘Send Mail Task’ will execute when both the containers execute successfully

Data Flow Task (DFT):

DFT is the heart of SSIS. Most of ETL operations are performed in DFT. As the name suggest, this task is used to transfer the data from source to destination. While transferring the data, operations like merge, merge join, data conversion etc. can be performed. This whole process involves Extracting the data from source, Transformation of data in the form expected in destination and then Load the data into destination which is nothing but ETL. However, the ‘Transformation’ part is optional.

DFT has to be a part of Control flow. To add DFT to control flow, drag the DFT task from SSIS toolbox to designer. Below screenshot shows a DFT in Control flow:

DFT

On double clicking the DFT, Data Flow Tab will open where you can add Source, Destination and any transformation if required. Below screenshot adds one source, transformation and one destination:

DFT

Precedence Constraint:

Now we have different tasks in control flow or DFTs, but the question is how will we create flow between these tasks? For this purpose, Precedence Constraint is used. They connect different tasks\ containers and direct the flow of execution.

Precedence Constraint can be of 3 types:

  • Success – indicated by Green arrow
  • Failure – indicated by Red arrow
  • Completion – indicated by Blue arrow

Below screenshot will show different Precedence Constraint:

Precedence Constraint

How to create Precedence Constraint?

Whenever you drag any task from SSIS toolbox to the designer, you must have noticed a green arrow coming from task. To create a Precedence Constraint, you need to connect this arrow with another task\ container.

On double clicking the constraint following dialogue box will open:

Precedence Constraint

One can perform following things on Precedence Constraint:

  • Write Constraints
  • Write Expressions
  • For multiple constraints you can choose Logical And/ Logical Or