Performance optimizations in SSIS

Developing an ETL is a part of challenge but the real challenge lies in developing an optimized one. In this very blog I will try to explain different performance optimizations technique that can be incorporated in ETL packages so as to run ETL in effective and efficient way. I have collected the material by reading MSBI books, googling out different blogs and from my own experience with ETL.

Here we goes..

  1. At connections level: By default, the network packet size has 0 value which means 4KB of data will be transferred at a time. Change this value to 32767 so as to enable 32 KB of data transfer to be transferred at one time

Untitled0

2. Using ‘Fast Parse’ property: This option can be used for very fast loading of data from flat files. To enable this option, perform following steps:

a. Right click the flat file source, and click ‘Show Advance Editor’:

Untitled1

b. Go to ‘Input and Output properties’ tab. In the Inputs and Outputs, expand the Output columns and select the column for which you want to enable ‘fast parse’.  After selecting the column, go to Custom Properties and set ‘FastParse’ property to ‘True’ as shown:

Untitled2

3. Avoid ‘Select *’: Instead of using ‘Select *’ while pulling data from SQL server table, use the column list and try to pull only required columns of table. SSIS uses buffer oriented architecture to transfer data wherein data from source first comes in buffers, required transformations are performed and then the data is dumped to destinations. However, size of the buffer depends on the estimated row size which is calculated as the maximum size of all the columns in a row. So if there are more number of columns in a row, they will occupy more space in buffer resulting in less number of rows in buffer. Therefore, always try to pull only required columns not all.

4. Avoid Asynchronous transformations: As far as possible, try to avoid them. Asynchronous transformations like sort needs to read and process all input data before creating any output records. As a result, data transfer process gets slower.

5. Considering data types: The data types should be as narrow as possible since they will be allocated less memory. Allocate the required number of bytes to the columns.

6. Indexes: If you are pulling large volume of data, then drop all the indexes if exists. After your load is complete, then re-create the indexes

7. OLEDB Destination settings:

Untitled3

  1. Keep Identity: By default, this property is unchecked, which means source identity values will not be copied to destination so destination table will create its own identity values if it has an identity column. If you check this property, then DFT engine will preserve the values of source identity and pass it to destination. If possible, keep it unchecked
  2. Check Constraints: By default, this property is checked. If you are sure that incoming data will not violate destination table constraints then uncheck this property. This setting specifies that the dataflow pipeline engine will validate the incoming data against the constraints of target table. If you un-check this option it will improve the performance of the data load.
  3. Rows per batch: By default, the value of this property is -1 or empty which means all the incoming rows will be treated as single batch. Change the value of this property to a positive value that specifies the number of rows in batch.
  4. Insert commit size: By default, the value is ‘2147483647’  as you can see in the above screenshot which means all the incoming data will be committed in one shot after successful completion of the DFT. We can specify a positive value for this setting to indicate that commit will be done for those number of records. Changing the default value for this setting will put overhead on the dataflow engine to commit several times but at the same time it will release the pressure on the transaction log and tempdb to grow specifically during high volume data transfers.

8. Using Delay Validation property: By default, this property is set to ‘False’. SSIS performs early validation that validates the package and its components before starting execution of package. If you are using some tasks that are dynamically passing some values to other tasks then set this property to ‘True’ for such tasks otherwise your package will fail.

9. Use Performance Counters to monitor SSIS performance

10. Using checkpoints: Please refer https://ssisblog.wordpress.com/?s=checkpoints

11. Parallel execution of tasks can also lead to performance benefits

Note: These are general guidelines to optimize your ETL package. If you have any constructive opinion or any doubts then do post here.

Enjoy! SSIS is fun..

Advertisements

Using Script Task

While developing an ETL solution, the developer may be required to write some .net code to perform some functions\ operations that are not supported by built in connection types. These operations may include:

  • Interacting with Active Directory to fetch user information
  • Interacting with file system
  • Creating performance counters

To perform all such above mention functions, we need some editor where we can write our .net code. Script task provides us with such capability wherein we can write .net code and can read\ write SSIS variables (both system and user defined). SQL Server 2012 has introduced the concept of ‘parameters’. These parameters can also be accessed within Script task like variables.

In the following example, I wanted to check if the text file exists in the directory. There can be many other ways to perform this task but here I will make use of Script task. Also I will show you how to access SSIS parameters\ variables in Script task.

The flow of execution will go as: I will pass the directory path to script task via parameters. Script task will read that path and check if the text file exist. I the file exists then script task will return full path of the file including filename to SSIS otherwise it will return 0. This return value will be captured in a variable to perform further processing if required.

Let’s begin with the example:

  1. Go to parameter tab in SSIS and create a parameter as:

Untitled0

This parameter contains the directory path where my text file is located.

2. Create a variable that will contain the value returned by the script task as:

Untitled2

3. Drag the script task from the SSIS toolbox to Control flow window:

Untitled1

Control flow will look like:

Untitled3

4. Double click the script task to open Script task editor as:

Untitled4

ReadOnlyVariables: contain variables\ parameters that the Script task can read but cannot modify. Those variables that we don’t want the script task to modify are kept in this section.

ReadWriteVariables: contain variables that script task can read as well as can modify the variables and return it to SSIS.

5. Click  the ellipses next to ‘ReadOnlyVariables’:

Untitled5

Next, select the parameter that we want to pass to Script task by checking the box:

Untitled6

Similarly click the check box next to ReadWriteVariables and select the variable:

Untitled7

After performing above steps, Script task editor will look like:

Untitled8

Click the editor as shown in above screenshot to open .cs file.

6. Once the .cs file opens up, type the following code in Main function:

string filepath = Dts.Variables[“$Package::param_FilePath”].Value.ToString();

if(String.IsNullOrEmpty(Directory.GetFiles(filepath,”*.txt”).FirstOrDefault()))

{

Dts.Variables[“User::var_ReturnValue”].Value = “0”;

}

else

{

Dts.Variables[“User::var_ReturnValue”].Value = Directory.GetFiles(filepath, “*.txt”).FirstOrDefault();

}

Dts.TaskResult = (int)ScriptResults.Success;

Note the use of parameters and variables in Script task. Using ‘DTS.Variables’, we can read\ write SSIS variables\ parameters. Also, variable\ parameter names are Case Sensitive. Be careful while using variable\ parameter names.

7. Now save the file and build the script. After the Build succeed, close the file. You will land on the Script task editor. Click OK and the job is done.

8. Since you have the script task output in variable ‘var_ReturnValue’, as per requirements you can decide how to use it.

Happy Coding!!

Shortest code to get comma separated values from a column

Getting results of a column in comma separated format is often required while one is writing the stored procedure. The comma separated values then can be used in SQL ‘In’ clause to filter the result. Below example will show you how to get values of a column in comma separated format:

Suppose you have a table named ‘tblCountry’ with following data:Untitled

Now you want your countries to be in comma separated format like India,Australia,US,UK.

Following snippet will give you the desired result:

DECLARE @SQL nvarchar(MAX)
SELECT @SQL = COALESCE(@SQL + ‘,’,”) + CountryName FROM [dbo].[tblCountry]
SELECT @SQL

The result is a string of characters that can be used further in the code.

Enjoy! Happy Coding!!

Leveraging SSIS in HDInsights

SSIS can be leveraged in HDInsights projects as effectively as in other BI projects. Once the data resides on Hadoop clusters, you surely want to import data to Cube or DW for further analytics. There are multiple ways to achieve the desired purpose:

1. Using SQOOP connector:

SQOOP connector basically fires a command on Hadoop Command prompt and imports the data from HDFS/ Hive to your SQL Server tables. To automate the process, you can wrap the SQOOP command in a batch file and execute this batch file from SSIS.

2. Using Hive data source in SSIS:

For this, one has to create a Hive data source and then in SSIS, use this data source to dump data to destination.
Note: 2nd option is not available in preview versions of HDInsights.

Using Data Profiling Task

Sometimes, you may need to review the data source before initiating processing of the data from that source. You may want to review the source for NULL values in columns, distinct rows etc. This initial analysis of data before processing it may help the developers in building the logic based on that data. To perform such an analysis, Data Profiling Task of SSIS can be used.

To use Data Profiling Task, drag the same from SSIS Tool Box to the Control flow. Double click the task and following window will be displayed:

Untitled

Click the drop down ,create a new File connection as shown below and Click Ok:

Untitled4

Now, since the destination connections are set, we will proceed to set the Requests.

Note: Requests indicate the different ways in which we want to analyze source.

Click the “Quick Profile” button shown in below image:

Untitled1

Configure settings as shown in below figure:

Untitled2

Now click Ok and a dialogue box will open showing the requests selected for analyzing data. Refer figure below:

Untitled3

Now run the Task. Once the execution is completed go to Data Profiler Viewer in Windows or double click the Data Profiling Task and open Data Profiler Viewer as shown:

Untitled5

After clicking the button, following window will open, Click Open as shown in below screenshot:

Untitled7

Browse to the location of XML file as specified in Destination of Data Profile Task. Click OK and following window will open:

Untitled6

Now select any of the options pointed in above figure and fetch the source information.

Like the page http://www.facebook.com/pages/SSISBlog/200678990060457  if the blog helps you in any aspect

Happy Coding!! Cheers!

Using File System Task

Sometimes, we need to copy files from one drive to another. One approach is you can perform a manual copy-paste. Now since it is not automated and in future you may need to do same task, it will kill your time. Other approach is to use a File System Task and develop a small ETL package to perform File Transfer.

Let’s proceed with a simple example. Create\ Populate a table in SQL Server using following script:

–Creating a table

 

CREATE TABLE [FileInfoTable]

(

[Source] varchar(50),

[FileName] varchar(20),

[Destination] varchar(50),

)

 

–Populating a table

INSERT INTO [FileInfoTable]

VALUES

(‘F:\S66\SOURCE’,’DG.txt’,’F:\D67′),

(‘F:\S66\SOURCE’,’HR.txt’,’F:\D67′),

(‘F:\S66\SOURCE’,’AB.txt’,’F:\D67′),

(‘F:\S66\SOURCE’,’DG.txt’,’F:\D68′),

(‘F:\S66\SOURCE’,’HR.txt’,’F:\D68′),

(‘F:\S66\SOURCE’,’AB.txt’,’F:\D68′),

(‘F:\S67\SOURCE’,’DG.txt’,’F:\D69′),

(‘F:\S67\SOURCE’,’HR.txt’,’F:\D69′),

(‘F:\S67\SOURCE’,’AB.txt’,’F:\D69′)

 

 

Create 2 folders in the F drive S66 and S77 and create some text files say DG.txt, HR.txt, AB.txt. This is just an example; you can create folders in any drive and have different files in it. Be sure to populate the table with the proper file paths.

Similarly create destinations where you want to dump these files. In my case, I have created 3 folders D67, D68, D69 under F: drive

Building SSIS package:

Below is the screenshot of SSIS package:

Untitled1

‘GetFileInfoTable’ will fetch the records from table created previously, store it in an Object and provide the same to For Each Loop Container via Object in the format below:

SourceName Destination
F:\S66\SOURCE\DG.txt F:\D67
F:\S66\SOURCE\HR.txt F:\D67
F:\S66\SOURCE\AB.txt F:\D67
F:\S66\SOURCE\DG.txt F:\D68
F:\S66\SOURCE\HR.txt F:\D68
F:\S66\SOURCE\AB.txt F:\D68
F:\S67\SOURCE\DG.txt F:\D69
F:\S67\SOURCE\HR.txt F:\D69
F:\S67\SOURCE\AB.txt F:\D69

One by one these records will be fetched by For Each Loop (stored in 2 SSIS variables SourceName and DestinationName) and be passed to File System Task.

Note: For passing an Object and configuring For Each Loop container please see https://ssisblog.wordpress.com/2013/03/06/working-with-loops-in-ssis/

Double click the FileSystemTask and set the properties as:

Untitled2

Run the package and check the destination. The files will be copied to the destination.

 

Leave a comment if it is helping you in any aspect.

Cheers! Happy Coding!!

Code to divide a column data into Range

DECLARE @NoOfIterations int, –Indicates in how many chunks you want to pull the data
@RangeAdd int,
@ColSIDStart int,
@ColSIDEnd int,
@Counter int ,
@MaxColSID int,
@MinColSID int

DECLARE @ColSIDRANGETBL TABLE
(
Iteration int,
ColSIDStart int,
ColSIDEnd int
)

SET @NoOfIterations = ?
SET @Counter = 1

SELECT @MinColSID = MIN(ColSID) FROM [dbo].[Table] (NOLOCK)
SELECT @MaxColSID = MAX(ColSID) FROM [dbo].[Table]] (NOLOCK)

SET @RangeAdd = (@MaxColSID – @MinColSID)/@NoOfIterations

SET @ColSIDStart = @MinColSID
SET @ColSIDEnd = @ColSIDStart + @RangeAdd

WHILE ( (@Counter<=@NoOfIterations) and (@ColSIDStart <= @ColSIDEnd) )
BEGIN
INSERT INTO @ColSIDRANGETBL
SELECT @Counter,@ColSIDStart,@ColSIDEnd

SET @ColSIDStart = @ColSIDEnd + 1
SET @ColSIDEnd = @ColSIDStart + @RangeAdd

IF(@ColSIDEnd>@MaxColSID)
BEGIN
SET @ColSIDEnd = @MaxColSID
END
SET @Counter = @Counter + 1
END

SELECT ColSIDStart, ColSIDEnd FROM @ColSIDRANGETBL