Monthly Archives: May 2013

Performance optimizations in SSIS

Developing an ETL is a part of challenge but the real challenge lies in developing an optimized one. In this very blog I will try to explain different performance optimizations technique that can be incorporated in ETL packages so as to run ETL in effective and efficient way. I have collected the material by reading MSBI books, googling out different blogs and from my own experience with ETL.

Here we goes..

  1. At connections level: By default, the network packet size has 0 value which means 4KB of data will be transferred at a time. Change this value to 32767 so as to enable 32 KB of data transfer to be transferred at one time

Untitled0

2. Using ‘Fast Parse’ property: This option can be used for very fast loading of data from flat files. To enable this option, perform following steps:

a. Right click the flat file source, and click ‘Show Advance Editor’:

Untitled1

b. Go to ‘Input and Output properties’ tab. In the Inputs and Outputs, expand the Output columns and select the column for which you want to enable ‘fast parse’.  After selecting the column, go to Custom Properties and set ‘FastParse’ property to ‘True’ as shown:

Untitled2

3. Avoid ‘Select *’: Instead of using ‘Select *’ while pulling data from SQL server table, use the column list and try to pull only required columns of table. SSIS uses buffer oriented architecture to transfer data wherein data from source first comes in buffers, required transformations are performed and then the data is dumped to destinations. However, size of the buffer depends on the estimated row size which is calculated as the maximum size of all the columns in a row. So if there are more number of columns in a row, they will occupy more space in buffer resulting in less number of rows in buffer. Therefore, always try to pull only required columns not all.

4. Avoid Asynchronous transformations: As far as possible, try to avoid them. Asynchronous transformations like sort needs to read and process all input data before creating any output records. As a result, data transfer process gets slower.

5. Considering data types: The data types should be as narrow as possible since they will be allocated less memory. Allocate the required number of bytes to the columns.

6. Indexes: If you are pulling large volume of data, then drop all the indexes if exists. After your load is complete, then re-create the indexes

7. OLEDB Destination settings:

Untitled3

  1. Keep Identity: By default, this property is unchecked, which means source identity values will not be copied to destination so destination table will create its own identity values if it has an identity column. If you check this property, then DFT engine will preserve the values of source identity and pass it to destination. If possible, keep it unchecked
  2. Check Constraints: By default, this property is checked. If you are sure that incoming data will not violate destination table constraints then uncheck this property. This setting specifies that the dataflow pipeline engine will validate the incoming data against the constraints of target table. If you un-check this option it will improve the performance of the data load.
  3. Rows per batch: By default, the value of this property is -1 or empty which means all the incoming rows will be treated as single batch. Change the value of this property to a positive value that specifies the number of rows in batch.
  4. Insert commit size: By default, the value is ‘2147483647’  as you can see in the above screenshot which means all the incoming data will be committed in one shot after successful completion of the DFT. We can specify a positive value for this setting to indicate that commit will be done for those number of records. Changing the default value for this setting will put overhead on the dataflow engine to commit several times but at the same time it will release the pressure on the transaction log and tempdb to grow specifically during high volume data transfers.

8. Using Delay Validation property: By default, this property is set to ‘False’. SSIS performs early validation that validates the package and its components before starting execution of package. If you are using some tasks that are dynamically passing some values to other tasks then set this property to ‘True’ for such tasks otherwise your package will fail.

9. Use Performance Counters to monitor SSIS performance

10. Using checkpoints: Please refer https://ssisblog.wordpress.com/?s=checkpoints

11. Parallel execution of tasks can also lead to performance benefits

Note: These are general guidelines to optimize your ETL package. If you have any constructive opinion or any doubts then do post here.

Enjoy! SSIS is fun..

Advertisements

Using Script Task

While developing an ETL solution, the developer may be required to write some .net code to perform some functions\ operations that are not supported by built in connection types. These operations may include:

  • Interacting with Active Directory to fetch user information
  • Interacting with file system
  • Creating performance counters

To perform all such above mention functions, we need some editor where we can write our .net code. Script task provides us with such capability wherein we can write .net code and can read\ write SSIS variables (both system and user defined). SQL Server 2012 has introduced the concept of ‘parameters’. These parameters can also be accessed within Script task like variables.

In the following example, I wanted to check if the text file exists in the directory. There can be many other ways to perform this task but here I will make use of Script task. Also I will show you how to access SSIS parameters\ variables in Script task.

The flow of execution will go as: I will pass the directory path to script task via parameters. Script task will read that path and check if the text file exist. I the file exists then script task will return full path of the file including filename to SSIS otherwise it will return 0. This return value will be captured in a variable to perform further processing if required.

Let’s begin with the example:

  1. Go to parameter tab in SSIS and create a parameter as:

Untitled0

This parameter contains the directory path where my text file is located.

2. Create a variable that will contain the value returned by the script task as:

Untitled2

3. Drag the script task from the SSIS toolbox to Control flow window:

Untitled1

Control flow will look like:

Untitled3

4. Double click the script task to open Script task editor as:

Untitled4

ReadOnlyVariables: contain variables\ parameters that the Script task can read but cannot modify. Those variables that we don’t want the script task to modify are kept in this section.

ReadWriteVariables: contain variables that script task can read as well as can modify the variables and return it to SSIS.

5. Click  the ellipses next to ‘ReadOnlyVariables’:

Untitled5

Next, select the parameter that we want to pass to Script task by checking the box:

Untitled6

Similarly click the check box next to ReadWriteVariables and select the variable:

Untitled7

After performing above steps, Script task editor will look like:

Untitled8

Click the editor as shown in above screenshot to open .cs file.

6. Once the .cs file opens up, type the following code in Main function:

string filepath = Dts.Variables[“$Package::param_FilePath”].Value.ToString();

if(String.IsNullOrEmpty(Directory.GetFiles(filepath,”*.txt”).FirstOrDefault()))

{

Dts.Variables[“User::var_ReturnValue”].Value = “0”;

}

else

{

Dts.Variables[“User::var_ReturnValue”].Value = Directory.GetFiles(filepath, “*.txt”).FirstOrDefault();

}

Dts.TaskResult = (int)ScriptResults.Success;

Note the use of parameters and variables in Script task. Using ‘DTS.Variables’, we can read\ write SSIS variables\ parameters. Also, variable\ parameter names are Case Sensitive. Be careful while using variable\ parameter names.

7. Now save the file and build the script. After the Build succeed, close the file. You will land on the Script task editor. Click OK and the job is done.

8. Since you have the script task output in variable ‘var_ReturnValue’, as per requirements you can decide how to use it.

Happy Coding!!

Shortest code to get comma separated values from a column

Getting results of a column in comma separated format is often required while one is writing the stored procedure. The comma separated values then can be used in SQL ‘In’ clause to filter the result. Below example will show you how to get values of a column in comma separated format:

Suppose you have a table named ‘tblCountry’ with following data:Untitled

Now you want your countries to be in comma separated format like India,Australia,US,UK.

Following snippet will give you the desired result:

DECLARE @SQL nvarchar(MAX)
SELECT @SQL = COALESCE(@SQL + ‘,’,”) + CountryName FROM [dbo].[tblCountry]
SELECT @SQL

The result is a string of characters that can be used further in the code.

Enjoy! Happy Coding!!

Leveraging SSIS in HDInsights

SSIS can be leveraged in HDInsights projects as effectively as in other BI projects. Once the data resides on Hadoop clusters, you surely want to import data to Cube or DW for further analytics. There are multiple ways to achieve the desired purpose:

1. Using SQOOP connector:

SQOOP connector basically fires a command on Hadoop Command prompt and imports the data from HDFS/ Hive to your SQL Server tables. To automate the process, you can wrap the SQOOP command in a batch file and execute this batch file from SSIS.

2. Using Hive data source in SSIS:

For this, one has to create a Hive data source and then in SSIS, use this data source to dump data to destination.
Note: 2nd option is not available in preview versions of HDInsights.