Tuesday, 9 September 2014

Basics of ETL and Informatica


Basics of ETL and Informatica

Posted by Koti Reddy on September 10, 2014
Thought to provide some basics of ETL and Informatica PowerCenter 8.6
What is ETL?
ETL stands for Extraction, Transformation and Load. It is the process of transferring heterogeneous data from different sources to target system called as data warehouse as per the business requirements for different purpose like management reporting, data analysis etc. ETL Tools provide facility to extract data from different non-coherent systems, cleanse it, merge it and load into target systems.
Informatica is one of the powerful ETL tool, easy to work which supports all the steps of Extraction, Transformation and Load process. It has got a simple visual interface like forms in visual basic. You just need to drag and drop different objects (known as transformations) and design process flow for Data extraction transformation and load.
Main Components of Informatica Power Center
1. Repository:
Repository is the heart of Informatica tool. Repository is a kind of data inventory where all the data related to mappings, sources, targets etc is kept. All the client tools and Informatica Server fetch data from Repository.
2. Informatica PowerCenter Server:
All the executions take place in server, it makes physical connections to sources/targets, fetches data, applies the transformations mentioned in the mapping and loads the data in the target system.
3. Informatica PowerCenter Client Tools:
These tools enable a developer to define transformation process, known as mapping in Designer, define run-time properties for a mapping, known as sessions in Workflow Manager, monitor execution of sessions in Workflow Monitor and manage repository, useful for administrators in Repository Manager.
Clients Tools
Administration Console
It’s a web based client tool which is responsible for creating and managing services, manipulate services, manages to create users/groups and privileges. It creates repositories, services and initialization of the services.
It creates and maintains two important services 1. Repository Services – Interacts with Repository and 2. Integration Services – Main services which used to execute the process through repository services.
Repository Manager
It is responsible for creating the folders to manage the Meta data used by different users.
Designer
It is responsible for defining the flows between source system and target systems through transformation logics called as mappings. There are five components of Designer
1. Source Analyzer: Defines the source definition through ODBC connection.
2. Target Designer: Defines the target definition through ODBC connection
3. Transformation Developer: Transformations help to transform the source data according to the requirements of target system. Sorting, Filtering, Aggregation, Joining are some of the examples of transformation.
4. Mapplet Designer: A Mapplet is a reusable object that you create in the Mapplet Designer. It contains a set of transformations and lets you reuse the transformation logic in multiple mappings.
5. Mappings Designer: Mappings are developed in mappings designer. A mapping is a set of source and target definitions linked by transformation objects that define the rules for data transformation.
Workflow Manager
Workflow helps to load the data from source to target in a sequential manner. A workflow is a set of instructions that tells the Integration service how to run tasks such as sessions, email notifications, and shell commands. After you create tasks in the Task Developer and Workflow Designer, you connect the tasks with links to create a workflow. There are three components of workflow manager
1. Task Developer: Defines the tasks like sessions, email, command. Session is a object of an repository which sets instructions to the server how to execute by carrying the mapping definitions and production connections.
2. Worklet Designer: It’s an object which carries set of sessions for execution in a serial or parallel flow.
3. Workflow Designer: it’s an object of repository which sets instructions to the server, when to execute the workflow.
Workflow Monitor
This monitor is helpful in monitoring and tracking the workflows created in each Power Center Server.
Simple steps to start with Informatica 8.6
1. Install the Informatica
2. Create a Repository in Admin Console
3. Create Integration Services in Admin Console
4. Connect to Repository Manager (First time, add the repository created and connect it, create the folder)
5. Work on the Designer to create Mappings
6. Work on the Workflow Manager to create workflow
7. Work on the Workflow Monitor to track the changes
- Koti Reddy

Tuesday, 2 September 2014

Overview of SQL Server Stored Procedure


Overview of SQL Server Stored Procedure

Overview of SQL server stored procedure
Table of Contents 
 


1. Overview 

When you need to store or retrieve data that is accessing and manipulating data within a database is a rather straightforward operation in SQL server database, and so we will go for T-SQL with four simple commands – SELECT,INSERTDELETE, and UPDATE, and perform all the required tasks. The skill in working with databases lies in being able to not just work with these commands, but also complete each unit of work efficiently.
We will cover the different types of stored procedures used in SQL Server and their usage. We will also examine the working of stored procedures, the potentially tricky areas, and how to avoid them.
To increase performances, every developer should consider a number of factors during development. However, we will focus on running T-SQL from within stored procedures; we will examine what constitutes a stored procedure and discuss different types. We will cover:
  • The difference between system stored procedures and user stored procedures
  • Creating and designing stored procedures  
  • Using parameters and best practices 
  • Flow control statements  
  • Recursion 
  • Return values 

2. What is a Stored Procedure? 

If a repetitive T-SQL task has to be executed within an application, then the best repository for it is a program called astored procedure, stored in SQL Server. Storing the code inside the SQL Server object gives us many advantages, like:
  • Security due to encryption  
  • Performance gains due to compilation 
  • Being able to hold the code in a central repository: 
    • Altering the code in SQL Server without replicating in several different programs
    • Being able to keep statistics on the code to keep it optimized 
  • Reduction in the amount of data passed over a network by keeping the code on the server
  • Hiding the raw data by allowing only stored procedures to gain access to the data
You may have executed some ad-hoc queries for tasks like inserting data, querying information in other systems, or creating new database objects such as tables. All these tasks can be placed within a stored procedure, so that any developer can run the same code without having to recreate the T-SQL commands. Also, generalizing the code for all values makes it generic and reusable.
Stored procedures are more than just tools for performing repetitive tasks. There are two main types of stored procedure – system stored procedures and user-defined stored procedures. We also have extended stored procedures that can reside as either system or user-defined types. Extended stored procedures give functionality that is not necessarily contained within SQL Server, like allowing DOS commands to run and working with e-mail. It is also possible to create your own extended stored procedures.
Sample example of stored procedure:
    /*
    DECLARE @OutPutValue VARCHAR(100)
    EXEC spExample 'CodeProject', @OutPutValue OUTPUT
    PRINT @OutPutValue
    */
    CREATE PROCEDURE [dbo].[spExample]
      @parameter1 VARCHAR(100)
     ,@parameter2 VARCHAR(200) OUTPUT
    AS
    BEGIN
    DECLARE @parameter3 VARCHAR(100)
    SET @parameter3 = ' Your development resources.'
        IF @parameter1 IS NOT NULL AND LEN(@parameter1) > 1
            SELECT @parameter2 = 'The ' 
                       + @parameter1 
                                           + @parameter3
        ELSE SELECT  @parameter2 = 'CodeProject is cool!'
    RETURN
    
    END
    GO
More details on Integration of an OLE Object with SQL Server will be found at this link. 

2.1 System Stored Procedures

In SQL Server, many administrative and informational activities can be performed by using system stored procedures. Every time we add or modify a table, make a backup plan, or perform any other administrative function from within Enterprise Manager, we actually call a stored procedure specifically written to complete the desired action. These stored procedures are known as system stored procedures, which are functions that enhance the basic functionality of SQL Server itself, either by extending the functionality of an existing system stored procedure or by creating new functionality that enhances the basics of what is already there.
System stored procedures are prefixed by sp_, so it is not advisable to use sp_ for any of the stored procedures that we create, unless they form a part of our SQL Server installation. Creating a stored procedure prefixed with sp_ and placing it in the master database will make it available to any database without the need to prefix the stored procedure with the name of the database. More details can be found at this link.
Let's clarify this with an example. If we take the sp_who stored procedure, call it sp_mywho, store it in the masterdatabase, and move to another database such as northwind, we can still execute sp_mywho, rather than having to specify the procedure in the fully qualified manner as master.dbo.sp_mywho.

2.2 User Stored Procedures  

A user stored procedure is any program that is stored and compiled within SQL Server (but not in the master database) and prefixed with sp_. User stored procedures can be categorized into three distinct types:
  • User stored procedures 
  • Triggers, and
  • User defined functions

2.3 Creating Stored Procedures

The creation process depends on what we want it to do, now let's take a look at the syntax for creating a stored procedure:

Syntax


An example of a simple stored procedure follows, where two numbers are passed in and the midpoint of the two numbers is listed:
CREATE PROCEDURE ut_MidPoint @LowerNumber int, @HigherNumber int
AS
BEGIN

   DECLARE @Mid int
   IF @LowerNumber > @HigherNumber
      RAISERROR('You have entered your numbers the wrong way round',16,1)

   SET @Mid = ((@HigherNumber - @LowerNumber) / 2) + @LowerNumber

   SELECT @Mid

END
At the time of creation, SQL Server takes our code and parses it for any syntactical errors. Column names and variables are checked for existence at compilation. Even if they don't exist, any temporary tables created within the stored procedure will also pass the compilation stage. This is known as deferred name resolution. It can be an asset or a drawback, as we can create temporary tables that exist for the lifetime of the stored procedure execution only, which is desirable, but if we define a temporary table and get something wrong later on, such as a column name, then the compilation will not pick up this error.
Once it is compiled, the details of the stored procedure are stored in three system tables in the concerned database, they are as follows:

sysobjects

Contains one row for each object (constraint, default, log, rule, stored procedure, and so on) created within a database. In tempdb only, this table includes a row for each temporary object.


More details can be found at this link.

sysdepends 

Contains dependency information between objects (views, procedures, and triggers) in the database, and the objects (tables, views, and procedures) that are contained in their definition.
More detail can be found at this link.

syscomments

Contains entries for each view, rule, default, trigger, CHECK constraint, DEFAULT constraint, and stored procedure within the database. The text column contains the original SQL definition statements.
More details can be found at this link.

2.3.1 Performance Consideration

When a stored procedure is created, it passes through several steps. First of all, the T-SQL is parsed and resolved, saved to the disk, and stored in SQL Server. The first time the procedure is executed, the procedure is retrieved and optimized, on the basis of any data queries and values passed through parameters. SQL Server will inspect the code and try to use the best indexes on the tables, which are referenced by checking the statistics that are held for those tables.
The query plan is then cached within SQL Server, ready for any further executions. SQL Server will always use this plan, providing it doesn't retire the plan. Thus, the performance gain of stored procedures comes from compiled cached plans.

2.3.2 Network Consideration

You may consider passing T-SQL statements to insert a row into a table, with very few characters. However, creating a stored procedure and passing only the name of the stored procedure, parameters, and their values reduces the characters needed. We can see the extra overhead imposed by the T-SQL statements on our network by multiplying this difference by the number of calls being made. This can be a significant issue, especially if we are using it over the Internet.
Compare the next two statements with each other:
CREATE PROCEDURE ut_MidPoint @LowerNumber int, @HigherNumber int
AS
BEGIN

   DECLARE @Mid int
   IF @LowerNumber > @HigherNumber
      RAISERROR('You have entered your numbers the wrong way round',16,1)

   SET @Mid = ((@HigherNumber - @LowerNumber) / 2) + @LowerNumber

   SELECT @Mid

END
The first statement has 74 characters, while the second has 46 characters, differing by a mere 28 characters. However, if this was a more column-intensive insert, with 10,000 of these in a day, for example, this amounts to almost 280k of wasted bandwidth! What if an image data type was being uploaded or downloaded? Anything that is of binary data type, such as images or sounds, and so on, is sent as binary values. These are converted to character strings, and this will double the size of the ad-hoc query that we are sending, when using T-SQL inline.

3. How Stored Procedures Work 

When we run a stored procedure, Adaptive Server prepares an execution plan so that the procedure's execution is very fast. Stored procedures can:
  • Take parameters 
  • Call other procedures
  • Return a status value to a calling procedure or batch to indicate success or failure and the reason for failure
  • Return values of parameters to a calling procedure or batch
  • Be executed on remote Adaptive Servers
The ability to write smart stored procedures significantly enhances the power, efficiency, and flexibility of SQL. Compiled procedures radically improve the performance of SQL statements and batches. In addition, stored procedures on other Adaptive Servers can be executed if both your server and the remote server are set up to allow remote logins. You can write triggers on your local Adaptive Server that executes procedures on a remote server whenever certain events, such as deletions, updates, or inserts, take place locally.
Stored procedures differ from ordinary SQL statements and from batches of SQL statements in that they are precompiled. The first time you run a procedure, Adaptive Server's query processor analyzes it and prepares an execution plan that is ultimately stored in a system table. Subsequently, the procedure is executed according to the stored plan. Since most of the query processing work has already been performed, stored procedures execute almost instantly.

4. Parameter Usage 

Data will be passed in and out of a stored procedure, even when the purpose of the procedure is not to return information. All variables defined as parameters have to be prefixed with an @ sign. The normal practice is to define input parameters before any output parameters.
As with any column in a database, use the data type that is most appropriate for the parameter and if the parameter has to match with a column in the table, then ensure that both the data types match exactly.
If the parameter does not match a column but will be still used, for example, as a join condition or a filter condition, then use a data type and size appropriate to the type of data, rather than have everything as character data type. Also, wherever possible, use varchar or nvarchar rather than char, just as you will in your inline T-SQL, which will avoid unnecessary spaces being passed across networks.

5. Calling a Stored Procedure

Many people tend to miss a performance enhancement related to executing or calling a stored procedure. For example, if you wanted to call the [Ten Most Expensive Products] stored procedure in the northwind database, you can simply do it as:
[Ten Most Expensive Products]
You can skip EXEC(UTE), but you will need it if you run one stored procedure within another.
However, this is not the most efficient way to call a stored procedure. When such a command is processed, SQL Server has a hierarchical method of finding a stored procedure and executing it. First of all, it will take the login ID of the user who wants to execute the procedure and see if a stored procedure of that name exists. If it doesn't, then SQL Server will look for the stored procedure under the DBO login. Hence, always fully qualify your stored procedure with the owner.

Summary 

We covered several issues concerning stored procedures. We have looked at what a stored procedure is, how to create one, and how stored procedures can ensure your development runs at its optimum. Even when a database is being replicated, stored procedures can and do improve performance.

Types of Join in SQL Server


Types of Join in SQL Server

Introduction

In this tip, I am going to explain about types of join.

What is join??

An SQL JOIN clause is used to combine rows from two or more tables, based on a common field between them.
There are many types of join.
  • Inner Join
    1. Equi-join
    2. Natural Join
  • Outer Join
    1. Left outer Join
    2. Right outer join
    3. Full outer join
  • Cross Join
  • Self Join

Using the Code

Join is very useful to fetching records from multiple tables with reference to common column between them.
To understand join with example, we have to create two tables in SQL Server database.
  1. Employee
    create table Employee(
     
    id int identity(1,1) primary key,
    Username varchar(50),
    FirstName varchar(50),
    LastName varchar(50),
    DepartID int
     
    ) 
  2. Departments
  3. create table Departments(
     
    id int identity(1,1) primary key,
    DepartmentName varchar(50)
     
    ) 
Now fill Employee table with demo records like that.
Fill Department table also like this....

1) Inner Join

The join that displays only the rows that have a match in both the joined tables is known as inner join.
 select e1.Username,e1.FirstName,e1.LastName,e2.DepartmentName _
from Employee e1 inner join Departments e2 on e1.DepartID=e2.id 
It gives matched rows from both tables with reference to DepartID of first table and id of second table like this.
Equi-Join
Equi join is a special type of join in which we use only equality operator. Hence, when you make a query for joinusing equality operator, then that join query comes under Equi join. 
Equi join has only (=) operator in join condition.
Equi join can be inner join, left outer join, right outer join.
Check the query for equi-join:
 SELECT * FROM Employee e1 JOIN Departments e2 ON e1.DepartID = e2.id 

2) Outer Join

Outer join returns all the rows of both tables whether it has matched or not.
We have three types of outer join:
  1. Left outer join
  2. Right outer join
  3. Full outer join
a) Left Outer join
Left join displays all the rows from first table and matched rows from second table like that..
 SELECT * FROM Employee e1 LEFT OUTER JOIN Departments e2
ON e1.DepartID = e2.id 
Result:
b) Right outer join
Right outer join displays all the rows of second table and matched rows from first table like that.
 SELECT * FROM Employee e1 RIGHT OUTER JOIN Departments e2
ON e1.DepartID = e2.id
Result:
3) Full outer join
Full outer join returns all the rows from both tables whether it has been matched or not.
 SELECT * FROM Employee e1 FULL OUTER JOIN Departments e2
ON e1.DepartID = e2.id 
Result:

3) Cross Join

A cross join that produces Cartesian product of the tables that are involved in the join. The size of a Cartesian product is the number of the rows in the first table multiplied by the number of rows in the second table like this.
 SELECT * FROM Employee cross join Departments e2 
You can write a query like this also:
 SELECT * FROM Employee , Departments e2

 4) Self Join

Joining the table itself called self join. Self join is used to retrieve the records having some relation or similarity with other records in the same table. Here, we need to use aliases for the same table to set a self join between single table and retrieve records satisfying the condition in where clause.
SELECT e1.Username,e1.FirstName,e1.LastName from Employee e1 _
inner join Employee e2 on e1.id=e2.DepartID
Here, I have retrieved data in which id and DepartID of employee table has been matched:

Points of Interest

Here, I have taken one example of self join in this scenario where manager name can be retrieved by managerid with reference of employee id from one table.
Here, I have created one table employees like that:
If I have to retrieve manager name from manager id, then it can be possible by Self join:
 select e1.empName as ManagerName,e2.empName as EmpName _
from employees e1 inner join employees e2 on e1.id=e2.managerid 
Result:

Wednesday, 27 August 2014

Dimensional Modeling in Data Warehousing


Dimensional Modeling in Data Warehousing

Dimensional data model concept involves two types of tables and it is different from the 3rd normal form.

This concepts uses Facts table which contains the measurements of the business and Dimension table which contains the context(dimension of calculation) of the measurements.

Fundamental Stages of Data Wareshousing


Fundamental Stages of Data Wareshousing

Offline Operational Databases -Data warehouses in this initial stage are developed by simply copying the database of an operational system to an off-line server where the processing load of reporting does not impact on the operational system's performance.

Offline Data Warehouse -Data warehouses in this stage of evolution are updated on a regular time cycle (usually daily, weekly or monthly) from the operational systems and the data is stored in an integrated reporting-oriented data structure Real Time Data Warehouse - Data warehouses at this stage are updated on a transaction or event basis, every time an operational system performs a transaction (e.g. an order or a delivery or a booking etc.)

Integrated Data Warehouse - Data warehouses at this stage are used to generate activity or transactions that are passed back into the operational systems for use in the daily activity of the organization.;

Datawarehousing : Intro


Datawarehousing : Intro

What is Data Warehousing?

A data warehouse is the main repository of an organization's historical data, its corporate memory.

It contains the raw material for management's decision support system. The critical factor leading to the use of a data warehouse is that a data analyst can perform complex queries and analysis, such as data mining, on the information without slowing down the operational systems (Ref:Wikipedia).

Data warehousing collection of data designed to support management decision making.

Data warehouses contain a wide variety of data that present a coherent picture of business conditions at a single point in time.

It is a repository of integrated information, available for queries and analysis.

OLTP vs OLAP : Difference


OLTP vs OLAP : Difference

Data Source OLTP: Operational data is from original data source of the data

OLAP: Consolidation data is from various source.

Process Goal OLTP: Snapshot of business processes which does fundamental business tasks

OLAP: Multi-dimensional views of business activities of planning and decision making Queries and Process Scripts OLTP: Simple quick running queries ran by users.

OLAP: Complex long running queries by system to update the aggregated data. Database

Design OLTP: Normalized small database. Speed will be not an issue due to smaller database and normalization will not degrade performance. This adopts entity relationship(ER) model and an application-oriented database design.

OLAP: De-normalized large database. Speed is issue due to larger database and de-normalizing will improve performance as there will be lesser tables to scan while performing tasks.

This adopts star, snowflake or fact constellation mode of subject-oriented database design. Back up and System Administration OLTP: Regular Database backup and system administration can do the job.

OLAP: Reloading the OLTP data is good considered as good backup option.

Define : Surrogate Key


Define : Surrogate Key

Surrogate key is a substitution for the natural primary key in Data Warehousing.

It is just a unique identifier or number for each row that can be used for the primary key to the table.

The only requirement for a surrogate primary key is that it is unique for each row in the table.

It is useful because the natural primary key can change and this makes updates more difficult.

Surrogated keys are always integer or numeric.