Sunday, February 15, 2015

My first learning example in Kettle An example is reproduced from Pentaho Tutorials Topic Filtering Rows

Hi Guys,

Here is my first learning experience with Kettle Community ETL.

Download Kettle from :  Click Me

Sources to get start with Kettle :

1) http://wiki.pentaho.com/display/EAI/Getting+Started
2) http://localhost:8080/pentaho/docs/InformationMap.html
3) http://wiki.pentaho.com/display/EAI/Pentaho+Data+Integration+%28Kettle%29+Tutorial

You should have installed java in your local machine and path, java_home, jre_home set for it.

Below topic is a re-production of Kettle Transformation & Job example from Pentaho Tutorials and the description is slightly differ from the actual description.

Lets look at simple basics and then well jump into 1st example.


#  What is Transformation in Kettle ?
# What are jobs in Kettle ?
# Core difference b/w Transformation & Jobs in Kettle ? 
# Extensions for transformations & jobs in Kettle ?
# What are steps & hops ? 

1) What is the usage of Transformation ?

    Transformations are used to describe the data flows for ETL such as reading from a source, transforming data and
    loading it into a target location.

2) Jobs

Jobs are used to coordinate ETL activities such as

Defining the flow and dependencies for what order transformations should be run
Preparing for execution by checking conditions such as, "Is my source file available?," or "Does a table exist?"
Performing bulk load database operations
File Management such as posting or retrieving files using FTP, copying files and deleting files
Sending success or failure notifications through email


3)Whats the difference between transformations and jobs?

Transformations are about moving and transforming rows from source to target. Jobs are more about high level flow control: executing transformations, sending mails on failure, ftping files

4) Extensions for transformations & jobs in Kettle ?
Transformation : .ktr
Jobs : .kjb

5) What are steps & hops ?

  • A transformation is a network of logical tasks called steps.
  • Transformations are essentially data flows.
  • The transformation is, in essence, a directed graph of a logical set of data transformation configurations.
  • Steps are the building blocks of a transformation, 
  • for example a text file input or a table output. 
  • There are over 140 steps available in Pentaho Data Integration and they are grouped according to function; for example, input, output, scripting, and so on. 
  • Each step in a transformation is designed to perform a specific task, such as reading data from a flat file, filtering rows, and logging to a database 
  •  Steps can be configured to perform the tasks you require. 
  •  
  • Hops are data pathways that connect steps together and allow schema metadata to pass from one step to another. 
  •  Hops determine the flow of data through the steps not necessarily the sequence in which they run.
  •  When you run a transformation, each step starts up in its own thread and pushes and passes data. 
More info  at :  Click Me


TOPIC :
Load sales data into a database. Several of the customer records are missing postal codes(zip codes) that must be resolved before loading into the database.

You will be given two .csv files to load the data into database. 
The first file is : sales_data.csv and later one is Zipssortedbycitystate.csv

( Location of the files : /home/sada/softwares installed/Pentaho/data-integration/samples/transformations/files
)


In short , there is a column called "postalcode"  in sales_data.csv with missing postal codes(i.e., the values are null) and need to fill the missed the missed postal codes using "postalcode"  column of Zipsortedbycitystate.csv file. 


Explanation : 
The Final Transformation looks as shown in below :

The final job looks as shown in below(doing job is optional for this topic)


Transformation is divided into below sub tasks.
1) Retrieve Data From Flat File ( i.e., Retrieve Data From "sales_data.csv")
2) Filtering the records ( from the source file get only not null valued rows for "postal code").
3) Load to into Relational database ( Total Number of rows in this 2747)
4) Retrieve Data From your look-up file.
5) Resolve Missing zip information.


Download the example : Click Me

NOTE : While running the .ktr file (or .kjb file) in your environment you should specify the flat files(.csv files) location as per your folder structure.

#:  Ive taken postgresql as output table.

Thank you.

Sadakar
BI developer.
(Pentaho/Jasper/Talend/Kettle).

:-)







No comments:

Post a Comment

Note: Only a member of this blog may post a comment.