ETL, which stands for Extract, Transform and Load, is the process to move data from a source to a destination. I use this generic definition, as the tools are not specific to data warehousing. ETL tools and processes can be used to migrate data in any data context from data warehousing to data migration on an OLTP system update.
The rest of this document will focus specifically on ETL issues and issues related to Pentaho Kettle.
A good resource for Penaho Kettle is http://wiki.pentaho.com/display/EAI/Spoon+User+Guide
ETL BASICS
Some of the common usages for ETL are:
Merging Data – Data is pulled from multiple sources to be merged into one or more destinations.
Data Scrubbing (Format Modification) – Data formats can be changed. i.e. string to int, additionally, ETL can be used to “scrub” the data where bad data can be “fixed” using fuzzy lookups or other operations to infer a value.
Automation – ETL can be used to automate the movement of data between two locations. This also standardizes the process, so that the load is done the same way in every run.
JOBS, TRANSFORMATIONS, STEPS, and HOPS
In Pentaho, each entire ETL process is described in one or more TRANSFORMATIONS. A transformation is defined as an entity that contains definitions on how to move data from one or more sources to one or more destinations. Specifically, a transformation will contain the following parts:
1. Input (s) – one or more input steps defines is the source of the data for a transformation. A transformation can contain 1 or more sources. Due to the nature of the script component, there is no practical limit on the kinds of sources that can be used.
2. Output(s) – one or more output steps defines the end point of the data’s journey. A transformation must contain 1 or more destinations. As with the source, the options for what can be used as a destination have no practical limits.
3. Steps– These are used to massage the data in transit between the source and destination. Data format modifications, data clean up, and data lookups are very common to be seen in this section.
4. Hops – Steps (boxes) are connected by hops (lines).
SOURCE DATA
Below is a list of 5 types of source data that can be used in an ETL process. This list should not be viewed as comprehensive. Note that source data are called “inputs” in Pentaho.
1. OLTP (RDBMS) – This is a transactional database that is typically attached to an application. This source provides the benefit of known data types and standardized access methods. Additionally, most developers are familiar with the technology and typically, this system will enforce data integrity. The
ETL developer must be cognizant of the OLTP’s application usage in order to not adversely affect the
OLTP systems performance. Additionally, depending on the implementation, business rules may not reside within the database tier and thus be inaccessible to the ETL process.
2. Mainframe (Legacy Systems) – While not as common as in the past, these systems still exist in many work places. There are little to no benefits to the ETL developer when accessing these types of systems and many detriments. The ability to access these systems is very limited and typically FTP of text files is used to facilitate access. Additionally, many mainframe systems use a character set called EBCDIC which requires a conversion to the more common ASCII format. Finally, many routines are written on a flavor of COBOL or Natural which may or may not be correct, and the original developer may not be available for consultation on how the algorithm works.
3. Flat Files – This format holds the distinction of being the easiest to consume from the ETL standpoint.
All data within the file is of the same format (string), and a methodology can be applied which allows easy error processing in a rerun scenario. This source does have a few caveats that need to be acknowledged. The first