logo头像

你需要一点点勇气

Fantastic Fireworks Data Warehousing ETL Process Design

1 Executive Summary

In the era of the information explosion, the data warehouse can better assist the enterprises to manage and analyze their business data, so as to support decision-making. After modeling the star schema for the Fantastic Firework data warehouse, the next step is to pop the source data which may be stored in several independent systems into the designed data warehouse. The data integration and data quality are the critical issues in this step.

  • This report will refine the previous star schema.
  • Additionally, the report will discuss the design of the ETL process which is used to extract the raw data from the operational systems, transform the raw data to the unified data format to ensure the data quality, and load the processed data to the data warehouse.
  • The data dictionary will be attached to provide the user with the definition of the schema and data content of the data warehouse, also it will navigate the end user where the integrated data is sourced from.

Issues in ETL process design:

  • The difference of date type between source systems, need to be unified in the data warehouse.
  • Handle low data quality records and invalid inputs (like replacing letter “o” with number “0” in Customer table).
  • Data integration across systems (like retrieving the cost per item from Inventory System to sales table).

2 Design of the ETL Process

2.1 Customer Dimension Table
Figure 1 Customer ETL Process

  • Extraction: extract data from four source customer datasets. The mapping between source data and extracted data is shown in Figure 2. There are some data issues when extracting data from multiple data sources, including different data type with the same field; misspelled postcode(number 0 and letter O, incorrect length), detect the same customer in different stores. The extraction is executed weekly, the version field is used to handle the historical customer records.
Figure 2 Customer source-target Data Mapping

  • Transformation: The first step is to unify the field name and data type after reading from the four source files. Then replace the letter O to number 0 in postcode and handle the misspelled postcode with incorrect postcode length. Meanwhile, detect and merge the same customer record between Darwin and Melbourne store. After that, merge and sort all the customer records, calculate the age of each customer. Finally, add the version number to handle historical records and output to the data table.
  • Loading: All customer records should be initially loaded into the data warehouse, the new and modified records should be incrementally loaded into the data warehouse with the constructive merge. The final data records which stored in the data warehouse showing as below Figure 3.
Figure 3 Customer Dimensional Table

2.2 Product Dimension Table
Figure 4 Product ETL Process

  • Extraction: Product and Product Order are the source files in the Inventory System. The mapping relationship is shown in Figure 5. The integration challenge is removing a large number of null rows exist in Product Order table and join these two tables. The extraction of this table is weekly.
Figure 5 Product source-target Data Mapping

  • Transformation: the first step is to remove the null rows by filtering those rows whose partNumber field is null. Then retrieve the unitPrice and validFromDate fields from Product Table, it is necessary to extract the year number of orderDate and validFromDate to retrieve to unitPrice for each Product. However, the unitPrice of orders in the year 2019 is not updated, so it remains the same with the unitPrice in 2018. A filter is needed to separate the records in 2019 and set their year number to 2019, then retrieve the corresponding unitPrice.
  • Loading: product records that already exist should be initially loaded into the data warehouse. The new records should be incrementally appended into the data warehouse. The sample of data records is shown in Figure 6.
Figure 6 Product Dimensional Table

2.3 Employee Dimension Table
Figure 7 Employee ETL Process

  • Extraction: the source file in Sales System is SalesPerson. The mapping between source data and extracted data show is shown in Figure 8. The only difference of the data format is Commission rate, which is String in source field but should be DOUBLE in DW for the convenience of calculation. The table should be updated weekly as a result of the possibility of frequent employee changing.
Figure 8 Employee source-target Data Mapping

  • Transformation: the first step is to do Splitting of the commissionRate to keep the number and get rid of the percent sign. Then use a Formula to convert the integer into two decimal places. The last step is to make sure the data in this field is accurate to two decimal places.
  • Loading: all existed records should be initially loaded into the data warehouse. The modified records should be incrementally loaded into the data warehouse with the destructive merge. The sample of data records is shown in Figure 9.
Figure 9 Employee Dimensional Table

2.4 Store Dimension Table
Figure 10 Store ETL Process

  • Extraction: the source file in the Sales System is Store. The mapping between source data and extracted data show is shown in Figure 11. The data format remains the same. The table should be updated weekly.
Figure 11 Store source-target Data Mapping

  • Transformation: no special transformation is needed in this step.
  • Loading: current store records should be initially loaded into the data warehouse. Destructive merge is needed to deal with address changing or opening of stores. The sample of data records is shown in Figure 12.
Figure 12 Store Dimensional Table

2.5 Date Dimension Table
Figure 13 Date ETL Process

  • Extraction: the date dimension extract data from the data xlsx file. The mapping between source data and target data is addressed in the following Figure 14. There is a data quality issue that the day in the Date field of some records not match with the corresponding Day fields. The extraction frequency should take place daily.
Figure 14 Date source-target Data Mapping

  • Transformation: After extracting the source data, the data type should be transferred as predefined in the data warehouse. Then the incorrect date value in Date field should be corrected by a formula that calculates the correct date based on the correct fields including Day, Month and Year.
  • Loading: The existed date records should be initially loaded into the data warehouse. The new records can be loaded daily in the append way. The final result is shown as figure 15.
Figure 15 Date Dimensional Table

2.6 Sales Fact Table
Figure 16 Sales ETL Process

  • Extraction: the Sales fact table extract data from Order and OrderItem Sources file, also collaborate with the related dimensional table to mapping some of the source fields to the foreign key. The mapping between source data and target data is demonstrated in the following Figure 17. Besides, the extraction is executed weekly.
Figure 17 Sales source-target Data Mapping

  • Transformation: the first step is to combine the Order and OrderItem files. Then convert the letter case and Order Date format so as to extract storeID and dateID, which acted as the foreign key, from the store table and data table respectively. Next, change the data type to the format that suitable for later calculation. After that, the database lookup step is used to extract the unitCost from the Product table. The calculator step is then used to calculate the dollarSales and margin. Finally, change the data type to the predefined format when designing a data warehouse.
  • Loading: load all the existed records to the data warehouse initially. Then gradually load the incremental records with the append method. The final result is shown in the following figure 18.
Figure 18 Sales Fact Table

2.7 Employees Fact Table
Figure 19 Employees ETL Process

  • Extraction: the source data are from Sales Fact Table and Employee Dimension Table. The ETL steps of Sales Fact Table have already provided the Employee_employeeID and Date_dateID within each sale record. For the convenience of tracking the each-day-commission of employees, the table should be sorted by date. The table should be updated monthly.
Figure 20 Employees source-target Data Mapping

  • Transformation: select the used part of fields is needed in Employees table, including employeeID, dateID and dollarSales. After sorting these records by date, group the records from the same employee and same date. The dollarSales in each group needs to be summed up. Use employeeID to retrieve the commissionRate of each employee from Employee table, then the total commission is easily derived by multiplying commissionRate and dollarSales.
  • Loading: the existing records will be initially loaded into the data warehouse. New records of Employees is better to be appended to the table without any overwriting. The sample of data records is shown in Figure 21.
Figure 21 Employees Fact Table

3 Redesign of the Data Warehouse

  1. Change Employee table from the dimension table to the fact table
  2. Add ‘orderID’ and ‘lineID’’ attributes in the Sales fact table
  3. Integrate the inventory fact table with Product dimension

Figure 22 illustrates the redesign of the Fantastic Firework data warehouse, the detailed modification will be addressed in the following paragraphs.

Figure 22 Redesign of Data Warehouse

In the first version of the design, there is a dimension table called Employee to help measure the key employees in the Sales fact table. In order to better measure the key employees on the basis of commission, an Employee fact table is created and the detailed design is listed below.

  • Employee Fact Table:

    Attributes: ID, employeeID, dateID, commission

    The Employee Fact Table uses ID as the surrogate key to identify the records uniquely. It has two foreign keys, employeeID and dateID, in order to connect with Employee dimension table and Date dimension table and get commissionRate as well as time data. The commission is calculated by multiplying the commissionRate and dollarSales.

  • Sales Fact Table:

    In terms of the Sales fact table, the surrogate key ‘saleID’ is replaced by OrderID and LineID in order to better match and track the source data from the original system.

Finally, different from the first version, the Inventory Fact Table is integrated into Product Dimension Table, because it is unnecessary to create a Fact Table for Inventory System.

4 Data Dictionary

There are three types of metadata, including operational metadata, business metadata, and technical metadata. Operational metadata provides the schema of data tables in data warehouse. Business metadata shows the data sources and the mapping relationship between source data to the target data, which has been illustrated in the source-target mapping tables in part two. Technical metadata describes the structural information of the target data table, the extraction and loading rules of the data warehouse.

4.1 Customer table metadata
  • Operational:

    custCode (VARCHAR45), postcode (VARCHAR45), name (VARCHAR45), dob (DATE), validUntilDate (DATE), age (INT), version (INT). Age field is used to support age group query in data warehouse. Version field is used to handle historical records.

  • Business:

    Source Systems: Sales System (Customer-Brisbane.txt, Customer-Darwin.txt, Customer-Melbourne.txt, Customer-Sydney.txt)

  • Technical:

    Primary key: custCode, postcode

    Extraction rules: extract the new and modified records weekly

    Loading rules: initial load and incremental load new record with constructive merge

4.2 Product table metadata
  • Operational:

    partNumber (VARCHAR 45), validFromDate (DATE), orderDate(DATE), description (VARCHAR100), unitPrice (DOUBLE), quantity (INT), unitCost (DOUBLE).

  • Business:

    Source Systems: Inventory System (Product.csv, ProductOrder.csv)

  • Technical:

    Primary key: partNumber, validFromDate, orderDate

    Extraction rules: extract new records weekly

    Loading rules: initial load and incremental appending

4.3 Employee table metadata:
  • Operational:

    employeeID (VARCHAR45), name (VARCHAR45), commissionRate (DOUBLE).

  • Business:

    Source System: Sales System (SalesPerson.csv)

  • Technical:

    Primary key: employeeID

    Extraction rules: extract new records and modify exist records weekly

    Loading rules: initial load and incremental load with destructive merge

4.4 Store table metadata:
  • Operational:

    storeID (INT), description (VARCHAR45), address (VARCHAR45).

  • Business:

    Source System: Sales System (Store.csv)

  • Technical:

    Primary key: storeID

    Extraction rules: extract new records and modify exist records weekly

    Loading rules: initial load and incremental load with destructive merge

4.5 Date table metadata:
  • Operational:

    dataID(INT), Date(DATE), DayofWeek(VARCHAR10), Month(VARCHAR10), Quarter(INT), Year(INT), Day(INT), Season(VARCHAR10).

  • Business:

    Source System: data.xlsx

  • Technical:

    Primary key: dateID

    Extraction rules: extract new records daily

    Loading rules: initial load and append new record daily

4.6 Sales table metadata:
  • Operational:

    OrderID(INT), LineID(INT), Product_partNumber(VARCHAR45), unitSales(INT), salePrice(DOUBLE), Customer_custCode(VARCHAR45), Date_dateID(INT), Store_storeID(INT), Employee_employee(VARCHAR45), dollarSales(DOUBLE), margin(DOUBLE). dollarSales is used to calculate the sales of each item line in order and margin shows the profit.

  • Business:

    Source System: Sales System(Order.csv, OrderItem.csv)

  • Technical:

    Primary key: OrderID, LineID

    Foreign key: Product_partNumber, Customer_xustCode, Date_dataID, Store_storeID, Employee_employeeID

    Extraction rules: extract new records weekly
    Loading rules: initial load and incremental appending.

4.7 Employees table metadata:
  • Operational:

    ID (INT), Employee_employeeID (VARCHAR45), Date_dateID (INT), commission (DOUBLE). ID is an auto-generated as a surrogate key. Commission is derived by calculating the commissionRate and dollarSales.

  • Business:

    Source System: Sales System (Order.csv, OrderItem.csv, SalesPerson.csv)

  • Technical:

    Primary key: ID

    Foreign key: Employee_employeeID, Date_dateID

    Extraction rules: extract new records monthly
    Loading rules: initial load and incremental appending