Employment Analysis - BI Solution (Part III)

December 11th, 2009

In this post we will finish the Employment BI project by creating an SSIS project to populate the dimensional model with the data captured in the Excel spreadsheet.

First we will start by slightly modifying the relational schema generated by the automatic schema generator.
The measure fields in the fact table are generated as “Real” which is a 4-byte floating point number. We will need to switch this to “Float” - the 8-byte floating point number.

Go to SQL Server Management Studio, navigate to database “Employment BI”, select table “Employment_Facts”, right-click, select “Design” and change the type of field “Employment” to “float”. Apply the same change to field “Unemployment_Rate” in table “Unemployment_Facts”.
Save the changes and switch back to the Employment BI solution in Business Intelligence Development Studio.

Now we will create a new SSIS project which will load the data into the relational database.
Select File/New/Project, select “Integration Services Project”, name it “SSIS”, select “Add to solution” and click “OK”.

  1. Create a new data flow task in the control flow of the SSIS project.
  2. Assuming that you have installed DataDefractor for SSIS, select the DataDefractor source from the toolbox and drop it on the canvas:
    Drop DataDefractor Source
  3. Double-click on the DataDefractor source component to invoke its wizard driven interface and click “New” to create a new DataDefractor connection manager to connect to the Excel file. Select “DataDefractor.ExcelFile from the list of connection managers and click “Add…”:
    Select DataDefractor Connection Manager
  4. Select employment_report.xls and click Ok:
    Select Excel File
  5. Select “Table-1″, move it to the list on the right by clicking the right-arrow “Add-table” button and click “Next”:
    Step 1
  6. On this step we will cut off the top header of the spreadsheet. We are interested in the area starting with the first row which has text in column C. Select “Vertical Sub-Pages”, Set property “MatchSingleSubpage” to “True” and type in .+ as a regular expression for column C:
    Step 2
    Click “Next” to continue to the next page.
  7. On this step select row 5 as the first row to contain facts and click “Next”:
    Step 3
  8. Select columns C through H as the columns containing facts and add them as a “Flexible Data Range”:
    Step 4
    Click “Next”
  9. We will spend some more time on this page to define the dimensionality of our data source. We will be processing the data of the Employment Facts first. As far as this measure group is concerned, there is only one measure - Employment Rate. So, instead of adding a list of measures from the data source, we simply type “Employment Rate” as the name of the measures:
    Step 5-1
  10. Next we will handle the MSA dimension. Select the “New Dimension” item and rename it to MSA. Then select column “B” and move it to the “Sources” list by clicking the left-arrow:
    Step 5-2
  11. Now we will add MSA’s attributes “Region”, “Sub Region” and “State”. Start by adding a new attribute called “Region”. Double-click column A to select it as the source of items for the attributes, select the “Column:A” source in the “Sources” list and press the “Properties” button. In the source properties dialog box select “Custom regular expression” and type in ^(\S*)
    This regex will match and extract only items which begin with a non-whitespace character.
    Select “Fill blank cells with carryover values” and click OK:
    Step 5-3
  12. Add the “Sub Region” and “State Attributes” following the same steps, but type in the following regular expressions for those attributes:
    For “Sub Region”: ^\s(\S.*)
    For “State”: ^\s{2}(\S.*)
    These regular expressions extract respectively values with a single whitespace before a non-space character and a two-white-spacer before a non-whitespace character.
  13. Create a Time Period dimension, select Row 3 as the source of member names, click “Properties” and use the the following regex to match only time period items: ^(\w{3}-\d{4})
    Step 5-4
  14. Then create the “Year” and “Month” attributes by using the following regexes to extract the appropriate data:
    For “Year” : ^\w{3}-(\d{4})
    For “Month” : ^(\w{3})-\d{4}
  15. We left the easy one for dessert - create an “Employment Category” dimension and select row 1 as the source of member names. No need to perform extractions:
    Step 5-5
    We’re done with the dimensional model - move on by pressing “Next”.
  16. Leave the default selections on page 6 and press Finish:
    Step 6
  17. Add four SQL Server Destination components to the data flow task and name them “Facts”, “MSA”, “Time Period” and “Employment Category”. Bind the DataDefractor outputs to their respective outputs then select each destination, map it to the appropriate relational table in the Employment BI database and map the columns from the DataDefractor output to the appropriate table fields. For example, the ID columns must be mapped to the PK_ and FK_ columns in the destination components:
    DataDefractor Data Flow
    Note the warning signs in the destination - they are caused by the fact that DataDefractor emits 1000 varchar data for the dimension names and attributes, while the schema generation wizard creates shorter fields. You can resolve this by modifying the size of the DataDefractor output columns in the “SSIS Advanced Editor”.
  18. Run the SSIS project to populate the relational model, then deploy the SSAS cube and process it.
    After this, you can browse the Employment part of the cube. Here’s a snapshot of the cube browser:
    Employment cube browser
  19. At this point we add a new SSIS data flow task and process the Unemployment data using another DataDefractor data flow task.
    We end up with two measure groups with one measure per group.
    There is only one thing left to do - make the Unemployment Rate measure aggregate by averaging the data points at lower levels.
    The way we do this is by adding two new measures based on the same data column in the fact table - lets’ call them URC and URS. Then we make the URC measure aggregate using the “Count” aggregation rule and the URS measure use the “Sum” aggregation.
    We hide the two measures, then create a calculated measure called “Unemployment Rate”. The MDX formula for this new measure is:

    iif([Measures].[URC] = 0, Null, [Measures].[URS] / [Measures].[URC])

    This pretty much wraps it up. We now have an analytical model containing employment and unemployment data created off of a seemingly unwieldy Excel source.

Employment Analysis - BI Solution (Part II)

December 11th, 2009

In this part of the tutorial we will decorate the UDM with some attributes and will generate the relational database schema.

We will start with the attributes of the MSA dimension. Looking back at the data source, we see that the MSA dimension (columns A and B) contains three levels of hierarchy - Region (Northeast, Midwest, etc.), Sub-region (New England, Middle Atlantic, etc.), State and the lowest level - MSA. So we have total of four attributes including the key attribute.

We manually add the attributes to the MSA dimension:
MSA Attributes

Then we add the Month and Year attributes to the Time Period dimension:
Time Attributes

The Employment Category dimension is flat - there is no hierarchical structure to it, so we’ll leave this dimension alone.
It is time to generate the schema now.
Follow these steps to get this done:

  1. Launch SQL Server Management Studio, connect to SQL Server Database Services and create a new relational database - we’ll call this database “Employment BI”:
    New Relational Database
  2. Go back to Business Intelligence Development Studio and create a new data source in the SSAS project - right-click on the “Data Sources” icon in the solution explorer and select “New Data Source”. Provide the necessary information to create the data source connection to the “Employment BI” relational database.
    Here’s a snapshot of my settings:
    New Connection
    For impersonation mode I usually select to have SSAS impersonate my windows account by selecting “Use the service account” in the Data Source Wizard, but feel free to select whatever makes sense in the context of your security settings:
    Impersonation Settings
    Go through the Data Source wizard and when prompted, name the data source “Employment BI”.
  3. Launch the Schema Generation Wizard by selecting “Database\Generate Relational Schema…”
  4. On the first screen select “Create a new data source view”, name the new data source “SSAS” and select the “Employment BI” data source:
    Generation Wizard - Page 1
  5. On the next screen uncheck the settings to preserve referential integrity and to preserve data on regeneration:
    Generation Wizard - Page 2
  6. Leave the naming conventions page as is and click “Finish”

The Schema Generator produces the entire relational schema and an SSAS data source view based on the metadata we provided earlier. Here’s a snapshot of the data source view:
DSV
The generator has produced two fact tables: Employment_Facts and Unemployment_Facts and three dimension tables: Employment_Category, Time_Period and MSA.
Note that the schema follows Kimball’s guidelines for keeping the dimensional tables denormalized. In other words the Schema Generator favors denormalized star schema over normalized snowflake schema. In a star schema all the attributes of a dimension are collapsed together in a single table to avoid complex and slow multi-table joins when data is queried.

In the next post we will finish this mini-project by creating an SSIS project to populate the dimensional model with the data captured in the Excel spreadsheet.

Part III >>

Employment Analysis - BI Solution (Part I)

December 11th, 2009

This is the first of a series of posts, which I will use to describe the implementation of a SQL Server 2008 BI solution designed to extract and analyze data captured in a semi-structured data source I discussed in my previous post.

In the course of this exercise, we will:

  • Define a top-down SQL Server Analysis Services (SSAS) Unified Dimensional Model (UDM), which will capture the dimensional structure of the source information.
  • Generate a relational star schema based on the UDM.
  • Develop a SQL Server 2008 Integration Services (SSIS) solution, which will extract the data from the semi-structured spreadsheet and use it to populate the UDM.
  • Revisit the SSIS solution and augment it with maintenance procedures such as updating the UDM with new data feeds.

A typical data integration solution loads data into a preexisting data warehouse schema. For the purposes of these posts, however, we will assume that the data warehouse schema does not exist yet - this will make the solution more thorough (by forcing us to design the schema) and will give us a chance to explore a rarely discussed feature of Analysis Services - the “top-down” schema generator. In case you have a preexisting schema and you need to apply the techniques discussed here, SSIS comes packaged with a toolbox full of data flow transformations and destinations that will help you funnel your data into your particular schema.

Developing the UDM

Let’s once again quickly review the contents of the data source.

It contains employment numbers (in thousands) for two categories of employees: non-agricultural and civilian employees. The numbers are broken down by metropolitan statistical area (MSA) and time period. The report also contains unemployment rates broken by MSA and time period, but not by employment category.

Here’s a screenshot of the original report with highlighted fact areas:

Employment Report Measure Groups

Essentially, this report contains two separate measure groups which share some, but not all of the dimensions reflected in the report:

  • The Employment measure group (the orange highlight above) contains numeric facts which refer to three dimensions: Employment Category and Time Period located in the header rows above the facts, and MSA located in the header columns to the left of the facts.
  • The Unemployment Rate measure group (the green highlight above) which refer to two dimensions: Time Period above and MSA to the left of the facts.

You may wonder why we didn’t highlight PCT CHG (percent change) as part of the facts. The reason is that we do not consider “percent change” a raw fact - it is in fact a calculation derived from the raw facts, the employment rate numbers. The problem with the “percent change” calculation (and many other pre-calculated bits of analysis typically present in data reports) is that they are correct in the current report layout, however, once the data is loaded into a flexible dimensional model, these static pre-calculated data points become wrong as the user starts pivoting the data and viewing it from many different angles, aggregating at different levels, etc.
We will skip PCT CHG while extracting the data from the raw report, but later we will extend our UDM with a flexible Percent Change calculated measure, which will be correct in the context of any pivoted point of view.

Now, let’s build us a UDM.

We will start by creating a new blank Visual Studio 2008 solution which will contain all the projects involved in getting our BI system up and running - we will call this solution Employment BI (at the end of this series, I will post the final solution for download).

We will continue by adding a new SSAS project to the solution - we will call it SSAS:
New SSAS Project

Now, we will create a new cube from scratch. It will not be based on a pre-existing data source schema - we will rather use this cube as the blueprint for creating a new relational schema using the new SSAS “Schema Generation Wizard”. This is usually called a “top-down” approach to building a data warehouse. Even though I am not sure about the practical implications of this approach (database architects, developers and DBAs typically prefer to have full control over the design of their data warehouse schema), I believe it is a very useful tool for educational and demonstration purposes.

The way to create a new cube from scratch is to right-click on the “Cubes” folder in Solution Explorer, and select “New Cube…”. Skip the “Welcome” page and navigate to the “Select Creation Method” page. Select “Generate tables in the data source”. Keep the “Template” selection to “None” - we want to design the cube manually:

New Cube

On the next page of the wizard, we will add the two measures: Employment and Unemployment Rate. Since they have different dimensionality (Unemployment Rate does not relate to the Employment Category dimension), we will create two separate measure groups - one measure group per measure, like this:

Measure groups

Employment is an additive fact - it presents the nominal number (in thousands) of people employed in the particular employment category, time frame and metropolitan area. When this measure is aggregated to a higher level across any of the dimensions, it is summed up. Therefore, we will use Sum as the aggregation rule.

On the other hand, Unemployment Rate is a percent value - it is non-additive. It makes sense to average the unemployment rate when rolling up totals and sub-totals. Unfortunately, Microsoft Analysis Services does not provide such an aggregation rule out-of-the box. However, such an aggregation could be accomplished by a combination of a calculated measure and two hidden physical measures - something I will demonstrate later. For now we will use aggregation “Sum” for Unemployment Rate and we will replace this with the correct solution later.

One more thing before we move on to the next page - the original facts in the report are floating point numbers. We will use Double precision number as the data type for both measures.

On the next page, we get to define the dimensions of our UDM:

SSAS Dimensions

Note that I unchecked the “Time” dimension from the upper list of template dimensions. This special time dimension is useful in those cases where the dimensional model contains semi-additive measures - i.e. measures which are additive across all dimensions but non-additive across the time dimension. An example of a semi-additive measure is the measure “amount” in a typical store inventory snapshot. It is additive across all dimensions such as store or product, but it is averaged across the time period.
In our Employment dimensional model we have only one special measure, Unemployment Rate, which is averaged across all dimensions, including time. This measure is fully non-additive. This is why, we do not need the special time dimension - we do have a Time Period dimension in the report, but for the purposes of our analysis it will suffice to create it as a regular dimension.

We move on to the next page, where we will define the dimension usage per measure group:

Dimension Usage

We set up the dimension usage to reflect the model observed in the raw data source - the Time Period and MSA dimensions are shared between both measure groups, but dimension Employee Category is referred to only by the Employment Facts measure group. Unemployment Facts has no direct relation to Employee Category.

On the last page of the wizard, we can review the main features of our cube and give it a name:

Create Cube

We will leave “Generate schema now” unchecked for now - we will make a few more tweaks to the UDM before we generate the schema.

We press “Finish” on this last page and, voila!, the cube and dimensions are created for us:

Created Cube

In my next post I will continue by decorating the dimensions with attributes, which will make for some interesting analysis. Then, with the help of the SSAS Schema Generation Wizard I will create the relational data warehouse schema. This will set the stage for the most interesting part of our mini-BI solution - the data extraction and loading procedure.

Stay tuned for more!

Part II >>

Importing HTML Tables with DataDefractor

October 23rd, 2007

Have you ever needed to scrape the data from a table posted to a website (or otherwise stored in HTML format) and import into a database, data mart or other structured data form? If yes, you’ve probably looked for and found a good number of HTML scraper applications that take care of just that. Unfortunately sifting through the HTML data and normalizing it using a typical HTML(XML)-oriented helper proves to be a lot of work, especially when having to deal with nested and denormalized tables.

On the other hand, you know that DataDefractor is very good at dissecting and normalizing complex tables, but how can we get HTML data in DataDefractor in the first place? The answer is simple - use Excel to acquire the data, save the spreadsheet as an Excel workbook or CSV file and, voilĂ , you have something to play with in DataDefractor. The trick is to open the HTML file directly with Excel as a spreadsheet as opposed to importing it through the “Get External Data” wizard.

There are several issues you need to address before you can do this. The first one being, Excel is not good at importing HTML if it’s not well structured.

HTML Tidy to the Rescue
HTML Tidy is a nice little program that takes HTML on the input and spits out structured and cleaned up XHTML, which is easily consumed by Excel as a spreadsheet. You can find tips about using HTML Tidy here.

Once the HTML stream is converted to XHTML, you can open it as a spreadsheet with any version of Excel (2000 and up) through File/Open. Excel will recognize the <TABLE> tags and position their contents respectively on the surface of the spreadsheet.
Newer versions of Excel provide a feature called “Get External Data from Web”. This feature provides you with the option of selecting specific tables from the HTML document. However, if you plan to extract data with DataDefractor you are better off opening the entire document as a spreadsheet - DataDefractor’s pattern-based rules will help you define a flexible mapping schema that will accommodate changes in the document.

Another issue is the size of the table (I am referring to the total number of rows and and columns occupied by spreadsheet). In case your HTML document is spread out too much, you can consider using Excel 2007 which can handle much more data than the previous versions.
The new Excel 2007 spreadsheet limits are:
Number of columns per sheet: 16,384
Number of rows per sheet: 1,048,576

You can see some more Excel 2007 numbers here.

Bottom line - if HTML is the source of data targeted by DataDefractor, just funnel it through Excel but be careful with the HTML’s structure - should you have any issues loading it with Excel, use HTML Tidy to tidy it up.

Loading complex Excel spreadsheets into SAP NetWeaver BI

July 9th, 2007

Summary

This article should help you get a quick start on using DataDefractor Solo to load complex spreadsheets into SAP NetWeaver 2004s BI (formerly known as SAP BW). It illustrates the process of converting a financial Excel spreadsheet into a complete structured data model and then loading it into an SAP BI InfoCube.

Applies to

Introduction

It might be useful in certain customer scenarios to extract data from complex business spreadsheet reports and load it into SAP BI InfoCubes for special data analysis.

This sample is based on a financial report provided by Freddie Mac, which is freely available at Freddie Mac’s website. It is also available in DataDefractor Solo samples’ folder. Therefore, it should be possible for every DataDefractor and SAP NetWeaver 2004s BI (SAP BI) customer to setup the same environment which was used here.

For the purposes of this example, SAP BI and DataDefractor Solo were deployed to a single Windows Server 2003 machine. The connection point between DataDefractor Solo and SAP BI is a set of flat files. Therefore, it is also possible to extract Excel data with DataDefractor on a Windows machine and then load it into SAP BI deployed to a UNIX machine.

Source spreadsheet

Freddie Mac PagedThe source spreadsheet is a financial report provided by Freddie Mac. It contains average U.S. national mortgage rates for 17 years, broken down by month.

The data model captured in the spreadsheet contains numeric mortgage rate facts recorded in the context of a time period. Let’s look at the contents and the layout of the spreadsheet.

Data Facts

Freddie Mac FactsThe facts are concentrated in several groups of numbers - one group per year. There are two columns of numbers for each month of each year - one column for the mortgage rates and one for the discount points that guarantee the rate.

Time Period

Freddie Mac TimeThe time period of the recorded facts consists of a year and a month of the year. The years of the facts are located in the header rows above each group of facts, while the months are located in the header column to the left of each group of years.

Product

Freddie Mac ProductAnd finally, the product that is being sold in the particular time period for the particular rate and discount points is located in the global header of the report - “CONVENTIONAL CONFORMING 15-YEAR FIXED-RATE MORTGAGE”.

Defining the data model

In this section we will use DataDefractor Solo Wizard to map out the data model captured in the mortgage rate report. We will draw the layout of the spreadsheet, define the location of the facts and define the dimensionality of the model. Once we’re done, we will use this mapping to convert the spreadsheet to a set of CSV files organized as a standard Kimball-style star schema ready to be loaded in SAP BI.

Step 1: Choosing a data source

Freddie Mac Page 1 - Data Source
The first step is to launch the DataDefractor Solo Wizard. You can start it by either double-clicking the DataDefractor Solo Wizard icon on your desktop or by navigating the following path on your “Start” menu:
Start\All Programs\Interactive Edge\DataDefractor Solo\DataDefractor Solo Wizard

The first page of the wizard asks you to specify the location of the data source.

Follow these steps:

  1. Select Excel radio button - this will launch an “Open File” dialog box.
  2. Navigate to the following Excel file:
    \Program Files\Interactive Edge\DataDefractor Solo\Samples\Freddie Mac\Freddie Mac.xls
    and select “Open”.
  3. Select Use all check box. This will instruct DataDefractor to use all the worksheets available in the workbook.
  4. Click Next to move to the next page of the wizard.

Step 2: Mapping out the data layout

Freddie Mac - Page 2 - Data Source LayoutIn this step we will define the data layout of the spreadsheet. We will start by isolating the header of the worksheet, then we will move on to slice the worksheet into vertical sub-pages and then slice those pages further into a collection of low-level horizontal sub-pages, which contain clear fact data surrounded but unbroken by context information.

Follow these steps

  1. Select Header/Footer radio button - this will isolate the header above the data area. A “Drill Down” reminder tool-tip balloon may pop-up - close it.
  2. Select the drill-down button Drill-Down Button to drill down into the data area. We can now slice this data area into vertical sub-pages.
  3. Select Vertical Sub-Pages. This will add a pattern row above the data area:
    Vertical Pattern
    We will use this pattern row to define a border-recognition pattern, which will split the data area into separate vertical pages.
  4. The spreadsheet is logically partitioned into several vertically stacked sub-pages. Each page contains a group of seven years. Looking at the report, we can deduct that every page begins at a row, which contains a year in column B. Click on the pattern entry below column B, type “^\d{4}” (without the quotes) and press Enter:
    Freddie Mac Layout - Vertical year pages
  5. Select the drill-down button Drill-Down Button to drill down again, this time into the vertical sub-pages. We will slice them further into horizontal sub-pages - one horizontal sub-page per year.
  6. Select Horizontal Sub-Pages. This will add a pattern column to the left of the data area. We will use this pattern column to define a border-recognition pattern, which will split the data area into separate horizontally positioned pages.
  7. We see that each horizontal page begins with a year in row 1. Click on the pattern entry to the left of row 1, type “^\d{4}” (without the quotes) and press Enter:
    Freddie Mac - Layout Horizontal Years
  8. Each horizontal sub-page begins with the year, but if we drill down, we will lose the month column. We need to carry it down into the new horizontal pages so that we can build a correct data model, which will consist of atomic low level virtual spreadsheets that contain facts and are surrounded by complete context information.
    Select property CarryoverColumns in the property grid below the layout radio buttons, type in 1 and press Enter:
    Freddie Mac Data Layout - Carry Over
  9. Select the drill-down button Drill-Down Button again, to drill down into the newly defined horizontal sub-pages:
    Freddie Mac - Horizontal Pages
  10. The new horizontal pages contain the lowest level of facts, the year and the month. We will move on to eliminating the extra blank columns to the right of the facts.
    We need to eliminate all the columns to the right of the last column, which has a non-empty cell in row 2:

    * Select Horizontal Sub-Pages
    * Set MatchSingleSubPage property to True
    * Set PatternDirection property to RightToLeft
    * In the pattern for row 2 type in the following regular expression (without the quotes): “.+
    (This regular expression matches any cell, which contains at least one symbol)
    Freddie Mac Layout - Horizontal Pages

  11. Drill down into the new pages.
  12. The only thing left to do is eliminate the footer of the virtual spreadsheets. This footer includes pre-calculated Annual Averages, which we don’t need in the data warehouse. Follow these steps to eliminate all the rows past the last row which contains the text “December” in column A:
    * Select Vertical Sub-Pages
    * Set MatchSingleSubPage property to True
    * Set PatternDirection property to BottomToTop
    * In the pattern for column A type in “December” (without the quotes)
    Freddie Mac - Eliminate Annual Averages

  13. We are done with defining the spreadsheet layout. Drill down into the new pages and review them by clicking on the tabs below the grid:
    Grid Tabs
    You should see a set of virtual spreadsheets with a uniform layout featuring twelve months in column A, mortgage rate facts in column B, discount points in column C and a year in cell B1:
    Freddie Mac Virtual Spreadsheet 1 Freddie Mac Virtual Spreadsheet 2

  14. We dissected the spreadsheet into smaller pieces and continued drilling into them until we reached to a repetitive uniform data layout, which contained a core of fact data surrounded by by context information. Now we are ready to move on and specify the precise location of that fact area and its context as it appears on all these virtual spreadsheets.
    Click Next to move to the next page of the wizard.

Step 3: Specifying the fact area

The fact area is indicated by picking the first row that contains facts and specifying the collection of columns that contain fact data.

Follow these steps:

  1. Click somewhere on row 4 in the grid to select it as the first row, which contains facts:
    Freddie Mac - First Fact Row
  2. Click Next to move to the next page of the wizard.
  3. On page Fact Source Columns, select columns B and C by clicking on the header of column C and dragging the mouse pointer over to the header of column B.
  4. Click button “Add Selected Columns” Add Selected Columns Button to add the selected column range to the list of column ranges, which contain facts.
  5. Select Flexible Range when prompted to select the type of column range to add to the list. This instructs DataDefractor to consider all the columns that appear to the right of column A as columns containing facts. This will accommodate spreadsheets which may contain more than two measures.
    Click OK to confirm the column range type:
    Freddie Mac Fact Column Range
  6. Click Next to move to the next page of the wizard.

Step 4: Defining the dimensional model

In this step we will define the dimensional model in terms of dimensions and attributes. We will also specify the location of each dimension and attribute as well as the location of measure names.

Follow these steps:

  1. Select item Measures in the axis tree on the left-hand side of the Axes Definition page.
  2. Select row 2 in the grid by clicking on the header of row 2.
  3. Click button “Add selected grid cells” Add Selected Columns Button to add this row as the source of measure names:
    Freddie Mac - Measures
  4. Select item New Dimension in the axis tree, click button Rename below the grid and type in Time; press Enter.
  5. While item Time is selected, select column A by clicking on its header.
  6. Click button Add Selected Columns Button to make the months represent the set of members for the time dimension.
  7. Click on cell B1 in the grid and click button Add Selected Columns Button to append the year to the names of the months. This will make the members of the time dimension unique across all virtual spreadsheets.
  8. Select Position when prompted to select the type of column reference:
    Freddie Mac - Time Dimension
  9. While dimension Time is selected, click button New Attribute below the grid and type Year; press Enter.
  10. While the new attribute Year is selected, click on cell B1 in the grid and click button Add Selected Columns Button to make the contents of cell B1 appear as the Year attribute of the members of the Time dimension; select Position when prompted for the type of column reference.
  11. Click button New Attribute again and type Month; press Enter.
  12. While attribute Month is selected, select column A by clicking on its header and click button Add Selected Columns Button to make the respective cell of column A appear as the Month attribute of each member of the Time dimension.
    Freddie Mac - Time attributes
  13. Click button New Dimension below the grid and type Product; press Enter.
  14. Click the drill-up button Drill Up Button five times until you zoom out to Level 1 of the spreadsheet.
  15. Click on cell B2 in the grid and click button Add Selected Columns Button to make the contents of cell B2 appear as the Product dimension; select Position when prompted for the type of column reference.
  16. The actual name of the product is a substring of the contents of cell B2. It is the string that appears before the word “SINCE“.
    To extract this substring select source Level 1: Cell:B2 in the list of Sources and click button Properties below that list.
  17. In the newly invoked Define Extraction dialog box, select extraction Custom regular expression and type in the following expression (without the quotes): “(.*)\sSINCE“:
    Freddie Mac - Product Name extraction
  18. Click OK to confirm the custom extraction rule:
    Freddie Mac - Product
  19. We’ve completed the dimensional model of the spreadsheet.
    Click Next to move to the next page of the wizard.

Step 4: Defining measure data types

Freddie Mac - Measure TypeIn this step we will define the data types of our measures. Both interest rate and discount point must be of type Float. Since this is the default type for all the measures detected by DataDefractor, we will simply verify the data type.

  1. Click on the name of each measure and make sure that property FactsType below the list of measures is set to Float.
  2. Click Next to move to the next page of the wizard.

Step 5: Setting output options and processing spreadsheet

Freddie Mac - Output OptionsThis is the last page of the wizard. Here we will specify the type of output we need and where to generate this output. We will also select some options which will make the output easier to load into SAP BI.

Follow these steps:

  1. Select radio button Star.
  2. Select field Files Prefix and type FreddieMacOutput.
  3. Select field Files Location and type output. This will create a new sub-folder output in the folder where the Excel file is located.
  4. Check check-box Include member names in fact table.
  5. Check check-box Include member names in fact table.
  6. The data mapping schema for converting Freddie Mac spreadsheets into a dimensional model is now complete. Click Save schema… to save the mapping schema for modifications or to be able to schedule batch processing of Freddie Mac Excel workbooks.
  7. Click Finish to process the current spreadsheet.

At this point, DataDefractor Solo processes the input spreadsheet applying the dimensional model and layout rules defined in the DataDefractor data mapping schema. When it is done processing, DataDefractor displays the results:
Freddie Mac - Results Dialog

Click on Open Destination Folder… to navigate to the folder where the output was generated.
The output is a classical Kimball-style star schema, which consists of three CSV files - one for the facts and one for each dimension:

  • FreddieMacOutput.Facts.csv - contains a normalized view of the facts extracted from the input spreadsheet. The measures are stored in separate columns. Each row in the file also contains dimension names and surrogate keys which can be used as foreign keys linked to the Time and Product dimension tables discussed below.
  • FreddieMacOutput.Time.csv - contains the keys and attributes of the Time dimension.
  • FreddieMacOutput.Product.csv - contains the keys and attributes of the Product dimension.

Loading the data into SAP BI

In this section we will load the output generated by DataDefractor Solo into SAP BI. We will use SAP Data Warehousing Workbench to build a data warehouse and to populate it with the Freddie Mac data.
Here is the list of procedures we will go through to accomplish this task:

  1. Build a dimensional model (InfoCube).
  2. Load the DataDefractor output into a Persistent Stage Area (PSA).
  3. Transfer/transform the data from PSA into the InfoCube.

Note: The following sections contain steps that create many SAP objects, some of which may need to be “activated” before they can be used. The steps to “activate” these objects have been omitted from this tutorial for brevity.

Step 1: Building the dimensional data model

Dimensional data modeling defines business entities such as cube, conformed dimension, attribute, measure etc. It provides the business user semantic meaning to diverse source data.
In SAP BI, the terms used to describe a dimensional model are:

  • InfoCube - a multidimensional data model, which contains facts that can be queried, pivoted and aggregated according to criteria defined by means of selecting dimension members on the various axes of the cube.
  • Characteristic - equivalent to a conformed dimension. Used to query one or more InfoCubes.
  • Navigation Attribute - equivalent to a conformed dimension’s attribute. Used to navigate and filter the members of a characteristic.
  • Key Figure - equivalent to a measure.

Freddie Mac SAP BW Data ModelIn this example, we will define the data model as one InfoCube FREDDIEMAC which consists of two characteristics - PRODUCT and TIME. The InfoCube also contains two key figures - POINTS and RATE. The TIME characteristic has two navigation attributes - YEAR and MONTH.

Follow these steps to build the model:

  1. Log into SAP BI using SAP GUI.
  2. Launch SAP Data Warehousing Workbench.
  3. Create a new InfoArea called IAFRED.
  4. In this area, create a new InfoCatalog called ICFRED.
  5. In this catalog create a new characteristic InfoObject and call it YEAR. Make its data type CHAR(8).
    Freddie Mac - YEAR Navigation Attribute
  6. Create another characteristic InfoObject in catalog ICFRED - name this characteristic MONTH. Make its data type CHAR(16).
  7. Create another characteristic InfoObject and name it TIME. Make its data type CHAR(16). Navigate to tab Attribute and add the YEAR and MONTH attributes to the list of attributes of the TIME InfoObject. Turn on the navigation option for both attributes:
    Freddie Mac Time Navigation Attributes
  8. Create a PRODUCT InfoObject characteristic in catalog ICFRED. Make its data type CHAR(60).
  9. Create a new InfoCatalog - ICKFFRED.
  10. In this catalog, create a new key figure InfoObject - RATE. Choose base aggregation SUM. Select exception aggregation AVERAGE on characteristic TIME. Make this key figure’s data type FLOAT:
    Freddie Mac SAP RATE Key Figure
  11. Create another key figure InfoObject in the same catalog - call it POINTS. Choose base aggregation SUM. Select data type FLOAT.
  12. Navigate to InfoProviders and create a new InfoCube FREDDIEMAC in InfoArea IAFRED. Follow these steps to build the InfoCube:
    * Define dimension PRODUCT and assign InfoObject PRODUCT to this dimension using Direct Input.
    * Define dimension TIME and assign InfoObject TIME to this dimension using Direct Input.
    * Using Direct Input, assign key figure InfoObject RATE to the list of InfoCube key figures.
    * Using Direct Input, assign key figure InfoObject POINTS to the list of InfoCube key figures.
    Freddie Mac - SAP InfoCube

Step 2: Loading Freddie Mac data into Persistent Staging Area

In this step we will load the input data into a Persistent Staging Area (PSA). PSA is an intermediate data storage area where data is collected before it is used to populate the dimensional data model.

Follow these steps:

  1. Navigate to DataSources.
  2. Create a new Application Component called ACFRED.
  3. Create a new DataSource FREDFACT of type Transaction Data and map it to file FreddieMacOutput.Facts.csv located in the folder used by DataDefractor Solo to generate the Freddie Mac star schema.
    Use the following data type mapping for the fields in this file:
    * TIMEID - INT2
    * TIME - CHAR(14)
    * PRODUCTID - INT2
    * PRODUCT - CHAR(45)
    * RATE - CHAR(10)
    * POINTS - CHAR(5)
  4. Freddie Mac - SAP Fact Data Source

  5. Create InfoPackage FREDFACTPKG under DataSource FREDFACT.
    Freddie Mac - SAP InfoPackage
  6. Navigate to tab Schedule and the execute the package to load the data into PSA.
  7. Create a new DataSource FREDTIME of type Master Data Attributes and map it to file FreddieMacOutput.Time.csv located in the Freddie Mac star schema folder.
    Use the following data type mapping for the fields in this file:
    * MEMBERID - INT2
    * NAME - CHAR(14)
    * YEAR - CHAR(5)
    * MONTH - CHAR(9)
  8. Create InfoPackage FREDTIMEPKG under DataSource FREDTIME.
  9. Navigate to tab Schedule and the execute the package to load the data into PSA.
  10. Create a new DataSource FREDPROD of type Master Data Attributes and map it to file FreddieMacOutput.Product.csv located in the Freddie Mac star schema folder.
    Use the following data type mapping for the fields in this file:
    * MEMBERID - INT2
    * NAME - CHAR(45)
  11. Create InfoPackage FREDPRODPKG under DataSource FREDPROD.
  12. Navigate to tab Schedule and the execute the package to load the data into PSA.

Step 3: Populating the dimensional data model

In this step we will populate the FREDDIEMAC InfoCube fact and master data with the data stored in PSA. To do this, we will define some transformations to transform, transfer and clean the data as it flows from PSA to the data targets.

Follow these steps:

  1. Under DataSource FREDFACT create a new InfoCube transformation targeting InfoCube FREDDIEMAC.
  2. Map the data source fields to the InfoObjects in the InfoCube by dragging lines from the data source to the InfoCube, following this mapping:
    * TIME to TIME
    * PRODUCT to PRODUCT
    * RATE to RATE
    * PTS to POINTS
    Replace the Rule Type of the TIME transformation with the following formula (without the quotes): “TOUPPER(TIME)“.
    Freddie Mac SAP Fact Transformation
  3. Under DataSource FREDFACT create a new Data Transfer Process.
  4. Navigate to tab Execute and click Execute to populate the key figures of the InfoCube:
    Freddie Mac - SAP Fact Transfer
  5. Under DataSource FREDTIME create a new InfoObject transformation targeting InfoObject TIME.
  6. Map the data source fields to the attributes in the InfoObject by dragging lines from the data source to the InfoObject, following this mapping:
    * MEMBERNAME to TIME
    Rule Type: Formula “TOUPPER(MEMBERNAME)
    * YEAR to YEAR
    Rule Type: Formula “TOUPPER(YEAR)
    * MONTH to MONTH
    Rule Type: Formula “TOUPPER(MONTH)
  7. Under DataSource FREDTIME create a new Data Transfer Process.
  8. Navigate to tab Execute and click Execute to populate the TIME master data.
  9. Under DataSource FREDPROD create a new InfoObject transformation targeting InfoObject PRODUCT.
  10. Map the data source fields to the attributes in the InfoObject by dragging lines from the data source to the InfoObject, following this mapping:
    * MEMBERNAME to PRODUCT
    Rule Type: Formula “TOUPPER(MEMBERNAME)
  11. Under DataSource FREDPROD create a new Data Transfer Process.
  12. Navigate to tab Execute and click Execute to populate the PRODUCT master data.

Step 4: Verifying the dimensional data model

Now that we’ve populated the InfoCube with facts and master data extracted from the Freddie Mac financial report, it is time to verify the results.

Follow these steps to verify the InfoCube contents:

  1. Navigate to InfoProviders, right-click on InfoCube FREDDIEMAC and select Display Data.
  2. Uncheck check-box Do not use any conversion.
  3. Choose Field selection for output and select the following fields:
    * PRODUCT
    * TIME
    * MONTH [TIME]
    * YEAR [TIME]
    * POINTS
    * RATE
  4. Click Execute twice.
    You should see the following query result:
    Freddie Mac - SAP InfoCube Result

The InfoCube is ready and populated with data, which can be analyzed using various front-end applications including NetWeaver Enterprise Portal, BEx Analyzer, BEx Web Analyzer, etc. Additionally with the help of DataDefractor Solo Batch Command, the extraction and transformation of future Freddie Mac reports can be automated and processing can be scheduled periodically. Furthermore, the loading of DataDefractor output into SAP can be automated with the help of SAP BI Process Chain.

“Employment Analysis” BI Solution - Part I (Building the UDM)

May 2nd, 2007

This is the first of a series of posts, which I will use to describe the implementation of a SQL Server 2005 BI solution designed to extract and analyze data captured in a semi-structured data source I discussed in my previous post.

In the course of this exercise, we will:

  • Define a top-down SQL Server Analysis Services (SSAS) Unified Dimensional Model (UDM), which will capture the dimensional structure of the source information.
  • Generate a relational star schema based on the UDM.
  • Develop a SQL Server 2005 Integration Services (SSIS) solution, which will extract the data from the semi-structured spreadsheet and use it to populate the UDM.
  • Revisit the SSIS solution and augment it with maintenance procedures such as updating the UDM with new data feeds.

Read the rest of this entry »

Semi-structured data spreadsheets

April 16th, 2007

I will christen this blog with a brief discussion about semi-structured spreadsheets (or what we, here at DataDefractor.com, call “semi-structured spreadsheets”) - the primary kind of data sources targeted for normalization by DataDefractor.

DataDefractor’s User Guide contains an extensive chapter dedicated to the intricate details of this kind of data sources, so I will try to be concise and you can consider this an introduction to the main problem DataDefractor is designed to solve.
Read the rest of this entry »

Hello world!

March 30th, 2007

Welcome to the DataDefractor blog.
We will try to make this a place where you can learn more about normalizing and extracting business data from semi-structured data sources, such as Excel worksheets, CSV files and flat-files, for the purposes of performing deep-dive analysis against this data. We will also focus on simple and advanced features of DataDefractor as a product.