Archive for December, 2009

Employment Analysis - BI Solution (Part III)

In this post we will finish the Employment BI project by creating an SSIS project to populate the dimensional model with the data captured in the Excel spreadsheet.

First we will start by slightly modifying the relational schema generated by the automatic schema generator.
The measure fields in the fact table are generated as “Real” which is a 4-byte floating point number. We will need to switch this to “Float” - the 8-byte floating point number.

Go to SQL Server Management Studio, navigate to database “Employment BI”, select table “Employment_Facts”, right-click, select “Design” and change the type of field “Employment” to “float”. Apply the same change to field “Unemployment_Rate” in table “Unemployment_Facts”.
Save the changes and switch back to the Employment BI solution in Business Intelligence Development Studio.

Now we will create a new SSIS project which will load the data into the relational database.
Select File/New/Project, select “Integration Services Project”, name it “SSIS”, select “Add to solution” and click “OK”.

  1. Create a new data flow task in the control flow of the SSIS project.
  2. Assuming that you have installed DataDefractor for SSIS, select the DataDefractor source from the toolbox and drop it on the canvas:
    Drop DataDefractor Source
  3. Double-click on the DataDefractor source component to invoke its wizard driven interface and click “New” to create a new DataDefractor connection manager to connect to the Excel file. Select “DataDefractor.ExcelFile from the list of connection managers and click “Add…”:
    Select DataDefractor Connection Manager
  4. Select employment_report.xls and click Ok:
    Select Excel File
  5. Select “Table-1″, move it to the list on the right by clicking the right-arrow “Add-table” button and click “Next”:
    Step 1
  6. On this step we will cut off the top header of the spreadsheet. We are interested in the area starting with the first row which has text in column C. Select “Vertical Sub-Pages”, Set property “MatchSingleSubpage” to “True” and type in .+ as a regular expression for column C:
    Step 2
    Click “Next” to continue to the next page.
  7. On this step select row 5 as the first row to contain facts and click “Next”:
    Step 3
  8. Select columns C through H as the columns containing facts and add them as a “Flexible Data Range”:
    Step 4
    Click “Next”
  9. We will spend some more time on this page to define the dimensionality of our data source. We will be processing the data of the Employment Facts first. As far as this measure group is concerned, there is only one measure - Employment Rate. So, instead of adding a list of measures from the data source, we simply type “Employment Rate” as the name of the measures:
    Step 5-1
  10. Next we will handle the MSA dimension. Select the “New Dimension” item and rename it to MSA. Then select column “B” and move it to the “Sources” list by clicking the left-arrow:
    Step 5-2
  11. Now we will add MSA’s attributes “Region”, “Sub Region” and “State”. Start by adding a new attribute called “Region”. Double-click column A to select it as the source of items for the attributes, select the “Column:A” source in the “Sources” list and press the “Properties” button. In the source properties dialog box select “Custom regular expression” and type in ^(\S*)
    This regex will match and extract only items which begin with a non-whitespace character.
    Select “Fill blank cells with carryover values” and click OK:
    Step 5-3
  12. Add the “Sub Region” and “State Attributes” following the same steps, but type in the following regular expressions for those attributes:
    For “Sub Region”: ^\s(\S.*)
    For “State”: ^\s{2}(\S.*)
    These regular expressions extract respectively values with a single whitespace before a non-space character and a two-white-spacer before a non-whitespace character.
  13. Create a Time Period dimension, select Row 3 as the source of member names, click “Properties” and use the the following regex to match only time period items: ^(\w{3}-\d{4})
    Step 5-4
  14. Then create the “Year” and “Month” attributes by using the following regexes to extract the appropriate data:
    For “Year” : ^\w{3}-(\d{4})
    For “Month” : ^(\w{3})-\d{4}
  15. We left the easy one for dessert - create an “Employment Category” dimension and select row 1 as the source of member names. No need to perform extractions:
    Step 5-5
    We’re done with the dimensional model - move on by pressing “Next”.
  16. Leave the default selections on page 6 and press Finish:
    Step 6
  17. Add four SQL Server Destination components to the data flow task and name them “Facts”, “MSA”, “Time Period” and “Employment Category”. Bind the DataDefractor outputs to their respective outputs then select each destination, map it to the appropriate relational table in the Employment BI database and map the columns from the DataDefractor output to the appropriate table fields. For example, the ID columns must be mapped to the PK_ and FK_ columns in the destination components:
    DataDefractor Data Flow
    Note the warning signs in the destination - they are caused by the fact that DataDefractor emits 1000 varchar data for the dimension names and attributes, while the schema generation wizard creates shorter fields. You can resolve this by modifying the size of the DataDefractor output columns in the “SSIS Advanced Editor”.
  18. Run the SSIS project to populate the relational model, then deploy the SSAS cube and process it.
    After this, you can browse the Employment part of the cube. Here’s a snapshot of the cube browser:
    Employment cube browser
  19. At this point we add a new SSIS data flow task and process the Unemployment data using another DataDefractor data flow task.
    We end up with two measure groups with one measure per group.
    There is only one thing left to do - make the Unemployment Rate measure aggregate by averaging the data points at lower levels.
    The way we do this is by adding two new measures based on the same data column in the fact table - lets’ call them URC and URS. Then we make the URC measure aggregate using the “Count” aggregation rule and the URS measure use the “Sum” aggregation.
    We hide the two measures, then create a calculated measure called “Unemployment Rate”. The MDX formula for this new measure is:

    iif([Measures].[URC] = 0, Null, [Measures].[URS] / [Measures].[URC])

    This pretty much wraps it up. We now have an analytical model containing employment and unemployment data created off of a seemingly unwieldy Excel source.

Employment Analysis - BI Solution (Part II)

In this part of the tutorial we will decorate the UDM with some attributes and will generate the relational database schema.

We will start with the attributes of the MSA dimension. Looking back at the data source, we see that the MSA dimension (columns A and B) contains three levels of hierarchy - Region (Northeast, Midwest, etc.), Sub-region (New England, Middle Atlantic, etc.), State and the lowest level - MSA. So we have total of four attributes including the key attribute.

We manually add the attributes to the MSA dimension:
MSA Attributes

Then we add the Month and Year attributes to the Time Period dimension:
Time Attributes

The Employment Category dimension is flat - there is no hierarchical structure to it, so we’ll leave this dimension alone.
It is time to generate the schema now.
Follow these steps to get this done:

  1. Launch SQL Server Management Studio, connect to SQL Server Database Services and create a new relational database - we’ll call this database “Employment BI”:
    New Relational Database
  2. Go back to Business Intelligence Development Studio and create a new data source in the SSAS project - right-click on the “Data Sources” icon in the solution explorer and select “New Data Source”. Provide the necessary information to create the data source connection to the “Employment BI” relational database.
    Here’s a snapshot of my settings:
    New Connection
    For impersonation mode I usually select to have SSAS impersonate my windows account by selecting “Use the service account” in the Data Source Wizard, but feel free to select whatever makes sense in the context of your security settings:
    Impersonation Settings
    Go through the Data Source wizard and when prompted, name the data source “Employment BI”.
  3. Launch the Schema Generation Wizard by selecting “Database\Generate Relational Schema…”
  4. On the first screen select “Create a new data source view”, name the new data source “SSAS” and select the “Employment BI” data source:
    Generation Wizard - Page 1
  5. On the next screen uncheck the settings to preserve referential integrity and to preserve data on regeneration:
    Generation Wizard - Page 2
  6. Leave the naming conventions page as is and click “Finish”

The Schema Generator produces the entire relational schema and an SSAS data source view based on the metadata we provided earlier. Here’s a snapshot of the data source view:
DSV
The generator has produced two fact tables: Employment_Facts and Unemployment_Facts and three dimension tables: Employment_Category, Time_Period and MSA.
Note that the schema follows Kimball’s guidelines for keeping the dimensional tables denormalized. In other words the Schema Generator favors denormalized star schema over normalized snowflake schema. In a star schema all the attributes of a dimension are collapsed together in a single table to avoid complex and slow multi-table joins when data is queried.

In the next post we will finish this mini-project by creating an SSIS project to populate the dimensional model with the data captured in the Excel spreadsheet.

Part III >>

Employment Analysis - BI Solution (Part I)

This is the first of a series of posts, which I will use to describe the implementation of a SQL Server 2008 BI solution designed to extract and analyze data captured in a semi-structured data source I discussed in my previous post.

In the course of this exercise, we will:

  • Define a top-down SQL Server Analysis Services (SSAS) Unified Dimensional Model (UDM), which will capture the dimensional structure of the source information.
  • Generate a relational star schema based on the UDM.
  • Develop a SQL Server 2008 Integration Services (SSIS) solution, which will extract the data from the semi-structured spreadsheet and use it to populate the UDM.
  • Revisit the SSIS solution and augment it with maintenance procedures such as updating the UDM with new data feeds.

A typical data integration solution loads data into a preexisting data warehouse schema. For the purposes of these posts, however, we will assume that the data warehouse schema does not exist yet - this will make the solution more thorough (by forcing us to design the schema) and will give us a chance to explore a rarely discussed feature of Analysis Services - the “top-down” schema generator. In case you have a preexisting schema and you need to apply the techniques discussed here, SSIS comes packaged with a toolbox full of data flow transformations and destinations that will help you funnel your data into your particular schema.

Developing the UDM

Let’s once again quickly review the contents of the data source.

It contains employment numbers (in thousands) for two categories of employees: non-agricultural and civilian employees. The numbers are broken down by metropolitan statistical area (MSA) and time period. The report also contains unemployment rates broken by MSA and time period, but not by employment category.

Here’s a screenshot of the original report with highlighted fact areas:

Employment Report Measure Groups

Essentially, this report contains two separate measure groups which share some, but not all of the dimensions reflected in the report:

  • The Employment measure group (the orange highlight above) contains numeric facts which refer to three dimensions: Employment Category and Time Period located in the header rows above the facts, and MSA located in the header columns to the left of the facts.
  • The Unemployment Rate measure group (the green highlight above) which refer to two dimensions: Time Period above and MSA to the left of the facts.

You may wonder why we didn’t highlight PCT CHG (percent change) as part of the facts. The reason is that we do not consider “percent change” a raw fact - it is in fact a calculation derived from the raw facts, the employment rate numbers. The problem with the “percent change” calculation (and many other pre-calculated bits of analysis typically present in data reports) is that they are correct in the current report layout, however, once the data is loaded into a flexible dimensional model, these static pre-calculated data points become wrong as the user starts pivoting the data and viewing it from many different angles, aggregating at different levels, etc.
We will skip PCT CHG while extracting the data from the raw report, but later we will extend our UDM with a flexible Percent Change calculated measure, which will be correct in the context of any pivoted point of view.

Now, let’s build us a UDM.

We will start by creating a new blank Visual Studio 2008 solution which will contain all the projects involved in getting our BI system up and running - we will call this solution Employment BI (at the end of this series, I will post the final solution for download).

We will continue by adding a new SSAS project to the solution - we will call it SSAS:
New SSAS Project

Now, we will create a new cube from scratch. It will not be based on a pre-existing data source schema - we will rather use this cube as the blueprint for creating a new relational schema using the new SSAS “Schema Generation Wizard”. This is usually called a “top-down” approach to building a data warehouse. Even though I am not sure about the practical implications of this approach (database architects, developers and DBAs typically prefer to have full control over the design of their data warehouse schema), I believe it is a very useful tool for educational and demonstration purposes.

The way to create a new cube from scratch is to right-click on the “Cubes” folder in Solution Explorer, and select “New Cube…”. Skip the “Welcome” page and navigate to the “Select Creation Method” page. Select “Generate tables in the data source”. Keep the “Template” selection to “None” - we want to design the cube manually:

New Cube

On the next page of the wizard, we will add the two measures: Employment and Unemployment Rate. Since they have different dimensionality (Unemployment Rate does not relate to the Employment Category dimension), we will create two separate measure groups - one measure group per measure, like this:

Measure groups

Employment is an additive fact - it presents the nominal number (in thousands) of people employed in the particular employment category, time frame and metropolitan area. When this measure is aggregated to a higher level across any of the dimensions, it is summed up. Therefore, we will use Sum as the aggregation rule.

On the other hand, Unemployment Rate is a percent value - it is non-additive. It makes sense to average the unemployment rate when rolling up totals and sub-totals. Unfortunately, Microsoft Analysis Services does not provide such an aggregation rule out-of-the box. However, such an aggregation could be accomplished by a combination of a calculated measure and two hidden physical measures - something I will demonstrate later. For now we will use aggregation “Sum” for Unemployment Rate and we will replace this with the correct solution later.

One more thing before we move on to the next page - the original facts in the report are floating point numbers. We will use Double precision number as the data type for both measures.

On the next page, we get to define the dimensions of our UDM:

SSAS Dimensions

Note that I unchecked the “Time” dimension from the upper list of template dimensions. This special time dimension is useful in those cases where the dimensional model contains semi-additive measures - i.e. measures which are additive across all dimensions but non-additive across the time dimension. An example of a semi-additive measure is the measure “amount” in a typical store inventory snapshot. It is additive across all dimensions such as store or product, but it is averaged across the time period.
In our Employment dimensional model we have only one special measure, Unemployment Rate, which is averaged across all dimensions, including time. This measure is fully non-additive. This is why, we do not need the special time dimension - we do have a Time Period dimension in the report, but for the purposes of our analysis it will suffice to create it as a regular dimension.

We move on to the next page, where we will define the dimension usage per measure group:

Dimension Usage

We set up the dimension usage to reflect the model observed in the raw data source - the Time Period and MSA dimensions are shared between both measure groups, but dimension Employee Category is referred to only by the Employment Facts measure group. Unemployment Facts has no direct relation to Employee Category.

On the last page of the wizard, we can review the main features of our cube and give it a name:

Create Cube

We will leave “Generate schema now” unchecked for now - we will make a few more tweaks to the UDM before we generate the schema.

We press “Finish” on this last page and, voila!, the cube and dimensions are created for us:

Created Cube

In my next post I will continue by decorating the dimensions with attributes, which will make for some interesting analysis. Then, with the help of the SSAS Schema Generation Wizard I will create the relational data warehouse schema. This will set the stage for the most interesting part of our mini-BI solution - the data extraction and loading procedure.

Stay tuned for more!

Part II >>