I will christen this blog with a brief discussion about semi-structured spreadsheets (or what we, here at DataDefractor.com, call “semi-structured spreadsheets”) - the primary kind of data sources targeted for normalization by DataDefractor.
DataDefractor’s User Guide contains an extensive chapter dedicated to the intricate details of this kind of data sources, so I will try to be concise and you can consider this an introduction to the main problem DataDefractor is designed to solve.
Depending on who you ask, you may get a different definition of the term “semi-structured data”. Some people, for instance regard semi-structured data as data stored in XML or HTML format. Another group defines it as data, which contains free-form text, video or sound information, which others believe is actually “unstructured” data.
For us, folks who make a living by building business intelligence systems and enterprise data warehouses, the most interesting definition of the term is the one given by database theorists:
Semi-structured data is data that exhibits any of the following characteristics:
- No fixed schema (e.g. not relational)
- Structure is implicit and irregular
- Data is nested and heterogeneous
In our data integration efforts, we often encounter data sources, which seem to fit this description. These are typically spreadsheet documents, arbitrary data dumps, business intelligence cross-tab reports, etc.; documents, produced by various processes within and outside of the enterprise. It is a common requirement to augment a data warehouse, data mart and other kinds of data analytical systems with data captured in such semi-structured data sources for the purposes of performing business intelligence.
The problem of course is the fact that being semi-structured, this data is not loadable into the relational BI system right off the bat even with the more advanced ETL tools. Usually it requires some involved, at times even complex scripting exercise, which leads to a limited solution, bound to handle only part of the possible variations of the individual data source.
Let’s take a look at a sample business report and examine the characteristics that would land it into the category of semi-structured data. I downloaded this report from the National Association of Home Builders website:
No fixed schema (e.g. not relational)
The report has no fixed schema (from relational point of view), because it may grow horizontally and because the columns of the report have no fixed type.
The report contains data about the employment rates of specific categories of employees. Since these categories are positioned across the columns of the report, adding more categories does not change the purpose of the report, but it certainly changes the schema (again, in the relational sense of the term).
As for data type, the second through seventh columns contain both text and numbers. The numbers are the facts captured by the report, while the text is the context of these facts, namely category of employees and the time period the facts have been recorded for.
Structure is implicit and irregular
To the human eye, there is clear structure to the report. However, this structure is implicit and parts of it are not contained within the report itself.
For instance the first column contains both geographic regions and sub-totals. In addition to this, a natural hierarchy exists for the regions. We can easily distinguish the geographic regions from the names of the sub-totals. We can also tell which level a geographic region is at.
We make these distinctions based on our implicit knowledge and the visual guides embedded in the report. Our knowledge is obviously external to the report. As for the visual guides, yes, they are part of the report, but these guides alone do not contribute to the formal structure of the document. What’s missing is a system of rules, which explicitly defines this structure.
Data is nested and heterogeneous
Data nesting can be observed in several places on the report. To begin with, the time period is nested within the context of the employee categories on the horizontal axis of the report. The separate levels of geographic regions, on the other hand, are nested within each other on the vertical axis.
The fact that the data is heterogeneous ties back to the no-fixed-schema discussion. Many data columns in the report contain different types of data at the same time - for example text, integer numbers and currency.
Well, this pretty much covers it for now.
See this series of posts, where I am demonstrating the development of a SQL Server 2005 BI system, which extracts the data stored in this semi-structured data source and loads it into a SQL Server Analysis Services UDM.
Next post: Employment Analysis BI Solution - Part I (Building the UDM).
Hi Vassil,
The last 8 days I have been working with DataDefractor - for a customer - to extract data out of Excel sheets and CSV files. I was sceptic in the beginning, but I have to say that you can really extract the desired range of data and getting structured data (in rows and columns) in return.
However I have also noticed that to extract data - in one procesflow - from different Excel files and/or tab’s within Excel files containing the same content, that these files have to be in the same layout/format/structure on which the data range has been defined, otherwise DataDefractor will not give data back (you can see what DataDefractor does by selecting an Excelfile with a slightly different layout… at some point a nested level will not show a range with data).
Do you know if there is a solution for this?
Like to hear from you!
René van Kooten
Great Post! How do you find your ideas? I find a few here and there looking at other articles and my own problems. I discovered rencently random word searches on Google, you can find random words here, http://watchout4snakes.com (no relation). You just frame the search like this Powered by WordPress Random Word. It will bring up other peoples blogs to get inspiration from. This post was a little longer then I was planning on, good luck!
Extremely superb entry, very enlightening stuff. Never thought I’d obtain the information I want in this article. I have been looking all over the web for a while now and was starting to get discouraged. Luckily, I happened onto your page and got exactly what I had been looking for.
Fairly great entry, really educational information. Never thought I would discover the info I need right here. I’ve been hunting throughout the internet for some time now and had been starting to get discouraged. Fortunately, I came onto your website and acquired precisely what I was struggling to find.