Archive for October, 2007

Importing HTML Tables with DataDefractor

Have you ever needed to scrape the data from a table posted to a website (or otherwise stored in HTML format) and import into a database, data mart or other structured data form? If yes, you’ve probably looked for and found a good number of HTML scraper applications that take care of just that. Unfortunately sifting through the HTML data and normalizing it using a typical HTML(XML)-oriented helper proves to be a lot of work, especially when having to deal with nested and denormalized tables.

On the other hand, you know that DataDefractor is very good at dissecting and normalizing complex tables, but how can we get HTML data in DataDefractor in the first place? The answer is simple - use Excel to acquire the data, save the spreadsheet as an Excel workbook or CSV file and, voilĂ , you have something to play with in DataDefractor. The trick is to open the HTML file directly with Excel as a spreadsheet as opposed to importing it through the “Get External Data” wizard.

There are several issues you need to address before you can do this. The first one being, Excel is not good at importing HTML if it’s not well structured.

HTML Tidy to the Rescue
HTML Tidy is a nice little program that takes HTML on the input and spits out structured and cleaned up XHTML, which is easily consumed by Excel as a spreadsheet. You can find tips about using HTML Tidy here.

Once the HTML stream is converted to XHTML, you can open it as a spreadsheet with any version of Excel (2000 and up) through File/Open. Excel will recognize the <TABLE> tags and position their contents respectively on the surface of the spreadsheet.
Newer versions of Excel provide a feature called “Get External Data from Web”. This feature provides you with the option of selecting specific tables from the HTML document. However, if you plan to extract data with DataDefractor you are better off opening the entire document as a spreadsheet - DataDefractor’s pattern-based rules will help you define a flexible mapping schema that will accommodate changes in the document.

Another issue is the size of the table (I am referring to the total number of rows and and columns occupied by spreadsheet). In case your HTML document is spread out too much, you can consider using Excel 2007 which can handle much more data than the previous versions.
The new Excel 2007 spreadsheet limits are:
Number of columns per sheet: 16,384
Number of rows per sheet: 1,048,576

You can see some more Excel 2007 numbers here.

Bottom line - if HTML is the source of data targeted by DataDefractor, just funnel it through Excel but be careful with the HTML’s structure - should you have any issues loading it with Excel, use HTML Tidy to tidy it up.