Importing HTML Tables with DataDefractor

Have you ever needed to scrape the data from a table posted to a website (or otherwise stored in HTML format) and import into a database, data mart or other structured data form? If yes, you’ve probably looked for and found a good number of HTML scraper applications that take care of just that. Unfortunately sifting through the HTML data and normalizing it using a typical HTML(XML)-oriented helper proves to be a lot of work, especially when having to deal with nested and denormalized tables.

On the other hand, you know that DataDefractor is very good at dissecting and normalizing complex tables, but how can we get HTML data in DataDefractor in the first place? The answer is simple - use Excel to acquire the data, save the spreadsheet as an Excel workbook or CSV file and, voilà, you have something to play with in DataDefractor. The trick is to open the HTML file directly with Excel as a spreadsheet as opposed to importing it through the “Get External Data” wizard.

There are several issues you need to address before you can do this. The first one being, Excel is not good at importing HTML if it’s not well structured.

HTML Tidy to the Rescue
HTML Tidy is a nice little program that takes HTML on the input and spits out structured and cleaned up XHTML, which is easily consumed by Excel as a spreadsheet. You can find tips about using HTML Tidy here.

Once the HTML stream is converted to XHTML, you can open it as a spreadsheet with any version of Excel (2000 and up) through File/Open. Excel will recognize the <TABLE> tags and position their contents respectively on the surface of the spreadsheet.
Newer versions of Excel provide a feature called “Get External Data from Web”. This feature provides you with the option of selecting specific tables from the HTML document. However, if you plan to extract data with DataDefractor you are better off opening the entire document as a spreadsheet - DataDefractor’s pattern-based rules will help you define a flexible mapping schema that will accommodate changes in the document.

Another issue is the size of the table (I am referring to the total number of rows and and columns occupied by spreadsheet). In case your HTML document is spread out too much, you can consider using Excel 2007 which can handle much more data than the previous versions.
The new Excel 2007 spreadsheet limits are:
Number of columns per sheet: 16,384
Number of rows per sheet: 1,048,576

You can see some more Excel 2007 numbers here.

Bottom line - if HTML is the source of data targeted by DataDefractor, just funnel it through Excel but be careful with the HTML’s structure - should you have any issues loading it with Excel, use HTML Tidy to tidy it up.

3 Responses to “Importing HTML Tables with DataDefractor”


  1. 1 Melanie

    Hi, do you know how to recover data from a formatted hard disk? I need to recover our data from my external hard drive.

    Yesterday, me and my friends spent the whole day shooting our short film. When we were about to transfer the file to my computer the files were no longer accessible and so as the hard drive. The data has been corrupted. Due to lack of better knowledge when it comes to corrupted files and inaccessible hard drive we all agreed to format hard drive. Now we are all researching on how we can recover our files.

  2. 2 Dorcas Amrich

    Assessing the money flow is one more essential element in the organization technique format, so as to sustain a normal money flow to meet the important capital requirements. Probability of monetary crisis and also the ways of crisis management should be pointed out within the structure. The business strategy should consist from the advertising plans and strategy leading to the expansion from the company.

  3. 3 lainaa 50

    Tarkista luottotiedot ennen kuin haet lainaa - ei ainoastaan tarkistaa luotto-raportti kannella, miten voit katsoa tulevaisuuteen lainanantajia, se näyttää myös mahdollisuuden kerätä tietoja, jotka ovat väärässä ja tuottaa itsellesi parhaat mahdollisuudet on hyväksynyt .

Leave a Reply