Automatically generating Web page scraper templates

Most Web developers are aware of tools that scrape data from Web pages, including from Search Engine Results Pages (SERP’s), but such tools usually have a big drawback: they require you to painstakingly model the page structure of each target website by manual inspection. Instead of doing that, we made a tool that auto-generates templates to fully enable webpage scraping without human design. We call it SERPy, which stands for Search Engine Result Parser. I’m going to explain how this works, as an example of how to scale semantic technology without an army of knowledge engineers constantly configuring things.

A truism in scaling any technology is not to re-invent the wheel when you don’t have to. Accordingly, SERPy runs on top of a third-party module called the Document Profiling Service (DPS), which analyzes documents sent to it via software as a service (SaaS). The DPS returns metadata about the document such as Simple Dublin Core tags, Cartesian coordinates of words and sentences, spam and porn probability factors, and similar information about the page. With these elements in hand, SERPy is ready to go. You can give SERPy any search engine results page, and it will locate the URI, title, and the excerpts for each entry. Additionally, it will give the start and stop offsets in the document for each result location.

What’s inside SERPy? It’s a Finite State Machine (FSM) stack. The first FSM takes output provided by the DPS and attempts to identify the type of search results page. For example, a classic Google-style search result versus a product, news, image or other style of SERP. Based on the search type classification, it then chooses a second FSM to identify features with those types of search results.

Take, for example, a repetitive list consisting of
[Link] [Description]
[Link] [Description]
[Link] [Description]

In the output that lines up on the same X coordinate and has larger text on top followed by smaller text underneath. Each state in the sequence fires a bonus or a penalty. The final state (or accept state) is a summation of the scores having a score threshold for pass or fail. If it passes, the system has determined that it has found a bonafide pattern for one of the known types of SERP’s. There are several hurdles that the FSM stack has get over. Ads and sponsored links can often be determined by text that is aligned to the rightmost-third of that page. Natural Language Processing (NLP) techniques are used to get past breadcrumbs at the top of the page and identify the end of the page. There are quite a bit of these little rules that make up the secret sauce. When enough of them fire, section identification becomes very accurate.

Until now, all the FSM’s in the stack are accept-or-reject machines, but the final one is a transducer, which infers how to generate a template from the pattern and then uses that template for any future scraping on an automated basis.

The results of this technique are very effective. Given 100 sample search engine result pages, all from distinctly different domains, the initial request (meaning, challenge) was to make SERPy work on at least 80% of them. We hit the 92% mark in roughly 30 days of development.

But wait, one problem with applications like SERPy is that they can be very CPU intensive, often taking 20-30 seconds to generate each result. So, we expanded SERPy to generate scraper templates to help speed things up.

The template generation mechanism fetches a few sample search engine results from a single search engine, C-SPAN.org for example. It takes three to four sample search results and runs them on the core SERPy FSM. To make the data more predictable to work with, the documents are pre-processed with htmltidy to generate perfect XHTML in pure UTF-8. This can be done by setting tidy with the following options:

  • TidyFixComments -> yes
  • TidyXhtmlOut -> yes
  • TidyNumEntities -> yes
  • TidyOutCharEncoding -> utf8

The start/stop offsets of each excerpt area are automatically marked with the custom tags
<TMPLExcerptBegin /> and <TMPLExcerptEnd />. Then they are fed through libexpat to produce a traversable DOM. (Expat is actually a stream parser, but creating a DOM from that is actually quite easy. You could also just use libxml2 which has a DOM interface.)

The code then traverses the DOM until it hits the first custom tag, TMPLExcerptBegin, and creates a backtrace of tags. Then, it follows the tag path until TMPLExcerptEnd is encountered. After normalization (removing whitespace and sequential numbers in identifiers) is done, we end up with the tag path that identifies the end of the excerpt.

An aggregation of such DOM paths are put into a string counter and then a common pattern emerges. This one-time process takes approximately one to two minutes and yields a very valuable result: a template that can be used repeatedly, so that subsequent calls can be done in milliseconds. The search result extraction engine using these templates is the easiest part of the process. Simply perform the search, convert the document to XHTML and parse it into a DOM. Then, using the template that has the excerpt tag path that we generated above, traverse the DOM and locate the excerpts, titles, and URI’s. In total, you’ve only added milliseconds to the search time and have cleanly scraped the results.

This type of automatic template generation is not limited to search engine result pages. It can also be used for fast extraction of related images, Dublin Core information, and ad removal. Using the DPS directly to identify those parts and then reproducing the same algorithm above should yield similar results.

SERPy is a demonstrable example of a standard method for developing scalable semantic technology. It starts out yielding a tool that can perform a particular task on a case-by-case basis with manual configuration and then transforms into an engine that can replace the manual configuration with automated modeling.

3 thoughts on “Automatically generating Web page scraper templates

  1. that’s a very interesting approach – right now I am trying to better understand how this approach could further be enhanced if the resulting extractions are not granular enough. In other words, let’s say the resulting extraction was reviewed by a human for verification and it was determined a specific section needs further parsing. In such situations what method would rapidly allow the operator to better train the target system? very cool approach!

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>