Friday

Information Extration: Web Scraping

Web scraping generically describes the various means of extracting content (data) from a website for the purpose of transforming that content into another format suitable for use in another context.

Web pages are coded in HTML, which uses a tree-like structure to represent the information. The actual content data is mingled with layout and rendering information. Scrapers are the programs that ‘know’ how to get particular content data back from a HTML page by learning the details of how that page is made-up by figuring out where the actual content data is. They then extract that data and structure it so it can be reused and integrated into a new or existing system or website.

A web scrap can occur over one, or many websites, in order to compile all the relevant data into one source. Think of it as ‘scraping’ all your favourite bits off a cake, or a selection of cakes, in order to create your ultimate cake.

Companies generally utilise this information extraction technique as means of obtaining the most recent data possible, particularly when working with information which is subject to frequent changes. Access to certain information can also provide a strategic business advantages. For example a business that knows the locations of competitors can make better decisions about where to focus further growth.

However, the most common, but controversial, use of information taken from websites is to repost scraped data to other sites.

A typical example application for web scraping is a called a ‘web crawler’. A web crawler will ‘crawl’ through sites and copy content from one or more of these existing websites in order to generate a scraper site. The result can range from fair use excerpts or reproduction of text and content, to plagiarized content.

Web scraping differs from ‘screen scraping’ in the sense that a website is really not a visual screen, but is a live HTML/JavaScript-based content structure, with a graphics interface in front of it. Web scraping does not involve working at the visual interface as screen scraping does, but rather working on the underlying object structure of the HTML and JavaScript.

Recursive web scraping – by following links to other pages over many websites, is called ‘web harvesting’, and is performed by software called a bot or a ‘webbot’, ‘harvester’ or ‘spider’.

There are legal web scraping sites that provide free content and are commonly used by companies looking to populate their own sites with web content, often to profit by some means from the traffic the content will hopefully bring. However this content does not help the ranking of the site in search engine results because the content is not original to that site and the original content is the priority of search engines.

1 comment:

Fuller said...

I didn't differentiate "web harvesting" from other terms until I read your article.

I'd like to instroduce a screen scraper, MetaSeeker, who leverages the power of web browser.