Web page segmentation by visual clustering

時間 2019-12-05

標籤 web page segmentation visual clustering 欄目 HTML 简体版

原文原文鏈接

Our job, at Mapado, is collecting all 「things to do」, all around the planet.javascript

In order to get this huge amount of information, we crawl the web, like Google does, searching for content related to concert, show, visits, attractions, …. When we find an interesting page, we try to extract the 「good」 data from it.html

One of our major challenge is to separate content that we are interested in (title, description, photos, dates, …) from all the crap around (advertising, navigation bar, footer content, related content…).java

In that challenge, one task is to regroup content that are visually close from each other. Usually, elements composing the main content of a web page are close from each other.web

When we began working on this task, we, innocently, thought that we could deal with the HTML DOM. In the DOM, elements are stored as a hierarchy, so elements with the same parent have good chance to be related.windows

A very intersting paper covering page segmentation can be found at 「Page Segmentation by Web Content Clustering「.app

Using DOM hierarchy is a good starting point but in many cases things are getting a lot more complicated :less

CSS stylesheet can move elements : elements which are close on the DOM hierarchy can be moved everywhere, including outside browser windows
CSS stylesheet can hide or show elements : many contents can share the same visual position, just being moved (or removed) by CSS and javascript
javascript code can display things that are not even in the DOM

So we started considering using webkit as a visual renderer in order to get visual features. There is a bunch of headless webkit packages like phantomjs, zombie.js or casperjs. Each of them can render a web page and get all computed properties of each element on the page.dom

One should use some of following useful features in order to cluster visually thing together :ide

position of the element in page (from top and left)
width & eight of element

Below is a screenshot of the Quai Branly Museum we want to cluster elements for.ui

When building the clustering model, we found that one of the main feature is the position of the leftmost and rightmost pixel of each bloc. Indeed, if you look at web pages, very often, different content blocks are separated by a vertical gap.

Adding position of the center of each element bloc and DOM depth improve the efficiency of the clustering.

Below is a first implementation of these concept in Python, using Scikit-Learn to perform the clustering.

Python