Jump to content


Check out our Community Blogs

Register and join over 40,000 other developers!


Recent Status Updates

View All Updates

Photo
- - - - -

Building Web Page Clustering Application, Need Help

pseudocode

  • Please log in to reply
4 replies to this topic

#1 vytska007

vytska007

    CC Lurker

  • New Member
  • Pip
  • 3 posts
  • Programming Language:C, Java, C++, C#, Python
  • Learning:C, Java, C++, Python

Posted 19 May 2012 - 09:20 AM

Hello there guys, i'm building an application using Python which crawles the given web site and gathers all the links inside it. After this the program has to put every page of website into the clusters of the same template. I'm thinking of the idea how to decide if two pages uses the same template. So far i came up with idea to parse current page html tags into string so i get for example string=htmlmetatitle.....ect after this i compare each string using Levenshtein string edit distance algorithm and decide if strings are roughly the same. So pages which meet defined similarity ratio are put in the same cluster. But for example if a forum page X using template A has one post and the page of the same template Y has twenty posts then their html_tag_strings differ a lot. So i'm asking for a better algorithm to decide if two pages uses the same template?? any could or pseudocode would be welcome :)
  • 0

#2 jwxie518

jwxie518

    Speaks fluent binary

  • Senior Member
  • PipPipPipPipPipPip
  • 517 posts

Posted 20 May 2012 - 01:25 AM

Comparing HTML tags are not going to help you. Comparing HTML pages is extremely slow. Just check out the Page Source of CodeCall (right here, check it out). The file is huge, contains many strings.

You need to setup some criteria. Based on them you consider whether they are using the same template.
Remember nowadays we don't make one big template. We "include" or "extend" some templates. You also can't remove all strings from the page source because some contents can provide usefulinformation.

Why not analyze the links? URLs? make them into some graphs, consider the weight and the distance.
I mean, if you remove certain HTML elements and contents in the post (you know for certain some pages contain dynamic information, such as a forum thread page) from your search, and just concentrate on links such as the urls to button images, avatar, etc, then you can say they are very likely come from the same group of templates.

Also, just because two pages generate 80% different contents from HTML page source, does not mean they are not based on the same template.

I can have this
<html>
<head></heaD>
<body>
{% load data based on url %}    // RESTful API .....
</body>
</html>

I can generate a page of tables, or a page of list. They don't show the same structure.

So you will definitely miss some of them.
  • 1

#3 vytska007

vytska007

    CC Lurker

  • New Member
  • Pip
  • 3 posts
  • Programming Language:C, Java, C++, C#, Python
  • Learning:C, Java, C++, Python

Posted 20 May 2012 - 02:32 AM

So i should analyze all page href's and for example get their xpaths? and then compare each page xpaths?
  • 0

#4 jwxie518

jwxie518

    Speaks fluent binary

  • Senior Member
  • PipPipPipPipPipPip
  • 517 posts

Posted 20 May 2012 - 10:43 PM

You said you might end up going to two threads (one with 1 reply, one with 30 replies). If the link is of the form `http://codecall.net/topic/xxxxxxxxx` then you should just ignore the rest. Just analyze one?

This leads to my question: what's the purpose of this application? Because it is very unlikely that `http://codecall.net/forum/xxxxxx` will use the same template as `http://codecall.net/topic/xxxx` logically from a human perspective, so, as the author of the machine, if it is your first time encounter `/topic/xxxx` this form, put it in the queue to analyze. Learn it. Then if your machine finds another path exactly of that form, don't put it in the analyzer queue (or / and) instead, just mark it "walk-through only" - that is, finds links only.

Now, if you have worked on sites like Drupal-based, many of their contents are actually based one some global templates (like a Page template). The urls may look different. Now that's the part you need to analyze. If the links are of the same format, you should assume they use the same template. Or you can just randomly choose them and analyze. If your machine learns ration different is high, alert the learning state and say "hey bro, this url may actually use different templates based on some kind of preferences: maybe by GET, POST, PUT, DELETE, or by language difference, by javascript / non-javascript, etc).

That's the way I see it... and I also wonder if compress the two html files would help at all (barebone structures - with name, tags, maybe, definitely not any content strings though). That might be the first step to comparison.
  • 2

#5 vytska007

vytska007

    CC Lurker

  • New Member
  • Pip
  • 3 posts
  • Programming Language:C, Java, C++, C#, Python
  • Learning:C, Java, C++, Python

Posted 21 May 2012 - 05:51 AM

Hmmm, interesting thougths. Thx for replaying btw :)

So in your opinion I should consider not the html tag structure only but the page link also? If i get it right, i should analyze only the first link of url template www.example.com/products/xxxxx, then if i meat onther link with the same prefix i.e. www.example.com/products/yyyyy then i should just put it in the same cluster, because they are likely from the same template?

Another idea is to use Tree edit distance algorithm (http://cs.brown.edu/...in-2000-TED.pdf) to calculate similarity between two html tag trees. The edit-distance between two trees is the minimum cost of a sequence of edit operations taking one input tree to another.I think this would give better rezults, because a tree of forum thread page with 1 thread would look like(example):

body
/ | \
div div div
/
li (link to thread page)

and the tree of forum thread page with 3 posts would look like:

body
/ | \
div div div
/ | \
li li li

i gotta check the result of these too trees, and decide if its a good idea. What do you think?
  • 0





Also tagged with one or more of these keywords: pseudocode

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download