Jump to content

Web spider

- - - - -

This topic has been archived. This means that you cannot reply to this topic.
13 replies to this topic

#1
martin2311

martin2311

    Learning Programmer

  • Members
  • PipPipPip
  • 32 posts
hey guys i need a help from u i made a spider tht gets url from a given site but it only gets URL (href) from the main page i want it to get all the pages from other pages too

like

www.site.com/
www.site.com/blog.php?id=1 <<<< link/urls/href on this page i need them too

#2
WingedPanther

WingedPanther

    A spammer's worst nightmare

  • Moderators
  • 16,831 posts
Without seeing the code you have already, nobody can add a lot, besides: load the other pages that are linked to on the site.
Programming is a branch of mathematics.
My CodeCall Blog | My Personal Blog

#3
martin2311

martin2311

    Learning Programmer

  • Members
  • PipPipPip
  • 32 posts
see here u go


Dim pageelement As HtmlElementCollection = WebBrowser1.Document.GetElementsByTagName("a")

For Each curelement As HtmlElement In pageelement

ListBox1.Items.Add(curelement.GetAttribute("href") & Environment.NewLine)

ListBox1.SelectedIndex = 0

Next

this add all the links of base site like all link that are on www.site.com/

but some sites have links under links like

www.site.com/something.php <<<<< then we click on some image ur url some other ur comes i want to crwal all the links of web site each and every link of site and add to list box.

#4
WingedPanther

WingedPanther

    A spammer's worst nightmare

  • Moderators
  • 16,831 posts
Yes, you'll have to add sublinks using the same logic, but only new links that aren't already on the list.
Programming is a branch of mathematics.
My CodeCall Blog | My Personal Blog

#5
LuthfiHakim

LuthfiHakim

    Programming God

  • Members
  • PipPipPipPipPipPipPip
  • 765 posts
What do you mean by links under links? and also what do you mean by this?

martin2311 said:

then we click on some image ur url some other ur comes i want to crwal all the links of web site .....

I developed similar kind of software before, and would love to share some tips with you.

#6
martin2311

martin2311

    Learning Programmer

  • Members
  • PipPipPip
  • 32 posts
@luthfihakim

bro i need to code a web sipder tht can get all the links of website i mean ALL links i coded tht gets only the links on main page i want it to do deep in website and get links and ofc remove duplicate links

#7
LuthfiHakim

LuthfiHakim

    Programming God

  • Members
  • PipPipPipPipPipPipPip
  • 765 posts
That's very obvious. What I don't understand is what do you mean by "links under links" and "we click on some image ur url some other ur comes...".

Anyways, you need to have a global collection object to hold all the links. This collection must be able to automatically discard duplicate links. So you only get unique links. As for the link, accompany each link with a flag telling whether you have visited it or not. Using this flag you will not visit the same page more that you want to.

#8
martin2311

martin2311

    Learning Programmer

  • Members
  • PipPipPip
  • 32 posts
yes that is what i want how can i do this as m noob how to "global collection object " this thing??? and flag one too

#9
LuthfiHakim

LuthfiHakim

    Programming God

  • Members
  • PipPipPipPipPipPipPip
  • 765 posts
Basically you need to define 3 classes.
  • Link, with properties:
    • Text (the url)
    • Visited, this property tells you if the link has been visited or not.


  • Links, a collection class to hold multiple Link objects.
  • Page, a class that does the spidering. Here you visit an url and collect the links in the page (like you did in your posted code). This class must have an internal Links collection object. For temporarily store collected links. Only after the spidering process of the page finished, the collected links in this internal collection should be submitted to global links collection.

That's the basic design of my spidering software. This design allows multiple spiders to visit several pages at once (of course it's multithreaded).

Based on the classes, the pseudocode would be:

GlobalLinks := New Links;

Spider := New Page;

GlobalLinks.Add("http://Start.com");

while not GlobalLinks.AllVisited do

begin

  Spider.Visit(GlobalLinks.FirstUnvisitedLink);

  Spider.CollectLinks;

  GlobalLinks.AddLinks(Spider.InternalLinks);

end;


Note that you can add another flag to allow you to stop/pause spidering beside waiting for all links to be visited.

#10
martin2311

martin2311

    Learning Programmer

  • Members
  • PipPipPip
  • 32 posts
is this C# ??? cause i don`t have any knowledge of C# can u please tell me where i cud learn more about it cause the code u gave i guess i wht i need to be done

#11
LuthfiHakim

LuthfiHakim

    Programming God

  • Members
  • PipPipPipPipPipPipPip
  • 765 posts
Nope, it's not C#. That's pseudocode with some pascal flavor :). That's not real code, but you can see the basic flow of the program. Please read it carefully as if each line is a simple english sentence. You will get the flow easily.

I believe in any flavor of VB (classic VB or VB.NET) you can define custom classes. Hoping no problem in coding with custom classes/objects instead on only with GUI (RAD approach).

Anyway, my spider project was in Delphi/Pascal. I currently don't have the source code anymore, so I have to rely on my memory.

[Added]
I forgot in the pseudocode to update the Link's flag. So here it goes:

GlobalLinks := New Links;

Spider := New Page;

GlobalLinks.Add("http://Start.com");

while not GlobalLinks.AllVisited do

begin

  Spider.Visit(GlobalLinks.FirstUnvisitedLink);

  Spider.CollectLinks;

  GlobalLinks.FirstUnvisitedLink.Visited := True;  <-- I left it in prior post

  GlobalLinks.AddLinks(Spider.InternalLinks);

end;



#12
martin2311

martin2311

    Learning Programmer

  • Members
  • PipPipPip
  • 32 posts
well i don`t know abt custom classes i`ll search for it n then i`ll try it hope this works because the code is exact wht i need to do but i don`t know of classes for now