hey guys i need a help from u i made a spider tht gets url from a given site but it only gets URL (href) from the main page i want it to get all the pages from other pages too
like
www.site.com/
www.site.com/blog.php?id=1 <<<< link/urls/href on this page i need them too
Web spider
Started by martin2311, Dec 30 2010 04:44 AM
13 replies to this topic
#1
Posted 30 December 2010 - 04:44 AM
|
|
|
#2
Posted 30 December 2010 - 09:44 AM
Without seeing the code you have already, nobody can add a lot, besides: load the other pages that are linked to on the site.
#3
Posted 30 December 2010 - 11:10 AM
see here u go
Dim pageelement As HtmlElementCollection = WebBrowser1.Document.GetElementsByTagName("a")
For Each curelement As HtmlElement In pageelement
ListBox1.Items.Add(curelement.GetAttribute("href") & Environment.NewLine)
ListBox1.SelectedIndex = 0
Next
this add all the links of base site like all link that are on www.site.com/
but some sites have links under links like
www.site.com/something.php <<<<< then we click on some image ur url some other ur comes i want to crwal all the links of web site each and every link of site and add to list box.
Dim pageelement As HtmlElementCollection = WebBrowser1.Document.GetElementsByTagName("a")
For Each curelement As HtmlElement In pageelement
ListBox1.Items.Add(curelement.GetAttribute("href") & Environment.NewLine)
ListBox1.SelectedIndex = 0
Next
this add all the links of base site like all link that are on www.site.com/
but some sites have links under links like
www.site.com/something.php <<<<< then we click on some image ur url some other ur comes i want to crwal all the links of web site each and every link of site and add to list box.
#4
Posted 30 December 2010 - 12:43 PM
Yes, you'll have to add sublinks using the same logic, but only new links that aren't already on the list.
#5
Posted 03 January 2011 - 06:50 AM
What do you mean by links under links? and also what do you mean by this?
I developed similar kind of software before, and would love to share some tips with you.
martin2311 said:
then we click on some image ur url some other ur comes i want to crwal all the links of web site .....
I developed similar kind of software before, and would love to share some tips with you.
#6
Posted 03 January 2011 - 10:07 AM
@luthfihakim
bro i need to code a web sipder tht can get all the links of website i mean ALL links i coded tht gets only the links on main page i want it to do deep in website and get links and ofc remove duplicate links
bro i need to code a web sipder tht can get all the links of website i mean ALL links i coded tht gets only the links on main page i want it to do deep in website and get links and ofc remove duplicate links
#7
Posted 03 January 2011 - 08:11 PM
That's very obvious. What I don't understand is what do you mean by "links under links" and "we click on some image ur url some other ur comes...".
Anyways, you need to have a global collection object to hold all the links. This collection must be able to automatically discard duplicate links. So you only get unique links. As for the link, accompany each link with a flag telling whether you have visited it or not. Using this flag you will not visit the same page more that you want to.
Anyways, you need to have a global collection object to hold all the links. This collection must be able to automatically discard duplicate links. So you only get unique links. As for the link, accompany each link with a flag telling whether you have visited it or not. Using this flag you will not visit the same page more that you want to.
#8
Posted 03 January 2011 - 08:17 PM
yes that is what i want how can i do this as m noob how to "global collection object " this thing??? and flag one too
#9
Posted 03 January 2011 - 08:41 PM
Basically you need to define 3 classes.
That's the basic design of my spidering software. This design allows multiple spiders to visit several pages at once (of course it's multithreaded).
Based on the classes, the pseudocode would be:
Note that you can add another flag to allow you to stop/pause spidering beside waiting for all links to be visited.
- Link, with properties:
- Text (the url)
- Visited, this property tells you if the link has been visited or not.
- Text (the url)
- Links, a collection class to hold multiple Link objects.
- Page, a class that does the spidering. Here you visit an url and collect the links in the page (like you did in your posted code). This class must have an internal Links collection object. For temporarily store collected links. Only after the spidering process of the page finished, the collected links in this internal collection should be submitted to global links collection.
That's the basic design of my spidering software. This design allows multiple spiders to visit several pages at once (of course it's multithreaded).
Based on the classes, the pseudocode would be:
GlobalLinks := New Links;
Spider := New Page;
GlobalLinks.Add("http://Start.com");
while not GlobalLinks.AllVisited do
begin
Spider.Visit(GlobalLinks.FirstUnvisitedLink);
Spider.CollectLinks;
GlobalLinks.AddLinks(Spider.InternalLinks);
end;
Note that you can add another flag to allow you to stop/pause spidering beside waiting for all links to be visited.
#10
Posted 03 January 2011 - 10:07 PM
is this C# ??? cause i don`t have any knowledge of C# can u please tell me where i cud learn more about it cause the code u gave i guess i wht i need to be done
#11
Posted 03 January 2011 - 11:41 PM
Nope, it's not C#. That's pseudocode with some pascal flavor :). That's not real code, but you can see the basic flow of the program. Please read it carefully as if each line is a simple english sentence. You will get the flow easily.
I believe in any flavor of VB (classic VB or VB.NET) you can define custom classes. Hoping no problem in coding with custom classes/objects instead on only with GUI (RAD approach).
Anyway, my spider project was in Delphi/Pascal. I currently don't have the source code anymore, so I have to rely on my memory.
[Added]
I forgot in the pseudocode to update the Link's flag. So here it goes:
I believe in any flavor of VB (classic VB or VB.NET) you can define custom classes. Hoping no problem in coding with custom classes/objects instead on only with GUI (RAD approach).
Anyway, my spider project was in Delphi/Pascal. I currently don't have the source code anymore, so I have to rely on my memory.
[Added]
I forgot in the pseudocode to update the Link's flag. So here it goes:
GlobalLinks := New Links;
Spider := New Page;
GlobalLinks.Add("http://Start.com");
while not GlobalLinks.AllVisited do
begin
Spider.Visit(GlobalLinks.FirstUnvisitedLink);
Spider.CollectLinks;
GlobalLinks.FirstUnvisitedLink.Visited := True; <-- I left it in prior post
GlobalLinks.AddLinks(Spider.InternalLinks);
end;
#12
Posted 04 January 2011 - 03:37 AM
well i don`t know abt custom classes i`ll search for it n then i`ll try it hope this works because the code is exact wht i need to do but i don`t know of classes for now


Sign In
Create Account


Back to top









