Jump to content

Get all links from a site

- - - - -

This topic has been archived. This means that you cannot reply to this topic.
1 reply to this topic

#1
bogus

bogus

    Newbie

  • Members
  • Pip
  • 6 posts
hello, i want to get all the links, more exactly the ones inside href, from a site and i did like this :

string currentAddress;
string currentAddressFull;
List<string> links = new List<string>();
HttpWebRequest req = WebRequest.Create(addressTextBox.Text) as HttpWebRequest;
HttpWebResponse res = req.GetResponse() as HttpWebResponse;
StreamReader reader=new StreamReader(res.GetResponseStream(),true);
string all = reader.ReadToEnd();
string host = req.Address.DnsSafeHost.ToString();
foreach (Match m in href.Matches(all))
{
currentAddress = Cut(m.ToString());
if (Connect(currentAddress))
{
if (!links.Contains(currentAddress))
links.Add(currentAddress);
}
else
{
currentAddressFull = "http://"+host + "/" + currentAddress;
if (Connect(currentAddressFull))
{
if (!links.Contains(currentAddressFull))
links.Add(currentAddressFull);
}
}
}

i get all matches like href="index.html" oo href="http://index.html" and using Cut method i obtain index.html or http://goal.com/index.html
the Connect method verifies if an address is valid
the problem is that at some point, using debuger, i obtain some address and it skips if (Connect(currentAddressFull)) which means it cannot connect to that address but it is a valid address, i don't undestant why

ty for reading this

#2
bogus

bogus

    Newbie

  • Members
  • Pip
  • 6 posts
there was a little mistake with the example

i get all matches like href="index.html" or href="http://goal.com/index.html" and using Cut method i obtain index.html or http://goal.com/index.html