hello, i want to get all the links, more exactly the ones inside href, from a site and i did like this :
string currentAddress;
string currentAddressFull;
List<string> links = new List<string>();
HttpWebRequest req = WebRequest.Create(addressTextBox.Text) as HttpWebRequest;
HttpWebResponse res = req.GetResponse() as HttpWebResponse;
StreamReader reader=new StreamReader(res.GetResponseStream(),true);
string all = reader.ReadToEnd();
string host = req.Address.DnsSafeHost.ToString();
foreach (Match m in href.Matches(all))
{
currentAddress = Cut(m.ToString());
if (Connect(currentAddress))
{
if (!links.Contains(currentAddress))
links.Add(currentAddress);
}
else
{
currentAddressFull = "http://"+host + "/" + currentAddress;
if (Connect(currentAddressFull))
{
if (!links.Contains(currentAddressFull))
links.Add(currentAddressFull);
}
}
}
i get all matches like href="index.html" oo href="http://index.html" and using Cut method i obtain index.html or http://goal.com/index.html
the Connect method verifies if an address is valid
the problem is that at some point, using debuger, i obtain some address and it skips if (Connect(currentAddressFull)) which means it cannot connect to that address but it is a valid address, i don't undestant why
ty for reading this
Get all links from a site
Started by bogus, Sep 15 2008 12:44 AM
1 reply to this topic
#1
Posted 15 September 2008 - 12:44 AM
|
|
|
#2
Posted 15 September 2008 - 12:47 AM
there was a little mistake with the example
i get all matches like href="index.html" or href="http://goal.com/index.html" and using Cut method i obtain index.html or http://goal.com/index.html
i get all matches like href="index.html" or href="http://goal.com/index.html" and using Cut method i obtain index.html or http://goal.com/index.html


Sign In
Create Account

Back to top









