Jump to content


Check out our Community Blogs

Register and join over 40,000 other developers!


Recent Status Updates

View All Updates

Photo
- - - - -

Developing a Web Crawler

authentication

  • Please log in to reply
5 replies to this topic

#1 DorumonSg

DorumonSg

    CC Lurker

  • Just Joined
  • Pip
  • 5 posts

Posted 18 August 2011 - 09:49 PM

Hi, first things first, I have never done anything more than a basic 3-tier web site development and basic cookie and session stuff before so this may be a stretch for me.

It doesn't have to be in VB.NET but I asume VB.NET has easier classes to make Toolbars for Internet Explorer since it's Microsoft? Hopefully it can be installed on to FireFox and Chrome as well.

Okay, here's the deal. I need to create a Web Crawler that can crawl web sites that require login authetication, assume it's simple J2EE text form. So I was thinking, I will try to make a Toolbar because if I install the Toolbar in a Browser, I won't need to go through the hassle of doing authentication for websites because I can just search the HTML source directly.

I also want to implement multi-tier searching for my Web Crawler where you are able to enter another hyperlink to continue search. But the problem is if I do that using my Toolbar, will I face the problem of authetication again? Because I am not directly searching the HTML of the page the Web Browser is at but I am trying to enter another page using my Toolbar which doesn't have any relationship with the authentication cookies so I will need to do the authentication for it too right?

Can somebody point me in the right direction? What Language should I use? And how do I implement cookies in my application? My main concerns are:

1. Which language has best library for what I am trying to do? From Microsoft I can only use VB.NET and C# because I am very bad at C and C++, I can also use Java.

2. I need to know how aunthentication works outside of the Web Browser. I know the upon Authentication the Web Site will usually give your computer cookies in the Cache, and upon second login the Web Site will search your Cache for the Cookies? But if I were to implement this on an application instead of using the normal Web Browser, how would I make the Web Site sent Cookies to my application and search my application for Cookies for aunthentication?
  • 0

#2 WingedPanther73

WingedPanther73

    A spammer's worst nightmare

  • Moderator
  • 17757 posts
  • Location:Upstate, South Carolina
  • Programming Language:C, C++, PL/SQL, Delphi/Object Pascal, Pascal, Transact-SQL, Others
  • Learning:Java, C#, PHP, JavaScript, Lisp, Fortran, Haskell, Others

Posted 19 August 2011 - 04:40 AM

I think you need to back up about 3 steps. Why are you talking about piggy-backing this on a browser? Do you know how the HTTP protocol (the way you pass webpages back and forth) works? Will you be crawling secure sites? Do you have examples of sites you want to crawl? What do intend to do with the crawled sites?
  • 0

Programming is a branch of mathematics.
My CodeCall Blog | My Personal Blog

My MineCraft server site: http://banishedwings.enjin.com/


#3 DorumonSg

DorumonSg

    CC Lurker

  • Just Joined
  • Pip
  • 5 posts

Posted 19 August 2011 - 10:34 AM

I think you need to back up about 3 steps. Why are you talking about piggy-backing this on a browser? Do you know how the HTTP protocol (the way you pass webpages back and forth) works? Will you be crawling secure sites? Do you have examples of sites you want to crawl? What do intend to do with the crawled sites?


1. Why not? I want the user to be able to navigate to what he wants to crawl and then do a search accordingly.

2. Not really, which is another reason I want to piggy back on a browser, so I can get jump direct into the HTML coding without touching the authentication. Unless you are able to direct me to a place where I can learn how to make external application login to web sites and maintain a session while doing stuff because I have no idea how to cookies and stuff will be saved and extracted in the case of external applications if it's possible?

3. Yeah I do but I don't just want to crawl that specific web site I want to be able to crawl more web sites that require authentication. Let's take for example, this forum?

4. That's not my call. I am assisting a professor make this for her own research purposes. I am also able to contact the administrators of the web site I will be crawling if I need any technical asisstance from their side. So it's all okay. But I want to try to make the crawler so that I can not only crawl the intended web site but most stuff as well. And I figure the best way to do it is to piggy-back on a browser that supports java and cookies so I can just dive direct into the HTML coding without touching much of the Request, POST, GET, cookie stuff. It will work like, the user logins and browses to the intended page then he can start to crawl using the toolbar.
  • 0

#4 WingedPanther73

WingedPanther73

    A spammer's worst nightmare

  • Moderator
  • 17757 posts
  • Location:Upstate, South Carolina
  • Programming Language:C, C++, PL/SQL, Delphi/Object Pascal, Pascal, Transact-SQL, Others
  • Learning:Java, C#, PHP, JavaScript, Lisp, Fortran, Haskell, Others

Posted 19 August 2011 - 10:42 AM

I have a feeling you are using the phrase "web crawler" differently from how it is normally used. Normally, a webcrawler just follows all the links through a site, keeping a copy of the HTML in a separate location for further processing. You would not normally do any form submissions as part of the process. It sounds like you're trying to do something different, but it's not really clear what. You keep talking about the HTML coding, for example, but a web crawler doesn't normally need to know/care about what the HTML means, it just needs to identify links for the next submission.

With all that said, have you looked at the FoxySpider addon for FireFox? It may do what you're looking for.
  • 0

Programming is a branch of mathematics.
My CodeCall Blog | My Personal Blog

My MineCraft server site: http://banishedwings.enjin.com/


#5 DorumonSg

DorumonSg

    CC Lurker

  • Just Joined
  • Pip
  • 5 posts

Posted 19 August 2011 - 10:53 AM

I have a feeling you are using the phrase "web crawler" differently from how it is normally used. Normally, a webcrawler just follows all the links through a site, keeping a copy of the HTML in a separate location for further processing. You would not normally do any form submissions as part of the process. It sounds like you're trying to do something different, but it's not really clear what. You keep talking about the HTML coding, for example, but a web crawler doesn't normally need to know/care about what the HTML means, it just needs to identify links for the next submission.

With all that said, have you looked at the FoxySpider addon for FireFox? It may do what you're looking for.


Ah I see... it was just how the project was described to me so I just wrote it as Web Crawler. No I can't use an addon because it's supposed to be my assignment.

Now that I think about it, it's more like a Ctrl-F function that can save the data.

I found out that Visual Basic provides a Web Browser Control that is based on I.E.? May I know if it supports Cookies too? Because if it does, it makes my job a lot simpler as I can just simply implement the Web Browser with a Search Function that can search the page's HTML and save what data I need according to the Search Keyword.

And if I wish to crawl to other hyperlink like a normal Web Crawler, I can simply search for the HTML and save the URL in an Array. With it I can direct the Web Browser to the next Hyperlink and continue to search and save the data, since the Web Browser has the cookies, I do not need to worry about authentication again.

As a side note, I am really interested in how to make an application able to store cookies and maintain a session with a website. Is there any place where I can read about it? It may come in useful in future.
  • 0

#6 WingedPanther73

WingedPanther73

    A spammer's worst nightmare

  • Moderator
  • 17757 posts
  • Location:Upstate, South Carolina
  • Programming Language:C, C++, PL/SQL, Delphi/Object Pascal, Pascal, Transact-SQL, Others
  • Learning:Java, C#, PHP, JavaScript, Lisp, Fortran, Haskell, Others

Posted 19 August 2011 - 11:29 AM

As I recall, the web-browser control in VB actually is IE, so it should support cookies, etc.

A session is stored on the server, and based on a cookie the browser maintains. I'm sure a search will help you understand more about it.
  • 0

Programming is a branch of mathematics.
My CodeCall Blog | My Personal Blog

My MineCraft server site: http://banishedwings.enjin.com/






Also tagged with one or more of these keywords: authentication

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download