Jump to content


Check out our Community Blogs

Register and join over 40,000 other developers!


Recent Status Updates

View All Updates

Photo
- - - - -

WebBrowser web scrapping


  • Please log in to reply
5 replies to this topic

#1 Tonchi

Tonchi

    Helping the world with programming

  • Expert Member
  • PipPipPipPipPipPipPip
  • 1249 posts
  • Location:Zagreb
  • Programming Language:C#, Others
  • Learning:C, C++, Python, JavaScript, Transact-SQL, Assembly

Posted 31 December 2012 - 10:24 AM

Have you ever wanted to make the application which will scrape some data from the web site?
If so, this article is just perfect for you.

.NET Framework is a very large and powerful framework and with it you can do almost everything that you want in your application.
.NET Framework provides you WebBrowser class which is Windows Forms class but you can use it in any .NET template you want. I will also explain it in this article.
Since I am C#.NET developer I will show you examples in the C#.NET code.


WebBrowser class


WebBrowser class is a very powerful class from which you can manipulate HTML code, navigate through the sites, interact with Javascript functions and other cool things.
Sure, there are other classes which can do their job for web scraping but WebBrowser class is most easiest to learn.
To work with WebBrowser class you first need to learn what that class can do. So let's start with the basics.
In order to work with that class you need to make instance of it since it is non static class.
I will make here the instance so you will recognize it in the next code examples.


System.Windows.Forms.WebBrowser wb = new System.Windows.Forms.WebBrowser();


You can find just everything about that class at MSDN but I will show and explain you the things that we need to make a basic web scraper.


Methods in WebBrowser class


Navigate

This is the method which you must have in order to make a web scraper. When this method is called, your application makes connection to the specific URL of the web site.
The best part of this method is that it is loading the entire HTML file from the URL so you can easy manipulate with it. So the example of navigating to some site is:

wb.Navigate("www.imdb.com");

Now, when you call this method, your application will connect with www.imdb.com which is site where you can find some information about the movies like rating, year of the release, list of actors and so on. In the later code we will try to scrap those informations into your application.

Stop

This method will come in use to you if you want to silence the sound of clicking and other sounds from the web site. Sound of clicking is most anoying sound for me when I am navigating through the web sites with my application. You will need to put this method after your navigation method in order to make it work. This is the example how to do call that method:

wb.Stop();

Properties in WebBrowser class

Document

This property is allowing you to gets an HtmlDocument from the web site. Once you get an HtmlDocument to your application you will be able to read or write values from tags attributes. This is the example:

wb.document.GetElementById("navbar-query").SetAttribute("value", textBox1.Text);

As you can see, I have been used textBox1.Text value to put it in the "value" from HTML. It is more like making input in your console but it is just comparing.
It will be wise to make a check in your application to see if textBox1.Text is null or not because it will make no sense to make null input in "value" attribute in the Htmldocument.

I have mentioned that you can manipulate with Javascript functions. So let's make an example how to use Javascript "click" function to programatically click on the button which is represented on the web site.

HtmlElement acceptButton = wb.document.GetElementById("navbar-submit-button");
if (acceptButton != null)
{
	   acceptButton.InvokeMember("click");
}

It will be wise for you to also check on the imdb.com to see the button which contains elemend id "navbar-submit-button". Your program will click that button with the code I provided you.
We are simply using the instance of the HtmlElement class to store "navbar-submit-button" element from the document. Once we stored it, we are checking if it is null or not. If it is not null it will call Javascript "click" function to make a click on the specific button.

Web scraping

HtmlElementCollection tables = wb.document.GetElementsByTagName("table");
try
{
		 if (tables.Count <= 0) return;
		 HtmlElementCollection rows = tables[0].GetElementsByTagName("tr");
		 foreach (HtmlElement row in rows)
		 {
				HtmlElementCollection cells = row.GetElementsByTagName("td");
				foreach (HtmlElement cell in cells)
				{
					  String text = cell.InnerText;
					  if (!String.IsNullOrEmpty(text) && !String.IsNullOrWhiteSpace(text))
					  {
							listBox1.Items.Add(text);
					  }
				 }
		 }
}
catch (ArgumentOutOfRangeException exc)
{
	  listBox1.Items.Add(exc.Message);
}

Now, I will explain most exciting part of this "how to". This code will display your search in your listBox1 control instance that you will have to create either in your Form or window.
I have created an instance of the HtmlElementCollection class which will store all table tags in it's collection. Since I have experienced ArgumentOutOfRangeException exception I will suggest you to use try-catch statement to make sure there is no pop up messages in your application that will tell you the exception message and to close that MessageBox and insted you will show that message in the listBox1 control instance to make your application more user friendly. Now it is very important to check if tables.Count is equal or less than zero and if it is, it will just return nothing. Otherwise it will create a new instance of HtmlEllementCollection class "rows" where you will store all "table" elements by tag name "tr".
In foreach loop you have to go through every rows in the collection and for each one, it will store element by tag name "td" into "cells" instance of the HtmlElementCollection class.
Now, you will have to open a new foreach loop to go through every single cells in the collection so you can catch InnerText and store it in the string variable. This was the most important part.
Now for the end you will have to check if text variable String.IsNullOrEmpty is false and if text variable String.IsNullOrWhiteSpace is false in order to put that specific InnerText in the listBox1 control instance.
Now run your code and search for a movie and all searches will be displayed in the listBox1 control instance.


Note: This is completely my tutorial. Original from me, posted on Microsoft TechNet Wiki.

Edited by Tonchi, 01 January 2013 - 11:12 AM.

  • 2

Microsoft Student Partner, Microsoft Certified Professional


#2 GrazerCode

GrazerCode

    CC Newcomer

  • Member
  • PipPip
  • 24 posts
  • Location:Indonesia
  • Programming Language:C, C++, (Visual) Basic
  • Learning:Java, Python, Ruby

Posted 30 March 2013 - 05:22 AM

Wow, C# is great for making browser too  :rules:


  • 1

Inspiration can come everywhere and at anytime ~GrazerCode~


#3 sam_coder

sam_coder

    CC Addict

  • Senior Member
  • PipPipPipPipPip
  • 380 posts

Posted 13 May 2014 - 10:07 AM

Great tutorial!  Haven't considered using WebBrowser.

 

I've used the Html Agility Pack though, which also works great. http://htmlagilitypack.codeplex.com/

What's nice about the Html Agility Pack, is that it very much mirrors the Xml library build into .NET.  It handles all the inconsistancies of html (eg. well formidness) and even allows XPath expressions. something like: /html/body/table/tr/td[@class='odd-row']

 

It's headless, so could be hidden away in the darker depths of an ASP.NET application without having to link Windows.Forms.

 

Anyway, if you still do this type of thing, should check it out! Another tool on the belt is always a great thing!


  • 0

#4 DataExtraction

DataExtraction

    CC Lurker

  • Just Joined
  • Pip
  • 1 posts

Posted 01 September 2015 - 05:33 AM

Yes.Webbrowser is one of the helpful tool for web scraping if one is using dotnet technology :thumbup1:


Edited by DataExtraction, 01 September 2015 - 05:47 AM.

  • 0

#5 RobP

RobP

    CC Lurker

  • Just Joined
  • Pip
  • 1 posts

Posted 04 November 2015 - 03:30 PM

I've not used Webbrowser before so thanks for the post, will be looking into it very soon. I tend to take the approach of using an HttpWebRequest along with the HtmlAgilityPack to achieve the same thing. Here's another introduction to .net web scraping which uses this technique.


  • 0

#6 Hawker

Hawker

    CC Lurker

  • Just Joined
  • Pip
  • 1 posts

Posted 25 August 2016 - 10:42 PM

Brilliant tutorial Tonchi, this will help me get started with my new scraper a lot faster now.


  • 0