Jump to content


Check out our Community Blogs

Register and join over 40,000 other developers!


Recent Status Updates

View All Updates

Photo
- - - - -

How To Capture Html Source Of A Web Page

html source web page network

  • Please log in to reply
2 replies to this topic

#1 Luthfi

Luthfi

    CC Leader

  • Expert Member
  • PipPipPipPipPipPipPip
  • 1320 posts
  • Programming Language:PHP, Delphi/Object Pascal, Pascal, Transact-SQL
  • Learning:C, Java, PHP

Posted 20 April 2012 - 05:31 AM

Overview

There are many libraries available for Delphi/FreePascal environment that allow it so easy to capture raw content of a web page. But if you are working under Windows which has at least Internet Explorer version 5 installed, actually you already have access to small, easy to use, and fast web client. This web client allows to access the raw html source of a web page as xml.

This web client is IXmlHttpRequest, which is a part of Microsoft XML Core Services (MSXML). Official information about MSXML is available here. For IXmlHttpRequest, you can read official information here.

Quote from the official page on IXmlHttpRequest:

Provides client-side protocol support for communication with HTTP servers.



Demo Project

Setup The Demo Application and Gui
  • Create new application. Give the project name of "CaptureHtml", and save the autocreated form as "Form_Main.pas".
  • Drop a TLabel from Standard tab of the Component Pallette. It will be automatically named Label1. Leave it as is. Set its Caption property to "URL".
  • Drop a TEdit. Name it edtUrl.
  • Drop a TSpeedBtn from Additional tab of the Component Pallette. Place it to the right of edtUrl. Name it btRetrieveHtmlSource.
  • Drop a TMemo from Standard tab of the Component Pallette. It will be automatically be named Memo1. Place and resize it under edtUrl.

Adjust the position and size of the controls so you get something like shown below.

WebSourceCapture_Design001.jpg


Retrieval Code

In Delphi 7 and above (I can not check with other versions), MSXML related libraries are declared in unit MsXml.pas. Therefore you have to use this unit in order to access MsXml related libraries. Add MsXml unit to your interface uses list.

In our demo project, html source retrieval is done after one specifies the url of the page which html source one wants to retrieve. So in design time double click btRetrieveHtmlSource to generate skeleton code for its OnClick event handler. And our html source retrieval code is placed here. So place the following codes for btRetrieveHtmlSource's OnClick event, and study it carefully. I have put extensive comments there to help you.

procedure TForm1.btnRetrieveHtmlSourceClick(Sender: TObject);
var
  vUrl: WideString;
  vClient: IXMLHttpRequest;
  vBodyData: OleVariant;
begin
  // Clean up the memo control where we want to show the page raw content
  Memo1.Clear;

  // massage the url of the web page
  vUrl := LowerCase(edtUrl.Text);
  if vUrl='' then
	raise Exception.Create('No URL specified');

  // make sure that protocol prefix http:// always presents  
  if (System.Pos('http://', vUrl) < 1)
	 and (System.Pos('https://', vUrl) < 1)
  then begin
	vUrl := 'http://' + vUrl;
	edtUrl.Text := 'http://' + edtUrl.Text;
  end;

  // create the web client
  vClient := CoXMLHTTPRequest.Create;

  // initiate with type of operation we want the web client to do
  vClient.open('GET', vUrl, False, EmptyParam, EmptyParam);

  // Setup the initial body data to empty
  vBodyData := EmptyParam;

  // this is where the retrieval operation actually begin
  vClient.send(vBodyData);
  // enter the "wait" loop until the web client's complete the retrieval.
  // complete here can be a success or fail
  repeat
	Application.ProcessMessages;
	Sleep(100);
  until vClient.readyState=4;

  // show the header we've got
  Memo1.Lines.Add('HEADER:');
  Memo1.Lines.Add(vClient.getAllResponseHeaders);

  // show the raw content of the page
  Memo1.Lines.Add('');
  Memo1.Lines.Add('CONTENT:');
  Memo1.Lines.Add(vClient.responseText);
end;

And there, we finished with the demo project. Time for testing. Let's run the demo project.


Running The Demo

Upon running the demo, you will get something like this:

CaptureHtmlSrc_Run00.png


Now enter "http://www.google.com" into the edit box, then click on the button. The retrieval process will be invoked and if everything is okay you will get the html source of the index page of www.google.com. Something like shown below.

WebSource_Captured001.jpg

So that's it. Ain't it easy to get html source of a web page? Full source code of the demo is attached with this tutorial. Feel free to use or improve it for anything.

Cheers!

Attached Files

  • Attached File  Demo.zip   221.94KB   1347 downloads

  • 1

#2 papabear

papabear

    CC Devotee

  • Senior Member
  • PipPipPipPipPipPip
  • 472 posts
  • Location:DarkSide

Posted 20 April 2012 - 05:17 PM

very nice and informative tutorial, I like how you explain it and did it, job well done :)
I manage to make my first Pascal program! :)
  • 0
Life has no CTRL+Z
Never Forget To HIT "LIKE" If I Helped

#3 Luthfi

Luthfi

    CC Leader

  • Expert Member
  • PipPipPipPipPipPipPip
  • 1320 posts
  • Programming Language:PHP, Delphi/Object Pascal, Pascal, Transact-SQL
  • Learning:C, Java, PHP

Posted 23 April 2012 - 10:40 AM

You did? That's great!
  • 0





Also tagged with one or more of these keywords: html source, web page, network

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download