There are many libraries available for Delphi/FreePascal environment that allow it so easy to capture raw content of a web page. But if you are working under Windows which has at least Internet Explorer version 5 installed, actually you already have access to small, easy to use, and fast web client. This web client allows to access the raw html source of a web page as xml.
This web client is IXmlHttpRequest, which is a part of Microsoft XML Core Services (MSXML). Official information about MSXML is available here. For IXmlHttpRequest, you can read official information here.
Quote from the official page on IXmlHttpRequest:
Provides client-side protocol support for communication with HTTP servers.
Setup The Demo Application and Gui
- Create new application. Give the project name of "CaptureHtml", and save the autocreated form as "Form_Main.pas".
- Drop a TLabel from Standard tab of the Component Pallette. It will be automatically named Label1. Leave it as is. Set its Caption property to "URL".
- Drop a TEdit. Name it edtUrl.
- Drop a TSpeedBtn from Additional tab of the Component Pallette. Place it to the right of edtUrl. Name it btRetrieveHtmlSource.
- Drop a TMemo from Standard tab of the Component Pallette. It will be automatically be named Memo1. Place and resize it under edtUrl.
Adjust the position and size of the controls so you get something like shown below.
In Delphi 7 and above (I can not check with other versions), MSXML related libraries are declared in unit MsXml.pas. Therefore you have to use this unit in order to access MsXml related libraries. Add MsXml unit to your interface uses list.
In our demo project, html source retrieval is done after one specifies the url of the page which html source one wants to retrieve. So in design time double click btRetrieveHtmlSource to generate skeleton code for its OnClick event handler. And our html source retrieval code is placed here. So place the following codes for btRetrieveHtmlSource's OnClick event, and study it carefully. I have put extensive comments there to help you.
procedure TForm1.btnRetrieveHtmlSourceClick(Sender: TObject); var vUrl: WideString; vClient: IXMLHttpRequest; vBodyData: OleVariant; begin // Clean up the memo control where we want to show the page raw content Memo1.Clear; // massage the url of the web page vUrl := LowerCase(edtUrl.Text); if vUrl='' then raise Exception.Create('No URL specified'); // make sure that protocol prefix http:// always presents if (System.Pos('http://', vUrl) < 1) and (System.Pos('https://', vUrl) < 1) then begin vUrl := 'http://' + vUrl; edtUrl.Text := 'http://' + edtUrl.Text; end; // create the web client vClient := CoXMLHTTPRequest.Create; // initiate with type of operation we want the web client to do vClient.open('GET', vUrl, False, EmptyParam, EmptyParam); // Setup the initial body data to empty vBodyData := EmptyParam; // this is where the retrieval operation actually begin vClient.send(vBodyData); // enter the "wait" loop until the web client's complete the retrieval. // complete here can be a success or fail repeat Application.ProcessMessages; Sleep(100); until vClient.readyState=4; // show the header we've got Memo1.Lines.Add('HEADER:'); Memo1.Lines.Add(vClient.getAllResponseHeaders); // show the raw content of the page Memo1.Lines.Add(''); Memo1.Lines.Add('CONTENT:'); Memo1.Lines.Add(vClient.responseText); end;
And there, we finished with the demo project. Time for testing. Let's run the demo project.
Running The Demo
Upon running the demo, you will get something like this:
Now enter "http://www.google.com" into the edit box, then click on the button. The retrieval process will be invoked and if everything is okay you will get the html source of the index page of www.google.com. Something like shown below.
So that's it. Ain't it easy to get html source of a web page? Full source code of the demo is attached with this tutorial. Feel free to use or improve it for anything.