Jump to content

Extracting parts of a web page...

- - - - -

This topic has been archived. This means that you cannot reply to this topic.
2 replies to this topic

#1
Jurgnetje

Jurgnetje

    Newbie

  • Members
  • Pip
  • 2 posts
Hi,

I'm trying to write some code to extract the content from a given container, (f.e. <div id="content">)

- all the regex solutions I tried, stumbled on the closing </div> elements inside the big container...

I started digging into the php.net/dom approach, but the best I got was just showing me plain text (all the markup was gone)

Here's my sample code:
(actually, I used the front page of the PHP home page for this test)

<?php


$url = 'myfavoritesite.net';


$html = new DOMDocument();

@$html->loadHTMLFile($url);

$result = $html->getElementById('content');

$text=utf8_decode($result->nodeValue);


// output the result

echo "<pre>". $text . "</pre>";

?>


I really hope there's some genius in here who will be able to enlighten me!

#2
Jurgnetje

Jurgnetje

    Newbie

  • Members
  • Pip
  • 2 posts
Hi guys,

just for your interest... I found the solution.
Here's the code:

<?php

$innerHTML = '';

$url = 'myfavoritesite.net';

$elem_id='content';


$doc = new DOMDocument();

@$doc->loadHTMLFile($url);   

$elem = $doc->getElementById($elem_id);


// loop through all childNodes, getting html       

$children = $elem->childNodes;

foreach ($children as $child) {

    $tmp_doc = new DOMDocument();

    $tmp_doc->appendChild($tmp_doc->importNode($child,true));       

    $innerHTML .= $tmp_doc->saveHTML();

}


echo $innerHTML;

?>



#3
smith

smith

    Programmer

  • Members
  • PipPipPipPip
  • 153 posts
haha, funny avatar!
Using a DOMDocument worked? Very nice work.

for (int i;;) {

   cout << "Smith";

}