Jump to content

Help with PREG_MATCH_ALL function

- - - - -

This topic has been archived. This means that you cannot reply to this topic.
7 replies to this topic

#1
rsnider19

rsnider19

    Learning Programmer

  • Members
  • PipPipPip
  • 34 posts
I'm working on a script to parse out URL's from a Sitemap.xml file. The script works fine on some sitemap files but not others. The one's it don't work on have specific XML tags like:

<urlset xsi:schemaLocation="http://www.xxxx.org/schemas/sitemap/0.9 http://www.sitemaps....0.9/sitemap.xsd">

<url>
<loc>
http://xxxxxx.wordpr...010/05/16/blog/
</loc>
<lastmod>2010-05-20T13:39:02+00:00</lastmod>
<changefreq>monthly</changefreq>
</url>

I end up picking up the <lastmod> and <changefreq> tags along with the url. How can I modify my preg function to just pull the URL?

Here is the snippet of code where I am looping through the file

foreach ( $lines as $line_number => $line )
{
$line = trim($line);

preg_match_all('/(?<=\<loc\>)(.*?)(?=\<\/loc\>)/U', $line, $matches,PREG_SET_ORDER);


if($matches)
{
if ( $matches[0][0] != '' )
{
$allMatches[] = $matches[0][0];
};
};
};

Also, is there a good document that explains pattern matching in detail?

Thanks in advance
Rob

#2
webcodez

webcodez

    Programmer

  • Members
  • PipPipPipPip
  • 149 posts
Hmm, not quite sure how your script exactly looks like, but I tried by recreating from the pattern you're supplying and for me it worked:

$file = file_get_contents("test.txt");
preg_match_all('/(?<=\<loc\>)(.*?)(?=\<\/loc\>)/Uis', $file, $matches,PREG_SET_ORDER);
 
if($matches)
{ 
if ( $matches[0][0] != '' ) 
{ 
$allMatches[] = $matches[0][0];
}
}
 
print_r($allMatches);

Made $file contain the whole file contents and then add is after /U to the pattern to make it apply to all lines. And edit "test.txt" to the filename of the file containing the XML you supplied.
</SPAN>

#3
rsnider19

rsnider19

    Learning Programmer

  • Members
  • PipPipPip
  • 34 posts
Thanks for your reply. I added the "is" and still getting the same results I have attached script and test files. I am running script on WAMP, not sure if that makes a difference.

AP1 contains address of sitemap(s)
SITEMAP.php is the script
URLLIST.TXT is output of run

Let me know if you get the same results.

Thanks again
Rob

Attached Files



#4
Orjan

Orjan

    Writes binary right handed and hex left handed

  • Moderators
  • 3,299 posts
Another version would be to do like this:
$xml = new SimpleXML($lines);
$url = $xml->url->loc;

which interprets the xml as xml instead.
__________________________________________
I study Information Systems at Karlstad University when I'm not on CodeCall

#5
rsnider19

rsnider19

    Learning Programmer

  • Members
  • PipPipPip
  • 34 posts
Sorry, I'm not familiar with SimpleXML. I got the following error trying it:

Fatal error: Class 'SimpleXML' not found in C:\wamp\www\buildsite.php on line 10

do you mean ?

$xml = new SimpleXMLElement($lines);

if so, I get

Fatal error: Uncaught exception 'Exception' with message 'String could not be parsed as XML' in C:\wamp\www\buildsite.php:10 Stack trace: #0 C:\wamp\www\buildsite.php(10): SimpleXMLElement->__construct('') #1 {main} thrown in C:\wamp\www\buildsite.php on line 10

Thanks,
Rob



#6
Orjan

Orjan

    Writes binary right handed and hex left handed

  • Moderators
  • 3,299 posts
Ah, sorry, that I meant. oh, so it's not real xml files either?
__________________________________________
I study Information Systems at Karlstad University when I'm not on CodeCall

#7
rsnider19

rsnider19

    Learning Programmer

  • Members
  • PipPipPip
  • 34 posts
Nevermind, I got it working with SimpleXML. Thanks for pointing that out to me because, well, it was Simple :)

Thanks,
Rob

#8
Orjan

Orjan

    Writes binary right handed and hex left handed

  • Moderators
  • 3,299 posts
What was wrong, how did you solve it?
__________________________________________
I study Information Systems at Karlstad University when I'm not on CodeCall