I'm working on a script to parse out URL's from a Sitemap.xml file. The script works fine on some sitemap files but not others. The one's it don't work on have specific XML tags like:
<urlset xsi:schemaLocation="http://www.xxxx.org/schemas/sitemap/0.9 http://www.sitemaps....0.9/sitemap.xsd">
−
<url>
<loc>
http://xxxxxx.wordpr...010/05/16/blog/
</loc>
<lastmod>2010-05-20T13:39:02+00:00</lastmod>
<changefreq>monthly</changefreq>
</url>
I end up picking up the <lastmod> and <changefreq> tags along with the url. How can I modify my preg function to just pull the URL?
Here is the snippet of code where I am looping through the file
foreach ( $lines as $line_number => $line )
{
$line = trim($line);
preg_match_all('/(?<=\<loc\>)(.*?)(?=\<\/loc\>)/U', $line, $matches,PREG_SET_ORDER);
if($matches)
{
if ( $matches[0][0] != '' )
{
$allMatches[] = $matches[0][0];
};
};
};
Also, is there a good document that explains pattern matching in detail?
Thanks in advance
Rob
Help with PREG_MATCH_ALL function
Started by rsnider19, May 20 2010 06:19 AM
7 replies to this topic
#1
Posted 20 May 2010 - 06:19 AM
|
|
|
#2
Posted 20 May 2010 - 07:57 AM
Hmm, not quite sure how your script exactly looks like, but I tried by recreating from the pattern you're supplying and for me it worked:
Made $file contain the whole file contents and then add is after /U to the pattern to make it apply to all lines. And edit "test.txt" to the filename of the file containing the XML you supplied.
</SPAN>
$file = file_get_contents("test.txt");
preg_match_all('/(?<=\<loc\>)(.*?)(?=\<\/loc\>)/Uis', $file, $matches,PREG_SET_ORDER);
if($matches)
{
if ( $matches[0][0] != '' )
{
$allMatches[] = $matches[0][0];
}
}
print_r($allMatches);
Made $file contain the whole file contents and then add is after /U to the pattern to make it apply to all lines. And edit "test.txt" to the filename of the file containing the XML you supplied.
</SPAN>
#3
Posted 20 May 2010 - 09:07 AM
Thanks for your reply. I added the "is" and still getting the same results I have attached script and test files. I am running script on WAMP, not sure if that makes a difference.
AP1 contains address of sitemap(s)
SITEMAP.php is the script
URLLIST.TXT is output of run
Let me know if you get the same results.
Thanks again
Rob
AP1 contains address of sitemap(s)
SITEMAP.php is the script
URLLIST.TXT is output of run
Let me know if you get the same results.
Thanks again
Rob
Attached Files
#4
Posted 20 May 2010 - 09:44 AM
Another version would be to do like this:
which interprets the xml as xml instead.
$xml = new SimpleXML($lines); $url = $xml->url->loc;
which interprets the xml as xml instead.
__________________________________________
I study Information Systems at Karlstad University when I'm not on CodeCall
I study Information Systems at Karlstad University when I'm not on CodeCall
#5
Posted 20 May 2010 - 11:13 AM
Sorry, I'm not familiar with SimpleXML. I got the following error trying it:
Fatal error: Class 'SimpleXML' not found in C:\wamp\www\buildsite.php on line 10
do you mean ?
$xml = new SimpleXMLElement($lines);
if so, I get
Fatal error: Uncaught exception 'Exception' with message 'String could not be parsed as XML' in C:\wamp\www\buildsite.php:10 Stack trace: #0 C:\wamp\www\buildsite.php(10): SimpleXMLElement->__construct('') #1 {main} thrown in C:\wamp\www\buildsite.php on line 10
Thanks,
Rob
Fatal error: Class 'SimpleXML' not found in C:\wamp\www\buildsite.php on line 10
do you mean ?
$xml = new SimpleXMLElement($lines);
if so, I get
Fatal error: Uncaught exception 'Exception' with message 'String could not be parsed as XML' in C:\wamp\www\buildsite.php:10 Stack trace: #0 C:\wamp\www\buildsite.php(10): SimpleXMLElement->__construct('') #1 {main} thrown in C:\wamp\www\buildsite.php on line 10
Thanks,
Rob
#6
Posted 20 May 2010 - 11:43 AM
Ah, sorry, that I meant. oh, so it's not real xml files either?
__________________________________________
I study Information Systems at Karlstad University when I'm not on CodeCall
I study Information Systems at Karlstad University when I'm not on CodeCall
#7
Posted 20 May 2010 - 11:46 AM
Nevermind, I got it working with SimpleXML. Thanks for pointing that out to me because, well, it was Simple :)
Thanks,
Rob
Thanks,
Rob
#8
Posted 20 May 2010 - 11:47 AM
What was wrong, how did you solve it?
__________________________________________
I study Information Systems at Karlstad University when I'm not on CodeCall
I study Information Systems at Karlstad University when I'm not on CodeCall


Sign In
Create Account


Back to top











