Hello all!
New to the forums and pretty excited to start learning some programming. I have a moderate amount of experience with XHTML/PHP/JS, but I would not classify myself as experienced at all.
Anyway, I'm in need of a program that crawls a URL and retrieves 2 values that are between a specific set of HTML tags.
The ideal program would go through a list of URLs in text document and crawl the source for the chosen values and output the values in a new text file in the order of the URLs.
So essentially the steps here would be:
1. Program opens txt document, reads first URL
2. Program opens URL and use a function like "IF TRUE PRINT "value"" in a new document
3. Program repeats for the next URL until all URLs are accounted for
Can anyone give me some direction on how to accomplish this? I'd really appreciate it! :)
Crawler to retrieve 2 values from source
Started by bemaitea, Jul 31 2010 12:03 PM
5 replies to this topic
#1
Posted 31 July 2010 - 12:03 PM
|
|
|
#2
Posted 31 July 2010 - 01:44 PM
Just a quick update:
Found a precompiled JS webcrawler which does EXACTLY what I'm looking for :thumbup:
Only thing is now I need to figure out how to customize the output HTML file...
I've posted in the JS forum so if anyone wan't to take a look and offer some advice that would be awesome :)
------
Link to new post:
http://forum.codecal...html#post266377
Found a precompiled JS webcrawler which does EXACTLY what I'm looking for :thumbup:
Only thing is now I need to figure out how to customize the output HTML file...
I've posted in the JS forum so if anyone wan't to take a look and offer some advice that would be awesome :)
------
Link to new post:
http://forum.codecal...html#post266377
#3
Posted 31 July 2010 - 06:22 PM
well you could try regular expressions to get the values pretty easy out of any file you want.
#4
Posted 02 August 2010 - 10:39 AM
Here is a script that will do that.
Script is in biterscripting. Save script in file C:/Scripts/Crawler.txt, start biterscripting and enter this command
Will show you the extracted value from that page between the tags <head> and </head>.
[COLOR=magenta]# Script Crawler.txt[/COLOR]
var str page, starttag, endtag, source
[COLOR=magenta]# Check if values are assigned to page, starttag, endtag[/COLOR]
if ( $page=="" OR $starttag=="" OR $endtag=="")
exit 1 "ERROR: Value for page, starttag or endtag is not specified."
endif
[COLOR=magenta]# Read the page into a string variable.[/COLOR]
cat $page > $source
[COLOR=magenta]# Remove portion before (and including) $starttag.[/COLOR]
stex -c ("^"+$starttag+"^]") $source > null
[COLOR=magenta]# Remove portion after (and including) $endtag[/COLOR]
stex -c ("[^"+$endtag+"^") $source > null
[COLOR=magenta]# What's remaining in $source is what we want to extract.[/COLOR]
echo $source
Script is in biterscripting. Save script in file C:/Scripts/Crawler.txt, start biterscripting and enter this command
script "C:/Scripts/Crawler.txt" page("[URL]http://www.somesite.com/somepage.asp[/URL]") starttag("<head>") endtag("</head>")
Will show you the extracted value from that page between the tags <head> and </head>.
#5
Posted 02 August 2010 - 07:17 PM
Hey JenniC!
Thanks alot for this script! It is pretty awesome, I've never heard of biterscript and after going through the site I can see I might be using this script quite often to perform routine tasks! So thank you two fold!
One of the issues I am having with the current script is that the tag's which I am trying to retrieve have a class to them, which biterscript is not liking. so for example, the tags I want to obtain are as such: <h3 class="zmp"> </h3>. As you can see, the quotes around the zmp are leading to Error 351: Invalid syntax. Any ideas?
Also, how difficult would it be to modify this script to read a txt file with a list of URLs and output a new txt file with the extracted values from each URL?
Again, thanks for introducing me to biterscript. I'm going through the lessons right now so hopefully I can pick up some knowledge!
Thanks alot for this script! It is pretty awesome, I've never heard of biterscript and after going through the site I can see I might be using this script quite often to perform routine tasks! So thank you two fold!
One of the issues I am having with the current script is that the tag's which I am trying to retrieve have a class to them, which biterscript is not liking. so for example, the tags I want to obtain are as such: <h3 class="zmp"> </h3>. As you can see, the quotes around the zmp are leading to Error 351: Invalid syntax. Any ideas?
Also, how difficult would it be to modify this script to read a txt file with a list of URLs and output a new txt file with the extracted values from each URL?
Again, thanks for introducing me to biterscript. I'm going through the lessons right now so hopefully I can pick up some knowledge!
#6
Posted 04 August 2010 - 07:06 AM
Cool. You are welcome.
Escape the double quotes in strings with backslash.
Not difficult at all. Call the Crawler.txt script in a loop.
Put the list of URLs in text file C:/URLList.txt, one per line. Then write a second script.
Save this second script in file C:/Scripts/ListCrawler.txt, call it with
This would call the Crawler.txt script in a loop. You will get extracted values from all URLs, one per line.
Quote
...<h3 class="zmp"> </h3>. As you can see, the quotes around the zmp are leading to Error 351...
Escape the double quotes in strings with backslash.
script "C:/Scripts/Crawler.txt" page("[URL]http://www.somesite.com/somepage.asp[/URL]") starttag("<h3 class=[COLOR=magenta]\"[/COLOR]zmp[COLOR=magenta]\"[/COLOR]>") endtag("</h3>")
Quote
how difficult would it be to modify this script to read a txt file with a list of URLs and output a new txt file with the extracted values from each URL?
Not difficult at all. Call the Crawler.txt script in a loop.
Put the list of URLs in text file C:/URLList.txt, one per line. Then write a second script.
[COLOR=magenta]# Script ListCrawler.txt[/COLOR]
var string listfile, URLlist, URL
[COLOR=magenta]# Check if value is assigned to listfile[/COLOR]
if ( $listfile=="")
exit 1 "ERROR: Value for listfile is not specified."
endif
[COLOR=magenta]# Get the contents of $listfile into $URLlist.[/COLOR]
cat $listfile > $URLlist
[COLOR=magenta]# Go thru lines of $URLlist one by one.[/COLOR]
lex "1" $URLlist > $URL
while ($URL <> "")
do
[COLOR=magenta]# The next URL is in $URL. Call our Crawler.txt script with it.[/COLOR]
script "C:/Scripts/Crawler.txt" page($URL) starttag("<h3 class=\"zmp\">") endtag("</h3>")
[COLOR=magenta]# Get the next URL[/COLOR]
lex "1" $URLlist > $URL
done
Save this second script in file C:/Scripts/ListCrawler.txt, call it with
script "C:/Scripts/ListCrawler.txt" listfile("C:/URLList.txt")
This would call the Crawler.txt script in a loop. You will get extracted values from all URLs, one per line.


Sign In
Create Account

Back to top









