Jump to content

Crawler to retrieve 2 values from source

- - - - -

This topic has been archived. This means that you cannot reply to this topic.
5 replies to this topic

#1
bemaitea

bemaitea

    Newbie

  • Members
  • Pip
  • 4 posts
Hello all!

New to the forums and pretty excited to start learning some programming. I have a moderate amount of experience with XHTML/PHP/JS, but I would not classify myself as experienced at all.

Anyway, I'm in need of a program that crawls a URL and retrieves 2 values that are between a specific set of HTML tags.

The ideal program would go through a list of URLs in text document and crawl the source for the chosen values and output the values in a new text file in the order of the URLs.

So essentially the steps here would be:

1. Program opens txt document, reads first URL
2. Program opens URL and use a function like "IF TRUE PRINT "value"" in a new document
3. Program repeats for the next URL until all URLs are accounted for

Can anyone give me some direction on how to accomplish this? I'd really appreciate it! :)

#2
bemaitea

bemaitea

    Newbie

  • Members
  • Pip
  • 4 posts
Just a quick update:

Found a precompiled JS webcrawler which does EXACTLY what I'm looking for :thumbup:

Only thing is now I need to figure out how to customize the output HTML file...

I've posted in the JS forum so if anyone wan't to take a look and offer some advice that would be awesome :)
------

Link to new post:
http://forum.codecal...html#post266377

#3
kresh7

kresh7

    Programming God

  • Members
  • PipPipPipPipPipPipPip
  • 661 posts
well you could try regular expressions to get the values pretty easy out of any file you want.
Posted Image

#4
JenniC

JenniC

    Newbie

  • Members
  • Pip
  • 6 posts
Here is a script that will do that.

 
[COLOR=magenta]# Script Crawler.txt[/COLOR]
var str page, starttag, endtag, source
[COLOR=magenta]# Check if values are assigned to page, starttag, endtag[/COLOR]
if ( $page=="" OR $starttag=="" OR $endtag=="")
    exit 1 "ERROR: Value for page, starttag or endtag is not specified."
endif
 
[COLOR=magenta]# Read the page into a string variable.[/COLOR]
cat $page > $source
 
[COLOR=magenta]# Remove portion before (and including) $starttag.[/COLOR]
stex -c ("^"+$starttag+"^]") $source > null
 
[COLOR=magenta]# Remove portion after (and including) $endtag[/COLOR]
stex -c ("[^"+$endtag+"^") $source > null
 
[COLOR=magenta]# What's remaining in $source is what we want to extract.[/COLOR]
echo $source


Script is in biterscripting. Save script in file C:/Scripts/Crawler.txt, start biterscripting and enter this command


script "C:/Scripts/Crawler.txt" page("[URL]http://www.somesite.com/somepage.asp[/URL]") starttag("<head>") endtag("</head>")


Will show you the extracted value from that page between the tags <head> and </head>.

#5
bemaitea

bemaitea

    Newbie

  • Members
  • Pip
  • 4 posts
Hey JenniC!

Thanks alot for this script! It is pretty awesome, I've never heard of biterscript and after going through the site I can see I might be using this script quite often to perform routine tasks! So thank you two fold!

One of the issues I am having with the current script is that the tag's which I am trying to retrieve have a class to them, which biterscript is not liking. so for example, the tags I want to obtain are as such: <h3 class="zmp"> </h3>. As you can see, the quotes around the zmp are leading to Error 351: Invalid syntax. Any ideas?

Also, how difficult would it be to modify this script to read a txt file with a list of URLs and output a new txt file with the extracted values from each URL?

Again, thanks for introducing me to biterscript. I'm going through the lessons right now so hopefully I can pick up some knowledge!

#6
JenniC

JenniC

    Newbie

  • Members
  • Pip
  • 6 posts
Cool. You are welcome.

Quote

...<h3 class="zmp"> </h3>. As you can see, the quotes around the zmp are leading to Error 351...

Escape the double quotes in strings with backslash.

script "C:/Scripts/Crawler.txt" page("[URL]http://www.somesite.com/somepage.asp[/URL]") starttag("<h3 class=[COLOR=magenta]\"[/COLOR]zmp[COLOR=magenta]\"[/COLOR]>") endtag("</h3>")



Quote

how difficult would it be to modify this script to read a txt file with a list of URLs and output a new txt file with the extracted values from each URL?

Not difficult at all. Call the Crawler.txt script in a loop.

Put the list of URLs in text file C:/URLList.txt, one per line. Then write a second script.

 
[COLOR=magenta]# Script ListCrawler.txt[/COLOR]
var string listfile, URLlist, URL
[COLOR=magenta]# Check if value is assigned to listfile[/COLOR]
if ( $listfile=="")
    exit 1 "ERROR: Value for listfile is not specified."
endif
 
[COLOR=magenta]# Get the contents of $listfile into $URLlist.[/COLOR]
cat $listfile > $URLlist
 
[COLOR=magenta]# Go thru lines of $URLlist one by one.[/COLOR]
lex "1" $URLlist > $URL
while ($URL <> "")
do
    [COLOR=magenta]# The next URL is in $URL. Call our Crawler.txt script with it.[/COLOR]
    script "C:/Scripts/Crawler.txt" page($URL) starttag("<h3 class=\"zmp\">") endtag("</h3>")
 
    [COLOR=magenta]# Get the next URL[/COLOR]
    lex "1" $URLlist > $URL
done


Save this second script in file C:/Scripts/ListCrawler.txt, call it with


script "C:/Scripts/ListCrawler.txt" listfile("C:/URLList.txt")


This would call the Crawler.txt script in a loop. You will get extracted values from all URLs, one per line.