I've had extensive C and C++ work, but never have done anything web related. What I'm looking to do is to download all the tabs off of ultimate-guitar.com
I found that the tabs are indexed by a certain prefix: /guitar/bandname/songname.some_song_version.some_guitarpro_version
An example would be:
ultimate-guitar.com/guitar/h/hammerfall/always_will_be_ver3.gp5
The problem is that:
1) I do not have a list of names of all the bands (I want to extract all of them) so I would have to have a loop running through all alphanumeric possibilities of band names, and testing them to see if they were on the database.
2) I do not know the song names either, so I would also have to incrementally look for them, as well as the version of the song name (always_will_be_ver1 or _ver2, etc.)
3) I do not also know the GP number, as it could be from 1 to usually around 8 for popular bands.
So my problem is that I do not know the band names, the song names, the version of the song names, or the version of the guitar pro file name.
I would have to incrementally look for everything on the database, unless I could access it and have some sort of way to pull the lists.
Since I have no web experience, I was wondering how to do this. Should I use PHP for this operation? I could learn it quite easily, it's just is it the right tool for the job, and is there an easier way than bruteforcing all characters and possibilities to download possible files?
Thanks
-G Friggen Unit
Trying to decide how to do this...
Started by Someonespecial, Apr 01 2009 11:54 AM
1 reply to this topic
#1
Posted 01 April 2009 - 11:54 AM
|
|
|
#2
Posted 01 April 2009 - 12:40 PM
If you're gonna use PHP: I suggest you do this:
a foreach that loops an array of the different indexes ('0-9', 'a', 'b',..'z')
within that loop,
you connect to the site using curl and fetches the page.
on the lower part of the page, there is a "Pages:" part. with a regexp, match out them all, (there's usually an unique divtag or something to match on)
loop through each page
read the page with curl, and with regexp find the name of each tab page
loop through all matches found
now you're on the songs page, here you do the same again, read the page with curl, and try to match on each pagename
loop through all song matches
finally inside the innermost loop, you curl the page and save whatever you want to save.
if you then wanna refresh and update daily, use the fresh tabs page ordered per date, and you'll get the only new ones very easy
but, for you, curl to read the page, and regexps to match the content for unique tags, is the main thing.
I think that the preg_match_all is a great function to run on the pages content, as it returns the info you want from a page in an array, so you easily can loop it in the next level.
a foreach that loops an array of the different indexes ('0-9', 'a', 'b',..'z')
within that loop,
you connect to the site using curl and fetches the page.
on the lower part of the page, there is a "Pages:" part. with a regexp, match out them all, (there's usually an unique divtag or something to match on)
loop through each page
read the page with curl, and with regexp find the name of each tab page
loop through all matches found
now you're on the songs page, here you do the same again, read the page with curl, and try to match on each pagename
loop through all song matches
finally inside the innermost loop, you curl the page and save whatever you want to save.
if you then wanna refresh and update daily, use the fresh tabs page ordered per date, and you'll get the only new ones very easy
but, for you, curl to read the page, and regexps to match the content for unique tags, is the main thing.
I think that the preg_match_all is a great function to run on the pages content, as it returns the info you want from a page in an array, so you easily can loop it in the next level.
__________________________________________
I study Information Systems at Karlstad University when I'm not on CodeCall
I study Information Systems at Karlstad University when I'm not on CodeCall


Sign In
Create Account

Back to top









