Jump to content

Trying to decide how to do this...

- - - - -

This topic has been archived. This means that you cannot reply to this topic.
1 reply to this topic

#1
Someonespecial

Someonespecial

    Newbie

  • Members
  • Pip
  • 1 posts
I've had extensive C and C++ work, but never have done anything web related. What I'm looking to do is to download all the tabs off of ultimate-guitar.com

I found that the tabs are indexed by a certain prefix: /guitar/bandname/songname.some_song_version.some_guitarpro_version

An example would be:

ultimate-guitar.com/guitar/h/hammerfall/always_will_be_ver3.gp5

The problem is that:

1) I do not have a list of names of all the bands (I want to extract all of them) so I would have to have a loop running through all alphanumeric possibilities of band names, and testing them to see if they were on the database.

2) I do not know the song names either, so I would also have to incrementally look for them, as well as the version of the song name (always_will_be_ver1 or _ver2, etc.)

3) I do not also know the GP number, as it could be from 1 to usually around 8 for popular bands.

So my problem is that I do not know the band names, the song names, the version of the song names, or the version of the guitar pro file name.

I would have to incrementally look for everything on the database, unless I could access it and have some sort of way to pull the lists.

Since I have no web experience, I was wondering how to do this. Should I use PHP for this operation? I could learn it quite easily, it's just is it the right tool for the job, and is there an easier way than bruteforcing all characters and possibilities to download possible files?

Thanks

-G Friggen Unit

#2
Orjan

Orjan

    Writes binary right handed and hex left handed

  • Moderators
  • 3,299 posts
If you're gonna use PHP: I suggest you do this:

a foreach that loops an array of the different indexes ('0-9', 'a', 'b',..'z')

within that loop,
you connect to the site using curl and fetches the page.
on the lower part of the page, there is a "Pages:" part. with a regexp, match out them all, (there's usually an unique divtag or something to match on)

loop through each page
read the page with curl, and with regexp find the name of each tab page

loop through all matches found
now you're on the songs page, here you do the same again, read the page with curl, and try to match on each pagename

loop through all song matches
finally inside the innermost loop, you curl the page and save whatever you want to save.

if you then wanna refresh and update daily, use the fresh tabs page ordered per date, and you'll get the only new ones very easy

but, for you, curl to read the page, and regexps to match the content for unique tags, is the main thing.

I think that the preg_match_all is a great function to run on the pages content, as it returns the info you want from a page in an array, so you easily can loop it in the next level.
__________________________________________
I study Information Systems at Karlstad University when I'm not on CodeCall