Jump to content

Best language to use for web scraping Yahoo! Finance?

- - - - -

  • Please log in to reply
2 replies to this topic

#1
ThePistonDoctor

ThePistonDoctor

    Newbie

  • Members
  • Pip
  • 6 posts
Good afternoon ladies/gents,

I am looking for recommendations on how to scrape data from company financial statements on Yahoo! Finance. Basically, I have developed a system for picking stocks which analyzes quarterly and annual revenue growth, cost of revenue growth, cash flows, balance sheet variables, various multiples/indicators, etc and I want to automate the process. For example, the first thing calculated is whether or not annual revenues have risen given the last three years. For example: PSDV Income Statement | pSivida Corp. Stock - Yahoo! Finance here we can see that since 2008 annual revenue has fallen from 12.162m to 4.965m. Thus the first thing I would want to make note of is whether this is increasing or decreasing (i.e. if ((2008 < 2009 AND 2009 < 2010) OR 2008 < 2010) then revenue is decreasing overall, and vice versa) Next I would want to calculate the average percentage change over the three year period (i.e. ((%change year0, year0-1) + (%change year0-1, year0-2) / 2)) to see if the company's revenue is increasing at an increasing rate, decreasing at a decreasing rate, increasing at a decreasing rate, etc.

The calculations will obviously become more complex than this but that's not the part I'm worried about. My concern is picking the proper language to make the application work efficiently and quickly. Now, I know that all of the data in the financial statements are displayed in tables, so I assume the best way to pull the proper data would be to determine where the data lies in the tables, and scrape it into my application. The three major tables I will be using are the income statement, balance sheet and cash flow statement, both annually and quarterly. I will also be using a few select criteria from the key statistics table, the competitor table, and the industry tables. That stuff can all come later though.

To keep this short and sweet, which language would you all recommend for scraping the data out of the page code? Python? Ruby? Java? I am very unbiased since I don't know any language particularly well and this will be a learning process for me no matter which language I choose. I do have a bit of programming experience though and understand the concepts (I know VB, Java, and various web design languages fairly well and also have experience with batch and shell scripting)

Any help will be greatly appreciated! :)

Thanks

#2
2Root

2Root

    Newbie

  • Members
  • Pip
  • 5 posts
Well, according to the yahoo api, yahoo! finance uses RSS feeds based on ticker symbols. You could use a javascript library such as YUI to grab the data and a language like php to store it and do calculations, or you could use python with django to do the same thing. There are so many available options that it comes down to preference to a language.

#3
ThePistonDoctor

ThePistonDoctor

    Newbie

  • Members
  • Pip
  • 6 posts
Thanks for the reply. I opted to use Python with Mechanize and BeautifulSoup. I am actually not using the ticker info for anything, only the historical financial statements, so that makes it a bit easier. I have started using BeautifulSoup to scrape the data and am learning my way around the different functions it offers. It seems fairly powerful and like it will do what I want. So far I've got it to download the entire contents of the page, pull out the relevant table, search the table for appropriate rows and pull some data out of the rows. Now I need to make it more efficient by finding a way to access the rows/data I want directly instead of having to search the whole table - right now it's slow as hell but it works. Lol

BTW 2Root, your username looks familiar to me for some reason. This might be a weird question, but were you by any chance a member of ShadowCrew back in the day?




1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users