Jump to content

Does any know how pull text strings off of a website?

- - - - -

  • Please log in to reply
4 replies to this topic

#1
roxygirl123

roxygirl123

    Newbie

  • Members
  • Pip
  • 1 posts
I want to pull text strings off a website, and write them to a .txt file or even a spreadsheet.
Does anyone have an idea of how to got about this?
I will be very grateful :)




-----------
Damsel in distress

#2
Flying Dutchman

Flying Dutchman

    Programming God

  • Members
  • PipPipPipPipPipPipPip
  • 890 posts
  • Location:::1
You can do it super easy with urllib and urllib2 modules. To parse text you can do it by hand or use regexes (regular expressions).
import urllib
import re


URL = "http://www.xkcd.com"


for line in urllib.urlopen(URL).readlines():
	if "<title>" in line:
		print line[9:-9] # print without html tags
		break


regex = re.compile(r"\<title\>(.*?)\</title\>")


for line in regex.findall(urllib.urlopen(URL).read()):
	print line
Above code gets the latest comic title from xkcd.
A conclusion is where you got tired of thinking.
#define class struct    // All is public.

#3
brokenbylaw

brokenbylaw

    Learning Programmer

  • Members
  • PipPipPip
  • 62 posts
I'd use urllib2 and you can use HTMLParser to parse the html.

#4
Flying Dutchman

Flying Dutchman

    Programming God

  • Members
  • PipPipPipPipPipPipPip
  • 890 posts
  • Location:::1
Forgot to mention earlier, there's also a BeautifulSoup module for parsing html.
A conclusion is where you got tired of thinking.
#define class struct    // All is public.

#5
ReekenX

ReekenX

    Programmer

  • Members
  • PipPipPipPip
  • 134 posts
There is also LXML which can help you parse HTML easilly if you know XPATH.
www.jarmalavicius.lt | www.github.com/reekenx | www.twitter.com/reekenx




1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users