Jump to content


Check out our Community Blogs

Register and join over 40,000 other developers!


Recent Status Updates

View All Updates

Photo
- - - - -

[Python] - Python Reddit Scrapper V1.0

python scraper reddit xml lxml requests python2.7.5 python 2.7.5

  • Please log in to reply
No replies to this topic

#1 Sundance

Sundance

    CC Devotee

  • Validating
  • PipPipPipPipPipPip
  • 572 posts
  • Programming Language:C, Java, PHP, Python, JavaScript, Perl, PL/SQL, Transact-SQL, Bash, Others

Posted 15 November 2013 - 09:19 AM

This is my Python Scraper, I won't include the counter part (my PHP document that renders it) however you can find it on my Pastebin which is HERE or you can find a link to it in the script.

 

NOTE: The PHP script is NOT commentated as this post is about my Python Script and NOT my PHP rendering also note there is NO css file to go with it.

# Python Reddit Scraper V1.0
# PHP XML Render page: http://pastebin.com/nip125AJ
# This script scrapes Reddit
# It then exports it to an XML file to be read whichever method you decide is best.
# I wanted to create a scraper that would grab three things 1. The name of the post 2. The comments section URL and 3. The image / youtube video attributed to the post

# NOTES
# Coded on Python 2.7.5 
# Requires Requests and LXML modules
# Coded by LKP from CodeCall.net 
# Pastebin: http://pastebin.com/u/LorenKPetrov

# IMPORTS #
from lxml import html # Imports HTML from LXML
import xml.etree.cElementTree as XMLT # Imports element tree for python so it can write XML in the right style.
import requests # imports the requests so Python can 

# VARIABLES #
page = requests.get('http://www.reddit.com/r/minecraft') # Gets the page to scrape
tree = html.fromstring(page.text) # converts the HTML page into a tree for XPATH to read
title = tree.xpath('//a[@class="title "]/text()') # Grabs the Hyperlink text with the class named title NOTE: The space is supposed to be there, on Reddit the space is still there.
link = tree.xpath('//li[@class="first"]/a/@href') # Similiar to above but grabs the Hyper link from the href tag from the li tag with the class "first".
imgur = tree.xpath('//p[@class="title"]/a/@href') # and again with above it grabs the href tag within the paragraph tag.
root = XMLT.Element("ENTRY") # This is my root XML tag so it doesn't become part of the loop
start = 0 # This number was what I used during my While tag.
total = len(title) # This counts the total of entries, to explain that a bit clearer if we liken it to a book, it's like counting the number of chapters in a book, I.e. 36 chapters.

# MAIN CODE #
while start < total: # While start (equal to 0) is less than the total (equal to however many variables are in the title list) do the following
	doc = XMLT.SubElement(root, "POST") # Writes the XML tag POST
	field1 = XMLT.SubElement(doc, "TITLE") # Writes the XML tag TITLE
	field1.text = title[start] # Writes the tag content for TITLE
	field2	= XMLT.SubElement(doc, "MEDIA") # Writes the XML tag MEDIA
	field2.text = imgur[start] # Writes the tag content for MEDIA
	field3 = XMLT.SubElement(doc, "LINK") # Writes the XML tag LINK
	field3.text = link[start] # Writes the tag content for LINK

	start = start + 1 # Adds 1 on to the variable 'start' so it will loop the code for the amount of times that the total is less than the start

tree = XMLT.ElementTree(root) # Makes the ENTRY tag in the XML document
tree.write("MC.xml") # finally, it writes the info to the specified XML document.


Edited by LKP, 16 November 2013 - 07:46 AM.

  • 1

Please read the

FaQ & Guidelines






Also tagged with one or more of these keywords: python, scraper, reddit, xml, lxml, requests, python2.7.5, python 2.7.5

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download