Jump to content

slow module or bad coding?

- - - - -

  • Please log in to reply
6 replies to this topic

#1
Flying Dutchman

Flying Dutchman

    Programming God

  • Members
  • PipPipPipPipPipPipPip
  • 890 posts
  • Location:::1
So, I made myself an app that check when the next episode (list is stored in an .xml file) is going to air. It checks for dates on Wikipedia and it takes like 10 seconds on my PC to show results.

Here' the code (minus the Qt part):

opener = urllib2.build_opener()

opener.addheaders = [('User-agent', 'Mozilla/5.0')]


today = datetime.date.today()


regex = re.compile(r'<td>([^<]*)<span style="display:none"> \(<span class="bday dtstart published updated">([^<]*)</span>\)</span></td>')


series = {}

xmldata = xml.dom.minidom.parse("list.xml")

for el in xmldata.getElementsByTagName("show"):

	series[el.getAttribute("title")] = el.getAttribute("url")


class Updater(threading.Thread):

	def __init__(self, content, row, regex):

		threading.Thread.__init__(self)

		self.content = content

		self.row = row

		self.regex = regex

		

	def run(self):

		for date1, date2 in regex.findall(self.content):

			if date2 == str(today):

				dlg.tableWidget.setItem(self.row, 1, QtGui.QTableWidgetItem("It's today!"))

				break

				

			y, m, d = map(int, date2.split("-"))


			if y >= int(today.year) and m == int(today.month) and int(d >= today.day) or y >= today.year and m > today.month:

				dlg.tableWidget.setItem(self.row, 1, QtGui.QTableWidgetItem(date1))

				break


		else:

			dlg.tableWidget.setItem(self.row, 1, QtGui.QTableWidgetItem("No date"))


dlg.tableWidget.setColumnWidth(0, 250)

dlg.tableWidget.setColumnWidth(1, 140)


row = 0

for title, url in series.items():

	dlg.tableWidget.insertRow(row)

	dlg.tableWidget.setItem(row, 0, QtGui.QTableWidgetItem(title))

	content = opener.open(url).read()

	t = Updater(content, row, regex)

	t.start()

	row += 1

I'm guessing the opener part is slow.
A conclusion is where you got tired of thinking.
#define class struct    // All is public.

#2
Alexander

Alexander

    It's Science!

  • Moderators
  • 4,124 posts
  • Location:Vancouver, Eh! Cleverness: 200
A few metrics such as how many unique addresses the program is loading, that takes 10 seconds would help diagnose the problem, although I believe it is mainly the latency of the HTTP requests taking up a majority of the time, which cannot be improved.

If you are indeed calling around 10 concurrent HTTP requests, if your thread queue is actually making them concurrent, 5-10 seconds would not seem moot in that situation as HTTP was not designed with low latency in mind, it could very well be normal behaviour. I would test a basic script and concurrently connect to a page with high content (like wikipedia) a few times and measure each response time, those simple tests can help you understand what has gone wrong (profiling is the idea)

If the list of episodes is on the same page , which I believe wikipedia does for the same show, can you not store the whole page once and use threads only to match the X episodes on that local data? I am not totally sure if that cannot be done.
Be sure to read the updated FAQ! || Health is achieved through the same 10,000 steps.
If a suggested code/method fails, informing us is less important than telling us why or what errors occurred.

#3
Flying Dutchman

Flying Dutchman

    Programming God

  • Members
  • PipPipPipPipPipPipPip
  • 890 posts
  • Location:::1
Yes, Wikipedia stores all dates for a given show on same page but I'm afraid I don't quite understand your suggestion. To my knowledge, does the read() method not download entire page?

content = opener.open(url).read()


A conclusion is where you got tired of thinking.
#define class struct    // All is public.

#4
Alexander

Alexander

    It's Science!

  • Moderators
  • 4,124 posts
  • Location:Vancouver, Eh! Cleverness: 200
Ah - my point is if all the episodes are on the same page, you appear to re-download the page each time you want a piece of information out of it, rather than storing the whole list of episodes and THEN working on extracting the data. HTTP requests are heavy, so it would be better to work your program around making less of those, for example taking the .read() out of your threading unless it is a new URL (series).
Be sure to read the updated FAQ! || Health is achieved through the same 10,000 steps.
If a suggested code/method fails, informing us is less important than telling us why or what errors occurred.

#5
Flying Dutchman

Flying Dutchman

    Programming God

  • Members
  • PipPipPipPipPipPipPip
  • 890 posts
  • Location:::1
Here's a couple of lines in my .xml file:

    <show title="The Big Bang Theory" url="http://en.wikipedia.org/wiki/List_of_The_Big_Bang_Theory_episodes" />

    <show title="Stargate Universe" url="http://en.wikipedia.org/wiki/List_of_Stargate_Universe_episodes" />


And I have dictionary whose keys are episodes titles and values are it's URLs. I go through that dictionary and create new thread for each, add content (page contents with .read() method) and regular expression and then display results.

Oh, I found earlier version of this program, and I did it without regular expressions and it takes less than a second to display results. It seems that I've written a slow regex. I'll take a closer look at that.
A conclusion is where you got tired of thinking.
#define class struct    // All is public.

#6
Alexander

Alexander

    It's Science!

  • Moderators
  • 4,124 posts
  • Location:Vancouver, Eh! Cleverness: 200
That is better! I would have thought about it, but it did not look out of the ordinary. I wonder why it is slow, I am glad you had diagnosed the problem!
Be sure to read the updated FAQ! || Health is achieved through the same 10,000 steps.
If a suggested code/method fails, informing us is less important than telling us why or what errors occurred.

#7
Flying Dutchman

Flying Dutchman

    Programming God

  • Members
  • PipPipPipPipPipPipPip
  • 890 posts
  • Location:::1
Finally cracked this one! :)

The problem was here:

content = opener.open(url).read()

t = Updater(content, row, regex)

I opened web page in for loop (which is in main thread) and then passed the contents to sub thread. Fixed it by passing URL to thread and then I open page in sub thread instead of main. Runs like a charm now.
A conclusion is where you got tired of thinking.
#define class struct    // All is public.




1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users