Jump to content

downloading images

- - - - -

This topic has been archived. This means that you cannot reply to this topic.
21 replies to this topic

#1
Hot_Milo23

Hot_Milo23

    Programmer

  • Members
  • PipPipPipPip
  • 120 posts
hey all,
got a small problem and im not really sure where to start with it. (i havent used the html libraries before). I quite enjoy the internet comic called ansems retort (some of you may know of it? or not?). and i would like a way to be able to view them offline. so i started going through each one methodically copying and pasting, got bored pretty quickly. so im looking for a way to do it easier?

if this helps each picture is held on a page that follows like this:
ansemretort.org/ansemretort/index.html?comic=x
x being the number (up to 529 atm).
also each strip is named : "Comicx.png" (x being the number again.)

if possible could someone show me a way to accomplish this?
at first i just wanted it for convenience, now i want to use it as a learning opportunity?

thanks in advance!

#2
Aereshaa

Aereshaa

    Programming God

  • Members
  • PipPipPipPipPipPipPip
  • 790 posts
Well, I'm not too familiar with python, but here's the way I would do it in pseudocode:
-while there's more comics to fetch:
--fetch url "blah.com/comic=" + num.to_s into string
--write string to file "comic" + num.to_s
-end while.
If someone proficient in python could translate this into python in should work.
Watches: Nanoha, Haruhi, AzuDai. Listens to: E-Type, Dj Melodie, Nightcore.
"When people are wrong they need to be corrected. And then when they can't accept it, an argument ensues." - MeTh0Dz

#3
psam

psam

    Learning Programmer

  • Members
  • PipPipPip
  • 34 posts
I wrote a python script that'll download the comics up to the edition 531 ( which i believe to be the last one right now ) into the folder you save it. I can't post the code because it contains the download link and my post count is less than 10 so i attached it to this message.

Attached Files

  • Attached File  pic.zip   279bytes   24 downloads


#4
Hot_Milo23

Hot_Milo23

    Programmer

  • Members
  • PipPipPipPip
  • 120 posts
haha, wow
thx psam
your a legend! :D
preciate it

#5
psam

psam

    Learning Programmer

  • Members
  • PipPipPip
  • 34 posts
No problem.
Any time you need ;).

#6
Hot_Milo23

Hot_Milo23

    Programmer

  • Members
  • PipPipPipPip
  • 120 posts
im sure this page is very dead by now but if u are still around psam, i would like your help again with a similar problem.

analyzing the code u used last time i see you didnt use the web address at all, u used the address of where the pictures were stored. i was just wondering how u knew where this was, and if u know how to do it again (with "Pokemon x" comics).

so if u still frequent this site psam, i would appreciate your help.
or if anyone else could have a look into this for me??

thx in advance guys :D

#7
Davison

Davison

    Newbie

  • Members
  • Pip
  • 9 posts
I slightly modified the previous posters code, to make it so you can define which comic you wish to start downloading from, and which you wish to finish with (i.e. you know you have not read 25 or 26, you simply run this from command prompt in windows with the code "spam and eggs.py" 25 26 (with 'spam and eggs.py' being the script name)


from urllib import urlretrieve

import sys


n = int(sys.argv[1])

finish = int(sys.argv[2])

while n < finish:

    url = 'urlurlurlurlurlurlurl' + str(n) + '.png'

    name = 'comictitlehere' + str(n) + '.PNG'

    file = open(name, 'w')

    urlretrieve(url, name)

    print 'downloading %s NOW' % (url)

    file.close()

    n += 1



*It says urlurl...as i cannot post links.

To retrieve the image url, the simplest possible method is
  • Right click the picture
  • Copy the image url
  • Paste to your address bar/empty txt file

If the url is along the lines of 'comic/500.png' or similar, you are fine.

However my code will not work if the comic you use has the date it was posted as the name, a-la Ctrl-Alt-Del.

I would probably be able to find a solution to this, but it is midnight and i have university tomorrow, i'll try and update tomorrow night with any solution for the problem of date.

Hope i've helped.

#8
Davison

Davison

    Newbie

  • Members
  • Pip
  • 9 posts
from urllib import urlretrieve

import sys,os


def main():

    n = int(sys.argv[1])

    while 1:

        url = 'urlurlurlurlurl' + str(n) + '.png'

        name = 'comic-' + str(n) + '.PNG'

        file = open(name, 'w')

        urlretrieve(url, name)

        file.close()

        if(os.stat(name)[6] < 10000):

            print 'Updated to Comic', str(n-1)

            break

        print 'downloaded %s' %url

        n += 1

    os.remove(name)


main():

This is some mildly updated script.

Added the os.stat function

This means that now, you enter your start comic number in the command line and the program will get any subsequent comics until it stores an image of less than 10000 bytes(can be changed to any value you like, this is just an example), where it will then exit the program and delete this sub-10000byte image.

Brushing up on regular expressions just now to handle the issue of date-url comics.

#9
Davison

Davison

    Newbie

  • Members
  • Pip
  • 9 posts
I'm an idiot, regular expressions are not needed

I love pythons included libraries btw, and this code variant is for sites with the pattern

yyyymmdd.jpg
can change the url and the extension to fit...

from urllib import urlretrieve
import sys,os,datetime,time

arg = sys.argv

one = sys.argv[1]
yr = one[:4]
mo = one[4:6]
dy = one[6:]

two = sys.argv[2]
yr2 = two[:4]
mo2 = two[4:6]
dy2 = two[6:]

start = datetime.date(int(yr),int(mo),int(dy))
end = datetime.date(int(yr2),int(mo2),int(dy2))

print 'Start: ',start,' End: ',end

while start <= end:
    site = 'urlurlurlurlurl' 
    if start.month < 10:
        month = '0' + str(start.month)
    else:
        month = str(start.month)
    if start.day < 10:
        day = '0' + str(start.day)
    else:
        day = str(start.day)
    url =  site + str(start.year) + month + day + '.jpg'

    name = 'ctrlaltdel - ' + start.strftime("%Y%m%d") + '.jpg'
    file = open(name, 'w')
    urlretrieve(url, name)
    file.close()

    if(os.stat(name)[6] < 20000):
        os.remove(name)
    else:
        print 'downloaded',url
    start = start + datetime.timedelta(1)


Process:
Get startdate and enddate from command line
Get site for start date, check if data more than 20000 bytes
If bytes less than 20000bytes...delete
Add 1 day and repeat

edit: Made it that start <= end, so that if there is a comic on the end date, it will also be downloaded

#10
Hot_Milo23

Hot_Milo23

    Programmer

  • Members
  • PipPipPipPip
  • 120 posts
Davidson,
you have been an awesome help, but for some reason it still wont work?
ive used urlretrieve to download the google logo. ive used it to download every file type (including png, which is what the comic is saved as). Both worked, but when i try to do the exact same thing with the comic, it wont??

im stumped?

thx for the help tho :D

#11
debtboy

debtboy

    Programming God

  • Members
  • PipPipPipPipPipPipPip
  • 916 posts
Good to see some Python :thumbup1:

#12
Davison

Davison

    Newbie

  • Members
  • Pip
  • 9 posts
Can you post the code the way you use it, and also can you post the site for me, if possible, so i can see the format