Jump to content

Indexing files on an apache server.

- - - - -

This topic has been archived. This means that you cannot reply to this topic.
11 replies to this topic

#1
hodge-podge

hodge-podge

    Learning Programmer

  • Members
  • PipPipPip
  • 47 posts
Alright, here's my problem. The Library of Congress has an atrocious site. Navigating the ui is painful enough, but finding pictures is even harder. As of now I've resorted to going directly through the
Apache Index pages to look at each photo. For each photo however there is a .gif copy, and a .tif copy along with the standard jpg copy. For some reason apache doesn't have the option to arrange the files by file type, and sorting through these pictures is almost as annoying using the main site. So what Im hoping to do is write an program to scrape the html of the the page and arrange the links by file type.

And this is what I'm talking about when I say index page...

http://memory.loc.go...b41000/3b41500/

So here's my question:

What language would be best to accomplish this?
And, is this not merely a simple task, and I'd be better off dealing with it.
If that is the case, does anyone know of software that can do this?

#2
dargueta

dargueta

    Writes binary right handed and hex left handed

  • Moderators
  • 4,721 posts
Funny you should ask - Last year I actually did something incredibly similar parsing the HTML of the LOC site for ISBN information. I can give you the code and you can modify it, if you like. What operating system do you want this for?
sudo rm -rf /

#3
hodge-podge

hodge-podge

    Learning Programmer

  • Members
  • PipPipPip
  • 47 posts
That'd be awesome. When you say operating system, I'm assuming you are asking what os I'm using. I am using windows....

#4
dargueta

dargueta

    Writes binary right handed and hex left handed

  • Moderators
  • 4,721 posts
Perfect. Here you go. Mind you, it's multithreaded, so be careful how many threads you use. I used 64 and crashed the LOC server. Once I stopped my program the site was back up.

I am not responsible if things go wrong. :)

Attached Files


sudo rm -rf /

#5
hodge-podge

hodge-podge

    Learning Programmer

  • Members
  • PipPipPip
  • 47 posts
Haha, really you crashed it? How would it do that? Simply to many requests? And thanks for this, I really appreciate it.

#6
dargueta

dargueta

    Writes binary right handed and hex left handed

  • Moderators
  • 4,721 posts
Apparently it can't handle 64 requests at the same time. Basically an unintentional DoS attack. I scaled it back to 32 and I think that worked.
sudo rm -rf /

#7
WingedPanther

WingedPanther

    A spammer's worst nightmare

  • Moderators
  • 16,831 posts
I remember the first time I heard about that. Somehow, I can actually believe that. A ddos from a single computer in a campus dorm :)
Programming is a branch of mathematics.
My CodeCall Blog | My Personal Blog

#8
dargueta

dargueta

    Writes binary right handed and hex left handed

  • Moderators
  • 4,721 posts
Amazing what people can do with their free time. Just to be an idiot I sent two friends an email from another friend (we're all close, so it's okay). It had...some rather...um... I'll just leave it at "there were goats involved." :D Anyway, I spoofed the headers to make it seem like it came from my victim friend. The others eventually figured it out and came to my room at three in the morning. When I answered the door, they attacked me with a large vibrating dildo. Apparently they found it at a frat house.
sudo rm -rf /

#9
chirag.jain18

chirag.jain18

    Newbie

  • Members
  • PipPip
  • 10 posts

dargueta said:

I spoofed the headers to make it seem like it came from my victim friend.
Hey,what email server you were using? How did you chang the headers? I am interested to know how it is done.Please guide me.

#10
dargueta

dargueta

    Writes binary right handed and hex left handed

  • Moderators
  • 4,721 posts
Pretty much any computer connected to a network with a program like sendmail installed can pull it off.

Edited by dargueta, 10 March 2010 - 12:30 AM.
Added link

sudo rm -rf /

#11
hodge-podge

hodge-podge

    Learning Programmer

  • Members
  • PipPipPip
  • 47 posts

dargueta said:

Perfect. Here you go. Mind you, it's multithreaded, so be careful how many threads you use. I used 64 and crashed the LOC server. Once I stopped my program the site was back up.

I am not responsible if things go wrong. :)

Hmmm, I'm getting a lot of errors when compiling. You think it could be that I don't have some necessary libraries?

It seems like the problem lies in the http.h header, which it can't seem to find. I'll look it up and see if thats it.

#12
dargueta

dargueta

    Writes binary right handed and hex left handed

  • Moderators
  • 4,721 posts
Oops.I attached the files. The http.txt file is actually http.h but it won't let me upload a .h file for reasons beyond my understanding.

Attached Files


Edited by dargueta, 11 March 2010 - 05:47 PM.
Forgot to attach files.

sudo rm -rf /