Alright, here's my problem. The Library of Congress has an atrocious site. Navigating the ui is painful enough, but finding pictures is even harder. As of now I've resorted to going directly through the
Apache Index pages to look at each photo. For each photo however there is a .gif copy, and a .tif copy along with the standard jpg copy. For some reason apache doesn't have the option to arrange the files by file type, and sorting through these pictures is almost as annoying using the main site. So what Im hoping to do is write an program to scrape the html of the the page and arrange the links by file type.
And this is what I'm talking about when I say index page...
http://memory.loc.go...b41000/3b41500/
So here's my question:
What language would be best to accomplish this?
And, is this not merely a simple task, and I'd be better off dealing with it.
If that is the case, does anyone know of software that can do this?
Indexing files on an apache server.
Started by hodge-podge, Mar 09 2010 05:12 PM
11 replies to this topic
#1
Posted 09 March 2010 - 05:12 PM
|
|
|
#2
Posted 09 March 2010 - 05:24 PM
Funny you should ask - Last year I actually did something incredibly similar parsing the HTML of the LOC site for ISBN information. I can give you the code and you can modify it, if you like. What operating system do you want this for?
sudo rm -rf /
#3
Posted 09 March 2010 - 05:32 PM
That'd be awesome. When you say operating system, I'm assuming you are asking what os I'm using. I am using windows....
#4
Posted 09 March 2010 - 05:40 PM
Perfect. Here you go. Mind you, it's multithreaded, so be careful how many threads you use. I used 64 and crashed the LOC server. Once I stopped my program the site was back up.
I am not responsible if things go wrong. :)
I am not responsible if things go wrong. :)
Attached Files
sudo rm -rf /
#5
Posted 09 March 2010 - 05:47 PM
Haha, really you crashed it? How would it do that? Simply to many requests? And thanks for this, I really appreciate it.
#6
Posted 09 March 2010 - 05:51 PM
Apparently it can't handle 64 requests at the same time. Basically an unintentional DoS attack. I scaled it back to 32 and I think that worked.
sudo rm -rf /
#7
Posted 09 March 2010 - 05:52 PM
I remember the first time I heard about that. Somehow, I can actually believe that. A ddos from a single computer in a campus dorm :)
#8
Posted 09 March 2010 - 05:57 PM
Amazing what people can do with their free time. Just to be an idiot I sent two friends an email from another friend (we're all close, so it's okay). It had...some rather...um... I'll just leave it at "there were goats involved." :D Anyway, I spoofed the headers to make it seem like it came from my victim friend. The others eventually figured it out and came to my room at three in the morning. When I answered the door, they attacked me with a large vibrating dildo. Apparently they found it at a frat house.
sudo rm -rf /
#9
Posted 10 March 2010 - 12:26 AM
dargueta said:
I spoofed the headers to make it seem like it came from my victim friend.
#11
Posted 11 March 2010 - 12:32 PM
dargueta said:
Perfect. Here you go. Mind you, it's multithreaded, so be careful how many threads you use. I used 64 and crashed the LOC server. Once I stopped my program the site was back up.
I am not responsible if things go wrong. :)
I am not responsible if things go wrong. :)
Hmmm, I'm getting a lot of errors when compiling. You think it could be that I don't have some necessary libraries?
It seems like the problem lies in the http.h header, which it can't seem to find. I'll look it up and see if thats it.
#12
Posted 11 March 2010 - 12:46 PM
Oops.I attached the files. The http.txt file is actually http.h but it won't let me upload a .h file for reasons beyond my understanding.
Attached Files
Edited by dargueta, 11 March 2010 - 05:47 PM.
Forgot to attach files.
sudo rm -rf /


Sign In
Create Account


Back to top












