Go Back   CodeCall Programming Forum > Software Development > Python
Register Blogs Search Today's Posts Mark Forums Read

Python Discussion forum for Python, a high-level language with simple syntax, but yet powerful.

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 03-03-2009, 10:31 AM
Newbie
 
Join Date: Mar 2009
Posts: 4
Jumbala102 is an unknown quantity at this point
Need help extracting text from pdf files

Okay, so I'm looking for a way to extract some text from some PDF files (to output it to a .txt file to make it easier to manipulate, I have to sort them by highest average, etc.)

I thought about using python at first since it's pretty easy to sort data in a .txt file using it, I found the pyPDF library but it's giving me some weird string because of the formatting of the PDF files, so I don't think that library will do.

Here's an example of one of the PDF files in question:
complexejuliequilles.com/files/_leagues_Leagues1_1240.pdf

The result should be something like when you click on select text in adobe reader and copy and paste it in a .txt file. If it's something else, I guess I don't mind all that much either as long as I can work with the data extracted from the pdf files.

It doesn't have to be in Python either, if you've got an alternative just let me know, I know some C++ and Java, but I can learn another language pretty easily I guess.

Thanks a lot in advance!

Jumbala102
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 03-03-2009, 12:06 PM
WingedPanther's Avatar
Super Moderator
 
Join Date: Jul 2006
Age: 36
Posts: 11,435
WingedPanther has much to be proud ofWingedPanther has much to be proud ofWingedPanther has much to be proud ofWingedPanther has much to be proud ofWingedPanther has much to be proud ofWingedPanther has much to be proud ofWingedPanther has much to be proud ofWingedPanther has much to be proud ofWingedPanther has much to be proud of
Re: Need help extracting text from pdf files

Are you going to be dealing with a large number of files, or is this something that can be partially dealt with by hand? I've used OpenOffice to open .pdf files (using a plugin).
__________________
CodeCall Blog | CodeCall Wiki | Shareware
Programming is a branch of mathematics.
My CodeCall Blog | My Personal Blog
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 03-03-2009, 03:05 PM
Newbie
 
Join Date: Mar 2009
Posts: 4
Jumbala102 is an unknown quantity at this point
Re: Need help extracting text from pdf files

I'm going to be dealing with 28 files on a weekly basis (most of them are larger than the one in the link I posted in the original post too), so I'd really like it if I didn't have to do it all by hand...

Right now, I have to take the top 20 averages for men, same thing for women every week and I have to open each file one after the other and try to find the top 20 of each category manually, which obviously takes a while, that's why I'd rather be able to have it done via an automated way. It would also make it possible for me to sort every single one of them instead of just the top 20 (there are about two thousand entries, more or less, so doing it all by hand is out of the question).

Thanks for replying though, if you find a way to deal with those pdf files, I'd really appreciate it!

Jumbala102
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 03-03-2009, 03:54 PM
WingedPanther's Avatar
Super Moderator
 
Join Date: Jul 2006
Age: 36
Posts: 11,435
WingedPanther has much to be proud ofWingedPanther has much to be proud ofWingedPanther has much to be proud ofWingedPanther has much to be proud ofWingedPanther has much to be proud ofWingedPanther has much to be proud ofWingedPanther has much to be proud ofWingedPanther has much to be proud ofWingedPanther has much to be proud of
Re: Need help extracting text from pdf files

I would look into the OpenOffice macro language... I think it may be able to open each file and get the info you want.
__________________
CodeCall Blog | CodeCall Wiki | Shareware
Programming is a branch of mathematics.
My CodeCall Blog | My Personal Blog
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 03-03-2009, 04:01 PM
Newbie
 
Join Date: Mar 2009
Posts: 4
Jumbala102 is an unknown quantity at this point
Re: Need help extracting text from pdf files

I guess I'll look into it, but I don't know where to start... Do you have a site I could visit or something? Do the macros work kind of like programming something or are they pre-built functions? Also, would it allow me to sort the data I extract or would I have to do it in two steps actually? Like step 1. use the macros in OpenOffice to extract the data and step 2. use python to work on the data and sort it like I want?
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 03-03-2009, 10:38 PM
WingedPanther's Avatar
Super Moderator
 
Join Date: Jul 2006
Age: 36
Posts: 11,435
WingedPanther has much to be proud ofWingedPanther has much to be proud ofWingedPanther has much to be proud ofWingedPanther has much to be proud ofWingedPanther has much to be proud ofWingedPanther has much to be proud ofWingedPanther has much to be proud ofWingedPanther has much to be proud ofWingedPanther has much to be proud of
Re: Need help extracting text from pdf files

Macros are a programming language that run within another application. I would look in the help system for more information.
__________________
CodeCall Blog | CodeCall Wiki | Shareware
Programming is a branch of mathematics.
My CodeCall Blog | My Personal Blog
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 03-03-2009, 10:41 PM
Newbie
 
Join Date: Mar 2009
Posts: 4
Jumbala102 is an unknown quantity at this point
Re: Need help extracting text from pdf files

Okay I think I'm going to do that... Do you guys think it would be possible to do it in Microsoft Office instead of Open Office, though? (Because we use Microsoft Office at my job (a bowling center, a student job), and my boss is computer illiterate), so I'd rather have my stuff work without having to install other programs (other than a python interpreter or some small apps like that)
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 03-03-2009, 10:46 PM
WingedPanther's Avatar
Super Moderator
 
Join Date: Jul 2006
Age: 36
Posts: 11,435
WingedPanther has much to be proud ofWingedPanther has much to be proud ofWingedPanther has much to be proud ofWingedPanther has much to be proud ofWingedPanther has much to be proud ofWingedPanther has much to be proud ofWingedPanther has much to be proud ofWingedPanther has much to be proud ofWingedPanther has much to be proud of
Re: Need help extracting text from pdf files

I don't know of anything that lets MS Office open pdf documents. I believe OOo 3.0 can use the same scripting language, however.
__________________
CodeCall Blog | CodeCall Wiki | Shareware
Programming is a branch of mathematics.
My CodeCall Blog | My Personal Blog
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 04-12-2009, 10:38 AM
Newbie
 
Join Date: Apr 2009
Posts: 3
togs is an unknown quantity at this point
Re: Need help extracting text from pdf files

Pretty long shot, but perhaps try using Adobe's accessibility service to convert to text, then parse using whatever you like:

adobe.com/products/acrobat/access_onlinetools.html

You might even be able to do the submission programmatically.

Cheers,
togs
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply

Tags
extracting, files, output, pdf, txt



Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes


Similar Threads
Thread Thread Starter Forum Replies Last Post
C# Tutorial: Writing Text Files Xav CSharp Tutorials 46 07-28-2009 09:18 AM
Differences between text files and binary files. LoneWolf C and C++ 3 02-24-2009 07:36 PM
Loading Text Files Using MonthCalender xGhost4000x Visual Basic Programming 0 09-25-2008 04:05 AM
Run Text Files as Executables MeTh0Dz C and C++ 16 06-26-2008 06:22 PM
How to style fonts of a text in a simple page? c0de Tutorials 3 09-15-2007 11:08 PM


All times are GMT -5. The time now is 09:27 AM.


vBulletin v3.8.0 ©2010, Jelsoft Enterprises Ltd.


no new posts

LinkBacks Enabled by vBSEO 3.1.0