Closed Thread
Results 1 to 9 of 9

Thread: Need help extracting text from pdf files

  1. #1
    Jumbala102 is offline Newbie
    Join Date
    Mar 2009
    Posts
    4
    Rep Power
    0

    Need help extracting text from pdf files

    Okay, so I'm looking for a way to extract some text from some PDF files (to output it to a .txt file to make it easier to manipulate, I have to sort them by highest average, etc.)

    I thought about using python at first since it's pretty easy to sort data in a .txt file using it, I found the pyPDF library but it's giving me some weird string because of the formatting of the PDF files, so I don't think that library will do.

    Here's an example of one of the PDF files in question:
    complexejuliequilles.com/files/_leagues_Leagues1_1240.pdf

    The result should be something like when you click on select text in adobe reader and copy and paste it in a .txt file. If it's something else, I guess I don't mind all that much either as long as I can work with the data extracted from the pdf files.

    It doesn't have to be in Python either, if you've got an alternative just let me know, I know some C++ and Java, but I can learn another language pretty easily I guess.

    Thanks a lot in advance!

    Jumbala102

  2. CODECALL Circuit advertisement
    Join Date
    Always
    Posts
    Many

     
  3. #2
    Join Date
    Jul 2006
    Posts
    16,486
    Blog Entries
    75
    Rep Power
    143

    Re: Need help extracting text from pdf files

    Are you going to be dealing with a large number of files, or is this something that can be partially dealt with by hand? I've used OpenOffice to open .pdf files (using a plugin).
    Programming is a branch of mathematics.
    My CodeCall Blog | My Personal Blog

  4. #3
    Jumbala102 is offline Newbie
    Join Date
    Mar 2009
    Posts
    4
    Rep Power
    0

    Re: Need help extracting text from pdf files

    I'm going to be dealing with 28 files on a weekly basis (most of them are larger than the one in the link I posted in the original post too), so I'd really like it if I didn't have to do it all by hand...

    Right now, I have to take the top 20 averages for men, same thing for women every week and I have to open each file one after the other and try to find the top 20 of each category manually, which obviously takes a while, that's why I'd rather be able to have it done via an automated way. It would also make it possible for me to sort every single one of them instead of just the top 20 (there are about two thousand entries, more or less, so doing it all by hand is out of the question).

    Thanks for replying though, if you find a way to deal with those pdf files, I'd really appreciate it!

    Jumbala102

  5. #4
    Join Date
    Jul 2006
    Posts
    16,486
    Blog Entries
    75
    Rep Power
    143

    Re: Need help extracting text from pdf files

    I would look into the OpenOffice macro language... I think it may be able to open each file and get the info you want.
    Programming is a branch of mathematics.
    My CodeCall Blog | My Personal Blog

  6. #5
    Jumbala102 is offline Newbie
    Join Date
    Mar 2009
    Posts
    4
    Rep Power
    0

    Re: Need help extracting text from pdf files

    I guess I'll look into it, but I don't know where to start... Do you have a site I could visit or something? Do the macros work kind of like programming something or are they pre-built functions? Also, would it allow me to sort the data I extract or would I have to do it in two steps actually? Like step 1. use the macros in OpenOffice to extract the data and step 2. use python to work on the data and sort it like I want?

  7. #6
    Join Date
    Jul 2006
    Posts
    16,486
    Blog Entries
    75
    Rep Power
    143

    Re: Need help extracting text from pdf files

    Macros are a programming language that run within another application. I would look in the help system for more information.
    Programming is a branch of mathematics.
    My CodeCall Blog | My Personal Blog

  8. #7
    Jumbala102 is offline Newbie
    Join Date
    Mar 2009
    Posts
    4
    Rep Power
    0

    Re: Need help extracting text from pdf files

    Okay I think I'm going to do that... Do you guys think it would be possible to do it in Microsoft Office instead of Open Office, though? (Because we use Microsoft Office at my job (a bowling center, a student job), and my boss is computer illiterate), so I'd rather have my stuff work without having to install other programs (other than a python interpreter or some small apps like that)

  9. #8
    Join Date
    Jul 2006
    Posts
    16,486
    Blog Entries
    75
    Rep Power
    143

    Re: Need help extracting text from pdf files

    I don't know of anything that lets MS Office open pdf documents. I believe OOo 3.0 can use the same scripting language, however.
    Programming is a branch of mathematics.
    My CodeCall Blog | My Personal Blog

  10. #9
    togs is offline Newbie
    Join Date
    Apr 2009
    Posts
    3
    Rep Power
    0

    Re: Need help extracting text from pdf files

    Pretty long shot, but perhaps try using Adobe's accessibility service to convert to text, then parse using whatever you like:

    adobe.com/products/acrobat/access_onlinetools.html

    You might even be able to do the submission programmatically.

    Cheers,
    togs

Closed Thread

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Similar Threads

  1. editing text files
    By Bertan in forum C and C++
    Replies: 5
    Last Post: 10-18-2010, 12:14 PM
  2. Extracting icons from exe, ico and dll files
    By ThemePark in forum C and C++
    Replies: 10
    Last Post: 07-15-2009, 04:21 PM
  3. Differences between text files and binary files.
    By LoneWolf in forum C and C++
    Replies: 3
    Last Post: 02-24-2009, 04:36 PM
  4. Run Text Files as Executables
    By MeTh0Dz in forum C and C++
    Replies: 16
    Last Post: 06-26-2008, 03:22 PM
  5. Text files
    By travy92 in forum Visual Basic Programming
    Replies: 1
    Last Post: 10-07-2007, 08:06 AM

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts