+ Reply to Thread
Results 1 to 9 of 9

Thread: Need help extracting text from pdf files

  1. #1
    Newbie Jumbala102 is an unknown quantity at this point
    Join Date
    Mar 2009
    Posts
    4

    Need help extracting text from pdf files

    Okay, so I'm looking for a way to extract some text from some PDF files (to output it to a .txt file to make it easier to manipulate, I have to sort them by highest average, etc.)

    I thought about using python at first since it's pretty easy to sort data in a .txt file using it, I found the pyPDF library but it's giving me some weird string because of the formatting of the PDF files, so I don't think that library will do.

    Here's an example of one of the PDF files in question:
    complexejuliequilles.com/files/_leagues_Leagues1_1240.pdf

    The result should be something like when you click on select text in adobe reader and copy and paste it in a .txt file. If it's something else, I guess I don't mind all that much either as long as I can work with the data extracted from the pdf files.

    It doesn't have to be in Python either, if you've got an alternative just let me know, I know some C++ and Java, but I can learn another language pretty easily I guess.

    Thanks a lot in advance!

    Jumbala102

  2. #2
    Super Moderator WingedPanther has much to be proud of WingedPanther has much to be proud of WingedPanther has much to be proud of WingedPanther has much to be proud of WingedPanther has much to be proud of WingedPanther has much to be proud of WingedPanther has much to be proud of WingedPanther has much to be proud of WingedPanther has much to be proud of WingedPanther's Avatar
    Join Date
    Jul 2006
    Age
    36
    Posts
    11,665
    Blog Entries
    57

    Re: Need help extracting text from pdf files

    Are you going to be dealing with a large number of files, or is this something that can be partially dealt with by hand? I've used OpenOffice to open .pdf files (using a plugin).
    CodeCall Blog | CodeCall Wiki | Shareware
    Programming is a branch of mathematics.
    My CodeCall Blog | My Personal Blog

  3. #3
    Newbie Jumbala102 is an unknown quantity at this point
    Join Date
    Mar 2009
    Posts
    4

    Re: Need help extracting text from pdf files

    I'm going to be dealing with 28 files on a weekly basis (most of them are larger than the one in the link I posted in the original post too), so I'd really like it if I didn't have to do it all by hand...

    Right now, I have to take the top 20 averages for men, same thing for women every week and I have to open each file one after the other and try to find the top 20 of each category manually, which obviously takes a while, that's why I'd rather be able to have it done via an automated way. It would also make it possible for me to sort every single one of them instead of just the top 20 (there are about two thousand entries, more or less, so doing it all by hand is out of the question).

    Thanks for replying though, if you find a way to deal with those pdf files, I'd really appreciate it!

    Jumbala102

  4. #4
    Super Moderator WingedPanther has much to be proud of WingedPanther has much to be proud of WingedPanther has much to be proud of WingedPanther has much to be proud of WingedPanther has much to be proud of WingedPanther has much to be proud of WingedPanther has much to be proud of WingedPanther has much to be proud of WingedPanther has much to be proud of WingedPanther's Avatar
    Join Date
    Jul 2006
    Age
    36
    Posts
    11,665
    Blog Entries
    57

    Re: Need help extracting text from pdf files

    I would look into the OpenOffice macro language... I think it may be able to open each file and get the info you want.
    CodeCall Blog | CodeCall Wiki | Shareware
    Programming is a branch of mathematics.
    My CodeCall Blog | My Personal Blog

  5. #5
    Newbie Jumbala102 is an unknown quantity at this point
    Join Date
    Mar 2009
    Posts
    4

    Re: Need help extracting text from pdf files

    I guess I'll look into it, but I don't know where to start... Do you have a site I could visit or something? Do the macros work kind of like programming something or are they pre-built functions? Also, would it allow me to sort the data I extract or would I have to do it in two steps actually? Like step 1. use the macros in OpenOffice to extract the data and step 2. use python to work on the data and sort it like I want?

  6. #6
    Super Moderator WingedPanther has much to be proud of WingedPanther has much to be proud of WingedPanther has much to be proud of WingedPanther has much to be proud of WingedPanther has much to be proud of WingedPanther has much to be proud of WingedPanther has much to be proud of WingedPanther has much to be proud of WingedPanther has much to be proud of WingedPanther's Avatar
    Join Date
    Jul 2006
    Age
    36
    Posts
    11,665
    Blog Entries
    57

    Re: Need help extracting text from pdf files

    Macros are a programming language that run within another application. I would look in the help system for more information.
    CodeCall Blog | CodeCall Wiki | Shareware
    Programming is a branch of mathematics.
    My CodeCall Blog | My Personal Blog

  7. #7
    Newbie Jumbala102 is an unknown quantity at this point
    Join Date
    Mar 2009
    Posts
    4

    Re: Need help extracting text from pdf files

    Okay I think I'm going to do that... Do you guys think it would be possible to do it in Microsoft Office instead of Open Office, though? (Because we use Microsoft Office at my job (a bowling center, a student job), and my boss is computer illiterate), so I'd rather have my stuff work without having to install other programs (other than a python interpreter or some small apps like that)

  8. #8
    Super Moderator WingedPanther has much to be proud of WingedPanther has much to be proud of WingedPanther has much to be proud of WingedPanther has much to be proud of WingedPanther has much to be proud of WingedPanther has much to be proud of WingedPanther has much to be proud of WingedPanther has much to be proud of WingedPanther has much to be proud of WingedPanther's Avatar
    Join Date
    Jul 2006
    Age
    36
    Posts
    11,665
    Blog Entries
    57

    Re: Need help extracting text from pdf files

    I don't know of anything that lets MS Office open pdf documents. I believe OOo 3.0 can use the same scripting language, however.
    CodeCall Blog | CodeCall Wiki | Shareware
    Programming is a branch of mathematics.
    My CodeCall Blog | My Personal Blog

  9. #9
    Newbie togs is an unknown quantity at this point
    Join Date
    Apr 2009
    Posts
    3

    Re: Need help extracting text from pdf files

    Pretty long shot, but perhaps try using Adobe's accessibility service to convert to text, then parse using whatever you like:

    adobe.com/products/acrobat/access_onlinetools.html

    You might even be able to do the submission programmatically.

    Cheers,
    togs

+ Reply to Thread

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

     

Similar Threads

  1. C# Tutorial: Writing Text Files
    By Xav in forum CSharp Tutorials
    Replies: 46
    Last Post: 07-28-2009, 08:18 AM
  2. Differences between text files and binary files.
    By LoneWolf in forum C and C++
    Replies: 3
    Last Post: 02-24-2009, 06:36 PM
  3. Loading Text Files Using MonthCalender
    By xGhost4000x in forum Visual Basic Programming
    Replies: 0
    Last Post: 09-25-2008, 03:05 AM
  4. Run Text Files as Executables
    By MeTh0Dz in forum C and C++
    Replies: 16
    Last Post: 06-26-2008, 05:22 PM
  5. Replies: 3
    Last Post: 09-15-2007, 10:08 PM

Bookmarks

Bookmarks

     
        Algorithms and Data Structures

        Java tutorials

        Algorithms Forum

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts