|
||||||
| Python Discussion forum for Python, a high-level language with simple syntax, but yet powerful. |
![]() |
|
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
|
|||
|
Need help extracting text from pdf files
Okay, so I'm looking for a way to extract some text from some PDF files (to output it to a .txt file to make it easier to manipulate, I have to sort them by highest average, etc.)
I thought about using python at first since it's pretty easy to sort data in a .txt file using it, I found the pyPDF library but it's giving me some weird string because of the formatting of the PDF files, so I don't think that library will do. Here's an example of one of the PDF files in question: complexejuliequilles.com/files/_leagues_Leagues1_1240.pdf The result should be something like when you click on select text in adobe reader and copy and paste it in a .txt file. If it's something else, I guess I don't mind all that much either as long as I can work with the data extracted from the pdf files. It doesn't have to be in Python either, if you've got an alternative just let me know, I know some C++ and Java, but I can learn another language pretty easily I guess. Thanks a lot in advance! Jumbala102 |
|
||||
|
Re: Need help extracting text from pdf files
Are you going to be dealing with a large number of files, or is this something that can be partially dealt with by hand? I've used OpenOffice to open .pdf files (using a plugin).
__________________
CodeCall Blog | CodeCall Wiki | Shareware Programming is a branch of mathematics. My CodeCall Blog | My Personal Blog |
|
|||
|
Re: Need help extracting text from pdf files
I'm going to be dealing with 28 files on a weekly basis (most of them are larger than the one in the link I posted in the original post too), so I'd really like it if I didn't have to do it all by hand...
Right now, I have to take the top 20 averages for men, same thing for women every week and I have to open each file one after the other and try to find the top 20 of each category manually, which obviously takes a while, that's why I'd rather be able to have it done via an automated way. It would also make it possible for me to sort every single one of them instead of just the top 20 (there are about two thousand entries, more or less, so doing it all by hand is out of the question). Thanks for replying though, if you find a way to deal with those pdf files, I'd really appreciate it! Jumbala102 |
|
||||
|
Re: Need help extracting text from pdf files
I would look into the OpenOffice macro language... I think it may be able to open each file and get the info you want.
__________________
CodeCall Blog | CodeCall Wiki | Shareware Programming is a branch of mathematics. My CodeCall Blog | My Personal Blog |
|
|||
|
Re: Need help extracting text from pdf files
I guess I'll look into it, but I don't know where to start... Do you have a site I could visit or something? Do the macros work kind of like programming something or are they pre-built functions? Also, would it allow me to sort the data I extract or would I have to do it in two steps actually? Like step 1. use the macros in OpenOffice to extract the data and step 2. use python to work on the data and sort it like I want?
|
|
||||
|
Re: Need help extracting text from pdf files
Macros are a programming language that run within another application. I would look in the help system for more information.
__________________
CodeCall Blog | CodeCall Wiki | Shareware Programming is a branch of mathematics. My CodeCall Blog | My Personal Blog |
|
|||
|
Re: Need help extracting text from pdf files
Okay I think I'm going to do that... Do you guys think it would be possible to do it in Microsoft Office instead of Open Office, though? (Because we use Microsoft Office at my job (a bowling center, a student job), and my boss is computer illiterate), so I'd rather have my stuff work without having to install other programs (other than a python interpreter or some small apps like that)
|
|
||||
|
Re: Need help extracting text from pdf files
I don't know of anything that lets MS Office open pdf documents. I believe OOo 3.0 can use the same scripting language, however.
__________________
CodeCall Blog | CodeCall Wiki | Shareware Programming is a branch of mathematics. My CodeCall Blog | My Personal Blog |
|
|||
|
Re: Need help extracting text from pdf files
Pretty long shot, but perhaps try using Adobe's accessibility service to convert to text, then parse using whatever you like:
adobe.com/products/acrobat/access_onlinetools.html You might even be able to do the submission programmatically. Cheers, togs |
![]() |
| Tags |
| extracting, files, output, pdf, txt |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | Search this Thread |
| Display Modes | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| C# Tutorial: Writing Text Files | Xav | CSharp Tutorials | 46 | 07-28-2009 09:18 AM |
| Differences between text files and binary files. | LoneWolf | C and C++ | 3 | 02-24-2009 07:36 PM |
| Loading Text Files Using MonthCalender | xGhost4000x | Visual Basic Programming | 0 | 09-25-2008 04:05 AM |
| Run Text Files as Executables | MeTh0Dz | C and C++ | 16 | 06-26-2008 06:22 PM |
| How to style fonts of a text in a simple page? | c0de | Tutorials | 3 | 09-15-2007 11:08 PM |
All times are GMT -5. The time now is 09:27 AM.
Amrosama.cc
Arekbulski.cc
Debtboy.cc
Guest.cc
Jaan.cc
James.cc
Mathx.cc
Tsz.cc
Vswe.cc