Jump to content


Check out our Community Blogs

Register and join over 40,000 other developers!


Recent Status Updates

View All Updates

Photo
- - - - -

Getting Text, Immage or pdf from a PDF

extract pdf

  • Please log in to reply
2 replies to this topic

#1 Dorgon

Dorgon

    CC Regular

  • Member
  • PipPipPip
  • 37 posts

Posted 04 January 2012 - 02:19 AM

Good morning,

I've been searching for a long time for a program or script that can extract information from a PDF file by giving it coordinates.
But I didn't find any thing, so I hope some one knows a program that can do this.

Why do I need it?

At the Company where I work as trainee at the moment, bills from our suppliers are getting checked en payed all by hand.
They want to have this all semi-automated.
There are a few things that automating this proces realy hard.

At the moment we have the current database/script running:
All prices of the articels are in a database, the employees at the finance department read the bill, and they fill in the order number, total Price, atricle numbers, and the quantity.
The system gives information back (total price by the info of the system, article numbers with the quantity)
If the price from the bill is bigger then *% than that is in the system they have to pay some attention to it.

But now we want the next:
We scan the bill, get the order number out of it (not that hard, I've done it before), get the information out of the system and compare it with the bill.
Not that hard, but the problem is: How do I get the right information out of the PDF?

With the Order number I did it like this:
PDF -> [OCR] -> TXT

Put the txt document in a string and then look for the order number by searching for P1, P4 or P7, getting the location, and take the next 7 characters. Compare that with the database and I had an efficiency of 75 - 100%
(OCR software isn't 100% right)
But I can't do this with a total price, article number or quantity because the searcing word isn't realy unique or special.
Lets say I need the quantity, the database says it is 5, I can't look in the string for 5 because 5 probaly will be like 20 times in the string.

So I was thinking, a total price, article number and things like that are allmost allways on the same position from a supplier.
So to extract the right information I just have to cut out a field out of an PDF file and I can have the information I want, OCR it, and then check it with the database.
If the price, article number and quantity are okay, that will mean the bill is totaly correct, the only thing that the emplyees from the finance department have to do is just pay.
If the Program can't find the right information, there has to be an human eye to look whats wrong with it(like a 4000 euro bill instead of 2000 that the system says and they have to make a call tot the supplier why they did this).


In short words: Does any one knows software that can extract information from a PDF by giving it coordinates.


Other Ideas are also welcome ofcorse :D

Thanks in advance!

William
  • 0

#2 cdg10620

cdg10620

    CC Addict

  • Senior Member
  • PipPipPipPipPip
  • 344 posts
  • Programming Language:C#, JavaScript, PL/SQL, Transact-SQL, VBScript, Others

Posted 04 January 2012 - 12:05 PM

Look at these items that I found doing a quick Google search:

Extract data from PDF coordinates in C# with PDF Extractor SDK | Software for home, business and developers
c# - Extract PDF text by coordinates - Stack Overflow
c# code to extract data from pdf file.

It seems that if you wanted to create a C# program that extracted the information it would be entirely possible and there are even .NET libraries already provided (such as PDFsharp Home of PDFsharp and MigradDoc Foundation - PDFsharp & MigraDoc) that allow you to process PDF information.
  • 0
-CDG10620
Software Developer

#3 Dorgon

Dorgon

    CC Regular

  • Member
  • PipPipPip
  • 37 posts

Posted 05 January 2012 - 04:09 AM

Thanks!

This isn't what I was looking for yet, but this helps me a lot!

William
  • 0





Also tagged with one or more of these keywords: extract, pdf

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download