I'm all new to this forum and I hope someone here can help me.
I'm starting my master thesis right now and I've gotten a subject from my professor. And I don't quite get it. I am supposed to do layout recognition on ocr-xml-output. So I need an ocr system that spits out xml and then I need to parse it and then I need to sort it all in a new xml-file where I sort the text into title, headline and so on. I'm planning to be able to recognize pages of dictionaries where I want to see what's a lemma and the flection-notations etc...
But right now I'm stuck with the first part. I can't find any ocr that gives me .xml output (not open xml or xslt or something like that). I've gotten some xml files from the professor generated by abbyy finereader but now when I'm trying to use finereader myself, on my own computer I can't figure out how he did it. Abbyy's support wasn't too helpful either as they said, they have never seen anything like it although I have .xml's that says generated by finereader...
I'm looking for something like this where I can get all of the coordinates for the letters:
<?xml version="1.0" encoding="UTF-8"?> <document version="1.0" producer="FineReader 7.0" xmlns=".../FineReader_xml/FineReader6-schema-v1.xml" xmlns:xsi="..w3.org/2001/XMLSchema-instance" xsi:schemaLocation="...FineReader_xml/FineReader6-schema-v1.xml ...://ww.abbyy.com/FineReader_xml/FineReader6-schema-v1.xml" mainLanguage="OldGerman" languages="OldGerman,Russian"> <page width="1126" height="1418" resolution="300" originalCoords="true"> <block blockType="Text" l="192" t="140" r="242" b="170"><region><rect l="192" t="140" r="242" b="170"></rect></region> <text> <par> <line baseline="165" l="208" t="145" r="238" b="165"><formatting lang="OldGerman" ff="Arial" fs="15." spacing="-49"><charParams l="208" t="148" r="233" b="165" suspicious="true">^</charParams><charParams l="233" t="145" r="238" b="151" suspicious="true">-</charParams></formatting></line></par> </text> </block> ...
Could anyone please help...


Sign In
Create Account

Back to top









