Jump to content

ocr - .xml-output

- - - - -

This topic has been archived. This means that you cannot reply to this topic.
1 reply to this topic

#1
Marion

Marion

    Newbie

  • Members
  • Pip
  • 6 posts
Hi there!

I'm all new to this forum and I hope someone here can help me.

I'm starting my master thesis right now and I've gotten a subject from my professor. And I don't quite get it. I am supposed to do layout recognition on ocr-xml-output. So I need an ocr system that spits out xml and then I need to parse it and then I need to sort it all in a new xml-file where I sort the text into title, headline and so on. I'm planning to be able to recognize pages of dictionaries where I want to see what's a lemma and the flection-notations etc...

But right now I'm stuck with the first part. I can't find any ocr that gives me .xml output (not open xml or xslt or something like that). I've gotten some xml files from the professor generated by abbyy finereader but now when I'm trying to use finereader myself, on my own computer I can't figure out how he did it. Abbyy's support wasn't too helpful either as they said, they have never seen anything like it although I have .xml's that says generated by finereader...

I'm looking for something like this where I can get all of the coordinates for the letters:



<?xml version="1.0" encoding="UTF-8"?>

<document version="1.0" producer="FineReader 7.0" xmlns=".../FineReader_xml/FineReader6-schema-v1.xml"

 xmlns:xsi="..w3.org/2001/XMLSchema-instance"

 xsi:schemaLocation="...FineReader_xml/FineReader6-schema-v1.xml ...://ww.abbyy.com/FineReader_xml/FineReader6-schema-v1.xml" mainLanguage="OldGerman" languages="OldGerman,Russian">

<page width="1126" height="1418" resolution="300" originalCoords="true">

<block blockType="Text" l="192" t="140" r="242" b="170"><region><rect l="192" t="140" r="242" b="170"></rect></region>

<text>

<par>

<line baseline="165" l="208" t="145" r="238" b="165"><formatting lang="OldGerman" ff="Arial" fs="15." spacing="-49"><charParams l="208" t="148" r="233" b="165" suspicious="true">^</charParams><charParams l="233" t="145" r="238" b="151" suspicious="true">-</charParams></formatting></line></par>

</text>

</block>


...


Could anyone please help...

#2
dargueta

dargueta

    Writes binary right handed and hex left handed

  • Moderators
  • 4,721 posts
Apparently it's some export feature. Have you checked the menus for something like that? It might be a plugin that you have to download and/or enable.
sudo rm -rf /