OCR Pdf file exports translation xml with extra tags

felipecasta · 11 December 2014 21:33

Im performing OCR with Omnipage and exporting a PDF. Im opening up the PDF in Infix and exporting an xml file for translation. The problem is there are many extra tags appearing, specifically at points where text meets text box borders; there are no visible line breaks in the pdf or any other thing I can see, yet the tags are there.
How can I get rid of these tags?

martin · 12 December 2014 10:09

Hi,

A PDF document does not contain any carriage returns or paragraph marks within it. Our software uses heuristics to infer where carriage returns and paragraph marks should be in the pdf. If you feel that we are getting them “wrong” can you email us a copy of the pdf to support@iceni.com and we’ll have a look at it.

regards,

Martin.