• Posted Aug. 2, 2017, 9:37 a.m. - 6 years, 6 months ago

The 3 types of scanned PDFs

Assorted ice cream cones including chocolate, vanilla and strawberry

Did you know there are actually 3 different types of scanned PDF which can, if you’re not careful, complicate the task of translation:

  • The simple scan – every page is just an image.
  • Searchable scans – each image has hidden text behind it.
  • Mixed – can include scanned images, hidden and real text all in the same PDF.

TransPDF will automatically run OCR on a PDF if it detects no real text – in other words, type 1 from the list above. But for types 2 and 3 it will sense the presence of real text and skip the OCR phase. This can be a problem when you need to translate all the text in the PDF.

Infix to the rescue

Using Infix PDF Editor you can convert all the real text into artwork. The PDF will look exactly the same but won’t contain any real text or fonts and will no longer be editable.

The upshot of this conversion is that TransPDF will detect the absence of editable text and process the PDF with OCR.

All you need to do is open the PDF using Infix and choose Text->Create Outlines… This will generate a new PDF suitable for uploading to TransPDF where it will be processed with OCR.

How do you tell which type is which?

A simple way to tell which kind of PDF you have is to do a word count. Using Infix PDF Editor, choose Document->Word Count…If there are no words, it’s a simple scanned PDF (type 1).

If your word count isn’t zero,  then you need to investigate further. Export all the images in the PDF and have a look through them to see if you think OCR is warranted for the entire document.

To do this, choose File->Export->Pages As… and press the Format… button. In the “General” tab, disable text output and enable image output then press OK.

Once all your images have been output you can select them all and open them all at once. This should allow you to view them as a slideshow on Windows or Macintosh.

If it looks like the images contain lots of text that needs translating, you’ll need to convert the PDF using the Create Outlines technique detailed above.