PDF Conversion Drama: To OCR or Not to OCR (Act 1)

Founder at OCR Craft
Over 15 Years of Experience in the Translation Industry

When you get a large, urgent project from your client, every minute counts. So, when your client asks you to translate a PDF document, the fastest and easiest method is converting the PDF to MS Word so a CAT tool can process it. Today’s CAT tools have advanced significantly, serving as smart storage and quick providers of the growing translation memory. However, your CAT might struggle with non-editable text.

Autoconverting PDFs to MS Word is justified for simple documents, but even then, the similarity between the converted Word and the original can be very misleading and often causes issues in later stages of working with the document.

Why is that so, you may ask? Well, if a perfect tool does not exist (not yet, that is), how can one hope for a perfect result? Even though, nowadays, automatic conversion tools are able to produce seemingly accurate copies of original documents, do not be fooled! The resemblance is only superficial. Once you choose to display all the formatting marks within the document, you will see how fragile, overly complicated, and maddeningly illogical the structure underneath tends to be. You will be shocked to find such a large number of formatting attributes applied that dealing with them manually may take ages.

But wait, why should you even care about what lies beneath a perfectly looking replica? Just “feed” it to your CAT, right? Unfortunately, no. If you put in garbage, you get out garbage.

There are several very important reasons for that:

Reason #1: Incorrect Segmentation

As you know, a CAT, just like us, can only process any given text in manageable chunks, which are, in most cases, segments. Automatic PDF converters, unfortunately, often cause breaks in sections, pages, paragraphs, and in the middle of sentences and table columns. For a CAT, this, of course, complicates the processing of segments. With broken segments, a CAT may not be able to retrieve any matches from your Translation Memory, which will, in turn, become polluted with unusable sentence fragments.

Not only can a PDF converter cause fragmented segmentation, but it can also affect the arrangement of the fragmented segments. In short, without double-checking the converted source file, you may end up translating an inaccurate source material, albeit unwittingly so.

Reason #2: Excessive Formatting Tags

As I mentioned before, automatic conversion software tends to use an excessive number of formatting attributes. These minor formatting changes are often barely visible to the naked eye—but not to your very sensitive CAT tool.

CATs internally convert into tags every formatting attribute found in a document. Tags, or labels, function as the CAT’s personal “sticky notes” to remind the software later how to “format back” the target file. So, when translating a tag-ridden file, not only will you have trouble staying focused due to numerous distracting tags, you will also need to transfer each of these tags into the target file and then double-check for any errors (which is very time-consuming!). What’s worse, however, is that you may risk missing a tag or two.

If you happen to miss any tag, here are some potential consequences:

Some text in the target document, after you’ve exported it from the CAT, may remain partially invisible; it can be a footnote (and its description), for instance, or something else.
The original text flow may be disrupted: for instance, in the text styles or colors.
You may fail to generate your target file, since CATs won’t export a file with missing tags.

And finally, if any repeated segments in the source file are formatted differently and, therefore, have dissimilar tags, your CAT application will not recognize these segments as repetitions.

As you can see, failure at this stage to professionally preprocess translation documents before “feeding” them to your CAT tool may complicate the project workflow at later stages. Document optimization for CAT use and logical arrangement of the text will significantly reduce any risks associated with bad segmentation and excessive or missing tags, which will also help you with your post-translation workload.

Should you need any assistance with PDFs in any of your urgent projects, we’d be happy to help you ensure professional and timely delivery.

Reason #1: Incorrect Segmentation

Reason #2: Excessive Formatting Tags

Contact us

Request a Quote