How to Prepare PDF File for CAT Tool in 4 Steps

Founder at OCR Craft
Over 15 Years of Experience in the Translation Industry

At OCR Craft, we specialize in preparing non-editable and scanned documents for CAT translation, including converting PDFs and other file formats into the editable .docx and .pptx formats. Our 1-to-1 PDF to DOC conversion service involves four important steps, which we are happy to share with you here.

Here is how we do it:

We use an automatic OCR program to convert a PDF file into an editable format. Then we carefully examine the resulting post-conversion Word file for any misrecognized characters, including spelling and punctuation issues.
We remove any unnecessary breaks in the file. As the next step, we reformat the converted file so that it looks exactly like the original document. We start by deleting any unnecessary breaks in the document, including line, page, and section breaks. By deleting any excess breaks, we can ensure the best CAT segmentation results.
We recreate the original document layout. To recreate the exact layout of the source PDF document, we use appropriate MS Word and PowerPoint functions to logically structure the document. Because of the different length of the target text, not all the translated text will fit inside the table cells, frames, or textboxes, and, as a result, some of the text may become partially invisible. Being very mindful of this, we painstakingly clean the document of any extraneous formatting attributes and restructure its text flow to mirror the original format while keeping it consistent and logical. We ensure smooth, break-free transition from page to page; apply variable height settings to table rows to prevent any text invisibility; generate an automatic table of contents; and employ grouped text boxes to ensure that images move together with their corresponding annotations.
We extract and format any legible text within stamps and seals. In addition to the main text, we also make editable any other details found in the document, such as seals, stamps, labels, etc. Even if the seals or stamps in the document do not contain any text, we still make sure to indicate their presence in the document.

We have listed some crucial steps in preparing PDF files for CAT translation (although by no means an exhaustive list of all the steps involved in providing this service). OCR conversion must be complemented by human input. Although contemporary CATs have advanced significantly, they are still helpless (at least, for now) when it comes to translating scanned or non-editable documents. Clients often turn to us for help, asking us to fix the issues they have encountered during the post-conversion and/or post-translation stage.

Preparing a PDF file for translation can be stressful, challenging, and time-consuming. If you are hoping that automatic PDF conversion will save you time on your project, I’m afraid you’re in for a disappointment. As it happens, correcting and restructuring someone else’s work may be harder and longer than recreating something from scratch! If you are struggling, don’t hesitate to ask for help. Do not wait until it’s too late; let us help you deliver your project safely and timely.

You should probably know that, after you use an OCR tool to convert your PDF documents into Word files, the resulting copies may have numerous issues, including:

– Sentences broken in the middle:

Bad segmentation is one of the major risks associated with the lack of post-conversion preparation of translation files.

– Section breaks added on each page:

Another risk associated with insufficient preparation before translation is that the flow of the text may be broken, with interrupting breaks inserted on each page. After you complete and generate the target translation, the file may have extraneous empty paragraphs added after each section break. Needless to say, your document should not have any extra section breaks nor any empty paragraphs, which most OCR programs strongly favor as a way to control the structural format of the document.

– Structural issues in post-conversion tables:

Post-conversion tables are often reproduced only superficially: the structure and flow of the indented text may be aligned by using tabulators (tabs), spaces, textboxes, or text columns, when, surely, a simple table would have sufficed. Moreover, the slightest change in the number of characters after translation may also cause the structure of the document to fall apart. The length and the structure of the target text may differ significantly, so if you don’t know the language, you may not be able to adjust the formatting.

– Incomplete formatting of the document:

Due to expansion during translation, some text may no longer fit inside structural elements of the document, becoming partially invisible. The missing text may not be recognized by CAT. What’s more, PDF converters are notoriously bad at recognizing and extracting image annotations, if any.

– Incorrectly recognized or missing text:

Even with spell check, automatic PDF converters might not recognize the entire text correctly: punctuation, letters, even whole words might be missing.

– Erroneous and inconsistent text segmentation by CAT:

In conversion, OCR programs tend to apply a lot of section breaks, some of which may break sentences in the middle. As it happens, a section break operates like a hard return, which is also one of the delimiters used by CAT tools to define segments. A text with broken sentences will be cut apart or mushed together in a way that may be grammatically illogical—and either incomprehensible or, worse, misleading and, therefore, potentially mistranslatable.

In other words, using an automatic PDF converter may jeopardize your project commitments and your ability to ensure professional and timely project delivery.

Should you need any assistance with your CAT file preparation & DTP needs, we’d be more than happy to share with you our expertise and experience, just as we do with over 150 LSPs in Europe and worldwide.