Founder at OCR Craft
Over 15 Years of Experience in the Translation Industry

Normally, a Translation Memory (TM) is created when translators start working on a client’s project document. Alternatively, a TM can be compiled from previous translation projects and reused old files using a CAT alignment tool.

Creating a TM is a simple task if the source and target file share the same editable format. However, if your source file is a PDF document, you may have some difficulties:

1. Due to poor PDF converter results, the content of a project document may be recreated with errors, which will cause bad segmentation (see the example below). Too many broken or misaligned segments may eventually jeopardize the accuracy of the TM. 
This may also pose a risk of contamination of the Translation Memory. Unfortunately, a misaligned or broken segment can be fixed only manually, which may require some time.

    2. Deficient conversion can also result in some of the PDF content getting lost, as illustrated by the screenshot below.

    3. To recreate the exact layout of the source PDF, an automatic PDF converter often utilizes an excessive number of section breaks to control the flow of the text. A section break, like a hard return, is also one of the markers, or delimiters, used by CATs to know where to divide the text into sentence-segments. Any text with an excessive number of breaks is then poorly and illogically fragmented. 

    Not only is it a pain to translate, it is also very detrimental for the TM: When the misaligned segments are “fed” back into the CAT, the TM is contaminated with unusable sentence fragments and unnecessary tags. This can lead to all sorts of issues, including excessive spaces, empty paragraphs, line breaks, etc., as the example below illustrates. 

    The only solution here is to optimize the post-OCR source documentation—to improve formatting and boost your TM leverage. If properly prepared, the post-OCR file will be better formatted, with any excessive tags removed and no mid-sentence breaks left, which will eventually increase TM match stats. All this, in the end, will improve the quality of the TM and the results of its use in the future. In the long term, a more responsible approach to formatting will help increase the value of your translation memories.

    Please remember that carefully and systematically managed TMs can help both your clients and vendors gain confidence and have peace of mind trusting your translation resources and your consistency in using best DTP and formatting practices. It is a win-win for everybody.