Sunday, March 21, 2010

Translating Thai: Some Experiences of Digitisation

Is it possible to produce a reasonably accurate translation from Thai into English with only a basic knowledge of the language and the aid of electronic tools? I’m not going to make great claims as my experiences are from home-grown experimentation over a few months. However, having recently completed a few translations, I think there are promising signs. At least I’m quite satisfied with a translation of my mother's article concerning Buddhism in Hampshire in the '60s, which runs to about 2,000 words. So there may be some pointers that others find helpful.

Setting this post in the context of biographical research, I’ll first describe some broad considerations and then discuss digitisation (scanning and optical character recognition). One tip I’d offer is that there needs to be attention to detail – rough and ready methods won’t yield very much that's of value. Certainly, there’s been more involved than I anticipated!

I’ll start with a list of very basic questions - as much for my own benefit as anyone else’s :-)

  • What are you trying to learn? Why is it significant? Even when carrying out research entirely in one’s native language, time often forces choices with regard to the materials that you examine closely. If they are in a foreign language, then that imposes further constraints.
  • Is there anyone who can help? It may be that you can effectively form a team.
  • Of the materials available, which ones are going to shed most light in key areas?
  • Among these materials, which ones are amenable to analysis? Are they easy to access physically? Are they printed or hand-written?

All these points apply to any language, but then each language has further characteristics that can make the situation more or less difficult.

With regard to Thai, its alphabet (44 consonants and 28 vowel forms) is much more elaborate, particularly with the use of diacritics. Even Thais will tell you that looking up words in a dictionary can be quite a chore. Yet, if the letters are clearly formed then actually reading it is not so hard because it’s generally phonetic. As someone with a limited vocabulary, needing to look up many words, I soon decided that it’d be much more convenient to have an accurate transcription in electronic form so that I can use software-based dictionaries.

A note on reading handwriting

So what about Thai handwriting?! In the Thai education system, primary school children learn to write by copying individual Thai printed letters – I’ve seen one of my cousins do this repeatedly when she was 5 years old. When they leave primary school they then learn cursive script and that stage can mark a huge departure. It’s a similar approach as I learnt for English, but I don’t know whether children develop their own style or are guided to adopt one of a number of standard styles. I’ve shown sets of photos to relatives and friends with Thai writing on the back – quite often there is a struggle to read what’s written, so it appears to be no easier than English. It’s a daunting prospect, but assuming that the writing is consistent, then it becomes a question of recognising patterns and perhaps understanding its topology will help. So for a given author, it may suffice for someone to translate a sample for me and I can try to figure out the rest.

Anyway, at the moment I can’t read much beyond the printed word, which means I have to ask others to copy type what I can’t read. For general documents concerning work that’s quite feasible, at least for someone in Europe the costs of getting this done in Thailand are affordable. However, a biography containing personal items (which are often of greater interest) requires more care – until their contents are known they should be read only by people you can trust.

So in the remainder of what I share here I’ll confine my attention to printed documents as I indicate a methodology I’m adopting for their translation.

Copy type or scan for OCR?

Technology-assisted translations often start with flatbed scanners that can convert the physical page into an image that then gets ‘read’ using optical character recognition software (OCR). In theory, since the printed word generates letters uniformly, software can accurately interpret them. In practice, results are imperfect for most kinds of sources and can take longer than expected. It may be better simply to copy type.

So when should OCR be used? Whatever language you are trying to read, the utility depends upon the nature and condition of the original document – if it is a fragile pocket volume with hundreds of faded pages with tiny letters in an obscure font, then even if you manage to safely scan the page, you may find OCR yields very poor results.

However, this kind of discussion assumes that there actually is some decent software for any language, when in fact for languages that don’t use Roman script, the situation seems to be very varied...

Available OCR options for Thai (very few!)

For Thai the available options have been very few. On asking a few Thai friends, I drew only blanks and when I carried out a quick investigation it seemed that until only a few years ago, the options were not far out of the university laboratory and didn’t look very amenable. An example is NEC-0006 อ่านไทย เวอร์ชัน 2.5 (OCR), which is inexpensive, but it doesn’t get very good experience reports from a Thai OCR discussion thread..

The larger well-established commercial products such as Omnipage and Abbyy seemed for a long time to have ignored Thai until a couple of years ago when additional language support for Abbyy FineReader Pro was introduced for Thai in version 9. Trusting the claims of accuracy I took the plunge and bought a copy - quite an investment, even with an educational discount.

I’m glad I did as the results are generally good, although its accuracy is inferior to that for languages based on Roman script. For someone like me who types Thai very slowly it is a useful start, but unless the lettering in the documents is very clear so that the accuracy is close to 100%, its utility will fall away for anyone who can type reasonably quickly and accurately.

(In case you are wondering, there have been efforts to recognise handwriting, but it’s a much harder task – I was interested to note, though, that a fairly recent paper, Maximization of Mutual Information for Offline Thai Handwriting Recognition, in IEEE Transactions on Pattern Analysis and Machine Intelligence, makes use of a toolit that is primarily used for speech recognition research. It prompts the question of the relationship of Thai speech to writing. From my very rudimentary knowledge of Thai linguistics I gather that it has roots in Sanskrit, where the letters of the alphabet are placed according to where in the throat/mouth/lips they are formed. Thai reflects this ordering quite substantially, though not completely.)

Undertaking the OCR.

I think getting the best results is an art and worth persevering to make improvements. For all but a few cases with one or two small documents, the whole scanning workflow ought to be considered as a successful process requires a good rhythm. Washington State Library has a useful checklist and there are some good tips on the OCR process provided by About.com. These cover physical aspects including the selection of the scanner itself, keeping it clean, the placement of the source document, the scan settings (resolution, colour contrast, expected language(s)), and how the scanned image is divided up for the actual process of scanning.

One particular aspect that many software packages provide is training. For text recognition this is basically the process of chopping up the scanned image into a sequence of glyphs (character elements) and assigning glyphs to character names – see e.g. Wikipedia for a detailed entry. As you feed in multiple samples and specify the assignments, it learns how particular characters should be interpreted. There’s a training tutorial for a software library called Gamera, which I found very helpful in explaining the concepts.

I’ve not yet used training, probably because I’ve been a bit lazy to make the effort to learn how to make it learn!

Finereader’s Thai OCR Performance and Correcting the OCR.

Here’s a sample of FineReader's output.

Thai OCR in Abbyy FineReader Pro 9

As you can see, it’s a long way from perfection! Here it obviously doesn’t handle the English. I actually set it to interpret everything as Thai – although I could have included English as an additional language, it seems to have a net effect of adversely affecting the Thai rendering, so since English is easy for me to recognise and type, I prefer to let it get that part wrong.

A Thai person might well be dissatisfied with the results, but overall I was quite happy given my very slow Thai typing speed. There were one or two characters that FineReader seemed to really struggle with, but correction was not difficult as the suggested match was often a character used here and not elsewhere – so I could do a ‘search and replace.’ More challenging was the handling of the small diacritical marks – in Thai they are all glyphs since they each contribute towards meaning, either as vowel sounds or tone marks. Instances where there are two such marks on a single letter are common and FineReader often struggled to pick out -่ ไม้เอก (mai ek) – it looks like a hyphen, but its placement varies a lot. If you look at the screenshot carefully, you can see that FineReader simply omits quite a few of these, perhaps because the original source document was not clear enough.

Even if you train an OCR package, there will still be imperfections, so the output needs to be corrected. This process is tedious, but helpful – not least in learning to read! It helps you to familiarise yourself with the alphabet and especially pay attention to the way letters are formed.

If you have a large screen, particularly with widescreen dimensions, then it’s probably easiest to use the scanned image, set the zoom as needed, and place it next to the OCR’ed version that you’re editing.

Conclusion

Although a quick and perfect system is far away, for printed texts a few OCR options are emerging that I find helpful in digitising printed Thai texts. Alternative suggestions are very welcome – I’m keen to improve what I’m doing, even though it’s already been quite an effort and I haven’t yet started talking about the translation itself...!

No comments: