Friday, April 9, 2010

Word group theory (2). Front-end processor's paradigmatic change.

When writing in Japanese, I usually use ATOK front-end processor. As ATOK's usability is excellent for me, I still can't step out for using Google's Japanese front-end processor. Nevertheless, I usually feel that Japanese front-end processors are now facing a time for paradigmatic change.

When I write a Japanese phrase '検索する' by using ATOK, a set of the Kanji & Hiragana character codes corresponding to '' are finally recorded on the document file. But the alphabetical keyboard-typing process of characters 'kennsakusuru' is not included in the file. It means that only the result of conversion, Kanji code (or Hiragana or Katakana code) translated from alphabetical characters has been the important information so far.

Conversely, in search engine's era, conversion process & its embedding into document files will become more important due to the fact that Japanese (Chinese) sentences are not separated by spaces between words.

Every word in English is separated by spaces. The power of Google-search, I guess, seems to be strongly supported by the existence of spaces between words, the structure of English language itself. Meanwhile, when analyzing Chinese language (or Japanese) sentences, computer algorithms have to tackle with the word-separation problem at first.

Let me take an example from a Chinese web.

A sentence '那麼胡錦濤你面臨了利用百度訴江案這件詭異的事情' was found on a Chinese web page.

As I can't understand Chinese language at all, I'm not able to separate above sentence into words correctly except for the well-known word '胡錦濤 (Hu Jintao)' and '百度(Baidu)'. A few other words like '利用' and '事情' seem to have the same meaning as those of Japanese, but this guessing is not conclusive level. Owing to these words, I can at least guess the meaning of the sentence roughly. 'Hu Jintao said something about peculiar situation related to using Baidu search engine.' If the spaces are deleted, this sentence become 'HuJintaosaidsomethingaboutpeculiarsituationrelatedtousingBaidusearchengine.' This expression contains the words like 'Ted' 'sing' 'use' 'arch'.

If Chinese language front-end processor can embed the alphabetical keyboard-typing processes of 'Hu Jintao', 'Baidu' and other words in the document file, it may help search engines to decide correct word separation.

Of course, corpus-based dictionary of Chinese language may be the basic approach to solve this problem in the Baidu era. But on another front, front-end processor's paradigmatic change also may become important in the Google-era.

Though Google has decided to leave China, the possibility of producing a new kind of Chinese language front-end processor based on a quite different approach still remains.