The quality of the Unihan database, while overall good, degrades along with the popularity of the languages covered. Chinese (Mandarin and Cantonese) are doing okay, Japanese isn’t too bad, where Korean and Viêtnamese have a lot to be desired. So I decided to give a helping hand and see if I could plug a few holes.
Step 1: What holes are there to fill?
The first step was to identify what’s missing. It’s all good and well to say that Korean isn’t well covered by the Unihan database, but actual facts would be better. Over the last few years (10?), I have done terrible things to the Unihan in my own little backyard. I have it today indexed more or less to my liking as an sqlite database. The tables are (as of last week, who knows what I’ll add):
Don’t mind the initial k- in the table names, it’s how I prefix Constants in my favorite languages, and the habit carried over to sqlite tables. Which is convenient, since Unicode does the same to the field names in Unihan… It could even be that this k- prefix habit was acquired from too much time reading Unihan docs… People familiar with the Unihan file will sneeze at the kHakka table. Si señor, I know that Unihan doesn’t cover Hakka, dammit! I had to fetch data from Dr Lau, and had to first build a Hakka input method (劉拼法) based on Dr Lau’s work, for my Macs. From that, indexing Hakka readings into my Unihan sqlite database wasn’t exactly a hardship.
Likewise, building a jyutping 粵拼 input system for Mac OS X from the Unihan wouldn’t be so hard, but I only reinvent the wheel when it’s really necessary. And a dude called Dominic Yu produced an input plugin back in the days. There you go, complete with instructions. For the curious here’s what my input plugins panel looks like:
So, from this Unihan sqlite database, how to determine what’s missing for Korean? Easy. The gist of it is a simple SQL query:
select distinct codepoint from '+tbl+' where '+tbl+'.codepoint not in (select codepoint from kKorean);
where tbl is each of the tables (except kKorean) of course. So I wrote a Python script that iterates over these tables, taking care of the duplicates of course. This yielded close to 18,000 characters without a Korean reading. That’s quite a lot…
Step 2: Let’s Grab Some Data
Next I had to find a reliable online source to fill in the gaps. I know exactly where to find info on all these missing sinograms, and more, in the dead-tree world (I used to own a copy of the 大漢韓辭典 which has 56,000 chars, give or take). But that wouldn’t be exactly practical… The best source I have found so far is Zonmal, which despite its third-world 20th century, webmaster-as-an-anally-retentive-dictator interface and ugly name, has quite a bit of information. After a little poking around, the local Adolf having tried hard to hide things from people like me – he who should be happy that some people are actually interested – I found out where to POST my queries, and how to find the results if any.
Since I didn’t want to hammer this site – the idea being to retrieve the data, not take it down, this affair being an .aspx thingy hosted on IIS – I had to be gentle. Also, the whole thing being encoded in EUC_KR, grrr, I needed to do on the fly conversions. For these reasons, I went back to my favorite language, REAL Basic, which is much better equipped than Python for the task. I set a timer at 8 seconds, and for the next 38 hours or so, my trusty MBP pinged that web site one request at a time, gently extracting the information I needed. Tonight I finally saw the result: 8,346 characters with a match, and readings filled out. That’s about one third of the missing characters. Not so bad.
In my list of tables, the one for Korean is called kKorean, and not kHangul – which is the name used in Unihan. The reason is that I store the Korean syllables in romanization, using the Yale system. Yale is definitely not the most common, but it is very well suited for automated conversion to and from hangul. I have two small functions in every language I use that provide this conversion. And they will be used in the next step: indexing.
Step 3: Cleanup and Indexing
For indexing I went back to Python, since I had code already for indexing from previous experiments. All I needed to do was read each line of the output from step 2, check whether there was a valid reading (or more), convert them to Yale (as the output from Zonmal was in hangul), and update the sqlite database. Barely forty lines of code. My Unihan database is now 35.6MB, including the indexes, and is used on a small web app I use daily to look up sinograms I either don’t know, don’t know the Cantonese reading, or the meaning. Very handy.
You will find below the source code for steps 1 and 3. You’d need my Unihan sqlite database to run them but it’s too heavy to upload – instead I’ll write another post on how to build it from the Unihan.txt file.
- Dylan’s Hakka Page – hideous but lots of good stuff in there
- Unihan database lookup – the original!
- Dominic Yu’s page on Chinese and computers – jyutping plugin
- Dr Lau’s PinFa input – Big5 encoding
- My own web app, based on Unihan
- Zonmal – the input form only
- Wiktionary zh – useful source but encoding of pinyin borked
- Wiktionary en – same, in English though, and encodings not borked. I’m planning to do a similar operation to fill in the gaps for kMandarin.
- ZDic – built on the Unihan too. And another eye-sore.
- Chinese Text Project – yet another Unihan-based 1996 eye-sore.
- 康熙字典網上版 – Want more eye-soreness, “made in China”?
- vi-nom-vni.mim – Lisp crapola, part of m17n library. Useful chu nom data. This will be used as some stage to fill in the gaps for Viêt.
- Narrow Python – what happens if you wanna go beyond the basic plane in Python? Boom. Read this.