一分錢, 一分貨 jat1 fan1 cin4, jat1 fan1 fo3
You get what you pay for.
Category Archives: Languages
一分錢, 一分貨 jat1 fan1 cin4, jat1 fan1 fo3
駟 Cantonese: si3. “Team of four horses”.
(Yeah, the dude next to me is reading 賽馬 rags…)
宮室築成以後, 董桌強選民間少女八百多人, 充作宮娥彩女. 至於從民間搜刮來的財物更是不其數, 僅囤積的糧食, 便足夠食用二十年.
成 sing4, seng4, cing4. finished.
When the palace was finished,
選 syun2. to choose, select.
民 man4. People, citizen.
間 gaaan1. space, interval.
少 siu3/2. Few, less.
女 neoi5. girl
–> young girl
八 baat3. 8
人 jan4. Man. Person
–> 800+ girls
Dong Zhuo selected (forcibly) eight hundred young girls or more
充 cung1. to fill, full, supply
作 zok3. to make, work, perform.
–> supplied to work as
娥 ngo4. beautiful. good
彩 coi2. colour(ful).
–> 彩女 (lower-rank) maids in the palace.
And sent them to work as maids in the palace.
至 zi3. to reach, arrive
於 wu1, jyu1. in at oon.
從 zung6/cung4/sung1. from, by, since, whence, through
–> as for
搜 sau2/1. search, seek; investigate
刮 gwaat3. shave, pare off, scrape
–> plundered, seized
來 loi4/6, lai4. to come, return.
財 coi3. valuables, riches, possessions.
物 mat6 thing, substance, creature.
更 ga(a)ng1. ang1. further, more.
是 si6. this. yes.
其 kei4. that, his/her/its.
數 sou3/2, sok3. number, several.
As for the property/resources seized from the public/civilians, they were innumerable.
僅 gan2/6. Only, merely, just.
囤 tyun4, deon6. grain basket.
積 zik1. accumulate, store up.
糧 loeng4. food, grain, provisions
食 sik6. eat, food
–> the accumulated/stored up provisions (food)
便 bin6, pin4. convenient, expedient.
足 zuk1, zeoi3. foot; enough.
夠 gau3. enough.
用 jung6. to use.
年 nin4. year
Just the accumulated food was enough to last 20 years.
董卓強迫獻帝遷都長安以後, 強征了二十五萬民夫, 在離長安二百多里的地方, 另築郿塢城, 建造宮室, 規模和京城不相上下.
董 dung2. Supervise. Surname.
卓 coek3/zoek3. Brilliant.
–> Dong Zhuo, died 192. Dictator.
強 koeng4/5, goeng6. Strong.
迫 baak1/3, bik1. Coerce. Busy.
–> Forcefully installed.
獻 hin3. Offer, present. Display.
帝 dai3. Emperor
–> Emperor Xian. Puppet of Dong Zhuo.
遷 cin1. To move, transfer.
都 dou1. Capital
–> Changed the capital to:
長 coeng4. Long.
安 on1. Peace.
後 hau6. After.
After Dong Zhuo installed Emperor Xian on the throne, and moved the Capital to Chang’An,
征 zing1. Invade. Conquered.
了 liu5. Past particle.
二 ji6. 2
十 sap6. 10
五 ng5. 5
萬 maan1. 10K
民 man4. People.
夫 fu1/4. Man, adult man. Those.
He captured 250,000 men,
在 zoi6. At.
離 lei4/6. Depart. Separate.
百 baak3. 100
多 do1. Numerous. Several.
里 lei5. Distance unit. Village.
–> At 100+ li away.
的 dik1. Genitive.
地 dei6. Place.
方 fong1. Region.
另 ling6. Another. Separate.
築 zuk1. Build(ing).
郿 mei4. County in Shaanxi.
塢 wu2. Enbankment. Low wall.
城 sing4, seng4. Castle, town.
–> Meiwu (name of the new city)
建 gin1. Build.
造 zou6, cou3/5. Build. Begin. Prepare.
宮 gung1. Palace. Temple.
室 sat1. Room. Place.
規 kwai1. Rules. Law.
模 mou4. Model, pattern. Copy.
–> Size, format.
和 wo6/4. Peace. Harmony. And.
京 ging1. Capital city.
不 bat1. Not.
相 soeng1. Mutual. Each other.
上 soeng6/5. Up/top, superior. Go/send up.
下 haa6/5. Bottom, below, inferior. Send down.
–> comparable, equivalent
sent them to a place 100 li or more from there, and had a new city built, Meiwu, and a palace comparable to the one in the capital.
The quality of the Unihan database, while overall good, degrades along with the popularity of the languages covered. Chinese (Mandarin and Cantonese) are doing okay, Japanese isn’t too bad, where Korean and Viêtnamese have a lot to be desired. So I decided to give a helping hand and see if I could plug a few holes.
Step 1: What holes are there to fill?
The first step was to identify what’s missing. It’s all good and well to say that Korean isn’t well covered by the Unihan database, but actual facts would be better. Over the last few years (10?), I have done terrible things to the Unihan in my own little backyard. I have it today indexed more or less to my liking as an sqlite database. The tables are (as of last week, who knows what I’ll add):
Don’t mind the initial k- in the table names, it’s how I prefix Constants in my favorite languages, and the habit carried over to sqlite tables. Which is convenient, since Unicode does the same to the field names in Unihan… It could even be that this k- prefix habit was acquired from too much time reading Unihan docs… People familiar with the Unihan file will sneeze at the kHakka table. Si señor, I know that Unihan doesn’t cover Hakka, dammit! I had to fetch data from Dr Lau, and had to first build a Hakka input method (劉拼法) based on Dr Lau’s work, for my Macs. From that, indexing Hakka readings into my Unihan sqlite database wasn’t exactly a hardship.
Likewise, building a jyutping 粵拼 input system for Mac OS X from the Unihan wouldn’t be so hard, but I only reinvent the wheel when it’s really necessary. And a dude called Dominic Yu produced an input plugin back in the days. There you go, complete with instructions. For the curious here’s what my input plugins panel looks like:
So, from this Unihan sqlite database, how to determine what’s missing for Korean? Easy. The gist of it is a simple SQL query:
select distinct codepoint from '+tbl+' where '+tbl+'.codepoint not in (select codepoint from kKorean);
where tbl is each of the tables (except kKorean) of course. So I wrote a Python script that iterates over these tables, taking care of the duplicates of course. This yielded close to 18,000 characters without a Korean reading. That’s quite a lot…
Step 2: Let’s Grab Some Data
Next I had to find a reliable online source to fill in the gaps. I know exactly where to find info on all these missing sinograms, and more, in the dead-tree world (I used to own a copy of the 大漢韓辭典 which has 56,000 chars, give or take). But that wouldn’t be exactly practical… The best source I have found so far is Zonmal, which despite its third-world 20th century, webmaster-as-an-anally-retentive-dictator interface and ugly name, has quite a bit of information. After a little poking around, the local Adolf having tried hard to hide things from people like me – he who should be happy that some people are actually interested – I found out where to POST my queries, and how to find the results if any.
Since I didn’t want to hammer this site – the idea being to retrieve the data, not take it down, this affair being an .aspx thingy hosted on IIS – I had to be gentle. Also, the whole thing being encoded in EUC_KR, grrr, I needed to do on the fly conversions. For these reasons, I went back to my favorite language, REAL Basic, which is much better equipped than Python for the task. I set a timer at 8 seconds, and for the next 38 hours or so, my trusty MBP pinged that web site one request at a time, gently extracting the information I needed. Tonight I finally saw the result: 8,346 characters with a match, and readings filled out. That’s about one third of the missing characters. Not so bad.
In my list of tables, the one for Korean is called kKorean, and not kHangul – which is the name used in Unihan. The reason is that I store the Korean syllables in romanization, using the Yale system. Yale is definitely not the most common, but it is very well suited for automated conversion to and from hangul. I have two small functions in every language I use that provide this conversion. And they will be used in the next step: indexing.
Step 3: Cleanup and Indexing
For indexing I went back to Python, since I had code already for indexing from previous experiments. All I needed to do was read each line of the output from step 2, check whether there was a valid reading (or more), convert them to Yale (as the output from Zonmal was in hangul), and update the sqlite database. Barely forty lines of code. My Unihan database is now 35.6MB, including the indexes, and is used on a small web app I use daily to look up sinograms I either don’t know, don’t know the Cantonese reading, or the meaning. Very handy.
You will find below the source code for steps 1 and 3. You’d need my Unihan sqlite database to run them but it’s too heavy to upload – instead I’ll write another post on how to build it from the Unihan.txt file.
- Dylan’s Hakka Page – hideous but lots of good stuff in there
- Unihan database lookup – the original!
- Dominic Yu’s page on Chinese and computers – jyutping plugin
- Dr Lau’s PinFa input – Big5 encoding
- My own web app, based on Unihan
- Zonmal – the input form only 😉
- Wiktionary zh – useful source but encoding of pinyin borked
- Wiktionary en – same, in English though, and encodings not borked. I’m planning to do a similar operation to fill in the gaps for kMandarin.
- ZDic – built on the Unihan too. And another eye-sore.
- Chinese Text Project – yet another Unihan-based 1996 eye-sore.
- 康熙字典網上版 – Want more eye-soreness, “made in China”?
- vi-nom-vni.mim – Lisp crapola, part of m17n library. Useful chu nom data. This will be used as some stage to fill in the gaps for Viêt.
- Narrow Python – what happens if you wanna go beyond the basic plane in Python? Boom. Read this.
Dear Appeul, I can haz a betteur string encoding endjin for ze Français ? It iz vélocité. Note to Appeul: ASCII is for ze ouik. You Tee Eff foh evah!
A long long time ago, when I was studying linguistics and Asian languages in Paris, I was introduced to a researcher who had written his PhD about the dialect spoken in a Hakka village called Sung Him Tong. He gave me a copy of his PhD dissertation, which I probably have somewhere up in storage in Kwaichung or wherever it is that my stuff is stored.
Back then I wasn’t interested in Cantonese, and other non-mainstream Chinese languages. I was immersed in Middle Chinese and other dead languages, and had little time for the languages spoken by live people. I was twenty-something and allowed to be foolish. Anyway, I chucked the dissertation in my library, where it accumulated dust. The only impression I kept was that this man must’ve been very determined to live in a Hakka village for 6 months or more, just to study their dialect. The image I had of Hakka villages was that of round, multy-storey wooden houses shared by several families in the boondocks of Mainland China.
Except it’s not in the boondocks… Well, at least not in the middle of nowhere in Mailand China, but a canon-shot away from Fanling KCR/MTR station. It looks like nowhere as shiny and modern as say Central 🙂 see some nice pics here, but even 20 years ago, it must’ve been less of a hardship than living in Shenzhen today…
Anyway for some strange reason I got a blast from the past — I was looking up some references about the Hakka language, and a bibliographical reference to that PhD dissertation came up — and I thought that I should look up that place, 20 years after receiving the dissertation… Better late than never, right? I felt some kind of disappointment — here I was, as a kid, imagining that dude slumming it in the mountains with the indigenous population, whereas he was probably commuting every day on a 小巴… My hero’s a commuter. Sigh…
So I poked around a bit — since this place is near 沙頭閣, a place I really want to visit — and hk-place is always a good start when you’re looking for info and piccies about forgotten places in HK. There’s lots of not so ancient buildings, but there seems to be a bunch of 圍村, home to the 鄧 family, and cousins to the people who live in 錦田 (and thus 吉慶圍 I suppose). The two pictures here (click to see the originals), show the contrast that you can find in such HK villages. Concrete and stone walls. This is something that I enjoy quite a bit. This village is definitely on my list of things to visit in 2011!