We’ve built a Chinese and Korean language quiz to estimate how many TV drama words you know. Here is how it works.
Table of contents
- What is Drama Vocab Quiz?
- How is the Chinese drama word list created?
- How is the Korean drama word list created?
- How is the Drama Vocab Quiz constructed?
- How do we estimate how many drama words you know?
Our Drama Vocab Quiz is a free language learning supplement for anyone studying Chinese or Korean developed by the Rakuten Institute of Technology Singapore in collaboration with Viki. It aims to help English speaking learners estimate how many words from the popular Chinese or Korean TV dramas they already know.
The quiz is a new companion tool for Learn Mode Chinese and Learn Mode Korean (an interactive tool we developed with Viki which allows users to click on drama subtitles to complete simple language learning tasks, such as getting instant definitions of the Chinese or Korean words spoken by the actors).
After you take the Drama Vocab Quiz, we tell you which words you got right, and provide links to the drama episodes featuring words you missed. Based on your familiarity with the easy, medium, and hard words presented in the quiz, we also estimate how many drama words in total you are likely to know.
You can then practice by watching dramas on Viki and looking up new words with the Learn Mode enabled, then retake our quiz to check your progress.
How is the Chinese drama word list created?
To create the Chinese Drama Vocab Quiz, we take all the Chinese dramas on Viki together with all their original language subtitles.
Since in Chinese writing there are no spaces to indicate where one word ends and another begins, we take an additional step of splitting sentences into individual words, or tokens, using a word tokenization tool. For example, a sentence 妳放心我已经请我的会计在处理了 is split as follows:
Note that in case of English or Korean, we would need to perform an additional stemming step. For instance, a word “walk” in English has different forms, such as “walking”, “walked”, “walks”, “walkable”. In contrast, Chinese words don’t take on affixes, so words produced by a tokenizer are all considered word types without a need for stemming.
Next, we look at word frequencies. For example, we find that the 10 most common words in Chinese dramas are:
Since we aim to focus our language learners on the vocabulary they are likely to encounter in popular culture, we remove rare words that appear less than 20 times in Chinese dramas, e.g.
At times, a specific word may be used a lot in a single drama, but not much elsewhere; that’s why we only include words that appear across at least 10 different dramas. After removing rare words, we match the remaining drama words against the entries in our in-house Chinese-English dictionary. Our final list comprises of about 6K most common Chinese drama words.
How is the Korean drama word list created?
To create the Korean Drama Vocab Quiz, we take all the Korean dramas on Viki together with all their original language subtitles.
Korean differs from Chinese in that it is an agglutinative language, featuring a very high number of affixes. Recall that in English a word “walk” has many different forms, such as “walking”, “walked”, “walks”, “walkable”, etc. Korean takes this complexity to a whole new level…
For example, a word 미남 [minam] “handsome guy” can be followed by multiple suffixes at once -이시라구요 [-issilaguyo] to form a single word meaning “someone said that he is handsome” (see here for a helpful discussion).
The above one-word “sentence” demonstrates how Korean word types can recursively be extended with different word endings to change the meaning of the original root word.
For our quiz, in the case of 미남이시라구요, we only extract the root word 미남 [minam], and count it as a unique word type, ignoring all other variations on the word associated with different endings.
Further, Korean writing is phonemic (i.e. each Hangul character corresponds to a particular sound), thus single character Hangul words can be ambiguous when written out of context. To avoid ambiguity in the quiz items, we removed single character Hangul words.
After filtering the single character Hangul words, the 10 most common words in Korean dramas on Viki are:
We also remove rare words that appear less than 20 times in the Korean dramas, and across less than 10 different dramas, e.g.
After removing the rare words, and the single character Hangul words, we match the remaining drama words against the entries in our in-house Korean-English dictionary. Our final list comprises of about 4K most common Korean drama words.
How is the quiz constructed?
For both Chinese and Korean quiz, we sort drama words by their frequency of appearance in Viki dramas. We rank the words from the most frequent to least frequent, and split the words into three difficulty levels using the percentiles of the word frequencies. We use words’ frequency as a proxy for difficulty level; that is, we assume that the words appearing most frequently in dialogues are generally considered “easier” than the words that are rarely used. The hard level includes words with the lowest frequencies from the 1st to 24th percentile. The intermediate level is made up of words in the 25th to 74th percentile. The easy level constitutes the 75th to 100th percentile.
For each quiz, we draw a sample of 30 words – 10 easy, 10 medium, and 10 hard. Every time you take our quiz, you see a new set of words. Based on your answers, we estimate your drama vocabulary size, or how many words used in Viki dramas you know.
How do we estimate how many drama words you know?
To estimate the percentage of unique drama words you know, we first tabulate the number of easy, medium, and hard questions you answered correctly or incorrectly.
To improve the accuracy of your vocabulary knowledge estimate, we try to discourage random guessing. Since there are only 3 answer options per question, the chances of guessing correctly are 1 out of 3. We penalize potential guessing by negating 1 question answered correctly for every 2 questions answered incorrectly within the same difficulty level.
From there, we estimate the percentage of words you know per level.
For the medium and hard words, we use the lower bound, and for the easy words the upper bound of a confidence interval from a one-tail t-test with significance level of 0.1. We treat the number of correct answers per level as the observed mean of the test parameter generated from a t-distribution. The confidence interval tells us the range of values parameterized by the distribution such that the observed mean is statistically probable.
For example, let’s assume you know 5 out of 10 medium frequency words. How many medium frequency drama words do you know in total? The lower bound of the interval in the one tail t-test is 0.34 (i.e. 0.5 - 0.16). We multiply the lower bound by a total number of medium frequency words to arrive at an estimate of you knowing about 34% of all the drama words in the medium category. We perform similar computations for the easy and hard words, and add up the words from each difficulty level to arrive at the final estimate of the number of drama words you know.
As more students take the quiz, we will be able to refine the scoring method, and improve our estimates of the number of drama words you know. We will be updating this page to reflect any adjustments to our approach.
We are excited to work on discovering new ways of powering language learning with authentic content. Thank you for reading, and let us know what you think!