Designing a metric
Contact is a word game. I recently created a multiplayer implementation that runs on WebSockets called sibylant graze; that page details the rules if you’re unfamiliar.
A natural question is how to select the best words to use when it’s our turn. A simple heuristic is that we want the other players to not be able to guess the word we’re thinking of given any prefix of that word. But given a prefix like “cham’ for “champion,” even though there are many other words that start with “cham,” it’s likely that a player will think of “champion” as well.
One way we can mitigate this, while making sure we choose common, well-known words, is to optimize for “phonemic hairpin turns,” in which players think the revealed prefix of the word is pronounced one way (since that’s the most common pronunciation of that prefix in other words), but in our word, it’s pronounced differently.
For example, the word “champagne” would probably not come across a player’s mind running through options with more common pronunciations of “cham” like “champion,” “chamelion,” or “chamber”.
Borrowed foreign language words are a good source of phonemic hairpin turns, but my guess is that plenty exist in more English-y words as well.
We’ll use the
cmudict phoneme dictionary corpus with
cmudict gives us the pronunciation of an entire word with
in ARPAbet code. Unfortunately, there’s no good way to get the
prefix of the ARPAbet transcription induced by a prefix of the
word—I couldn’t find any other corpus that gave this information
Although likely to elicit ire from linguists, we could try assuming a 1:1 phoneme to character ratio.
# arpabet :: Word -> [ARPAbet-Transcription] # (maps to a list to account for homographs) import nltk arpabet = nltk.corpus.cmudict.dict() offsets =  for w, ts in arpabet.items(): offsets.append(sum(abs(len(w) - len(t)) for t in ts)/len(ts)) print([offsets.count(idx)/len(offsets) for idx in range(3)]) # [0.25005062573407316, 0.399700295654287, 0.24193430804746668]
25% of words have the same number of characters as phonemes, and another
40% are only one off. Though this is linguistically-speaking, completely
incorrect, we’ll go ahead and assume that the first
n characters of a
word correspond to the first
n phonemes of the ARPAbet transcription.
First, we’ll write a function that computes a prefix’s subscore. It looks for all other words that begin with the same prefix, and checks the frequency of the word’s transcription-prefix against all the other words’ transcription-prefixes.
words = set(arpabet.keys()) from itertools import chain def flatten(xss): return chain.from_iterable(xss) def subscore(word, word_transcription, prefix): n = len(prefix) candidates = [word for word in words if word.startswith(prefix)] ts = list(t[:n] for t in flatten(arpabet[c] for c in candidates)) t = word_transcription[:n] return len(ts) / ts.count(t) print(subscore('dolphin', arpabet['dolphin'], 'do')) # 3.3979933110367893 print(subscore('domestic', arpabet['domestic'], 'do')) # 21.617021276595743
Clearly, the pronunciation of “do” in “dolphin” is much more common than in “domestic.”
The straightforward way to extend this to an entire word is to just calculate the subscores for each successive prefix.
def score(word, word_transcription): return [subscore(word, word_transcription, word[:idx]) for idx in range(1, len(word))] print(score('dolphin', arpabet['dolphin'])) # [1.0025953802232026, 3.3979933110367893, 2.4375, 1.3333333333333333, 1.3333333333333333, 1.0] print(score('domestic', arpabet['domestic'])) # [1.0025953802232026, 21.617021276595743, 3.607142857142857, 3.0, 1.4285714285714286, 1.2857142857142858, 3.0]
Depending on your play style, you may prefer the score to be spread out throughout the word or concentrated at the beginning. This is easy; we can select for the former using a concave-down function like the square root or the latter using a concave-up function like the square. I personally prefer the score to be spread out.
def realize_score(score, fn): return sum(fn(s) for s in score) sqrt = lambda a: a ** .5 print(realize_score(score('dolphin', arpabet['dolphin']), sqrt)) # 7.715312096763084 print(realize_score(score('domestic', arpabet['domestic']), sqrt)) # 13.343179316809652
Finally, we can try to get the word with the best realized score. The runtime as of now is quadratic, so we’ll just select a random subset of the corpus words.
scores = [(word, score(word, arpabet[word])) for word in random.sample(words, 100)] best = sorted(scores, reverse=True, key=lambda x: realize_score(x, sqrt)) print([word for word, _ in best][:20]) # ['schunk', 'entrees', 'psychosomatic', 'lieutenants', "executive's", 'equalization', 'schiel', 'chemically', 'disinvited', 'levina', 'thuot', 'revolutionize', 'deruyter', 'viewership', "absolut's", "edition's", 'premonitions', 'watersheds', 'brocades', 'transpac']
As expected, foreign language words did well, as well did words with silent letters. “executive” did well because of the odd pronunciation of the first two letters, and “premonitions” did well because of the odd pronunciation of the first three.
Other great words I’ve come across are “hourglass”, “tournament”, “asymmetry”, “bronchitis”, “herbal”, and “overt.”
- Speed up the algorithm (is it possible to improve the asymptotics?) and run it with a larger dataset
- Find some balance between normalizing for word length and actually selecting longer words (since those are probably harder)
- Try out other realizing functions
- Weight a candidate more with greater frequency, so it is more likely you will know the answer
- Weight a list of candidates less with greater length, so it’s harder to lose by the other players spamming hints