a modal/duality modality

Achieving Sibylant Graze: Word-Selection for Contact with NLTK

Designing a metric

Contact is a word game. I recently created a multiplayer implementation that runs on WebSockets called sibylant graze; that page details the rules if you’re unfamiliar.

A natural question is how to select the best words to use when it’s our turn. A simple heuristic is that we want the other players to not be able to guess the word we’re thinking of given any prefix of that word. But given a prefix like “cham’ for “champion,” even though there are many other words that start with “cham,” it’s likely that a player will think of “champion” as well.

One way we can mitigate this, while making sure we choose common, well-known words, is to optimize for “phonemic hairpin turns,” in which players think the revealed prefix of the word is pronounced one way (since that’s the most common pronunciation of that prefix in other words), but in our word, it’s pronounced differently.

For example, the word “champagne” would probably not come across a player’s mind running through options with more common pronunciations of “cham” like “champion,” “chamelion,” or “chamber”.

Borrowed foreign language words are a good source of phonemic hairpin turns, but my guess is that plenty exist in more English-y words as well.

Implementation

We’ll use the cmudict phoneme dictionary corpus with Python’ nltk.

cmudict gives us the pronunciation of an entire word with in ARPAbet code. Unfortunately, there’s no good way to get the prefix of the ARPAbet transcription induced by a prefix of the word—I couldn’t find any other corpus that gave this information either.

Although likely to elicit ire from linguists, we could try assuming a 1:1 phoneme to character ratio.

# arpabet :: Word -> [ARPAbet-Transcription]
# (maps to a list to account for homographs)
import nltk
arpabet = nltk.corpus.cmudict.dict()

offsets = []
for w, ts in arpabet.items():
    offsets.append(sum(abs(len(w) - len(t)) for t in ts)/len(ts))
print([offsets.count(idx)/len(offsets) for idx in range(3)])
# [0.25005062573407316, 0.399700295654287, 0.24193430804746668]

25% of words have the same number of characters as phonemes, and another 40% are only one off. Though this is linguistically-speaking, completely incorrect, we’ll go ahead and assume that the first n characters of a word correspond to the first n phonemes of the ARPAbet transcription.

First, we’ll write a function that computes a prefix’s subscore. It looks for all other words that begin with the same prefix, and checks the frequency of the word’s transcription-prefix against all the other words’ transcription-prefixes.

words = set(arpabet.keys())
from itertools import chain
def flatten(xss):
    return chain.from_iterable(xss)
def subscore(word, word_transcription, prefix):
    n = len(prefix)
    candidates = [word for word in words if word.startswith(prefix)]
    ts = list(t[:n] for t in flatten(arpabet[c] for c in candidates))
    t = word_transcription[:n]
    return len(ts) / ts.count(t)

print(subscore('dolphin', arpabet['dolphin'][0], 'do'))
# 3.3979933110367893
print(subscore('domestic', arpabet['domestic'][0], 'do'))
# 21.617021276595743

Clearly, the pronunciation of “do” in “dolphin” is much more common than in “domestic.”

The straightforward way to extend this to an entire word is to just calculate the subscores for each successive prefix.

def score(word, word_transcription):
    return [subscore(word, word_transcription, word[:idx]) for idx in range(1, len(word))]

print(score('dolphin', arpabet['dolphin'][0]))
# [1.0025953802232026, 3.3979933110367893, 2.4375, 1.3333333333333333, 1.3333333333333333, 1.0]
print(score('domestic', arpabet['domestic'][0]))
# [1.0025953802232026, 21.617021276595743, 3.607142857142857, 3.0, 1.4285714285714286, 1.2857142857142858, 3.0]

Depending on your play style, you may prefer the score to be spread out throughout the word or concentrated at the beginning. This is easy; we can select for the former using a concave-down function like the square root or the latter using a concave-up function like the square. I personally prefer the score to be spread out.

def realize_score(score, fn):
    return sum(fn(s) for s in score)
sqrt = lambda a: a ** .5

print(realize_score(score('dolphin', arpabet['dolphin'][0]), sqrt))
# 7.715312096763084
print(realize_score(score('domestic', arpabet['domestic'][0]), sqrt))
# 13.343179316809652

Results

Finally, we can try to get the word with the best realized score. The runtime as of now is quadratic, so we’ll just select a random subset of the corpus words.

scores = [(word, score(word, arpabet[word][0])) 
          for word in 
          random.sample(words, 100)]
best = sorted(scores, reverse=True, key=lambda x: realize_score(x[1], sqrt))

print([word for word, _ in best][:20])
# ['schunk', 'entrees', 'psychosomatic', 'lieutenants', "executive's", 'equalization', 'schiel', 'chemically', 'disinvited', 'levina', 'thuot', 'revolutionize', 'deruyter', 'viewership', "absolut's", "edition's", 'premonitions', 'watersheds', 'brocades', 'transpac']

As expected, foreign language words did well, as well did words with silent letters. “executive” did well because of the odd pronunciation of the first two letters, and “premonitions” did well because of the odd pronunciation of the first three.

Other great words I’ve come across are “hourglass”, “tournament”, “asymmetry”, “bronchitis”, “herbal”, and “overt.”

Further work