Language-specific random spelling: fesh, excenture, and the like

Language-specific random spelling: fesh, excenture, and the like

John Kerl
kerl.john.r@gmail.com
July 15, 2009

When I play Scrabble, I often think that what’s on my rack (motch, say) looks like it should be English — but (unfortunately for my score!) it isn’t in the dictionary.

Likewise, when I read translations of Lewis Carroll’s Jabberwocky (some of which are compiled here), it’s clear that even nonsense words can look like they certainly belong to a particular language: who could doubt the Englishness of “’Twas brillig, and the slithy toves did gyre and gimble in the wabe”, or the Frenchness of «Évite le frumieux Band-à-prend!»?

Question: Is it possible to encode the property of looking Englishy, Frenchish, and so on?

Here are some randomly generated words, which are results of my experiment to do just that. Hopefully, you can tell which language is which:

adeal mather emoor snve fesh excenture fomposh cong egg recerante came opin warl weldury bang gomerick tower inyoad axe priecitant

figutes houpon tecourd’hui dinchode sans l’entreprise l’ésion si prenide muimesion dectacs auinde toyais bout qusicont placérde coluer doyeai a consicement

vor govio elanar habibisidar doy miear solosibar denco concucicer nepecar prantrzo pensamionto moladar pante tío sado numbiar dudo escrerir adqucés

jeden anssen Taten welbet bller eigerer ohnegen hohe aum senem getolen Abef muöls ihrar wowors wevor ihrd leinf alte veropäischen

The mathematical model

My model uses Markov chains for letter-to-letter transition probabilities, hierarchically on word length.

What that means is: when I generate a random word, I first choose a word length. To do that, I’ve consulted a word list for the particular language. If 21% of words in the list have 5 letters, then I give myself a 21% chance of generating a 5-letter word, and so on for other word lengths. Having selected the word length, the second part is to pick the letters in the word. I’ve consulted the word list to find how often 5-letter words start with a given letter — for 5-letter words, h is the first letter 2.5% of the time; suppose I pick h to be the first letter. Then I consult my data to see what letters follow h as the second letter of five-letter words, and how often, maybe picking a. Consulting my data to see what letters follow a as the third letter of a five-letter word, maybe I pick b: this gets me up to hab _ _. And so on, to the last letter of the word.

What makes this a Markov-chain model is that I only look at transition probabilities from one letter to the next — I don’t look at correlations between the frequency of, say, the 1st and 4th letters. What makes it hierarchical is that I keep a separate Markov chain for each word length. The five-letter words don’t know what the four-letter words are doing.

Plusses and minuses

Plusses:

This is a simple model, easy to implement in software.
The randomly generated words often look like they do belong to the correct language. The above samples, of 20 words each, were made by acquiring statistics from word lists for English, French, Spanish, and German, respectively. In particular, it’s reassuring to see that actual words (such as egg, sans, tío, and alte) do sometimes appear. (Certainly, when you ask a native speaker of a language to make up a nonsense word, the method their mind uses is nothing like the one presented here. Birds flap their wings and airplanes do not; when we build machinery which imitates nature, we do not imitate exactly — if we can build it, and if it flies, we use it.)
By contrast, here are a few 5-letter words where each letter was drawn out of a (metaphorical) hat, with each of the 26 letters a-z being equally likely, and no relationships between which letter is more or less likely to follow any other:

xyeot ivgcv bvxvq aqobd nneat rllvb ownrn qxovq ulsfa agerq mhhnj ryciv fioyp snsbl iwijo jdkyd ezwul ytedg nhqbj kzzid

The hierarchical Markov-chain method certainly does better than this.

Minuses:

Letters are probably not the correct unit. E.g. maybe one should treat th, oo, etc. as single units. (The units should not be phonemes, since I’m generating written words rather than spoken ones.) See Cameron McLeman’s Mathematics of English page for a nice survey of random spelling as well as random sentences of real words. Also see Josh Millard’s Garkov.
True correlations go more than one letter deep. There are bl _ _ _ words in German (e.g. blieb) and _ ll _ _ words (e.g. alles), but no bll _ _ words. Since I only look at correlations between adjacent letters, though, I generated the un-German-looking bller. Likewise for the number of vowels or consonants in a word: my model doesn’t forbid all-vowel or all-consonant words.
For small dictionaries, say there is only one 11-letter word. Then, that is the only 11-letter word I will ever generate. This same argument helps to explain how I got tecourd’hui above, which is surprisingly close to the actual French word aujourd’hui.

Conclusion: This model generates some reasonable-looking words, as well as some unreasonable-looking ones. The results are better than completely random words such as mhhnj, but with more effort one could do even better!

Word lists

The word lists were all found via web searches, and are cached here. (Set your browser’s text encoding to Unicode after clicking through the French, Spanish, or German links, so that characters such as é will appear correctly.)

English 2000-word general-service list
2,000 most common French words
10,000 most common French words
1000 most common Spanish words
1000 most common German words
879 most common Latin words
2000 English math words
1500 German math words
English Scrabble ® word list (180,000 words)
Spanish word list (54,000 words)
100,000 German words
336,000 French words
French Scrabble ® word list (130,000 words, with no diacritical marks)
5,000 Latin words

Word-length frequencies for English

How long do English words tend to be? It depends! One of the central issues in statistics is the sampling problem: when we want to collect information about English, which English utterances do we look at? Everything written since 1900 (too much data!), an abridged dictionary, an unabridged dictionary, the front page of today’s New York Times? What is a representative sample? If I picked not the front page of the Times but the sports page, would I still be getting a representative sample of English words? How many pages of the newspaper would be enough? The same issue confronts pollsters: any sample which is smaller than the entire population has its own randomness, introduced by the very selection of the sample.

So, the answers you get depend on the sample you pick. Here are word-length frequencies for the English 2000-word general-service list and the Scrabble dictionary, respectively:

      *General-service list*        *Scrabble dictionary*
    Word length Count Percent     Word length  Count Percent
    ----------- ----- -------     -----------  ----- -------
              1     2   0.08%               1      0   0.00%
              2    18   0.78%               2    102   0.05%
              3   170   7.44%               3   1015   0.56%
              4   493  21.58%               4   4030   2.25%
              5   472  20.66%               5   8938   5.00%
              6   390  17.07%               6  15788   8.83%
              7   293  12.82%               7  24029  13.44%
              8   178   7.79%               8  29766  16.65%
              9   128   5.60%               9  29150  16.31%
             10    87   3.80%              10  22326  12.49%
             11    28   1.22%              11  16165   9.04%
             12    13   0.56%              12  11417   6.38%
             13     7   0.30%              13   7750   4.33%
             14     4   0.17%              14   5059   2.83%
             15     1   0.04%              15   3157   1.76%
    -----------  ----             ----------- ------
    Total:       2284             Total:      178692

Note that the Scrabble dictionary has no one-letter words, since one-letter words aren’t legal in Scrabble. The Scrabble dictionary doesn’t go past 15-letter words, since the Scrabble board is 15x15. Coincidentally, the general-service 2000-word list (which actually has 2,284 words) has one 15-letter word (dissatisfaction) but nothing longer.

Also note that the English GSL-2000 words tend to be shorter than the 178,692 words of the Scrabble dictionary — as we would expect for more common words.

More output

Here’s some more output:

English GSL-2000 used as input:

paghore secusise bauple shake mend bulcuessman mawyel arime resser veant compere abstint mecitric paty swich shit deare deliete obed sounden excrse wemase comevey reflory momerve blake mollor socker empape livesite weed wall potwove each mug varele everm roke soncose chivage elail grep imitulake nod inccist eshacle adeer desporitial piglagh cay fatton russ reg ponqueman rate consareart wint breage pidor wich stopy mint shoss cowhice ballive

English Scrabble dictionary used as input:

bynxer rearuoors stapee shoskioutic coeaties rencobucleins colferors mitelescks foxinint dioblpolates plambogial wrilite upebiff macta dispatusm phocleramamed prostimmate chedot pyliglin sppoxiter meschizes cyeended breished

Three-letter correlations

If we use three letters’ worth of memory instead of two, the results are better. (In each of the examples below, I used the longest word list available for the given language.) We no longer get un-English-looking bits such as oux or tss:

anacurbage plachers clobberil gusttion tarring ginfindingly shipwomer swinnes churency hygintods boatyers reectatted cashers kingling anything viscin specifief gota argoristed prectabilition doxiconmetts swanting jelemninas rephispating encioters deadstoned alts natchloins supprotophated camonsible doconic egoutsu panconden slitlest discoruartion taxilocum cumidiners dogitintably muckoical overretrams aurzy defs noncefesical conctilliship occurpins linictoxly aurfuself rioreates ravied pasters stewalorties murine hawkinesses ultiquing indebuses preocelissness exechips mantificatible chomarability orles comptuousnesses plex midable egoucker casbud culsies axe mancemied gransolonias cledicon mathenesses caysses perbarthorists praeming dabyins virsitude conkstice flionizations hompharableness presenciptinary translociship sverksives catster skilaparoments piquatrized pantance heassbopped pargendents sallotes cralless rubulassed finprayers nutpentations mistain prammetive dulmils receder pentordeus trapharancies jaistilles

Here’s some multilingual output using the three-letter hierarchical model. Hopefully, it’s obvious by now which language is which!

texueux roseras plaçâtes exhumèrent recturraient orileffé cinquetassions laissiez sauceptant astilette palleriez montrenards cainguassions resputosionnes résaïsmez enjupillâmes industionnerait ratît fausive étipitent attecturdas dînermé ponctionnerons klaineront glasées pannappors pilette succelanimerait habant condilluses débaillonnerez perpirez déconnantes démodonnerai défouts acondrions lamiteriez soussonnée affiardre jascutas vidoutasses regre-nèses délierais bronteraient aboues hydrocipait cambrais cahrages xérontivise roncélassez prévantiques étant sinquageoiserai enjoinguais inverture décorbât mestpode perollaiera controchâmes amientonnasse échelation rembilants careinèrent préliquerie préatterez groyer conjectisions parnifien parafeux torrèvera explissionnait tandgengla dimptanisatives briquilisant terinats trintiques intempterait décomationt bêtendâtes défrassités affrulaiens gamarquerons lansonnant potumorne perbureraient lanalises supilée horaillera bouip-tuassions dégîtes

perónimo bolón hipergametre progabozar cremornero sanfija buintivo desafandez morricete despraías avitasol curacipsis esmotorrar antupelante diacnograncia micrugamiento dicricarcear bisfato filamberecer estempolí mícleta nocupo encumanico elotando paleador zarífero manquear curioso malomperar senestrosia rasexuar desalificapio enseche taristero locentáceo cimoyata grecon taragratico hilicótana alcino fajo evaseación sanzanear menioso catulariamiento incunvolar acúlle motas raivero sobriopón ahope adenorretro relpar irrepciolísimo tranjear flusismo dedio amulero sochetón dítruco delazor puertorriqueño lablarainia quincuagonador vinencia guinquez engendenda varondar antitorra empredor plaurismo damargono traptarpo mesiedad porcular aceitertar desandiguible rapiñaque cizañras coquente rizo esfenatico ahilluta emparionario pacero viguito halocatoria ultenciada mustero canguno estalapra ligón desconstecimiento cardica lena rondable dividir dolteo tardache telial

Konditsstangige Böservolle Sirplen Grundecher Spreible Marsales techtausfälle Nah wohlassee mild Tängebianen schwachlosten Frakungen evandligte versaufehne Yelsch verschützen Probinus Handen träßcher dreindereis Halzten böhebrufen Blanit stagödien weliehmen Trupt Netrantagte Schmutenen anglagente abnaubt unbeitkrauer Terbst absensige Jetes Inlatschente Koopronsduo Mills Postenpland Erwalinanischer einprückt Bußerfere reilen Nathesann Gelasseist Teamaufsstade Juwenhert introbildenen insagte fründeuteses Balbstischen Rechbeschlegung höhegendeter Peitungengeher Harsten Ramorm klähunger Pflitärpolierheit Diskurverschnisitzung Watonois Glume feichmerket Besunkinn Heizschten Imre Eingezweirme Gleigessischer Algaber Mohnarte Aukürdett Econlosen Krack Nexy Ohrisetzen Materalrat Wucke entporten Nachnast stemlich umgige Kaminettaatos Asieg imprakrika Cladlult Stampfalle unweinabene abgenachlonnen Fesikommischef Bieler Mittbilitatzsittlinik Gastitätschaftliche viehrt unterristig volomende unritteit Erfizeingen anmische aufreogisten Wäschschrischt

epulo occlamo voculosum domitor nestum legumen pyus alumen immonitio ipsum luminarium volacrius inhibeo prohisus lennatio equino post invicero perviter eribro persona obvolla exteptor exibro abduco navito interrogatia alicus presto resisto vovi iniete forper dilevo equa solutantim maria elerlitio spera poeta id desteram verfera excensus volamitas dependo pautum libeficus etium soctus demplius sloratio exulto famessi impraesentialis aucron susquis apostagius auctitera tanquam perfidum formadvenus lascedi proponsiva reductor re rabolvo eadeo quasi cridne volubiliter esse aesena cortineo vertonsiva fortitier hano depremus triminatus prodigentio adsula meddera instito insanus runc sceleritus beatum croluxus fundo rigre flus incruscus annutum infandus acculo auxilium occasco dulceus aliquesco loci

Technical details

The software is written in Python. Click here for the version using two-letter correlations, or here for the version using three-letter correlations. (The software for generating uncorrelated random words — this is what came up with xyeot, ivgcv, etc. above — is written in Python as well: click here.)

I don’t handle Unicode two-byte sequences (e.g. é, ö) as single characters. Rather, the two bytes are counted separately. Thus, été ends up being treated as a 5-letter word, and words such as fröären appear even though German never has ö followed by ä. A nice little enhancement to my program would be to instruct Python to treat the input files as Unicode, so that each of these letters with a diacritical mark would be treated as single characters.

← More documents