Language-specific random spelling: fesh, excenture, and the like

John Kerl
kerl.john.r@gmail.com
July 15, 2009

Contents:
The mathematical model  |  Plusses and minuses  |  Word lists  |  Word-length frequencies for English  |  More output  |  Three-letter correlations  |  Technical details


When I play Scrabble, I often think that what’s on my rack (motch, say) looks like it should be English — but (unfortunately for my score!) it isn’t in the dictionary.

Likewise, when I read translations of Lewis Carroll’s Jabberwocky (some of which are compiled here), it’s clear that even nonsense words can look like they certainly belong to a particular language: who could doubt the Englishness of “’Twas brillig, and the slithy toves did gyre and gimble in the wabe”, or the Frenchness of «Évite le frumieux Band-à-prend!»?

Question: Is it possible to encode the property of looking Englishy, Frenchish, and so on?

Here are some randomly generated words, which are results of my experiment to do just that. Hopefully, you can tell which language is which:

  adeal mather emoor snve fesh excenture fomposh cong egg recerante came opin warl weldury bang gomerick tower inyoad axe priecitant  

  figutes houpon tecourd’hui dinchode sans l’entreprise l’ésion si prenide muimesion dectacs auinde toyais bout qusicont placérde coluer doyeai a consicement  

  vor govio elanar habibisidar doy miear solosibar denco concucicer nepecar prantrzo pensamionto moladar pante tío sado numbiar dudo escrerir adqucés  

  jeden anssen Taten welbet bller eigerer ohnegen hohe aum senem getolen Abef muöls ihrar wowors wevor ihrd leinf alte veropäischen  


The mathematical model

My model uses Markov chains for letter-to-letter transition probabilities, hierarchically on word length.

What that means is: when I generate a random word, I first choose a word length. To do that, I’ve consulted a word list for the particular language. If 21% of words in the list have 5 letters, then I give myself a 21% chance of generating a 5-letter word, and so on for other word lengths. Having selected the word length, the second part is to pick the letters in the word. I’ve consulted the word list to find how often 5-letter words start with a given letter — for 5-letter words, h is the first letter 2.5% of the time; suppose I pick h to be the first letter. Then I consult my data to see what letters follow h as the second letter of five-letter words, and how often, maybe picking a. Consulting my data to see what letters follow a as the third letter of a five-letter word, maybe I pick b: this gets me up to hab _ _. And so on, to the last letter of the word.

What makes this a Markov-chain model is that I only look at transition probabilities from one letter to the next — I don’t look at correlations between the frequency of, say, the 1st and 4th letters. What makes it hierarchical is that I keep a separate Markov chain for each word length. The five-letter words don’t know what the four-letter words are doing.


Plusses and minuses

Plusses:

Minuses:

Conclusion: This model generates some reasonable-looking words, as well as some unreasonable-looking ones. The results are better than completely random words such as mhhnj, but with more effort one could do even better!


Word lists

The word lists were all found via web searches, and are cached here. (Set your browser’s text encoding to Unicode after clicking through the French, Spanish, or German links, so that characters such as é will appear correctly.)


Word-length frequencies for English

How long do English words tend to be? It depends! One of the central issues in statistics is the sampling problem: when we want to collect information about English, which English utterances do we look at? Everything written since 1900 (too much data!), an abridged dictionary, an unabridged dictionary, the front page of today’s New York Times? What is a representative sample? If I picked not the front page of the Times but the sports page, would I still be getting a representative sample of English words? How many pages of the newspaper would be enough? The same issue confronts pollsters: any sample which is smaller than the entire population has its own randomness, introduced by the very selection of the sample.

So, the answers you get depend on the sample you pick. Here are word-length frequencies for the English 2000-word general-service list and the Scrabble dictionary, respectively:

      *General-service list*        *Scrabble dictionary*
    Word length Count Percent     Word length  Count Percent
    ----------- ----- -------     -----------  ----- -------
              1     2   0.08%               1      0   0.00%
              2    18   0.78%               2    102   0.05%
              3   170   7.44%               3   1015   0.56%
              4   493  21.58%               4   4030   2.25%
              5   472  20.66%               5   8938   5.00%
              6   390  17.07%               6  15788   8.83%
              7   293  12.82%               7  24029  13.44%
              8   178   7.79%               8  29766  16.65%
              9   128   5.60%               9  29150  16.31%
             10    87   3.80%              10  22326  12.49%
             11    28   1.22%              11  16165   9.04%
             12    13   0.56%              12  11417   6.38%
             13     7   0.30%              13   7750   4.33%
             14     4   0.17%              14   5059   2.83%
             15     1   0.04%              15   3157   1.76%
    -----------  ----             ----------- ------
    Total:       2284             Total:      178692
Note that the Scrabble dictionary has no one-letter words, since one-letter words aren’t legal in Scrabble. The Scrabble dictionary doesn’t go past 15-letter words, since the Scrabble board is 15x15. Coincidentally, the general-service 2000-word list (which actually has 2,284 words) has one 15-letter word (dissatisfaction) but nothing longer.

Also note that the English GSL-2000 words tend to be shorter than the 178,692 words of the Scrabble dictionary — as we would expect for more common words.


More output

Here’s some more output:

English GSL-2000 used as input:

  paghore secusise bauple shake mend bulcuessman mawyel arime resser veant compere abstint mecitric paty swich shit deare deliete obed sounden excrse wemase comevey reflory momerve blake mollor socker empape livesite weed wall potwove each mug varele everm roke soncose chivage elail grep imitulake nod inccist eshacle adeer desporitial piglagh cay fatton russ reg ponqueman rate consareart wint breage pidor wich stopy mint shoss cowhice ballive  

English Scrabble dictionary used as input:

  bynxer rearuoors stapee shoskioutic coeaties rencobucleins colferors mitelescks foxinint dioblpolates plambogial wrilite upebiff macta dispatusm phocleramamed prostimmate chedot pyliglin sppoxiter meschizes cyeended breished  
 


Three-letter correlations

If we use three letters’ worth of memory instead of two, the results are better. (In each of the examples below, I used the longest word list available for the given language.) We no longer get un-English-looking bits such as oux or tss:

  anacurbage plachers clobberil gusttion tarring ginfindingly shipwomer swinnes churency hygintods boatyers reectatted cashers kingling anything viscin specifief gota argoristed prectabilition doxiconmetts swanting jelemninas rephispating encioters deadstoned alts natchloins supprotophated camonsible doconic egoutsu panconden slitlest discoruartion taxilocum cumidiners dogitintably muckoical overretrams aurzy defs noncefesical conctilliship occurpins linictoxly aurfuself rioreates ravied pasters stewalorties murine hawkinesses ultiquing indebuses preocelissness exechips mantificatible chomarability orles comptuousnesses plex midable egoucker casbud culsies axe mancemied gransolonias cledicon mathenesses caysses perbarthorists praeming dabyins virsitude conkstice flionizations hompharableness presenciptinary translociship sverksives catster skilaparoments piquatrized pantance heassbopped pargendents sallotes cralless rubulassed finprayers nutpentations mistain prammetive dulmils receder pentordeus trapharancies jaistilles  

Here’s some multilingual output using the three-letter hierarchical model. Hopefully, it’s obvious by now which language is which!

  texueux roseras plaçâtes exhumèrent recturraient orileffé cinquetassions laissiez sauceptant astilette palleriez montrenards cainguassions resputosionnes résaïsmez enjupillâmes industionnerait ratît fausive étipitent attecturdas dînermé ponctionnerons klaineront glasées pannappors pilette succelanimerait habant condilluses débaillonnerez perpirez déconnantes démodonnerai défouts acondrions lamiteriez soussonnée affiardre jascutas vidoutasses regre-nèses délierais bronteraient aboues hydrocipait cambrais cahrages xérontivise roncélassez prévantiques étant sinquageoiserai enjoinguais inverture décorbât mestpode perollaiera controchâmes amientonnasse échelation rembilants careinèrent préliquerie préatterez groyer conjectisions parnifien parafeux torrèvera explissionnait tandgengla dimptanisatives briquilisant terinats trintiques intempterait décomationt bêtendâtes défrassités affrulaiens gamarquerons lansonnant potumorne perbureraient lanalises supilée horaillera bouip-tuassions dégîtes  

  perónimo bolón hipergametre progabozar cremornero sanfija buintivo desafandez morricete despraías avitasol curacipsis esmotorrar antupelante diacnograncia micrugamiento dicricarcear bisfato filamberecer estempolí mícleta nocupo encumanico elotando paleador zarífero manquear curioso malomperar senestrosia rasexuar desalificapio enseche taristero locentáceo cimoyata grecon taragratico hilicótana alcino fajo evaseación sanzanear menioso catulariamiento incunvolar acúlle motas raivero sobriopón ahope adenorretro relpar irrepciolísimo tranjear flusismo dedio amulero sochetón dítruco delazor puertorriqueño lablarainia quincuagonador vinencia guinquez engendenda varondar antitorra empredor plaurismo damargono traptarpo mesiedad porcular aceitertar desandiguible rapiñaque cizañras coquente rizo esfenatico ahilluta emparionario pacero viguito halocatoria ultenciada mustero canguno estalapra ligón desconstecimiento cardica lena rondable dividir dolteo tardache telial  

  Konditsstangige Böservolle Sirplen Grundecher Spreible Marsales techtausfälle Nah wohlassee mild Tängebianen schwachlosten Frakungen evandligte versaufehne Yelsch verschützen Probinus Handen träßcher dreindereis Halzten böhebrufen Blanit stagödien weliehmen Trupt Netrantagte Schmutenen anglagente abnaubt unbeitkrauer Terbst absensige Jetes Inlatschente Koopronsduo Mills Postenpland Erwalinanischer einprückt Bußerfere reilen Nathesann Gelasseist Teamaufsstade Juwenhert introbildenen insagte fründeuteses Balbstischen Rechbeschlegung höhegendeter Peitungengeher Harsten Ramorm klähunger Pflitärpolierheit Diskurverschnisitzung Watonois Glume feichmerket Besunkinn Heizschten Imre Eingezweirme Gleigessischer Algaber Mohnarte Aukürdett Econlosen Krack Nexy Ohrisetzen Materalrat Wucke entporten Nachnast stemlich umgige Kaminettaatos Asieg imprakrika Cladlult Stampfalle unweinabene abgenachlonnen Fesikommischef Bieler Mittbilitatzsittlinik Gastitätschaftliche viehrt unterristig volomende unritteit Erfizeingen anmische aufreogisten Wäschschrischt  

  epulo occlamo voculosum domitor nestum legumen pyus alumen immonitio ipsum luminarium volacrius inhibeo prohisus lennatio equino post invicero perviter eribro persona obvolla exteptor exibro abduco navito interrogatia alicus presto resisto vovi iniete forper dilevo equa solutantim maria elerlitio spera poeta id desteram verfera excensus volamitas dependo pautum libeficus etium soctus demplius sloratio exulto famessi impraesentialis aucron susquis apostagius auctitera tanquam perfidum formadvenus lascedi proponsiva reductor re rabolvo eadeo quasi cridne volubiliter esse aesena cortineo vertonsiva fortitier hano depremus triminatus prodigentio adsula meddera instito insanus runc sceleritus beatum croluxus fundo rigre flus incruscus annutum infandus acculo auxilium occasco dulceus aliquesco loci  


Technical details

The software is written in Python. Click here for the version using two-letter correlations, or here for the version using three-letter correlations. (The software for generating uncorrelated random words — this is what came up with xyeot, ivgcv, etc. above — is written in Python as well: click here.)

I don’t handle Unicode two-byte sequences (e.g. é, ö) as single characters. Rather, the two bytes are counted separately. Thus, été ends up being treated as a 5-letter word, and words such as fröären appear even though German never has ö followed by ä. A nice little enhancement to my program would be to instruct Python to treat the input files as Unicode, so that each of these letters with a diacritical mark would be treated as single characters.


← More documents