A MATHEMATICAL / STATISTICAL TECHNIQUE

A MATHEMATICAL / STATISTICAL TECHNIQUE
FOR VOWEL ISOLATION
IN SIMPLE-SUBSTITUTION CIPHERTEXTS

Donald R. Burleson, Ph.D.
(American Cryptogram Association nom: CTHULHU)

Copyright © 2007, 2020 by Donald R. Burleson. All rights reserved.

In 1989 I discovered a new (but simple) algorithm for isolating vowels in a simple-substitution cryptogram.
As a member of the American Cryptogram Association I published an article
("The ‘MACC’—A Statistical Technique for Vowel Islolation," by-lined as CTHULHU,
my member "nom" in the ACA) in the March-April 1989 issue of the official ACA journal
The Cryptogram. During the intervening years, I have tested the method extensively,
finding it quite satisfactory, but until now (January 2007) I have not written
about the algorithm for a broader audience.

To illustrate the method here, I will use a cryptographic problem (number A-20)
from the September-October 2006 issue of The Cryptogram, on which my vowel isolation algorithm
turns out to work perfectly; the technique does not always work with perfection,
but in general it works well enough to assist greatly in the solution of such difficult cryptograms
as the one used for illustration here, which was composed by ACA member PETROUSHKA
and is reprinted here by permission of the American Cryptogram Association:

AQNIPKWG VZIXUQKY SNBAZTXKI, PNBLTZQ SBAJXUI YSZM VNWKQZWG KUIBWPZ
TBAKYXW UZKPSJNL, PLXQKMG ANYKXW MZXAQ AYZUXLKN.

To apply my technique, which I call MACC (the "Mean Associated Contact Count"),
I first do a frequency count of the ciphertext letters and list them
from highest to lowest frequency of occurrence:

(10) K; (9) Z; (8) X; (7) A, N; (6) Q, W; (5) B, I, P, U, Y; (4) L, S; (3) G, M, T; (2) J, V.

Next for each ciphertext letter I list the letters that the given letter contacts (stands adjacent to),
and I record the number of such adjacent letters as the given letter’s "variety of contact count"
or VCC. This gives VCC values of:

K(12), Z(13), X(11), A(8), N(12), Q(6), W(6), B(7), I(7), P(7), U(5), Y(6), L(6), S(6), G(2),
M(3), T(4), J(4), V(2).

Then, in each letter’s adjacent-letter list, I record each adjacent letter’s own VCC
and add the VCCs for each list, then divide this total by the root letter’s own VCC
to produce an average, the Mean Associated Contact Count (MACC, the average number of contacts
a given letter’s contacts themselves have), computed for each separate letter:

For K: P(7), W(6), Q(6), Y(6), X(11), I(7), U(5), A(8), Z(13), M(3), L(6), N(12).
TOTAL 90. MACC = 90 / 12 = 7.500.

For Z: V(2), I(7), A(8), T(4), Q(6), S(6), M(3), W(6), P(7), U(5), K(12), X(11), Y(6).
TOTAL 83. MACC = 83 / 13 = 6.385.

For X: I(7), U(5), T(4), K(12), X(11), Y(6), W(6), L(6), Q(6), Z(13), A(8).
TOTAL 77. MACC = 77 / 11 = 7.000.

For A: Q(6), B(7), Z(13), J(4), K(12), N(12), X(11), Y(6).
TOTAL 71. MACC = 71 / 8 = 8.875.

For N: Q(6), I(7), S(6), B(7), P(7), V(2), W(6), J(4), L(6), A(8), Y(6), K(12).
TOTAL 77. MACC = 77 / 12 = 6.417.

For Q: A(8), N(12), U(5), K(12), Z(13), X(11).
TOTAL 61. MACC = 61 / 6 = 10.167.

For W: K(12), G(2), N(12), B(7), P(7), X(11).
TOTAL 51. MACC = 51 / 6 = 8.500.

For B: N(12), A(8), L(6), S(6), I(7), W(6), T(4).
TOTAL 49. MACC = 49 / 7 = 7.000.

For I: N(12), P(7), Z(13), X(11), K(12), U(5), B(7).
TOTAL 67. MACC = 67 / 7 = 9.571.

For P: I(7), K(12), N(12), W(6), Z(13), S(6), L(6).
TOTAL 62. MACC = 62 / 7 = 8.857.

For U: X(11), Q(6), I(7), K(12), Z(13).
TOTAL 49. MACC = 49 / 5 = 9.800.

For Y: K(12), S(6), X(11), N(12), A(8), Z(13).
TOTAL 62. MACC = 62 / 6 = 10.333.

For L: B(7), T(4), N(12), P(7), X(11), K(12).
TOTAL 53. MACC = 53 / 6 = 8.833.

For S: N(12), B(7), Y(6), Z(13), P(7), J(4).
TOTAL 49. MACC = 49 / 6 = 8.167.

For G: W(6), M(3).
TOTAL 9. MACC = 9 / 2 = 4.500.

For M: Z(13), K(12), G(2).
TOTAL 27. MACC = 27 / 3 = 9.000.

For T: Z(13), X(11), L(6), B(7).
TOTAL 37. MACC = 37 / 4 = 9.250.

For J: A(8), X(11), S(6), N(12).
TOTAL 37. MACC = 37 / 4 = 9.250.

For V: Z(13), N(12).
Total 25. MACC = 25 / 2 = 12.500.

Now I rank the ciphertext letters by listing them from lowest to highest
corresponding MACC values:

G Z N X B K S W L P A M T J I U Q Y V

The idea behind the algorithm is that, at least theoretically—and practice bears this out significantly,
as I have solved hundreds of difficult cryptograms this way—the idea is that the vowels
should strongly tend to "float" to the top, or near the top, of the ranking.
In this case, the leading symbols G, Z, N, X, B, K, … are statistically
the most likely to stand for vowels.

The reasons for this are fairly elementary. Vowels contact a richer variety of letters
than consonants do. I.e., consonants, by contrast, typically contact relatively few
different letters. For each ciphertext letter, there are two things that can make
the Mean Associated Contact Count (MACC) have a low numerical value so that the letter
floats to a position near the top of the distribution:
(1) the denominator in the fraction TOTAL / VCC being large, denoting a high number of different contact letters; and
(2) the numerator in the fraction TOTAL / VCC being small, due to many of the contact letters having a
low number of contacts themselves, suggesting that they are most likely consonants.
Both phenomena gravitate toward the "root" letter (for which the contact list
has been made) being a vowel.

One now takes note of the ciphertext letters leading off the MACC ranking, i.e. the letters
most likely to stand for vowels; one examines the positions of such letters in the
ciphertext words; and fortified by these observations one uses one’s knowledge of word
structure and sentence structure to proceed with the processes of solution in the usual ways
familiar to the cryptanalyst. In this case the solution turns out to be:

STODGILY PEDANTIC HOUSEMAID, GOURMET HUSBAND CHEF POLITELY INDULGE MUSICAL
NEIGHBOR, GRATIFY SOCIAL FEAST SCENARIO.

This solution would have been a great deal more difficult without the vowel isolation procedure described.
If we list the ciphertext letters and give their ultimate plaintext equivalents as known with the solution
complete, we see that the vowels have indeed floated to the top of the distribution:

Cipher / Plain :

G Y

Z E

N O

X A

B U

K I

S H

W L

L R

P G

A S

M F

T M

J B

I D

U N

Q T

Y C

V P

Sometimes the vowel isolation algorithm may produce some anomalies.
For example, the "liquid" consonants L and R are notorious vowel imitators,
and they may "float up" among the vowels; note that in the example given,
the plaintext letters L and R do end up fairly close to the top of the distribution;
this is typical. Other letters, for one reason or another, may occasionally
"float up" as well.

However, experience (my own and that of other cryptanalysts who have used the method,
in some cases writing computer programs to run it) has shown that the method always
works well enough to help identify most of the vowels, leading to a speedier solution
of the whole problem. (As a rough rule of thumb, among the top seven or so letters
in the MACC ranking, at least four can usually be assumed to be vowels.)
My own statistical analysis of large bodies of text has shown that, not surprisingly,
the longer the ciphertext, the more successfully the MACC method isolates all the vowels,
or at least A, E, I, O, and U (Y being a semi-consonant and thus not quite so reliable).
My study of large amounts of English-language text suggests that overall the MACC ranking
tends to run approximately as follows:

E O A I U L R N C P S T D G B H Y X M K W F V Q J Z

(It is understood that not all twenty-six letters may be present in short ciphertexts.)

Thus for sufficiently long texts, the five major vowels float to the top of the distribution
closely followed by the liquids L and R, the behavior of the rest of the alphabet being somewhat
less describable. Even in shorter texts, which can be viewed as statistical samples
of the workings of the language as a whole, the tendency of the vowels to move to the top
is reliable enough (as statistical sample behavior reflecting the tendencies of the language
"population" at large) to be exceedingly helpful. And as vowels are the morphological sites
at which words "breathe," their identification can only facilitate cryptanalytic solution in general.

A natural question that arises is: does this technique still work well for simple substitution ciphers
not having word divisions? Such ciphers of course exhibit "false adjacencies," as the last letter of one
word appears to be adjacent to the first letter of the next word when we do not see the word division, so that
one might think the method described here would fail to work well, but rather surprisingly, the method when
applied to a cipher without word divisions tends, in my experience, still to work reasonably well. I have
at times worked out examples in which all five vowels come to the top even without word divisions, though one should
still keep in mind that "false adjacencies" may still increase a given cipher letter's variety of contact count
enough to give the letter more credit for "vowel-hood" than it may merit.

Finally I should mention that while the basic technique described here works quite well in the great majority
of cases, it is possible to experiment with some extensions of the method, and I will briefly describe one such
extension with which I have experimented.

After doing the basic MACC procedure described above, so that you have a mean associated contact count (MACC) for
each cipher letter, one could even go back to the point at which one has only listed each given cipher letter's
contacts and copy in, next to each of those contacts, not the variety of contact count as in the basic procedure, but
rather the contacting letter's MACC value itself. Then average these for each cipher letter, adding the associated-MACC
values and dividing by the given cipher letter's own contact count, producing MMACC, the mean-mean associated contact count.
But for a cipher letter that ends up representing a plaintext vowel this would be averaging unduly large
values since most of such a letter's contacts would be consonants
with relatively large MACC values. So after averaging the given cipher letter's associated-MACC values I reverse the
lowest-to-highest character of the original MACC procedure ranking and now rank the MMACC values from
highest to lowest, expecting the vowels by and large to gravitate to the head of the ranking.
This extension may sometimes take what would have been a misleadingly upward-floating low-frequency
consonant and move it farther to the right in the ranking. This extension of the original MACC technique
may provide a modest improvement in the ranking, though sometimes no improvement is actually needed.
In any event, a completely worked out example of the enhancement, complete with worksheets, can be seen at
www.blackmesapress.com/Crypto2.htm .

Often I find that the basic procedure suffices-- list each cipher letter's contacts (including itself, by the way,
in the case of a doubled letter, which however are infrequent in difficult cryptograms and are usually consonants
anyway), write in the contacting letters' own variety of contact counts, average these by adding up and dividing by
the given cipher letter's variety of contact count, producing a MACC (mean associated contact count) for each cipher letter,
rank the cipher letters from smallest to largest MACC values, and generally expect to see the letters standing for
plaintext vowels come to the top of the distribution.