Statistical Vowel Isolation in a Ciphertext: An Enhancement

Donald R. Burleson, Ph.D.

Copyright (c) 2020 by Donald R. Burleson. All rights reserved.


Previously (see blackmesapress.com/crypto.htm) I discovered and wrote about a mathematical/statistical technique
for isolating vowels in a simple-substitution ciphertext. This current article is an enhancement or extension of that method.

The idea behind the original method was simply that vowels tend to be contacted by (i.e., tend to stand adjacent to)
more different letters than consonants do, and specifically that vowels thus are typically contacted by a relatively large
number of letters most of which are themselves contacted by relatively few letters.

To capitalize upon this idea, we may take a ciphertext and proceed as follows (see Figure 1 below):

First do a frequency count of the cipher letters and arrange them in order of decreasing frequency.
Then in a column beneath each cipher letter, list the letters that stand adjacent to the given letter,
including possibly the given cipher letter itself in the case of a doubled letter.
In the block immediately below each cipher letter, write the number of adjacencies that letter has;
we will call this the variety-of-contact count (VCC).
Next for each cipher letter, in the column beneath it where the adjacencies are listed, copy in the VCC value
for each of the contacts; and
for each cipher letter, add the VCC values in the letter's column and divide by the number of items added,
to compute a simple average or mean. This will be called the Mean Assoiated Contact Count (MACC) for each cipher letter.
After the MACC value is computed for each cipher letter, RANK the cipher letters in order of smallest-to-largest value of MACC.
In general we will expect MACC values to be small for vowels and larger for consonants, so that
the tendency is for the cipher letters standing for vowels to gravitate to the head of the ranking.





Figure 1



As an illustration I will use the cryptogram A-21 from the September-October 2019 issue of The Cryptogram,
the official journal of the American Cryptogram Association.
(I employed this same example in the virtual slide show
presentation that I gave for the annual ACA convention in September 2020, but that presentation did not have time
or space to show the details of the enhancement/extension which follows.) The cryptogram reads as follows:


VEFXFVDMY HRQXDMY UXDI JVDAMY BXCSH, RMVHSI VFYXMY MU

HSRVDMY HSWZEMY YSHVAADUZEA ZE JXUS TXY QZH XVHUSY.


When the MACC procedure described above is done with this ciphertext, the result is the worksheet
shown in Figure 1, where the computed MACC values for the cipher letters are shown across the top of the page.

When the cipher letters are ranked from smallest MACC-value to largest MACC-value, the cipher letters' ranking is:

S X Z E V M W H K D A I U F Q Y J C B T

where underlining here represents letters with tied ranks.

After the solution is in hand, one sees that the corresponding plaintext ranking is

E O I N U A M S C L G D T P H R V K J F

and one sees that except for a slight irregularity (the plaintext letter N coming up with the vowels)
the process has resulted mostly in the vowels floating to the head of the distribution.

While the basic MACC technique just illustrated does work remarkably well, in my experience solving countless cryptograms
with its aid, it does admit some few irregularities, as the "floating up N" instance in the example shows. It is possible
to entertain various ideas for enhancing or extending the MACC method, and I would like here to describe a particular enhancement
with which I have experimented at length, with considerable success.

Here is how the enhancement works. First, for a given ciphertext (I will again use the above cryptogram as an example),
first do the basic MACC procedure just described.

Then when a MACC value has been computed for each cipher letter, make a new worksheet as before, with the cipher letters listed
in order of descending frequency, and construct a column beneath each cipher letter, with that letter's contacts listed as before.

Next, in the column below each cipher letter, where before we copied in that cipher letter's contacts' own variety-of-contact counts,
now we copy in, beside each contacting letter in the column, that letter's own MACC value. (See Figure 2.)

Now, similarly to what we originally did with the associated contact counts, for each cipher letter we add up the MACC
values for the contacting letters in the column and divide by the number added, computing again an ordinary average or mean,
which we will call the "mean mean associated contact count" (MMACC) for each cipher letter. In Figure 2, these MMACC values
are listed across the top of the page as the ordinary MACC values were before.

The question now arises: how do we rank the cipher letters in terms of their MMACC values? A moment's reflection will show
that for a cipher letter probably standing for a plaintext vowel, the situation in a sense has reversed-- before, the variety-of-contact
counts were expected to be low for a probable vowel, since a vowel contacts mostly letters with relatively low contact counts. But now,
with MACC values themselves listed beside the contacts in the column below a given cipher letter, those MACC values could be expected
to be high for contact letters listed below a probable vowel.

My first inclination was to preserve the ranking character of the original MACC method, which is lowest-to-highest ranking, and one way to
do this would have been to take the reciprocal of each MMACC value and then rank those from lowest to highest. But a quicker way to
accomplish the same thing, if one is willing to employ a reversal of direction in the ranking technique, is to leave the MMACC values
as they are and simply rank the corresponding cipher letters from highest MMACC value to lowest MMACC value, again expecting then
the vowels to float to the top of the distribution.

When one does this with the cryptogram in the example, one obtains a cipher letter ranking

X S Z V M A D E F R J H Y U I Q B T W C


which corresponds, after one has the solution in hand, to a plaintext letter ranking

O E I U A G L N P C V S R T D H J F M K .


As one will observe, the vowels have done a splendid job of floating to the top of the distribution.





Figure 2