Simple substitution ciphers

Section 3.1 Simple substitution ciphers

A substitution cipher is a permutation of the letters in an alphabet. Recall that an alphabet is a finite set of symbols, and that we will use two special set notations: \(\mathcal{E}\) to denote the set of capital letters of the English alphabet, and \([n]=\set{1,2,3,\dotsc,n}\text{.}\)

Subsection 3.1.1 Monoalphabetic substitution

Also called simple substitution, encryption of a message by performing a single permutation is a monoalphabetic substitution. The Caesar cipher is one of the earliest known uses, and was a simple three-character shift of the alphabet. Represented in two-line notation over a ten-element set range(10), this is

\begin{equation*} \sigma = \begin{bmatrix} 0 \amp 1 \amp 2 \amp 3 \amp 4 \amp 5 \amp 6 \amp 7 \amp 8 \amp 9 \\ 3 \amp 4 \amp 5 \amp 6 \amp 7 \amp 8 \amp 9 \amp 0 \amp 1 \amp 2 \end{bmatrix}\text{.} \end{equation*}

In this case, we could also write \(\sigma(x) = (x+3)\rem 10\text{,}\) using \(\rem\) in the mathematical equivalent of its Python usage.

Remark 3.1.1. Modular arithmetic and indexing.

Similarly to the convention that \(0\in\Nats\text{,}\) we should consider the canonical \(n\)-element set to be \([n]=\set{0,1,2,\dotsc,n-1}\) (which is the set of elements of range(n)) rather than \(\set{1,2,3,\dotsc,n}\text{.}\) This example shows one of the other reasons for doing so: modular arithmetic works correctly in \([n]\text{.}\) In arithmetic under modulus \(n\text{,}\) all calculations are carried out normally but reported as remainders from integer division by \(n\text{.}\)

Unfortunately this is not the convention of abstract algebra, where \(\sym{n}\) denotes the symmetric group of all permutations of \(\set{1,2,3,\dotsc,n}\text{.}\)

Substitution ciphers are a form of shared secret cryptography, where the message originator and message receiver must agree on a secret encryption method prior to transmission and share that secret first. In this case, the mathematical formula of the permutation is the shared secret.

Definition 3.1.2. Shift cipher.

The easiest type of encryption is a shift cipher of an alphabet \(\mathcal{A} = \set{a_0,a_1,a_2,\dotsc,a_{n-1}}\text{,}\) defined by \(\sigma(a_i) = a_{(i+s)\rem n}\) for some constant \(s\text{.}\)

Checkpoint 3.1.3. Breaking a shift cipher.

The message JYZWKTZGYVIJRIVVRJPKFSIVRB is encrypted with a shift cipher on the 26-letter alphabet \(\mathcal{E}\text{.}\) Decrypt it, and explain what the shared secret is for a shift cipher.

In general it is not quite so simple to break a substitution cipher on \(\mathcal{E}\) as it is to break a shift cipher; specifically, there are only 25 non-identity shift permutations of \(\mathcal{E}\text{,}\) but

\begin{equation*} \abs{\sym{\mathcal{E}}}= 26! = 403291461126605635584000000 \approx 4\times 10^{26}\text{.} \end{equation*}

If you were to try one million permutations per second and you had to test all \(26!\) permutations, it would take 12,788,288,341,153.146 years to test them all. There has to be a better way!

Subsection 3.1.2 Frequency analysis

Thankfully we are not unarmed when it comes to decrypting a monoalphabetic substitution. Even though the pool of possible permutations is vast, we can analyze the frequencies with which each letter is used in normal text to try to determine parts of the mapping. It is quite easy to write a program in Python counting the frequency of occurrence of each letter in a string. For instance, taking the original 1972 etext of the Declaration of Independence from Project Gutenberg

www.gutenberg.org/cache/epub/1/pg1.txt

and converting it into all capital letters (which is how it was produced in 1972 since computers didn’t have minuscule letters at that early date), we obtain the following results for the 8133 characters in the text:

{'\n': 165, 'T': 646, 'H': 352, 'E': 869, ' ': 1212, 'D': 256, 'C': 187, 'L': 229,
'A': 482, 'R': 425, 'I': 453, 'O': 519, 'N': 488, 'F': 184, 'P': 139, 'U': 210,
'S': 478, 'M': 145, 'W': 97, 'V': 74, ',': 107, 'B': 95, 'Y': 81, 'G': 130, 'Q': 6,
"'": 1, 'K': 14, '.': 36, '-': 3, 'J': 16, 'Z': 4, ';': 10, 'X': 9, ':': 10, '&': 1}

Listing 3.1.4. Dictionary of frequencies of each character in the United States of America’s Declaration of Independence

This was produced using the following simple program.

Technology 3.1.5. A Python program to count the frequencies of each character in a string.

def tally(some_text):
    tally_dict = { }
    for c in some_text:
        try:
            tally_dict[c] += 1
        except KeyError:
            tally_dict[c] = 1
    print(f"String Length: {len(some_text)}")
    return(tally_dict)

Listing 3.1.6. The tally program

Modifying the code slightly allows us to count just the alphabetic characters, and we find the following frequency results, listed from most common to least common.

Table 3.1.7. Letter frequencies in the Declaration of Independence of the United States of America

#	Letter	Frequency	#	Letter	Frequency	#	Letter	Frequency	#	Letter	Frequency
0	E	869	7	R	425	14	M	145	21	J	16
1	T	646	8	H	352	15	P	139	22	K	14
2	O	519	9	D	256	16	G	130	23	X	9
3	N	488	10	L	229	17	W	97	24	Q	6
4	A	482	11	U	210	18	B	95	25	Z	4
5	S	478	12	C	187	19	Y	81
6	I	453	13	F	184	20	V	74

A general analysis of text shows the most frequently occurring English letters to be ETAOIN SHRDLU, which the above table shows to be correct (up to reordering). Hence if these letters can be correctly matched in the mapping, there are only \(14!=87178291200\) remaining permutations. Even this is insurmountable to check by hand, but context can help greatly; looking over a long passage which has been partially decrypted often allows one to make reasonable guesses to other mappings of characters under a simple substitution. Unfortunately, this often requires a lot of guessing and checking, which is unreasonable without other tools from frequency analysis: there are only two one-letter words in English (A and I) relatively few common two-letter words (about 24 of them), and the frequency order for two-letter pairs (digraphs) and three-letter pairs (tri-graphs) can also be studied.

Unfortunately, all of that is, all of that is time consuming and requires a lot of guessing and checking. Having simple programs written to effectively tally the different types of frequencies, as well as allowing for simple trial-and-error substitions to be made, makes the cryptanalysis much smoother. Once whole words start to be identified, having a large vocabulary certainly helps fill in the rest, much as when playing Wheel of Fortune!

Example 3.1.8. Breaking a simple substitution by frequency analysis.

Consider the following text:

KP GP HMBHZVXKPGRRY IKV HEHPXPJ HGURY XP CTRY G YKTPJ OGP
BGOH KTV KW VIH JGUUHV XP FIXBI IH RKNJHN XP BGUZHPVHUD
ZRGBH GPN FGRSHN DRKFRY GD VIKTJI XP IHDXVGVXKP VKFGUND
SKSTDISXP QUXNJH IH IGN DTBBHDDWTRRY GEKXNHN OHHVXPJ IXD
RGPNRGNY KP VIH DVGXUBGDH IXD JGUUHV FGD TPNHU VIH UKKW KW
G IXJI WXEHDVKUXHN IKTDH GPN FGD OKUH RXSH G BTZQKGUN VIGP
G UKKO VIH RGPNRGNY FIK ZUKEXNHN IXO FXVI JGUUHV NXPPHUD
GPN GVVHPNGPBH RXEHN KP VIH WRKKU QHRKF GPN HEHUY VXOH IH
FHPV KTV IH FGD KQRXJHN VK ZGDD IHU SXVBIHP VIH NKKU KW
FIXBI XPEGUXGQRY DVKKN KZHP GPN HGBI VXOH IH ZGDDHN VIH
YKTPJ OGP IGN G DXBS WUXJIVHPHN WHHRXPJ FIXBI OGNH IXO
DBKFR GPN WHHR GDIGOHN IH FGD IKZHRHDDRY XP NHQV VK IXD
RGPNRGNY GPN FGD GWUGXN KW OHHVXPJ IHU VIXD FGD PKV QHBGTDH
IH FGD BKFGUNRY GPN GQCHBV ATXVH VIH BKPVUGUY QTV WKU DKOH
VXOH ZGDV IH IGN QHHP XP GP KEHUDVUGXPHN XUUXVGQRH BKPNXVXKP
EHUJXPJ KP IYZKBIKPNUXG IH IGN QHBKOH DK BKOZRHVHRY GQDKUQHN
XP IXODHRW GPN XDKRGVHN WUKO IXD WHRRKFD VIGV IH NUHGNHN
OHHVXPJ PKV KPRY IXD RGPNRGNY QTV GPYKPH GV GRR IH FGD
BUTDIHN QY ZKEHUVY QTV VIH GPMXHVXHD KW IXD ZKDXVXKP IGN KW
RGVH BHGDHN VK FHXJI TZKP IXO IH IGN JXEHP TZ GVVHPNXPJ VK
OGVVHUD KW ZUGBVXBGR XOZKUVGPBH IH IGN RKDV GRR NHDXUH VK NK
DK PKVIXPJ VIGV GPY RGPNRGNY BKTRN NK IGN G UHGR VHUUKU WKU
IXO QTV VK QH DVKZZHN KP VIH DVGXUD VK QH WKUBHN VK RXDVHP
VK IHU VUXEXGR XUUHRHEGPV JKDDXZ VK ZHDVHUXPJ NHOGPND WKU
ZGYOHPV VIUHGVD GPN BKOZRGXPVD GPN VK UGBS IXD QUGXPD WKU
HMBTDHD VK ZUHEGUXBGVH VK RXH  PK UGVIHU VIGP VIGV IH FKTRN
BUHHZ NKFP VIH DVGXUD RXSH G BGV GPN DRXZ KTV TPDHHP

Applying frequency analysis to this text, we see the following:

[(148, 'H'), (121, 'G'), (105, 'V'), (102, 'K'), (90, 'P'),
(87, 'X'), (82, 'I'), (76, 'N'), (74, 'D'), (70, 'U'),
(58, 'R'), (37, 'B'), (27, 'Z'), (27, 'O'), (26, 'T'),
(25, 'Y'), (24, 'J'), (24, 'F'), (23, 'W'), (21, 'Q'),
(14, 'E'), (9, 'S'), (3, 'M'), (2, 'C'), (1, 'A')]

A first guess we can make from this analysis is that the letter H in the ciphertext is probably an E in the original passage. However, the passage is actually too short to get much more than that from the frequencies of single letters, so instead we use additional clues. The frequencies of three-letter words (not trigraphs, but whole words) are as follows:

[(12, 'GPN'), (11, 'VIH'), (8, 'IGN'), (8, 'FGD'), (7, 'IXD'),
(4, 'WKU'), (4, 'QTV'), (4, 'IXO'), (3, 'KTV'), (3, 'IHU'),
(2, 'PKV'), (2, 'OGP'), (2, 'GRR'), (1, 'RXH'), (1, 'IKV'),
(1, 'GPY'), (1, 'FIK'), (1, 'BGV')]

Since we see GPN and VIH at the front of that list, and we already suspect that H has replaced E, we now have a strong argument that GPN is actually AND and VIH is actually THE. Here’s a code that allows for replacement of guesses in a pair of “decoder strings”, passed as a two-element list called decoder.

def test_decode(cmsg, decoder, chars=40):
    decoder_dict = {}
    for x,y in zip(decoder[0],decoder[1]):
        decoder_dict[x] = y
    msg_blanks = [ ]
    for x in cmsg:
        if x in decoder[0]:
            y = decoder_dict[x]
            msg_blanks += [y.lower()]
        else:
            msg_blanks += '_'
    out = [ ]
    for i in range(len(cmsg)//chars + 1):
        out.append( "".join(cmsg[i*chars:(i+1)*chars] ))
        out.append( "".join(msg_blanks[i*chars:(i+1)*chars] ))
        out.append("")
    return "\n".join(out)

trial = [' VIHGPN',' THEAND']
print(test_decode(cmsg, trial, 60))

Listing 3.1.9. A code to test the decoding guesses determined by frequency analysis of a monoalphabetic substitution

Assuming cmsg in the listing above is the encrypted message of this problem, here is the beginning of the output of test_decode(cmsg, trial, 60):

KP GP HMBHZVXKPGRRY IKV HEHPXPJ HGURY XP CTRY G YKTPJ OGP BG
_n an e__e_t__na___ h_t e_en_n_ ea___ _n ____ a ___n_ _an _a

OH KTV KW VIH JGUUHV XP FIXBI IH RKNJHN XP BGUZHPVHUD ZRGBH
_e __t __ the _a__et _n _h__h he __d_ed _n _a__ente__ __a_e

GPN FGRSHN DRKFRY GD VIKTJI XP IHDXVGVXKP VKFGUND SKSTDISXP
and _a__ed ______ a_ th___h _n he__tat__n t__a_d_ _____h__n

QUXNJH IH IGN DTBBHDDWTRRY GEKXNHN OHHVXPJ IXD RGPNRGNY KP V
___d_e he had ____e_______ a___ded _eet_n_ h__ _and_ad_ _n t

IH DVGXUBGDH IXD JGUUHV FGD TPNHU VIH UKKW KW G IXJI WXEHDVK
he _ta___a_e h__ _a__et _a_ _nde_ the ____ __ a h__h ___e_t_

UXHN IKTDH GPN FGD OKUH RXSH G BTZQKGUN VIGP G UKKO VIH RGPN
__ed h___e and _a_ ___e ___e a _____a_d than a ____ the _and

RGNY FIK ZUKEXNHN IXO FXVI JGUUHV NXPPHUD GPN GVVHPNGPBH RXE
_ad_ _h_ _____ded h__ __th _a__et d_nne__ and attendan_e ___

We have no guarantee that the decoding we’ve supplied is absolutely correct, but the last line shows a clear indication that it might be: the word attendan_e is most likely attendance; this tells us that B probably maps to C. Moreover, we see several times the undecrypted word KW; since the first word KP is decoded as _n, it is likely that K should decode to either I or O, but the most commonly used two-letter word in English is “of”. Moreover, we see the third word HMBHZVXKPGRRY of the encoded message ends with RRY. It is very common in English to see words ending “-lly”, so we reason that RY should decode to LY, giving us a fixed point: the letter Y appears to have been encoded to itself! But then that would make the translation of the third word (after all these substitutions) e_ce_t_onally. As a first guess to that, we suppose that it is exceptionally, giving us the decoding of MZX as XPI. Since X and P are nowhere near the most-frequently used letters, this is exceptional deduction on our part!

We change trial in the above code and try again:

>>> trial = [' VIHGPNBKWRYMZ',' THEANDCOFLYXP']
KP GP HMBHZVXKPGRRY IKV HEHPXPJ HGURY XP CTRY G YKTPJ OGP BG
on an exceptionally hot e_enin_ ea_ly in __ly a yo_n_ _an ca

OH KTV KW VIH JGUUHV XP FIXBI IH RKNJHN XP BGUZHPVHUD ZRGBH
_e o_t of the _a__et in _hich he lod_ed in ca_pente__ place

GPN FGRSHN DRKFRY GD VIKTJI XP IHDXVGVXKP VKFGUND SKSTDISXP
and _al_ed _lo_ly a_ tho__h in he_itation to_a_d_ _o___h_in

QUXNJH IH IGN DTBBHDDWTRRY GEKXNHN OHHVXPJ IXD RGPNRGNY KP V
__id_e he had __cce__f_lly a_oided _eetin_ hi_ landlady on t

IH DVGXUBGDH IXD JGUUHV FGD TPNHU VIH UKKW KW G IXJI WXEHDVK
he _tai_ca_e hi_ _a__et _a_ _nde_ the _oof of a hi_h fi_e_to

UXHN IKTDH GPN FGD OKUH RXSH G BTZQKGUN VIGP G UKKO VIH RGPN
_ied ho__e and _a_ _o_e li_e a c_p_oa_d than a _oo_ the land

RGNY FIK ZUKEXNHN IXO FXVI JGUUHV NXPPHUD GPN GVVHPNGPBH RXE
lady _ho p_o_ided hi_ _ith _a__et dinne__ and attendance li_

At this point, the human brain becomes wonderfully competent at filling in blanks, even without knowing the source text; making reasonable guesses allows you to deduce all of the remaining mappings of the encoding. The only letter not occuring in the original encrypted message was L. Make sure you know what character was mapped to L in the encryption!

Prev Top Next