Download Guide Other Projects About
cryptogram analyzer
by Daniel Parrott

Most people are familiar with cryptograms in some form or another; the daily newspaper often has a section devoted to them. However, only a handful of people are really any good at solving them. And it usually takes them some time to do so. Yet, they find the process of substituting one character for this or that character to be quite exciting, as they can see they are about to discover what inner meaning is hidden inside the puzzle.

solution

For the lazy among us, there is simply no hope for us that this puzzle is ever going to be solved. We either do not have the patience or the motivation (perhaps both for that matter). But, maybe there is a glimmer of hope. Computers are quite good when it comes to calculating numbers and performing mathematical operations. However, numbers are one thing - aren't English phrases quite another? Actually, the answer is no.

Computers understand characters in a system called ASCII. This system is a means to represent the alphabet and many other characters numerically. Thus, the letter 'A' is taken to mean the number 65, the letter 'B' as 66, the letter 'C' as 67, and ad infinitum. These are simply the upper-case characters. The computer also differentiates between case, so lower-case 'a' is 97.

Now that characters can be represented as numbers, we can see that the computer is able to perform character substitution just like humans are able to do so. The trick is in finding out which characters ought to be substituted for another, and then determining if the substitutions actually make any sense.

The first step - that of determining which substitutions to make - is actually one of the more difficult tasks. However, the process is rather easy to understand. My approach uses essentially the same method that Edgar Allen Poe proposed in his short story “The Gold-Bug”. In this story, Poe explains that the most frequently occurring character in English phrases is the letter 'E', which is then subsequently follwed by 'A', 'O', 'I', and so on.

As a result, the first step of the process involves finding out which character in the cryptogram occurs most frequently. We shall call this character 'X'. Once this is determined, the computer program then decides that this must be the letter 'E'. The next step is a bit trickier, for the program must then determine which words in the English language have the letter 'E' each place that the character 'X' is found. In my approach, the solution is this: find the two longest words in the phrase, and begin with the first of those words. Thus we are now working on a long word, and everywhere it has the letter 'X' in it, it must have the letter 'E'. The reason for working with a longer word is that it provides us with more initial substitutions to make for words that we encounter later on.

Now, the program begins finding matches that suit this criteria. Each matching word must be of the same length as the ciphered word. In addition, each matching candidate must not have overlapping uses of substitutions. That is, suppose the ciphered word is “sqozzydr”, which means “pressing”. In this instance, 'e' is substituted for the letter 'o', but note also that once the 'z' is found in the ciphered word, it must make the same substitution in the matching candidate. As a result, 's' must be a subsitute for 'z' in every part of this word, and consequently this is true for the entire phrase. When the program comes across “kouqz”, the 'z' is assumed to be an 's', and therefore the word “wears” is a good candidate.

So for each word in the phrase, the computer program comes up with a list of matching candidates according to the current assumption that 'E' must be substituted everywhere the most frequently occurring character is found. At this point, no substitution rules are in effect, so when I mentioned that 's' must be substituted for 'z', this is only true when the program begins processing a matching candidate. Once the list of possible matching candidates is found, the program can then begin working on making substitution rules. It first establishes this list of rules based on the substitutions it finds for the longest word in the phrase, and then determines if these rules also work for a matching candidate found for the second longest word in the phrase. If there is no conflict in substitution rules, the program then proceeds onward with the next longest word, and so on. But, what if there is a conflict?

When that happens, the program tries a different matching word for the current word it is working on. Essentially, there are potentially hundreds of matching candidates for each word in the phrase, and so there are many possibilities that the program must work through. If it encounters a matching candidate that conflicts with the current set of substitution rules, it simply tries a different candidate. If it exhausts the entire list of matching candidates, then the program assumes that the current set of substitution rules is invalid and then steps back to the previous word. It then tries to go through and find a different matching candidate for this word that does not conflict with the substitution rules that worked earlier. If it manages to find one that does not conflict, then it tries the other word again. If, once again, it is unable to find a matching candidate that does not conflict, it goes back to the previous word again, and tries to find a different matching candidate.

If after all this it is unable to find a matching candidate for this word that does not cause a conflict to occur for the word after it, the program starts back at the word before this one. This continues until the program has to try a different matching word for the first word (the longest one), and then a completely different substitution ruleset is initialized. Keep in mind, however, that the program is assuming that the most frequently occurring character must be the letter 'E'. If this is not the case, the program will then make another loop that finds matching words where the most frequently occurring character is the letter 'A', for instance. Then these matching candidates are used, a ruleset established, and subsequent tests are applied to determine if there are conflicts along the way. If so, different matches are tried until there are no conflicts. Meanwhile, the program is also testing each word with the substitutions applied to determine if they all exist in the dictionary. If so, it has found a possible phrase. The user can specify how many possible phrases to find, in the event that the first possibility is undesirable. Usually, this is not the case.

diagram

I also came up with another means to find candidates. Rather than assuming the most frequently occurring character is 'E', or 'A', or anything else, it could be said that perphaps 'E' is the second most frequently occurring character in the phrase, or the third most frequent. This would assume that the phrase uses the character 'E' at least once, but nearly all English phrases do. Thus, as an additional option, they user may opt to choose this selection process. Sometimes it is faster, sometimes it is slower - it depends on the phrase.

Sources:
http://www.esg.montana.edu/meg/consbio/cryptogram/crypto.html

Copyright © 2007 by Daniel Parrott