Assembly Language for Beginners

(Jeff_L) #1

9.2 Information entropy


stick which our visitor had left behind him the night before. It was a
fine, thick piece of wood, bulbous-headed, of the sort which is known as
a "Penang lawyer." Just under the head was a broad silver band nearly
an inch across. "To James Mortimer, M.R.C.S., from his friends of the
C.C.H.," was engraved upon it, with the date "1884." It was just such a
stick as the old-fashioned family practitioner used to carry--dignified,
solid, and reassuring.


"Well, Watson, what do you make of it?"


Holmes was sitting with his back to me, and I had given him no sign of
my occupation.


(Full version of the text ishere.)


The text looks correct. Yes, I made up this example and choose well-known text of Conan Doyle, but it’s
very close to what I had in my practice some time ago.


Other ideas to consider


If we would fail with space counting, there are other ideas to try:



  • Take into consideration the fact that lowercase letters are much more frequent than uppercase ones.

  • Frequency analysis.

  • There is also a good technique to detect language of a text: trigrams. Each language has some
    very frequent letter triplets, these may be “the” and “tha” for English. Read more about it: N-
    Gram-Based Text Categorization,http://code.activestate.com/recipes/326576/. Interestingly
    enough, trigramsdetectioncanbeusedwhenyoudecryptaciphertextgradually, likeinthisexample
    (you just have to test 3 adjacent decrypted characters).


For non-Latin writing systems encoded in UTF-8, things may be easier. For example, Russian text
encoded in UTF-8 has each byte interleaved with 0xD0/0xD1 byte. It is because Cyrillic characters
are placed in 4th block of Unicode table. Other writing systems has their own blocks.

9.2 Information entropy


For the sake of simplification, I would say, information entropy is a measure, how tightly some piece of
data can be compressed. For example, it is usually not possible to compress already compressed archive
file, so it has high entropy. On the other hand, 1MiB of zero bytes can be compressed to a tiny output file.
Indeed, in plain English language, one million of zeros can be described just as “resulting file is one million
zero bytes”. Compressed files are usually a list of instructions to decompressor, like this: “put 1000 zeros,
then 0x23 byte, then 0x45 byte, then put a block of size 10 bytes which we’ve seen 500 bytes back, etc.”


Texts written in natural languages are also can be compressed tightly, because natural languages has
a lot of redundancy (otherwise, a tiny typo will always lead to misunderstanding, like any toggled bit in
compressed archive make decompression nearly impossible), some words are used very often, etc. In
everyday speech, it’s possible to drop up to half of words and it still be recognizable.


Code for CPUs is also can be compressed, because someISAinstructions are used much more often than
others. In x86, most used instructions areMOV/PUSH/CALL(5.11.2 on page 731).


Data compressors and ciphers tend to produce very high entropy results. GoodPRNGalso produce data
which cannot be compressed (it is possible to measure their quality by this sign).


So, in other words, entropy is a measure which can help to probe contents of unknown data block.


9.2.1 Analyzing entropy in Mathematica


(Thisparthasbeenfirstappearedinmyblogat13-May-2015. Somediscussion:https://news.ycombinator.
com/item?id=9545276.)

Free download pdf