9.4. FORTUNE PROGRAM INDEXING FILE
% od -t x1 --address-radix=x --skip-bytes=0x32 fortunes.dat
000032 01 48 00 00 01 7c 00 00 01 ab 00 00 01 e6 00 00
000042 02 20 00 00 02 3b 00 00 02 7a 00 00 02 c5 00 00
000052 03 04 00 00 03 3d 00 00 03 68 00 00 03 a7 00 00
000062 03 e1 00 00 04 19 00 00 04 2d 00 00 04 7f 00 00
000072 04 ad 00 00 04 d5 00 00 05 05 00 00 05 3b 00 00
000082 05 64 00 00 05 82 00 00 05 ad 00 00 05 ce 00 00
000092 05 f7 00 00 06 1c 00 00 06 61 00 00 06 7a 00 00
0000a2 06 d1 00 00 07 0a 00 00 07 53 00 00 07 9a 00 00
0000b2 07 f8 00 00 08 27 00 00 08 59 00 00 08 8b 00 00
0000c2 08 a0 00 00 08 c4 00 00 08 e1 00 00 08 f9 00 00
0000d2 09 27 00 00 09 43 00 00 09 79 00 00 09 a3 00 00
0000e2 09 e3 00 00 0a 15 00 00 0a 4d 00 00 0a 5e 00 00
...
If we would interpret this array as little-endian, the first element is 0x4801, second is 0x7C01, etc. High
8-bit part of each of these 16-bit values are seems random to us, and the lowest 8-bit part is seems
ascending.
But I’m sure that this is big-endian array, because the very last 32-bit element of the file is big-endian ( 00
00 5f c4here):
% od -t x1 --address-radix=x fortunes.dat
...
000660 00 00 59 0d 00 00 59 55 00 00 59 7d 00 00 59 b5
000670 00 00 59 f4 00 00 5a 35 00 00 5a 5e 00 00 5a 9c
000680 00 00 5a cb 00 00 5a f4 00 00 5b 1f 00 00 5b 3d
000690 00 00 5b 68 00 00 5b ab 00 00 5b f9 00 00 5c 49
0006a0 00 00 5c ae 00 00 5c eb 00 00 5d 34 00 00 5d 7a
0006b0 00 00 5d a3 00 00 5d f5 00 00 5e 3a 00 00 5e 67
0006c0 00 00 5e a8 00 00 5e ce 00 00 5e f7 00 00 5f 30
0006d0 00 00 5f 82 00 00 5f c4
0006d8
Perhaps,fortuneprogram developer had big-endian computer or maybe it was ported from something
like it.
OK, so the array is big-endian, and, judging by common sense, the very first phrase in the text file must
be started at zeroth offset. So zero value should be present in the array somewhere at the very beginning.
We’ve got couple of zero elements at the beginning. But the second is most appealing: 43 is going right
after it and 43 is valid offset to valid English phrase in the text file.
The last array element is 0x5FC4, and there are no such byte at this offset in the text file. So the last
array element is pointing behind the end of file. It’s supposedly done because phrase length is calculated
as difference between offset to the current phrase and offset to the next phrase. This can be faster than
traversing phrase string for percent character. But this wouldn’t work for the last element. So thedummy
element is also added at the end of array.
So the first 6 32-bit integer values are supposedly some kind of header.
Oh, I forgot to count phrases in text file:
% cat fortunes | grep % | wc -l
432
The number of phrases can be present in index, but may be not. In case of very simple index files, number
of elements can be easily deduced from index file size. Anyway, there are 432 phrases in the text file. And
we see something very familiar at the second element (value 431). I’ve checked other files (literature.dat
and riddles.dat in Ubuntu Linux) and yes, the second 32-bit element is indeed number of phrases minus 1.
Whyminus 1? Perhaps, this is not number of phrases, but rather the number of the last phrase (starting
at zero)?
And there are some other elements in the header. In Mathematica, I’m loading each of three available
files and I’m taking a look on the header: