An estimate of entropy of English texts is 1.34 bits per letter [1]. This implies that, if the letters are coded into 5 bits, one needs to appropriately
combine 4 text files in order to obtain bit sequences of full entropy, since 4*1.34 = 5.36 > 5. The method used in our software is to sum (mod 32)
the coded
values of a-z (mapped to 0-25) as 5 bits of the corresponding letters of
the
text files.
There are plenty of other schemes for obtaining high quality pseudo-random sequences in practice, e.g. AES in counter mode. However our scheme seems to be much simpler both in the underlying logic (understandability) and in implementation and is thus a viable alternative that one could use/need
under
circumstances.
The software, TEXTCOMBINE-SP, is available at mok-kong-shen.de
[1] T. M. Cover, R. C. King, A Convergent Gambling Estimate of the
Entropy of
English, IEEE Trans. Inf. Theory, vol. 24, 1978, pp. 413-421.
M. K. Shen
On 2017-07-09, Mok-Kong Shen <mok-kong.shen@t-online.de> wrote:
An estimate of entropy of English texts is 1.34 bits per letter [1]. This
implies that, if the letters are coded into 5 bits, one needs to
appropriately
combine 4 text files in order to obtain bit sequences of full entropy, since >> 4*1.34 = 5.36 > 5. The method used in our software is to sum (mod 32)
the coded
values of a-z (mapped to 0-25) as 5 bits of the corresponding letters of
the
text files.
That is a very bad estimate-- it is basically the estimate of the
entropyif you pick one letter out at random from the text file. It does
NOT take into account correlations between the letters, of which there
are loads and loads. Ie, if you pick three letters in sequence, there is
high probability that they are correlated, which would be disasterous
for a pseudo random number generator. Also, text is an extremely biased source. Eg, in English the letter z occurs with a somewhat different frequency than e. Exactly why you woud want to do
what you do is entirely unclear since there are lots of extremely good pseudo random number generators out there--ones not based on a half
assed theory
There are plenty of other schemes for obtaining high quality pseudo-random >> sequences in practice, e.g. AES in counter mode. However our scheme seems to >> be much simpler both in the underlying logic (understandability) and in
implementation and is thus a viable alternative that one could use/need
under
circumstances.
It is NOT viable, unless you want a complete cockup of a random number generator
The software, TEXTCOMBINE-SP, is available at mok-kong-shen.de
[1] T. M. Cover, R. C. King, A Convergent Gambling Estimate of the
Entropy of
English, IEEE Trans. Inf. Theory, vol. 24, 1978, pp. 413-421.
M. K. Shen
I am extremely sorry to say that I was unfortunately misled by some
erroneous
computations in the design stage such that I like to retract this software (instead of attempting certain more complicated redesign) and sincerely ask for pardon from readers of this thread for having wasted their precious
time.
M. K. Shen
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 153:18:29 |
Calls: | 10,383 |
Files: | 14,054 |
Messages: | 6,417,839 |