Information Theory

Chapter 1

1.1 Entropy of a source

Definition:

A source is an ordered pair $\phi = (S, P)$ , where $S = \{x_1, \dots, x_n\}$ is a finite set, known as a source alphabet, and $P$ is a probability distribution over $S$ . Denote the probability of $x_i$ by $p_i$ or $p(x_i)$ .

Theorem 1.1
1. $H(p_1, \dots, p_n)$ is defined and continuous for all $p_1,\dots ,p_n$ satisfying $0\le p_i \le 1$ , $\sum p_i = 1$ .
2. $H(\frac{1}{n},\dots, \frac{1}{n}) < H(\frac{1}{n+1},\dots, \frac{1}{n+1})$
3. For $b_i \in Z^{+}$ , $\sum b_i = n$ , $H(\frac{1}{n},\dots, \frac{1}{n}) = H(\frac{b_1}{n},\dots, \frac{b_k}{n}) + \sum_{i=1}^{k} \frac{b_i}{n} H(\frac{1}{b_i},\dots, \frac{1}{b_i})$

A function $H$ satisfy properties 1-3 if and only if it has the form

$H(p_1,\dots, p_n) = -\sum_{i=1}^{n} p_i \log_b p_i$

where $b > 1$ and $p\log_b p=0$ if $p = 0$ .

Definition

Let $P = \{p_1, \dots, p_n\}$ be a probability distribution. Then $H(p_1, \dots, p_n)$ is the b-ary entropy of the distribution $P$ .

If $P$ is the distribution of a source, $H$ is the entropy of the source.

Example 1.1.3
The entropy function
$H(p, 1-p) = p\log_2 \frac{1}{p} + (1 - p) \log_2 \frac{1}{1-p}$
Definition: entropy of a random variable
Let $X$ be a random variable with range $S = \{x_1, \dots, x_n\}.$ If $P(X=x_i) = p(x_i)$ then the entropy of $X$ is defined by
$H(X) = \sum_{i=1}^{n} p(x_i) \log \frac{1}{p(x_i)}$
Join entropy of $X$ and $Y$ :
$H(X, Y) = \sum_{i=1}^n\sum_{j=1}^n p(x_i, y_j) \log \frac{1}{p(x_i, y_j)}$
Join entropy of $X_k$ , $k=1,\dots, n$

1.2 Properties of entropy

lemma 1.2.1

$\ln x \leq x -1$
for all $x >0$ , with equality holding if and only if $x=1$ .
lemma 1.2.2
$P = \{p_1, \dots, p_n\}$ , $0\leq p_i \leq 1$ and $\sum p_i = 1$ .
$Q = \{q_1, \dots, q_n\}$ , $0\leq q_i \leq 1$ and $\sum q_i \leq 1$ .
Then
$\sum_{i=1}^n p_i \log \frac{1}{p_i} \leq \sum_{i=1}^{n} p_i \log \frac{1}{q_i}$
Equality holds if and only if $q_i = p_i$ for all $i$ .
Theorem 1.2.3 (the range of entropy function)
Let $X$ be a discrete random variable with range $\{x_1, \dots, x_n\}$ .

$0 \leq H(X) \leq \log n$
$H(X) = \log n$ if and only if $p(x_i) = \frac{1}{n}$ for all $i$ .
$H(X) = 0$ if and only if $p(x_i) = 1$ for some $i$ .
Theorem 1.2.4
Let $\{p_1, \dots, p_n, q_1, \dots, q_m\}$ be a probability distribution.
If $a = p_1 + p_2 + \dots + p_n$ then
$H(p_1, \dots, p_n, q_1, \dots, q_m) = H(a, 1-a) + aH(\frac{p_1}{a}, \dots, \frac{p_n}{a}) + (1 - a) H(\frac{q_1}{1 - a}, \dots, \frac{q_m}{1 - a})$
Theorem 1.2.5
$H(X, Y) \leq H(X) + H(Y)$
Equality holds if $X$ and $Y$ are independent.
$H(X_1, \dots, X_n) \leq H(X_1) + \dots + H(X_n)$
$H(X_1, \dots, X_n, Y_1, \dots, Y_m) \leq H(X_1, \dots, X_n) + H(Y_1, \dots, Y_m)$
Theorem 1.2.9
The entropy function is a convex down function.

Chapter 2

2.1 variable length encoding

Definitions:
code alphabet: a finite set $A = \{a_1, \dots, a_r\}$ .
string or word: any sequence of elements of A
The set of all strings over $A$ is denoted as $A^*$
$r$ -ary code: a nonempty subset of $A^*$ .
The radix of the code: $r$
E.g. {0, 1} makes binary codes, {0, 1, 2} makes ternary codes.
Definition
An encoding scheme for the source is an ordered pair $(C, f)$ where $C$ is a code and $f: S\rightarrow C$ is an encoding function.
Definition
The average code word length of an encoding scheme $(C, f)$ for the source $(S, P)$ where $S=\{s_1, \dots, s_n\}$ .

$avgLen = \sum_{i=1}^n P(s_i) len(f(x_i))$
Definition
fixed length code / block code: All codewords in a code have the same length.
Otherwise, it is variable length code.
fixed length encoding scheme
variable length encoding scheme
Definition
A code $C$ is uniquely decipherable if whenever $c_1,\dots, c_k$ , $d_1,\dots,d_j$ are codewords in $C$ and $c_1,\dots, c_k = d_1,\dots,d_j$ , then $k=j$ and $c_i=d_i$ for all $i=1,\dots,k$ .
Definition
A code is instantaneous if each codeword in any string of codewords can be decoded as soon as it is received.
If a code is instantaneous, then it is also uniquely decipherable.
the converse is not true.
Definition
A code has the prefix property if no codeword is a prefix of any other codeword, that is, if whenever $c=x_1 x_2 \dots x_n$ is a codeword, then $x_1 \dots x_k$ is not a codeword for $1\leq k < n$ .
Theorem 2.1.1
A code is instantaneous if and only if it has the prefix property.
Theorem 2.1.2 Kraft's Theorem
1. If $C$ is an r-ary instantaneous code with codeword lengths $l_1,\dots, l_n$ , then these lengths must satisfy
  $\sum_{k=1}^n \frac{1}{r^{l_k}} \leq 1$ (Kraft's Inequality).
2. If the numbers $l_1,\dots, l_n$ and $r$ satisfy this inequality, then there exists an instantaneous r-ary code with codeword lengths $l_1,\dots, l_n$ .
Theorem 2.1.3 McMillan's Theorem
The code lengths of a uniquely decipherable r-ary code must satify Kraft's inequality.
Theorem 2.1.4
If a uniquely decipherable code exists with codeword lengths $l_1, \dots, l_n$ , then an instantaneous code must also exits with the same codeword lengths.

2.2 Huffman Encoding

One of the optimal r-ary encoding scheme with minimum average length.

2.3 Noiseless Coding Theorem

Theorem 2.3.1
$(c_1,\dots, c_n)$ is an instantaneous encoding scheme for $(p_1,\dots,p_n)$ . Then

$H_r(p_1, \dots, p_n) \leq AveLen (c_1, \dots, c_n)$ .
Equality holds if and only if $len(c_i) = \log_r(1/p_i)$
Theorem 2.3.2 The noiseless coding theorem
For any probability distribution $P=(p_1, \dots, p_n)$ . We have
$H_r(p_1, \dots, p_n) \leq (MinAveLen)_r(p(c_1, \dots, pc_n) < H_r(p_1, \dots, p_n) + 1$ .
Definition n-th extension
Theorem 2.3.4

$H_r(P) \leq \frac{MinAveLen(P^n)}{n} \leq H_r(P) + \frac{1}{$
$H_r(P) \leq \frac{MinAveLen(P^n)}{n} \leq H_r(P) + \frac{1}{$

Chapter 3 Noise Coding

3.1 Discrete Memoryless Channel and Coditional Entropy

Definition
A discrete memoryless channel consists of an input alphabet $\{x_1,\dots, x_n\}$ , an output alphabet $\{y_1, \dots, y_t\}$ and a set of channel probability satisfying
$\sum_{j=1}^{t} p(y_i|x_i)=1$ .

Binary symmetric channel: $p(0|0)=p(1|1)=p, p(1|0)=p(0|1)=1-p$
Binary erasure channel: $p(?|0) = q, p(?|1) = s$

Definition
Condictional entropy of X given $Y=y_j$ is defined by
$H(X|Y=y_j)=\sum_{i=1}^s p(x_i|y_j) \log \frac{1}{p(x_i|y_j)}$ .
It measures the uncertainty remaining in the input after having observed the output, or the loss of information about X caused by the channel.
Theorem 3.1.1
If X and Y are random variables, then
$H(X|Y) = H(X, Y) - H(Y)$ .

$H(X|Y) \leq H(X)$ with equality holds if and only if X and Y are independent.
channel matrix
In a $t\times s$ channel matrix, $A_{ij} = p(y_j|x_i)$
Each row a single input. Each column is a single output.
lossless: input is completely determined by output. (one to many)
deterministic: output is completely determined by input. (many to one)
noiseless: both lossless and deterministic. (one to one)
useless: knowledge about input tells nothing about output. (many to many)
row/column symmetric: each row/column has the same group of numbers, but may in different order.
symmetric: both row and column symmetric.
Theorem
For row symmetric channel, $H(Y|X)$ is independent of input distribution.
Theorem
For column symmetric channel, uniform input produces uniform output.

3.2 Mutual Information and Channel Capacity

Definition
A channel with input X and output Y, mutual information
$I(X;Y) = H(X) - H(X|Y)$ .
It is the amount of information that we learnt about X by virtue of knowing Y. Or the amount of information about X that gets through the channel.

The definition is the same for random vectors $X$ and $Y$ .

Properties
1. $H(X|Y) + H(Y) = H(X, Y) \leq H(X) + H(Y)$
2. $I(X; Y) = I(Y; X) = H(X) - H(X|Y)$
3. $0 \leq I(X; Y) = H(X) + H(Y) - H(X, Y)$
4. $I(X; X) = H(X)$
Definition
The capacity of a channel is the maximum mutual information $I(X; Y)$ , taken over all input distributions $p(x_i)$ of $X$ .

$\max_{p(x_i)} I(X; Y)$
Theorem
The capacity of a symmetric channel, with $t$ inputs

$\log t - \sum_{j=1}^t p(y_j|x_i)\log \frac{1}{p(y_j|x_i)}$

capacity:
lossless: $\log s$
deterministic: $\log u, u = |{y_j|p(y_j|x_i)=1 for some x_i}|$
noiseless: $\log s$
useless: $0$
binary symmetric channel: $1 - H(p)$ , p is crossover probability

3.3 Noisy Coding Theorem

A decision scheme is a partial function $f$ from the set of output strings to the set of codewords.
"partial" means the function is only defined for part of output strings.
decision error/decoding error: when $f(d)$ is not the codeword that had been sent.

Assume the codeword is $c$ , and the output string received is $d$ .
The probability f a decision error:

$p(error|c) = \sum_{d\notin f^{-1}(c)} p(d|c)\\ p_e = \sum_c P(error|c)p(c)$

definition
An idea observer is any decision scheme $f$ for which $f(d)$ satisfys
$p(f(d)|d) = \max_{c} p(c|d)$ for all output strings $d$ , for a given input distribution.
Theorem
For any input distribution, an idea observer decision scheme minimizes the probability of a decision error among all decision schemes.

Idea observers depends on the input distributions.
For uniform input, it is a maximum likelihood decision scheme.

Theorem
An $s$ -ary $n$ -length $(n, |C|)$ -code $C$ .
The rate of the code is
$R(C) = \frac{\log_s|C|}{n}$
Theorem: The Noisy Coding Theory
Consider a discrete memoryless channel with capacity $\eta$ . For any positive number $R < \eta$ , there exists a sequence $C_n$ of $s$ -ary codes, and corresponding decision schemes $f_n$ , with the following properties.
1. $C_n$ has length $n$ and rate at least $R$ .
2. The maximum probability of error of $f_n$ approaches 0.
Theorem: Fano's inequality
For any decision scheme $(C, f)$ with $|C|=m$ , and any input distribution, if $p_e$ denotes the probability of a decision error, then
$H(X|Y) \leq H(p_e) + p_e \log(m-1)$
theorem: weak convere to the noisy coding theorem
Consider a discrete memoryless channel with capacity $\eta$ . Suppose that $C_n$ is a sequence of $(n, \lceil s^{nR}\rceil)$ -codes, with corresponding decision schemes $f_n$ , and that the average probability of error of $f_n$ is $p_e^{av}(n)$ . Then if $R > \eta$ , there exists a constant $\eta > 0$ for which
$p_e^{av}(n) \geq \eta$ for all n.
Theorem: Strong Converse to he Noisy Coding Theorem
Consider a distrete memoryless channel, with capacity $\eta$ . Suppose that $C_n$ is a sequence of $(n, \lceil s^{nR} \rceil)$ -codes, with corresponding decision schemes $f_n$ , and that the average probability of error of $f_n$ is $p_e^{av}(n)$ . Then if $R > \eta$ , we must have
$p_e^{av}(n) \rightarrow 1$ as $n\rightarrow \infty$

Chapter 4 General remarks on codes

4.1 Error detection and correction

definition
code alpgabet: $A = \{a_1, \dots, a_q \}$
$A^n$ is the set of all strings of length $n$ over $A$
$q$ -ary block code $C$ : any nonempty subset of $A^n$
codeword: each string in $C$
an $(n, M)$ -code: length n and size M
The rate of a $q$ -ary $(n, M)$ -code is
$R = \frac{\log_q M}{n}$
definition
A discrete memoryless channel consists of an input alphabet $A = \{a_1, \dots, a_q\}$ , an output alphabet $O = \{b_1, \dots, b_t\}$ containing $A$ , and a set of channel probabilities, or transition probabilities, satisfying
$\sum_{j=1}^t p(b_j | a_i) = 1$

other definitions:
binary symmetric channel, crossover probability, channel matrix
input distribution

$p(X=c) = p(c)$
output distribution

$P(Y=d) = \sum_{c} p(d|c)p(X=c)$
joint probability

$P(X=c, Y=d) = p(d|c) P(X=c)$
backward channel probability

$P(X=c|Y=d) = \frac{P(X=c, Y=d)}{P(Y=d)}$

decision scheme, decision error/decoding error, idea observer, average probability of error, maximum likelihood decision, the Noisy Coding Theorem.

4.2 Minimum Distance Decoding

definition
Hamming distance : the number of positions in which and differ.
1. positive definiteness
2. symmetry
3. triangle inequality

minimum distance decoding: choose a codeword that is closest to the received word (maximum likelihood decoding).

definition
The minimum distance of a code $C$ is
$d(C) = \min_{c, d\in C} d(c, d)$

A $(n, M, d)$ -code is a code with length $n$ , size $M$ , and minimum distance $d$ .

definition
t-error-detecting: if at most t error is made in a codeword, the resulting word is NOT a codeword.
exactly t-error-detecting: t-error-detecting, but not (t+1)-error-detecting

t-error-correcting: at most $t$ errors can be corrected with MDD
exactly t-error-correcting

Theorem
A code $C$ is exactly t-error-detecting if and only if $d(C) = t+1$ .
A code $C$ is exactly t-error-correcting if and only if $d(C) = 2t + 1$ or $2t + 2$ .
$d(C)=d$ if and only if $C$ is exactly $\lfloor (d-1)/2 \rfloor$ -error-correcting.
definition
An $(n, M, d)$ -code is maximal if it is not contained in any $(n, M+1, d)$ -code.
Theorem
Denote $B_p(n, m) = \sum_{k=0}^{m} C_n^k p^k (1-p)^{n-k}$
For the binary symmetric channel, the probablity of a coding error for a maximal $(n, M, d)$ -code satisfies
$1 - B_p(n, d-1) \leq P_{decode\space error} \leq 1 - B_p(n, \lfloor \frac{d-1}{2} \rfloor)$
definition
$x$ is a word in $A^n$ where $|A|=q$ . $r$ is a nonnegative real number. The sphere of radius $r$ about $x$ is the set
$S_q(x, r) = \{y\in A^n | d(x, y) \leq r\}$
The volumn of the sphere $S_q(x, r)$ is the number of elements in it.

$V_q(n, r) = \sum_{k=0}^r C_n^k (q-1)^k$
definition
Let $C$ be a code in $A^n$ . The packing radius of $C$ is the largest integer $r$ for which the spheres $S_q(c, r)$ about each codeword $c$ are disjoint.
The covering radius of $C$ is the smallest integer $s$ for which the spheres $S_q(c, s)$ about each codeword $c$ cover $A^n$ .
Theorem
A code $C$ is $t$ -error-correcting if and only if the spheres $S_q(c, t)$ about each codeword are disjoint, if and only if $pr(C) = t$ .
Definition
A code $C$ is perfect if $pr(C) = cr(C)$ .
There exists a number $r$ for which the spheres $S_q(c, r)$ about each codeword $c$ are disjoint and cover $A^n$ .
Theorem: The sphere-packing condition
Let $C$ be a $(n, M, d)$ -code. Then $C$ is perfect if and only if $d=2t+1$ is odd and
$M\times V_q(n, t) = q^n$

4.3 Families of codes

definition
A $q$ -ary $(n, q^k)$ -code is systematic if there are $k$ positions $i_1, \dots, i_k$ with the property that by restricting the codewords to these positions we get all of the $q^k$ possible $q$ -ary words of length $k$ .
information set, information symbols
Theorem
If $p$ is a prime number and $n$ is a positive integer, there is exactly one field of size $q=p^n$ which is denoted by $F_q$ or $GF(q)$ .
extend to vector space $(F_q)^n$ , denoted by $V(n, q)$

Types of codes

Linear codes
A code $L \subset V(n, q)$ is a linear code if it is a subspace of $V(n, q)$ .
If $L$ has dimension $k$ over $V(n, q)$ and minimum distance $d$ , $L$ is an $[n, k, d]$ -code.
definition
The weight $w(x)$ of a word $x \in V(n, q)$ is the number of nonzero positions in $x$ . The minimum weight $w(C)$ of a code $C$ is the minimum weight of all nonzero codewords in $C$ .

For linear code, $d(L) = w(L)$ .

definition
A linear code $L\subset V(n, q)$ is cyclic if it is an ideal of
$R_n = F_q[x] /<x^n-1>$ .
(when codewords are thought of as polynormials)

Families of codes

repetition codes
$q$ -ary repetition code of length $n$ is $\{00\dots 0, \dots, (q-1)\dots (q-1)\}$
Hamming codes $H_q(r)$
is a $q$ -ary $[\frac{q^r-1}{q-1}, n-r, 3]$ -code, single-error-correcting
Galay codes

Chapter 5： Linear Codes

5.1 Linear codes and dual

definition
Let $L$ be an $[n, k]$ -code.
Generator matrix: a $k\times n$ matrix $G$ whose rows forms a basis for $L$ .

The codewords in $L$ are the linear combinations of thr rows of $G$ .
Standard form: $G = (I_k | A)$ , where $I_k$ is the identity matrix of size $k$ .
It is systematic on its first $k$ coordinate positions.

definition
Let $L$ be an $[n, k]$ -code. The set
$L^{+} = \{x\in V(n, q)| x\cdot c = 0, \forall c \in L\}$ is the dual code of $L$ .
Theorem
1. With generator matrix $G$ ,
  $L^{+} = \{x\in V(n, q)| xG^T = 0\}$
2. The dual code of $[n, k]$ -code is a $[n, n-k]$ -code.
3. $L^{++} = L$ .
definition
parity check matrix: $H = (-A^T | I_{n-k})$ , is the generator matrix for the dual codes of $L$ .
$x\in L$ if and only if $xH^T = 0$ .
The rows of $H$ are the coefficients of a system of linear equations(parity check equations) whose solutions are precisely the codewords in $L$ .
Theorem
Let $L$ be a $[n, k, d]$ -code, with parity check matrix $H$ . Then $H$ has at most $d-1$ linearly independent columns. (and at least $d$ linearly dependent columns)
Theorem: Gilbert-Varshamov Bound
There exists a $q$ -ary linear $[n, k]$ -code with minimum distance at least $d$ provided that

$q^k \leq {{q^n}\over {\sum_{i=0}^{d-2} C_{n-1}^i (q-1)^i}}$
definition
Let $L$ be a $[n, k, d]$ -code, with parity check matrix $H$ . For any $x \in V(n, q)$ , $xH^T$ is the syndrome of $x$ .

The quotient space of $V(n, q)$ modulo $L$ is

$\frac{V(n, q)}{L} = \{x+L | x\in V(n, q)\}$
a coset of

$L$ :

$x + L = \{x+c | c\in L\}$

Theorem
Let $L$ be a $[n, k, d]$ -code, with parity check matrix $H$ .
$x$ and $y$ in $V(n, q)$ have the same syndrome if and only if they are in the same coset of the quotient space $\frac{V(n, q)}{L}$ .
Theorem
Let $L$ be a $[n, k, d]$ -code, with parity check matrix $H$ .
The minimum distance decoding is equivalent to decoding a received word $x$ as a word $c = x - a$ where $a$ is a word of smallest weight in the coset $x+L$ , or a word of smallest weight with the same syndrome as $x$ .

standard array

0	$c_1$	$c_2$	$\dots$	$c_m$
$a_1$	$c_1+a_1$	$c_2+a_1$	$\dots$	$c_m+a_1$
$a_2$	$c_1+a_2$	$c_2+a_2$	$\dots$	$c_m+a_2$
$\dots$	$\dots$	$\dots$	$\dots$	$\dots$

The first row is the codewords in $L$ . The $i$ -th row is formed by choosing a word $a_i$ of smallest weight that is not yet in the array, and adding it to each word of the first row, to form the coset $a_i =Ｌ$ . $a_i$ is the coset leaders.

syndrome decoding: if $x$ is received, compute its syndrome, find the coset leader $a_i$ with the same syndrome, and decode $x$ as $x - a_i$ .

5.3 Maximum Distance Separable Codes

definition
A linear $[n, k, d]$ -code must have $d \leq n-k+1$ .
A linear $[n, k, n-k+1]$ -code is called a maximum distance separable (MDS) code.
Theorem
$L$ is MDS code if and only if any $n-k$ columns of $H$ are linearly independent.

If a code is MDS, then so is its dual code.

Theorem
$L$ is MDS if and only if any $k$ columns of $G$ are linearly independent.
MDS code is systematic on any $K$ positions.
Theorem
Let $L$ be an $[n, k]$ -code with generator matrix $G = (I_k | A)$ in standard form.
$L$ is an MDS code if and only if every square submatrix of $A$ is nonsingular.

The support of a vector $x \in V(n, q)$ is the set of all coordinate positions where $x$ is nonzero.

Theorem
A linear code $L$ is an MDS code if and only if given any $d$ coordinate positions, there is a (minimum weight) codeword whose support is precisely these positions.