Why Go Deep?

Deep neural networks have achieved tremendous success in a wide range of applications and have shown overwhelming advantages compared with shallow networks in complicated tasks.
Why deep neural networks outperform shallow networks?
According to a well-known result - the universal approximation theorem, a feedforward neural network with a single, huge, hidden layer is a universal approximator of Borel measurable functions. This theorem states that shallow networks are powerful, but does not conclude that why deep networks are even more powerful.
Another research answers this question. Delalleau and Benjio (2011) showed that a shallow network requires exponentially more hidden units than a deep network in order to compute certain families of polynomials. However, their result is limited in a certain kind of neuron and
a specific problem. A general study is needed.
In 2014, Bengio pushed forward their work to show that deep networks are able to separate their input space into exponentially more linear regions than shallow networks with the same number of hidden units.
Personally, I think this is a good answer to the initial question. Here is my understanding of Benjio’s research.
The non-linear activation function used in most deep neural networks is ReLU, which is called a piecewise linear function in the paper. “Piecewise” means the function produces linear response on only a portion of input space. In ReLU where $a = \max{\{0, x\}}$ , only $x > 0$ leads to activation.
A piecewise linear function behaves differently in a specific point, which I call a critical point. In ReLU, this point is $0$ . For the first hidden layer, the collection of inputs that is mapped to the critical point of the $i$ -th hidden unit forms a hyperplane in the input space. Formally, let $W_{i, :}$ be the weight vector from input to the $i$ -th hidden unit of the first hidden layer, $b_i$ be the corresponding bias. The hyperplane with respect to the critical point on this unit is $H_{i} = \{x\in R^{n_0} | W_{i,:}x + b_i = 0\}$
for $i = 1\dots n_1$
In a shallow network with $n_0$ input units and $n_1$ hidden units, there are totally $n_1$ hyperplanes. Such set of hyperplanes is able to split the input space into at most $\sum_{j=0}^{n_0} C_{n_1}^j$ regions (Zaslavsky 1975). Such regions are mutually exclusive.
Such regions are called linear regions. The number of linear regions that a network generates imply the flexibility of the network. In classification, the more linear regions are used, the more approximation to the real decision boundary the network will achieve.

Another discovery is that in deep networks, the intermediate layers are able to map several pieces of inputs from previous layers into the same output, so that computation from previous layers can be re-used in the next layer.
This is like folding a paper. Each layer folds the paper folded by the last layer, generating more creases in the original, unfolded paper.
A shallow network only folds the original paper. For each hidden unit, it create a crease in the original space. The number of creases is the number of hidden units. The decision boundary is approximated by combining those creases.
A deep network folds the paper “recursively”. It folds the folded paper so that creases in the original paper increase exponentially with respect to the number of layers. Folding the folded paper is an another expression of “re-use the low level computations”.

In conclusion, by re-using the results from previous layers, deep networks are able to separate input space into exponentially more linear regions than shallow networks, making the model more powerful, capable and flexible. This is why deeper is better.