Why different architectures are needed

A feedforward multilayered neural network has the capacity to learn a huge hypothesis space and extract complex features in each of the nonlinear hidden layers. So, why do we need different architectures? Let's try to understand this.

Feature engineering is one of the most important aspects in machine learning (ML). With too few or irrelevant features, we may have underfitting; and with too many features, we may be overfitting the data. Creating a good, hand-crafted set of features is a tedious, time-consuming, and iterative task.

Deep learning comes with a promise that, given enough data, the deep learning model is capable of automatically figuring out the right set of features—that is, a hierarchy of features of increasing complexity. Well, the promise of deep learning is simultaneously true and slightly misleading. Deep learning has indeed simplified feature engineering in many scenarios, but it certainly hasn't completely eradicated the need for it. As manual feature engineering has decreased, the architectures of neural network models themselves have become increasingly more complex. Specific architectures are designed to solve specific problems. Architecture engineering is a much more generic approach than hand-crafted feature engineering. In architecture engineering, unlike in feature engineering, domain knowledge is not hardcoded as specific features but is instead only used at an abstract level. For example, if we were dealing with image data, one very high-level piece of information about the data is two-dimensional locality of object pixels and another is translation invariance. In other words, translating a cat's image by a few pixels still keeps it a cat.  

In the feature engineering approach, we have to use very specific features, such as edge detectors, corner detectors, and various smoothing filters, to build a classifier for any image processing/computer vision task. Now, for neural networks, how can we code two-dimensional locality and translation invariance information? If a dense, fully connected layer is placed preceding the input data layer, then every pixel in the image is connected to each unit in the dense layer. But, pixels from two spatially distant objects need not be connected to same hidden unit. It may be that a neural network with strong L1 regularization would be able to sparsify the weights after prolonged training with lots of data. We can design the architecture to restrict only local connections to the next layer. A small set of neighboring pixels (say, a 10 x 10 sub-image of pixels) may have connections to one unit of the hidden layer. Because of translation invariance, the weights used in these connections can be reused. This is what a CNN does. This weight-reuse strategy has other benefits, such as dramatically lowering the number of model parameters. This helps the models to generalize.

Let's take another example of how abstract domain knowledge can be hardcoded into neural networks. Suppose we have temporal data or sequential data. A normal feedforward network would treat each input example as independent of the previous input. So, any hidden feature representation that is learned should also depend on the recent history of the data rather than just the current data. So, a neural network should have some feedback loops or memory. This key idea gave rise to the recurrent neural network architecture and its modern robust variants, such as LSTM networks.

Other advanced ML problems, such as speech translation, question-answering systems, and relationship modeling have demanded the development of varied deep learning architectures.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.140.88