To this day, the models that we have discussed so far remain appropriateoptions when we are dealing with tabular data. By tabular, we mean thatthe data consist of rows corresponding to examples and columnscorresponding to features. With tabular data, we might anticipate thatthe patterns we seek could involve interactions among the features, butwe do not assume any structure a priori concerning how the featuresinteract.

Sometimes, we truly lack knowledge to guide the construction of craftierarchitectures. In these cases, an MLP may be the best that we can do.However, for high-dimensional perceptual data, such structure-lessnetworks can grow unwieldy.

For instance, let us return to our running example of distinguishingcats from dogs. Say that we do a thorough job in data collection,collecting an annotated dataset of one-megapixel photographs. This meansthat each input to the network has one million dimensions. Even anaggressive reduction to one thousand hidden dimensions would require afully-connected layer characterized by (10^6 times 10^3 = 10^9)parameters. Unless we have lots of GPUs, a talent for distributedoptimization, and an extraordinary amount of patience, learning theparameters of this network may turn out to be infeasible.

A careful reader might object to this argument on the basis that onemegapixel resolution may not be necessary. However, while we might beable to get away with one hundred thousand pixels, our hidden layer ofsize 1000 grossly underestimates the number of hidden units that ittakes to learn good representations of images, so a practical systemwill still require billions of parameters. Moreover, learning aclassifier by fitting so many parameters might require collecting anenormous dataset. And yet today both humans and computers are able todistinguish cats from dogs quite well, seemingly contradicting theseintuitions. That is because images exhibit rich structure that can beexploited by humans and machine learning models alike. Convolutionalneural networks (CNNs) are one creative way that machine learning hasembraced for exploiting some of the known structure in natural images.

6.1.1. Invariance¶

Imagine that you want to detect an object in an image. It seemsreasonable that whatever method we use to recognize objects should notbe overly concerned with the precise location of the object in theimage. Ideally, our system should exploit this knowledge. Pigs usuallydo not fly and planes usually do not swim. Nonetheless, we should stillrecognize a pig were one to appear at the top of the image. We can drawsome inspiration here from the children's game 'Where's Waldo' (depictedin Fig. 6.1.1). The game consists of a number of chaoticscenes bursting with activities. Waldo shows up somewhere in each,typically lurking in some unlikely location. The reader's goal is tolocate him. Despite his characteristic outfit, this can be surprisinglydifficult, due to the large number of distractions. However, what Waldolooks like does not depend upon where Waldo is located. We couldsweep the image with a Waldo detector that could assign a score to eachpatch, indicating the likelihood that the patch contains Waldo. CNNssystematize this idea of spatial invariance, exploiting it to learnuseful representations with fewer parameters.

We can now make these intuitions more concrete by enumerating a fewdesiderata to guide our design of a neural network architecture suitablefor computer vision:

  1. In the earliest layers, our network should respond similarly to thesame patch, regardless of where it appears in the image. Thisprinciple is called translation invariance.

  2. The earliest layers of the network should focus on local regions,without regard for the contents of the image in distant regions. Thisis the locality principle. Eventually, these local representationscan be aggregated to make predictions at the whole image level.

Let us see how this translates into mathematics.

6.1.2. Constraining the MLP¶

To start off, we can consider an MLP with two-dimensional images(mathbf{X}) as inputs and their immediate hidden representations(mathbf{H}) similarly represented as matrices in mathematics andas two-dimensional tensors in code, where both (mathbf{X}) and(mathbf{H}) have the same shape. Let that sink in. We nowconceive of not only the inputs but also the hidden representations aspossessing spatial structure.

Let ([mathbf{X}]_{i, j}) and ([mathbf{H}]_{i, j}) denotethe pixel at location ((i), (j)) in the input image andhidden representation, respectively. Consequently, to have each of thehidden units receive input from each of the input pixels, we wouldswitch from using weight matrices (as we did previously in MLPs) torepresenting our parameters as fourth-order weight tensors(mathsf{W}). Suppose that (mathbf{U}) contains biases, wecould formally express the fully-connected layer as

(6.1.1)¶[begin{split}begin{aligned} left[mathbf{H}right]_{i, j} &= [mathbf{U}]_{i, j} + sum_k sum_l[mathsf{W}]_{i, j, k, l} [mathbf{X}]_{k, l} &= [mathbf{U}]_{i, j} +sum_a sum_b [mathsf{V}]_{i, j, a, b} [mathbf{X}]_{i+a, j+b}.end{aligned},end{split}]

where the switch from (mathsf{W}) to (mathsf{V}) isentirely cosmetic for now since there is a one-to-one correspondencebetween coefficients in both fourth-order tensors. We simply re-indexthe subscripts ((k, l)) such that (k = i+a) and(l = j+b). In other words, we set([mathsf{V}]_{i, j, a, b} = [mathsf{W}]_{i, j, i+a, j+b}). Theindices (a) and (b) run over both positive and negativeoffsets, covering the entire image. For any given location ((i),(j)) in the hidden representation ([mathbf{H}]_{i, j}), wecompute its value by summing over pixels in (x), centered around((i, j)) and weighted by ([mathsf{V}]_{i, j, a, b}). Translation Invariance¶

Now let us invoke the first principle established above: translationinvariance. This implies that a shift in the input (mathbf{X})should simply lead to a shift in the hidden representation(mathbf{H}). This is only possible if (mathsf{V}) and(mathbf{U}) do not actually depend on ((i, j)), i.e., wehave ([mathsf{V}]_{i, j, a, b} = [mathbf{V}]_{a, b}) and(mathbf{U}) is a constant, say (u). As a result, we cansimplify the definition for (mathbf{H}):

(6.1.2)¶[[mathbf{H}]_{i, j} = u + sum_asum_b [mathbf{V}]_{a, b} [mathbf{X}]_{i+a, j+b}.]

This is a convolution! We are effectively weighting pixels at((i+a, j+b)) in the vicinity of location ((i, j)) withcoefficients ([mathbf{V}]_{a, b}) to obtain the value([mathbf{H}]_{i, j}). Note that ([mathbf{V}]_{a, b}) needsmany fewer coefficients than ([mathsf{V}]_{i, j, a, b}) since itno longer depends on the location within the image. We have madesignificant progress! Locality¶

Now let us invoke the second principle: locality. As motivated above, webelieve that we should not have to look very far away from location((i, j)) in order to glean relevant information to assess what isgoing on at ([mathbf{H}]_{i, j}). This means that outside somerange (|a|> Delta) or (|b| > Delta), we should set([mathbf{V}]_{a, b} = 0). Equivalently, we can rewrite([mathbf{H}]_{i, j}) as

(6.1.3)¶[[mathbf{H}]_{i, j} = u + sum_{a = -Delta}^{Delta} sum_{b = -Delta}^{Delta} [mathbf{V}]_{a, b} [mathbf{X}]_{i+a, j+b}.]

Note that (6.1.3), in a nutshell, is a convolutionallayer. Convolutional neural networks (CNNs) are a special family ofneural networks that contain convolutional layers. In the deep learningresearch community, (mathbf{V}) is referred to as a convolutionkernel, a filter, or simply the layer's weights that are oftenlearnable parameters. Online banking system project in java pdf. When the local region is small, the difference ascompared with a fully-connected network can be dramatic. Whilepreviously, we might have required billions of parameters to representjust a single layer in an image-processing network, we now typicallyneed just a few hundred, without altering the dimensionality of eitherthe inputs or the hidden representations. The price paid for thisdrastic reduction in parameters is that our features are now translationinvariant and that our layer can only incorporate local information,when determining the value of each hidden activation. All learningdepends on imposing inductive bias. When that bias agrees with reality,we get sample-efficient models that generalize well to unseen data. Butof course, if those biases do not agree with reality, e.g., if imagesturned out not to be translation invariant, our models might struggleeven to fit our training data.

6.1.3. Convolutions¶

Before going further, we should briefly review why the above operationis called a convolution. In mathematics, the convolution between twofunctions, say (f, g: mathbb{R}^d to mathbb{R}) is defined as

(6.1.4)¶[(f * g)(mathbf{x}) = int f(mathbf{z}) g(mathbf{x}-mathbf{z}) dmathbf{z}.]

That is, we measure the overlap between (f) and (g) when onefunction is 'flipped' and shifted by (mathbf{x}). Whenever wehave discrete objects, the integral turns into a sum. For instance, forvectors from the set of square summable infinite dimensional vectorswith index running over (mathbb{Z}) we obtain the followingdefinition:


For two-dimensional tensors, we have a corresponding sum with indices((a, b)) for (f) and ((i-a, j-b)) for (g),respectively:

(6.1.6)¶[(f * g)(i, j) = sum_asum_b f(a, b) g(i-a, j-b).]

This looks similar to (6.1.3), with one majordifference. Rather than using ((i+a, j+b)), we are using thedifference instead. Note, though, that this distinction is mostlycosmetic since we can always match the notation between(6.1.3) and (6.1.6). Ouroriginal definition in (6.1.3) more properly describesa cross-correlation. We will come back to this in the followingsection.

6.1.4. 'Where's Waldo' Revisited¶

Returning to our Waldo detector, let us see what this looks like. Theconvolutional layer picks windows of a given size and weighs intensitiesaccording to the filter (mathsf{V}), as demonstrated inFig. 6.1.2. We might aim to learn a model so thatwherever the 'waldoness' is highest, we should find a peak in the hiddenlayer representations. Channels¶

There is just one problem with this approach. So far, we blissfullyignored that images consist of 3 channels: red, green, and blue. Inreality, images are not two-dimensional objects but rather third-ordertensors, characterized by a height, width, and channel, e.g., with shape(1024 times 1024 times 3) pixels. While the first two of theseaxes concern spatial relationships, the third can be regarded asassigning a multidimensional representation to each pixel location. Wethus index (mathsf{X}) as ([mathsf{X}]_{i, j, k}). Theconvolutional filter has to adapt accordingly. Instead of([mathbf{V}]_{a,b}), we now have ([mathsf{V}]_{a,b,c}).

Moreover, just as our input consists of a third-order tensor, it turnsout to be a good idea to similarly formulate our hidden representationsas third-order tensors (mathsf{H}). In other words, rather thanjust having a single hidden representation corresponding to each spatiallocation, we want an entire vector of hidden representationscorresponding to each spatial location. We could think of the hiddenrepresentations as comprising a number of two-dimensional grids stackedon top of each other. As in the inputs, these are sometimes calledchannels. They are also sometimes called feature maps, as eachprovides a spatialized set of learned features to the subsequent layer.Intuitively, you might imagine that at lower layers that are closer toinputs, some channels could become specialized to recognize edges whileothers could recognize textures.

To support multiple channels in both inputs ((mathsf{X})) andhidden representations ((mathsf{H})), we can add a fourthcoordinate to (mathsf{V}): ([mathsf{V}]_{a, b, c, d}).Putting everything together we have:

(6.1.7)¶[[mathsf{H}]_{i,j,d} = sum_{a = -Delta}^{Delta} sum_{b = -Delta}^{Delta} sum_c [mathsf{V}]_{a, b, c, d} [mathsf{X}]_{i+a, j+b, c},]

where (d) indexes the output channels in the hiddenrepresentations (mathsf{H}). The subsequent convolutional layerwill go on to take a third-order tensor, (mathsf{H}) Easy piano mandarin songs pdf. , as theinput. Being more general, (6.1.7) is thedefinition of a convolutional layer for multiple channels, where(mathsf{V}) is a kernel or filter of the layer.

There are still many operations that we need to address. For instance,we need to figure out how to combine all the hidden representations to asingle output, e.g., whether there is a Waldo anywhere in the image.We also need to decide how to compute things efficiently, how to combinemultiple layers, appropriate activation functions, and how to makereasonable design choices to yield networks that are effective inpractice. We turn to these issues in the remainder of the chapter.

6.1.5. Summary¶

  • Translation invariance in images implies that all patches of an imagewill be treated in the same manner.

  • Locality means that only a small neighborhood of pixels will be usedto compute the corresponding hidden representations.

  • In image processing, convolutional layers typically require manyfewer parameters than fully-connected layers.

  • CNNS are a special family of neural networks that containconvolutional layers.

  • Channels on input and output allow our model to capture multipleaspects of an image at each spatial location.

6.1.6. Exercises¶

  1. Assume that the size of the convolution kernel is (Delta = 0).Show that in this case the convolution kernel implements an MLPindependently for each set of channels.

  2. Why might translation invariance not be a good idea after all?

  3. What problems must we deal with when deciding how to treat hiddenrepresentations corresponding to pixel locations at the boundary ofan image?

  4. Describe an analogous convolutional layer for audio.

  5. Do you think that convolutional layers might also be applicable fortext data? Why or why not?

  6. Prove that (f * g = g * f).


