We can say that we have vectors in -dimensional space or
We do have a function that can map these specific vectors to an output in the discrete set .
The problem is, it cannot generalize. There's no rule or expression at play here that we can use to take a new image and map it to that same discrete set. This rule is the function we're trying to learn. But what on earth could possibly be the best fit?
Neural networks are compositions of simple functions. Why do neural networks work so well? How do we even get a function that can even see handwritten digits and recognize them.
This part will be covered in much better depth in January, but there"s an idea that neural networks tap into called universality, to truly understand it, a session might need to be dedicated to just this property… for now, suffice it to say that neural networks are capable of universal approximation and there are theorems that you can go ahead and look into that lay out all the conditions for this capability.
Back to neural networks, they're compositions of functions,
The reason they work so well, is that we are finding or computing the right parameters such that the output of this function matches the expected output as much as possible.
Now the question arises? What kind of simple functions?
The first answer could be a linear function which maps from . The output vector would simply be probabilities from 0 to 9.
A linear function, in linear algebra terms, means a mapping between two vector spaces which preserves the operations of vector addition and scalar multiplication. Or, The function is such that,
Linear functions are unfortunately severely limited, they allow for simplicity. Even though intuitively you can say that a 1 and a 0 can be combined to form a 6 or a 9, there"s no well-defined rule for such combinations that will generalize. The input-output mapping is not exactly linear.
This was such an issue that this resulted in an AI "winter".
The rule that turned out best are Linear Piecewise Continuous Functions. Now the reason as to why this works is somewhat unclear.The best answer I found was given by Gilbert Strang here, at least to develop an intuition for this.
"Linear for simplicity, continuous to model an unknown but reasonable rule and piecewise to achieve the nonlinearity that is an absolute requirement for real images and data." – Gilbert Strang, Linear Algebra And Learning From Data (2019)
Another note, such functions were also used back in the 1940s to "model" biological neural networks by none other than the mathematician Householder. Perhaps there is some intuition behind these functions that we"re missing which we"ll ideally cover when we dive deeper in January.
We"re trying to fit a curve, and we in many ways need to "break" out of simply preserving the operations as they limit what the output function can look like.
Here"s a definition of such functions,
A function is said to be piecewise continuous on an interval if it is defined and continuous except possibly at a finite number of points Furthermore, at each point of discontinuity, we require that the left and right hand limits exists.
At the ends of the domain, the left hand is ignored at and the right hand limit is ignored at .
You will see as this progresses how such functions are used.
We use composition to create complex functions from simple ones.
In the form,
Each function is either linear or affine, with our activation function applied on top.
An affine function is a function with a linear transformation as well as a translation.
This means that the functions are of the form where is the linear transformation and is the translation.
These functions are then composed like mentioned above .
The goal of neural networks becomes making it so that the and that we compute for the entire function is such that the loss or error between the function"s output and the true expected output is as minimal as possible
The derivative of a function dependent on multiple variables with respect to one of the variables, while the others are kept constant. For a function ,
The gradient operator is a linear operator that helps calculate the total derivative.
This gradient operator is very useful to us. If we can somehow compute the "gradient" of a loss function, i.e. a function that calculates how far off our network is from predicting the right output and the partial derivatives with respect to each weight and bias, we could use that information to update our and to better match the expected output.
If we move our weights and biases in the opposite direction of the gradient then we would get closer and closer to a minima.
We"ll be defining this with a bit more depth and rigour in January, but this intuition is all that you"ll need so that you can feel like you learned something.
This, in some sense is an intuition for gradient descent.