# TensorFlow Playground

## TensorFlow Playground

This post is an effort to understand how neural networks work. The visualizations are images obtained by experiments using TensorFlow Playground. Kudos to TensorFlow for making such an amazing framework!

Right now, I have added the experiments that I found the most interesting. Learning is an ongoing process and new interesting insights will be added. Most of these might be obvious to a lot of people, but, I am noting them here because all these nitty grittes are really important in knowing how to fine tune the actual network, or in other words, how the inputs get translated to the outputs and how the weights are being assigned. If there are any errors please point them out and I will fix them.

Note that the initialization of weights is completely random so it is not possible to reproduce these experiments under the exact same conditions but similar results might be possible.

### Legend for experiments

Blue is a positive weight or input. (Understand as overlap)

Orange is a negative weight or input. (Understand as intersection)

The dashed lines represent the weight matrix, in the interactive version one can hover over the inputs and outputs to see the magnitude of the weights learnt.

The tables give the learnt equations for:

- Input units (x1, x2, sin(x1), sin(x2), x1x2 which is denoted as x3, x1^2, x2^2)
- Weights for Input units (w1…..wn)
- Hidden units in First hidden layer (h11…h1n)
- Weights for First hidden layer (w11…..w1n)
- Hidden units in Second hidden layer (h21…..h2n)
- Weights for Second hidden layer (w21…..w2n)
- Output units (o1…..on)

These equations are useful for predicting what you think the output should look like and what the neural network outputs. This is a good thought experiment so I have included the equations that I came up with as well. I used the equations using simple digital logic rules and gates.

### Understanding Overlap

Here’s how I understood the overlap idea, manually using paper and pen(cil), draw the distributions, remove the ones that have negative weight by performing an intersection of sets, and similarly, when there is positive or overlap add those distributions.

### Experiments

#### 1

If we give input features as below, give abundant hidden units in two hidden layers and have two outputs as we have two kinds of inputs. Lets see what we can learn.

First Hidden Layer | Equation |
---|---|

h11 | x1w1 - x2w2 |

h12 | - x1w1 + x2w2 |

h13 | x1w1 - x2w2 |

h14 | x1w1 - x2w2 |

Second Hidden Layer | Equation |
---|---|

h21 | -h11w11 -h12w12 +h13w13 -h14w14 |

h22 | h11w11 +h12w12 -h13w13 +h14w14 |

Output Layer | Equation |
---|---|

o1 | -h21w21 + h22w22 |

#### 2

Making the test samples more visible. The test samples now have a solid outline of the same color which helps in better understanding.

First Hidden Layer | Equation |
---|---|

h11 | x1w1 - x2w2 |

h12 | - x1w1 + x2w2 |

h13 | x1w1 - x2w2 |

h14 | x1w1 - x2w2 |

Second Hidden Layer | Equation |
---|---|

h21 | -h11w11 -h12w12 +h13w13 -h14w14 |

h22 | h11w11 +h12w12 -h13w13 +h14w14 |

Output Layer | Equation |
---|---|

o1 | -h21w21 + h22w22 |

#### 3

Changing the regularization to L1 we notice that only one part of the data distribution can be identified by the classifier. This was expected because L1 acts like a rectifier and keeps only the positive parts alive. L1 is a measure of the least absolute deviation from the target values. L1 gives a smaller magnitude of error and is robust, however it will not try to fit the errors.

First Hidden Layer | Equation |
---|---|

h11 | - x1w1 + x2w2 |

h12 | x1w1 + x2w2 |

h13 | x1w1 + x2w2 |

h14 | - x1w1 - x2w2 |

Second Hidden Layer | Equation |
---|---|

h21 | + h11w11 -h12w12 -h13w13 -h14w14 |

h22 | - h11w11 - h12w12 -h13w13 +h14w14 |

Output Layer | Equation |
---|---|

o1 | +h21w21 - h22w22 |

#### 4

Changing the regularization to L2 which minimizes the least square error, we notice that the output is much better classified. This again was as expected because the L2 loss penalizes the outliers and because of the square term outputs an even larger loss value. So, L2 can be considered a good regularizer for this data distribution and this also serves as a confirmation that the L2 loss is sensitive to outliers.

First Hidden Layer | Equation |
---|---|

h11 | + x1w1 - x2w2 |

h12 | - x1w1 - x2w2 |

h13 | - x1w1 + x2w2 |

h14 | - x1w1 + x2w2 |

Second Hidden Layer | Equation |
---|---|

h21 | + h11w11 - h12w12 - h13w13 - h14w14 |

h22 | - h11w11 + h12w12 + h13w13 - h14w14 |

Output Layer | Equation |
---|---|

o1 | +h21w21 - h22w22 |

#### 5

Now, we decrease the number of hidden units to just one. Given the data distribution we dont expect it to learn much because one single line or a linear classifier cant distinguish four clusters.

First Hidden Layer | Equation |
---|---|

h11 | + x1w1 + x2w2 |

Second Hidden Layer | Equation |
---|---|

h21 | + h11w11 |

Output Layer | Equation |
---|---|

o1 | - h21w21 |

#### 6

By increasing the number of hidden units in the second layer we are not helping in learning and this is confirmed as shown in the diagram below.

First Hidden Layer | Equation |
---|---|

h11 | - x1w1 - x2w2 |

Second Hidden Layer | Equation |
---|---|

h21 | - h11w11 |

h22 | - h11w11 |

Output Layer | Equation |
---|---|

o1 | + h21w21 + h22w22 |

#### 7

A similar experiment by increasing the number of hidden units in the first hidden layer.

First Hidden Layer | Equation |
---|---|

h11 | - x1w1 - x2w2 |

h12 | - x1w1 + x2w2 |

Second Hidden Layer | Equation |
---|---|

h21 | + h11w11 - h12w12 |

h22 | - h11w11 - h12w12 |

Output Layer | Equation |
---|---|

o1 | + h21w21 + h22w22 |

#### 8

Now, we experiment on a different data distribution. Slightly more interesting data distribution. The learned classifier is not bad. It shows where the different test data points are clearly.

First Hidden Layer | Equation |
---|---|

h11 | - sin(x1)w1 |

h12 | - sin(x1)w1 |

Second Hidden Layer | Equation |
---|---|

h21 | + h11w11 + h12w12 |

h22 | - h11w11 - h12w12 |

Output Layer | Equation |
---|---|

o1 | - h21w21 + h22w22 |

#### 9

Now, with two sinusoidal input distributions which are perpendicular to each other. The result is not a linear classifier but it classifies the test datapoints. This I find very interesting because it confirms the fact that a neural network is like an complex network which can solve problems in which the data is not linearly separable.

First Hidden Layer | Equation |
---|---|

h11 | - sin(x1)w1 - sin(x2)w2 |

h12 | + sin(x1)w1 + sin(x2)w2 |

Second Hidden Layer | Equation |
---|---|

h21 | - h11w11 + h12w12 |

h22 | - h11w11 - h12w12 |

Output Layer | Equation |
---|---|

o1 | + h21w21 + h22w22 |

#### 10

This data distribution is like a x1 * x2, so let this be denoted be x3 and its corresponding weight by w3.

First Hidden Layer | Equation |
---|---|

h11 | - x3w3 |

h12 | - x3w3 |

Second Hidden Layer | Equation |
---|---|

h21 | + h11w11 - h12w12 |

h22 | - h11w11 - h12w12 |

Output Layer | Equation |
---|---|

o1 | - h21w21 - h22w22 |

#### 11

By increasing the number of hidden units, there is no increase or change in the classifiers which is as expected through the equations.

First Hidden Layer | Equation |
---|---|

h11 | - x3w3 |

h12 | + x3w3 |

h13 | + x3w3 |

Second Hidden Layer | Equation |
---|---|

h21 | - h11w11 + h12w12 + h13w13 |

h22 | + h11w11 + h12w12 - h13w13 |

Output Layer | Equation |
---|---|

o1 | - h21w21 + h22w22 |

#### 12

By increasing the focus on some parts of the data distribution we can get a better classification. Think about this as an overlap between the two inputs x2 and x3.

First Hidden Layer | Equation |
---|---|

h11 | - x2w2 + x3w3 |

h12 | - x2w2 + x3w3 |

Second Hidden Layer | Equation |
---|---|

h21 | + h11w11 - h12w12 |

Output Layer | Equation |
---|---|

o1 | - h21w21 |

#### 13

This is for a new series of experiments with a slightly cleaner data distribution. The inputs are x1 and x2, there are two hidden layers. The test loss and training loss are both 0.0! A good case of overfitting! The model is being fitted on an easy and linearly separable dataset with too many hidden units and hidden layers. Its good to remember that a general thought experiement is to overfit on the training set just to make sure that we are able to learn.

First Hidden Layer | Equation |
---|---|

h1 | + x1w1 + x2w2 |

h2 | + x1w1 + x2w2 |

Second Hidden Layer | Equation |
---|---|

h21 | -h1 -h2 |

Output Layer | Equation |
---|---|

o1 | -h21 |

#### 14

Changing the data distribution helps to add a little noise (I’m not sure if noise is the right word here), it adds some sort of disturbance but not noise in the actual sense of bad input.

First Hidden Layer | Equation |
---|---|

h1 | + x1w1 + x2w2 |

h2 | - x1w1 - x2w2 |

Second Hidden Layer | Equation |
---|---|

h21 | -h1 -h2 |

Output Layer | Equation |
---|---|

o1 | +h21 |

#### 15

With the same data distribution we aim at building a better classifier. The experiments for these are below:

#### 16

To aid in this process, lets use another hidden unit in the first layer.

First Hidden Layer | Equation |
---|---|

h11 | - x1w1 - x3w3 |

h12 | + x1w1 - x2w2 |

h13 | - x1w1 + x2w2 |

Second Hidden Layer | Equation |
---|---|

h21 | - h11w11 - h12w12 - h13w13 |

Output Layer | Equation |
---|---|

o1 | - h21w21 |

#### 17

Now, remove the added unit as it does not seem to help and add another unit to the second hidden layer. This might help because its like adding another level of knowledge on top of the first hidden layer. Using the visualizations we see that the hunch was correct and it has helped.

First Hidden Layer | Equation |
---|---|

h11 | - x1w1 + x2w2 |

h12 | + x1w1 - x2w2 |

Second Hidden Layer | Equation |
---|---|

h21 | + h11w11 + h12w12 |

h21 | - h11w11 - h12w12 |

Output Layer | Equation |
---|---|

o1 | + h21w21 - h21w21 |

#### 18

But were the better results because of adding the extra hidden unit ? or just decreasing the hidden unit from the first hidden layer? Also what is the minimum number of units per layer that we can use to learn this data distribution.

First Hidden Layer | Equation |
---|---|

h11 | + x1w1 + x2w2 |

h12 | - x1w1 - x2w2 |

Second Hidden Layer | Equation |
---|---|

h21 | + h11w11 + h12w12 |

Output Layer | Equation |
---|---|

o1 | - h21w21 |

#### 19

Now we change the input features again. Focus on the training and test loss and see that both undergo oscillations but still the test error is still alright in comparison to the training error. What I mean by this is that the test error is almost near the ceiling created by the training error. The weight of x1^2 is given by w3.

First Hidden Layer | Equation |
---|---|

h11 | + x1w1 - x2w2 - x1^2w3 |

h12 | - x1w1 - x2w2 + x1^2w3 |

Second Hidden Layer | Equation |
---|---|

h21 | - h11w11 - h12w12 |

Output Layer | Equation |
---|---|

o1 | - h21w21 |

#### 20

Focus on the training and test loss and see that both undergo a lot of oscillations. This is expected because if we try to overlap the three inputs we see that the distribution of the blue weights or positive data points is almost everywhere which will confuse the classifier and cause it to oscillate a lot. Recall how a perceptron works and in the end each is a linear classifier.

First Hidden Layer | Equation |
---|---|

h11 | + x1w1 - x2w2 - x1^2w3 |

h12 | - x1w1 - x2w2 + x1^2w3 |

Second Hidden Layer | Equation |
---|---|

h21 | - h11w11 - h12w12 |

h22 | + h11w11 + h12w12 |

Output Layer | Equation |
---|---|

o1 | - h21w21 + h22w22 |

#### 21

Add x2^2 as well, weight denoted by w4.

First Hidden Layer | Equation |
---|---|

h11 | + x1w1 - x2w2 + x1^2w3 - x2^2w4 |

h12 | + x1w1 + x2w2 + x1^2w3 + x2^2w4 |

Second Hidden Layer | Equation |
---|---|

h21 | + h11w11 + h12w12 |

h22 | + h11w11 - h12w12 |

Output Layer | Equation |
---|---|

o1 | - h21w21 + h22w22 |

#### 22

Keeping the same distribution, we remove the x1^2 and x2^2. We should get something similar to the experiments we have already seen before.

First Hidden Layer | Equation |
---|---|

h11 | - x1w1 + x2w2 |

h12 | + x1w1 - x2w2 |

Second Hidden Layer | Equation |
---|---|

h21 | + h11w11 + h12w12 |

h22 | - h11w11 - h12w12 |

Output Layer | Equation |
---|---|

o1 | + h21w21 - h22w22 |

#### 23

This part was just to check for the possible outputs. The input distribution is *really* hard!!!! The test loss is much higher and does not decrease as much as the training loss does, which is as expected.

- x3 denotes x1x2 and weight is by w3
- w1 denotes weights for sin(x1)
- w2 denotes weights for sin(x2)

First Hidden Layer | Equation |
---|---|

h11 | - x3w3 - sin(x1)w1 - sin(x2)w2 |

h12 | - x3w3 + sin(x1)w1 + sin(x2)w2 |

h13 | - x3w3 - sin(x1)w1 - sin(x2)w2 |

Second Hidden Layer | Equation |
---|---|

h21 | - h11w11 - h12w12 - h13w13 |

h22 | + h11w11 - h12w12 - h13w13 |

Output Layer | Equation |
---|---|

o1 | + h21w21 + h22w22 |

#### 24

What happens if we decrease the hidden units from the first hidden layer, lets check it out. Intuitively I dont think it will make any difference because the data distribution is too hard to learn.

- x3 denotes x1x2 and weight is by w3
- w1 denotes weights for sin(x1)
- w2 denotes weights for sin(x2)

First Hidden Layer | Equation |
---|---|

h11 | + x3w3 - sin(x1)w1 + sin(x2)w2 |

h12 | - x3w3 + sin(x1)w1 + sin(x2)w2 |

Second Hidden Layer | Equation |
---|---|

h21 | - h11w11 + h12w12 |

h22 | - h11w11 + h12w12 |

Output Layer | Equation |
---|---|

o1 | + h21w21 - h22w22 |

As expected it does not learn as much and the training error is much higher in comparison to experiment 23.

I have more experiments by using just a single layer neural network which has one hidden layer. All the above experiments were for two hidden layers which I thought would be more interesting to post.

blog comments powered by Disqus