From 6971870d6072805377814044826323f2940133fe Mon Sep 17 00:00:00 2001
From: Tarun Chadha <tarunchadha23@gmail.com>
Date: Sun, 28 Apr 2019 23:44:12 +0200
Subject: [PATCH] updated theory

---
 neural_nets_intro.ipynb | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/neural_nets_intro.ipynb b/neural_nets_intro.ipynb
index 031aa98..95a8f4e 100644
--- a/neural_nets_intro.ipynb
+++ b/neural_nets_intro.ipynb
@@ -376,15 +376,20 @@
     "Generally **crossentropy** and **mean squared error** loss functions are used for classification and regression problems, respectively.\n",
     "\n",
     "### Gradient based learning\n",
-    "As mentioned above, once we have decided upon a loss function we want to solve an **optimization problem** which minimizes this loss by updating the weights of the network. This is how the learning actually happens.\n",
+    "As mentioned above, once we have decided upon a loss function, we want to solve an **optimization problem** which minimizes this loss by updating the weights of the network. This is how learning happens in a NN.\n",
     "\n",
-    "The most popular optimization methods used in Neural Network training are some sort of **Gradient-descent** type methods, for e.g. gradient-descent, RMSprop, adam etc. \n",
+    "The most popular optimization methods used in Neural Network training are some **Gradient-descent (GD)** type methods, such as gradient-descent, RMSprop and Adam. \n",
     "**Gradient-descent** uses partial derivatives of the loss function with respect to the network weights and a learning rate to updates the weights such that the loss function decreases and hopefully after some iterations reaches its (Global) minimum.\n",
     "\n",
-    "First, the loss function and its derivative are computed at the output node and this signal is propogated backwards, using chain rule, in the network to compute the partial derivatives. Hence, this method is called **Backpropagation**.\n",
+    "First, the loss function and its derivative are computed at the output node, and this signal is propagated backwards, using the chain rule, in the network to compute the partial derivatives. Hence, this method is called **Backpropagation**.\n",
     "\n",
-    "Depending of\n",
+    "One way to perform a single GD pass is to compute the partial derivatives using all the samples in our data, computing average derivatives and using them to update the weights. This is called **Batch gradient descent**. However, in deep learning we mostly work with very big datasets and using batch gradient descent can make the training very slow!\n",
     "\n",
+    "The other extreme is to randomly shuffle the dataset and advance a pass of GD with the gradients computed using only **one** sample at a time. This is called **Stochastic gradient descent**.\n",
+    "\n",
+    "In practice, an approach in-between these two is used. The entire dataset is divided into **m** batches and these are used one by one to compute the derivatives and apply GD. This technique is called **Mini-batch gradient descent**. \n",
+    "\n",
+    "One pass through the entire training dataset is called **1 epoch** of training.\n",
     "\n",
     "\n",
     "### Activation Functions\n",
-- 
GitLab