diff --git a/neural_nets_intro.ipynb b/neural_nets_intro.ipynb
index 031aa982320c0b38e6fcc4947b9feff0adcd02ad..95a8f4e02392ad459b0da86f843f3bb9bbbb18cd 100644
--- a/neural_nets_intro.ipynb
+++ b/neural_nets_intro.ipynb
@@ -376,15 +376,20 @@
     "Generally **crossentropy** and **mean squared error** loss functions are used for classification and regression problems, respectively.\n",
     "\n",
     "### Gradient based learning\n",
-    "As mentioned above, once we have decided upon a loss function we want to solve an **optimization problem** which minimizes this loss by updating the weights of the network. This is how the learning actually happens.\n",
+    "As mentioned above, once we have decided upon a loss function, we want to solve an **optimization problem** which minimizes this loss by updating the weights of the network. This is how learning happens in a NN.\n",
     "\n",
-    "The most popular optimization methods used in Neural Network training are some sort of **Gradient-descent** type methods, for e.g. gradient-descent, RMSprop, adam etc. \n",
+    "The most popular optimization methods used in Neural Network training are some **Gradient-descent (GD)** type methods, such as gradient-descent, RMSprop and Adam. \n",
     "**Gradient-descent** uses partial derivatives of the loss function with respect to the network weights and a learning rate to updates the weights such that the loss function decreases and hopefully after some iterations reaches its (Global) minimum.\n",
     "\n",
-    "First, the loss function and its derivative are computed at the output node and this signal is propogated backwards, using chain rule, in the network to compute the partial derivatives. Hence, this method is called **Backpropagation**.\n",
+    "First, the loss function and its derivative are computed at the output node, and this signal is propagated backwards, using the chain rule, in the network to compute the partial derivatives. Hence, this method is called **Backpropagation**.\n",
     "\n",
-    "Depending of\n",
+    "One way to perform a single GD pass is to compute the partial derivatives using all the samples in our data, computing average derivatives and using them to update the weights. This is called **Batch gradient descent**. However, in deep learning we mostly work with very big datasets and using batch gradient descent can make the training very slow!\n",
     "\n",
+    "The other extreme is to randomly shuffle the dataset and advance a pass of GD with the gradients computed using only **one** sample at a time. This is called **Stochastic gradient descent**.\n",
+    "\n",
+    "In practice, an approach in-between these two is used. The entire dataset is divided into **m** batches and these are used one by one to compute the derivatives and apply GD. This technique is called **Mini-batch gradient descent**. \n",
+    "\n",
+    "One pass through the entire training dataset is called **1 epoch** of training.\n",
     "\n",
     "\n",
     "### Activation Functions\n",