Given a dataset like:
โ$[(x^{1},y^{1}), (x^{2},y^{2}), ...., (x^{m},y^{m})]$โ
We want:
โ$\hat{y}^{(i)} \approx y^{(i)}$โ
Concept  Description 
 Number of examples in dataset 
โ$x^{(i)}$โ 

 Predicted output 
Loss Function  A function to compute the error for a single training example 
Cost Function  The average of the loss functions of the entire training set 
Convex Function  A function that has one local value 
NonConvex Function  A function that has lots of different local values 
Gradient Descent  An iterative optimization method that we use to converge to the global optimum of 
In other words: The
Cost Function
measures how well our parametersw
andb
are doing on the training set, so the bestw
andb
are the values that minimize๐น(w, b)
as possible
General Formula:
โ$w:=w\alpha\frac{dJ(w,b)}{dw}$โ
โ$b:=b\alpha\frac{dJ(w,b)}{dw}$โ
ฮฑ
(alpha) is the Learning Rate
It is a positive scalar determining the size of the step of each iteration of gradient descent due to the corresponded estimated error each time the model weights are updated, so, it controls how quickly or slowly a neural network model learns a problem.
โMore on Learning Rateโ