@komalvenkatesh4527

If you are slightly lost, it's possible that you have just stumbled on the video directly without watching the previous 3 explaining exponential weighted averages. Once you see that and come back here, it will make perfect sense.

@klausdupont6335

Explaining the function of each parameter (especially the role of β in both (β) and (1 - β)) is extremely useful for understanding the role of momentum in gradient descent. Really appreciate that!

@AdmMusicc

I directly jumped to this video for reference purposes. Can someone explain what Vdw and Vdb is?  Is V our loss function?

@wifi-YT

Andrew’s great.  Was he right when he says at 8:55 that under BOTH versions — with or without the (1-ß) term — that 0.9 is the commonly used value for ß.  After all, in version using (1-ß) term, the left term multiplier (0.9) is 9 times LARGER than the right term multiplier (0.1), whereas when using the alternative version without the (1-ß) term, the left term multiplier (0.9) is actually SMALLER than the right term multiplier (1), yielding a vastly different weighted average.   Thus, it can’t be correct that under both versions, the commonly used value for ß is the same 0.9.  Does anyone agree?  In fact, it would seem that under the latter version, ß should be 9.0, rather than 0.9, so that the weighting is the equivalent of Andrew’s preferred former version.

@jerekabi8480

This is really helpful. I can now fully understand sgd with momentum by understanding the concepts of exponentially weighted average.  Thanks a lot.

@joshuasansom_software

Thanks so much! Love that feeling when it all just clicks.

@kareemjeiroudi1964

I like this analogy of a ball and a basket. Thanks for this generous video!

@deraktdar

The point about (1 - β) being excluded in some frameworks was very helpful. This type of thing is why reading methodology section of papers is totally useless - unscientific and unreproducable (only way to reproduce is to copy their exact code for a specific framework). Often exact hyper-parameters values are documented, but then details of how they are actually used in the framework is up in the air.

@abhishekbhatia651

Where is the bias correction in the steps for the initial steps? Or does it not matter as we are anyway gonna make a large number of steps?
Edit: My bad, he explains it at 6:50

@ahmedb2559

Thank you !

@forcescode4962

But where did we use the bias correction formula?

@thehazarika

I have learnt so much from you. I really want to thank you for that.I really appreciate your work. I would like to suggest you to think about few beginer learner like me and explain the concept with some analogy and provide some aid to lead the trail of thinking in a appropiate direcetion.

@komalvenkatesh4527

Perfect, thank you; showing vectors in vertical direction cancelling out and horizontal vectors adding up was the key point that cleared everything up for me.

@rajupowers

What is Vdw, and Vdb - @7:15

@sumithhh9379

In Gilbert strang lecture, he only mentioned beta not 1-beta. Will the learning rate be used as 1/1-alpha if 1-beta is not used?

@debajyotisg

Could someone explain this to me? For simplicity let's consider only one hidden layer. For a given mini batch (lets say t = 1), I can calculate the loss and compute the gradients 'dw' and 'db'. ( 'dw' and 'db' are vectors of size depending upon the number of nodes in the layers). When I want to calculate Vdw (which I expect is the same dimension as dw), do I average over the elements of dw? i.e. Vdw_0 = 0, Vdw_1 = beta*Vdw_0 + (1 - beta)* dw_1..... and so on?

@michaelscheinfeild9768

very good explanation on the gradient with momentum the example of ball with speed and acceleration is very intuitive. thank you

@shwonder

Is there need for input normalisation, when using gradient descent with momentum? My intuition says that normalisation will add very low value (if any) in this case. Is this correct?

@fatima-ezzahrabadaoui4198

what are dw and db ??

@temisegun8631

thank you this was helpful