File size: 4,143 Bytes
db1f0f8
d64a508
 
 
 
 
db1f0f8
 
d64a508
db1f0f8
d64a508
 
 
db1f0f8
 
 
d64a508
db1f0f8
d64a508
 
 
 
 
 
 
 
 
 
 
 
db1f0f8
 
 
 
 
d64a508
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
db1f0f8
 
 
 
 
d64a508
 
 
 
 
 
 
 
 
 
 
 
 
 
db1f0f8
 
 
 
d64a508
 
 
 
 
 
 
 
 
 
 
 
 
 
 
db1f0f8
d64a508
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
db1f0f8
 
d64a508
db1f0f8
d64a508
db1f0f8
d64a508
 
 
 
 
 
 
 
 
 
 
 
 
 
 
db1f0f8
 
d64a508
 
 
 
 
db1f0f8
d64a508
 
 
 
 
 
db1f0f8
d64a508
 
 
 
 
 
db1f0f8
 
 
 
d64a508
db1f0f8
d64a508
db1f0f8
d64a508
 
 
 
 
 
 
 
 
 
db1f0f8
 
 
d64a508
db1f0f8
d64a508
db1f0f8
 
 
d64a508
db1f0f8
d64a508
db1f0f8
d64a508
 
db1f0f8
d64a508
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
---

title: "Understanding Optimizers in Machine Learning"
format: 
  revealjs:
    auto-animate: true
editor: visual

---

# Understanding Optimizers in Machine Learning

## Overview

This presentation will dive into various optimizers used in training neural networks. We'll explore their paths on a loss landscape and understand their distinct behaviors through visual examples.


---

## What is an Optimizer?

Optimizers are algorithms or methods used to change the attributes of the neural network such as weights and learning rate to reduce the losses. Optimizers help to get results faster and more efficiently.


---

## Key Concepts

-   **Gradient Descent**
-   **Stochastic Gradient Descent (SGD)**
-   **Momentum**
-   **Adam** 

Each optimizer will be visualized to illustrate how they navigate the loss landscape during the training process.


---

## Gradient Descent

### Pros and Cons


::: {.columns}
::: {.column}
- **Pros**
    - Simple and easy to understand.
    - Effective for small datasets.
:::

::: {.column}
- **Cons**
    - Slow convergence.
    - Sensitive to the choice of learning rate.
    - Can get stuck in local minima.
:::
:::


---

## Stochastic Gradient Descent (SGD)

### Pros and Cons
::: {.columns}
::: {.column}
- **Pros**
    - Faster convergence than standard gradient descent.
    - Less memory intensive as it uses mini-batches.
:::

::: {.column}
- **Cons**
    - Variability in the training updates can lead to unstable convergence.
    - Requires careful tuning of learning rate.
:::
:::

---

## Momentum

### Pros and Cons
::: {.columns}
::: {.column}
- **Pros**
    - Accelerates SGD in the right direction, thus faster convergence.
    - Reduces oscillations.
:::

::: {.column}
- **Cons**
    - Introduces a new hyperparameter to tune (momentum coefficient).
    - Can overshoot if not configured properly.
:::
:::

---

## Adam (Adaptive Moment Estimation)

### Pros and Cons

::: {.columns}
::: {.column}
- **Pros**
    - Computationally efficient.
    - Works well with large datasets and high-dimensional spaces.
    - Adjusts the learning rate automatically.
:::

::: {.column}
- **Cons**
    - Can lead to suboptimal solutions in certain cases.
    - Might be computationally more intensive due to maintaining moment estimates for each parameter.
:::
:::

---

## RMSprop

RMSprop is an adaptive learning rate method which was designed as a solution to Adagrad's radically diminishing learning rates.

### Pros and Cons

::: {.columns}
::: {.column}
- **Pros**
    - Balances the step size decrease, making it more robust.
    - Works well in online and non-stationary settings.
:::

::: {.column}
- **Cons**
    - Still requires careful tuning of learning rate.
    - Not as widely supported in frameworks as Adam.
:::
:::

---

## AdaMax

AdaMax is a variation of Adam based on the infinity norm which might be more stable than the method based on the L2 norm.

### Pros and Cons

::: {.columns}
::: {.column}
- **Pros**
    - Suitable for datasets with outliers and noise.
    - More stable than Adam in certain scenarios.
:::

::: {.column}
- **Cons**
    - Less commonly used and tested than Adam.
    - May require more hyperparameter tuning compared to Adam.
:::
:::

---



## Loss Function and Its Gradient

We will use a simple quadratic function as our loss landscape to visualize how different optimizers navigate towards the minimum.

```{python}

#| echo: true

# Define the loss function and its gradient

def loss_function(x, y):

    return x**2 + y**2



def gradient(x, y):

    return 2*x, 2*y



```


---

## Simulating Optimizer Paths

Let's simulate the paths that different optimizers take on the loss surface.


---

## Visualizing the Optimizer Paths

This visualization shows the paths taken by SGD, Momentum, and Adam through the loss landscape.


---
## Conclusion

Understanding these paths helps us choose the right optimizer based on the specific needs of our machine learning model.