Computer-Science

Language Model

Recurrent Neural Networks (RNN)

\[h_t = \tanh (W_{hh}h_{t-1} + W_{xh}x_{t})\]

Backpropagation Through Time (BPTT)

๊ธฐ์กด์˜ ์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ์—์„œ๋Š” backpropagation ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ด์šฉํ•œ๋‹ค. RNN์—์„œ๋Š” ์ด๋ฅผ ์‚ด์ง ๋ณ€ํ˜•์‹œํ‚จ ๋ฒ„์ „์ธ Backpropagation Through Time (BPTT) ์„ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, ๊ทธ ์ด์œ ๋Š” ๊ฐ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์ด ๋„คํŠธ์›Œํฌ์˜ ๋งค ์‹œ๊ฐ„ ์Šคํ…๋งˆ๋‹ค ๊ณต์œ ๋˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ฆ‰, ๊ฐ ์‹œ๊ฐ„ ์Šคํ…์˜ ์ถœ๋ ฅ๋‹จ์—์„œ์˜ gradient๋Š” ํ˜„์žฌ ์‹œ๊ฐ„ ์Šคํ…์—์„œ์˜ ๊ณ„์‚ฐ์—๋งŒ ์˜์กดํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์ด์ „ ์‹œ๊ฐ„ ์Šคํ…์—๋„ ์˜์กดํ•œ๋‹ค.

RNN tradeoffs

Vanishing Gradient๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ์ด์œ 

ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋กœ ํ”ํžˆ ์‚ฌ์šฉ๋˜๋Š” sigmoid ํ•จ์ˆ˜์™€ tanh ํ•จ์ˆ˜์˜ ๋„ํ•จ์ˆ˜๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค. ๋„ํ•จ์ˆ˜์˜ ์ตœ๋Œ€๊ฐ’์ด 1๋ณด๋‹ค ์ž‘์€ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

์—ญ์ „ํŒŒ (back-propagation) ๊ณผ์ •์—์„œ ํ™œ์„ฑํ™” ํ•จ์ˆ˜์˜ ๋ฏธ๋ถ„๊ฐ’์ด ์‚ฌ์šฉ๋˜๋Š”๋ฐ, 1๋ณด๋‹ค ์ž‘์€ ๊ฐ’์ด ๋ฐ˜๋ณตํ•ด์„œ ๊ณฑํ•ด์ง€๋ฉด 0์— ์ˆ˜๋ ดํ•˜๊ฒŒ ๋œ๋‹ค.

Why is Vanishing Gradient a Problem?

๋จผ ๊ฑฐ๋ฆฌ์˜ gradient signal์€ ๊ฐ€๊นŒ์šด ๊ฑฐ๋ฆฌ์˜ gradient signal๋ณด๋‹ค ํ›จ์”ฌ ์ž‘๊ธฐ ๋•Œ๋ฌธ์— ์†์‹ค๋œ๋‹ค.

๋”ฐ๋ผ์„œ ๋จผ ๊ฑฐ๋ฆฌ์˜ gradient signal์€ ๋ชจ๋ธ ๊ฐ€์ค‘์น˜์— ์˜ํ–ฅ์„ ๋ผ์น˜์ง€ ๋ชปํ•˜๊ณ , ๋ชจ๋ธ ๊ฐ€์ค‘์น˜๋Š” ๊ฐ€๊นŒ์šด ๊ฑฐ๋ฆฌ์˜ gradient signal์— ๋Œ€ํ•ด์„œ๋งŒ ์—…๋ฐ์ดํŠธ๋œ๋‹ค.

Exploding Gradient๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ์ด์œ 

Back propagation ์—ฐ์‚ฐ์—๋Š” Activation function์˜ ํŽธ๋ฏธ๋ถ„ ๊ฐ’๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๊ฐ€์ค‘์น˜ ๊ฐ’๋“ค๋„ ๊ด€์—ฌํ•˜๊ฒŒ ๋œ๋‹ค.

๋งŒ์•ฝ, ๋ชจ๋ธ์˜ ๊ฐ€์ค‘์น˜๋“ค์ด ์ถฉ๋ถ„ํžˆ ํฐ ๊ฐ’์ด๋ผ๊ณ  ๊ฐ€์ •์„ ํ•˜๋ฉด, ๋ ˆ์ด์–ด๊ฐ€ ๊นŠ์–ด์งˆ์ˆ˜๋ก ์ถฉ๋ถ„ํžˆ ํฐ ๊ฐ€์ค‘์น˜๋“ค์ด ๋ฐ˜๋ณต์ ์œผ๋กœ ๊ณฑํ•ด์ง€๋ฉด์„œ backward value๊ฐ€ ํญ๋ฐœ์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๊ฒŒ ๋œ๋‹ค. ์ด๋ฅผ Exploding gradient๋ผ ํ•œ๋‹ค.

Why is Exploding Gradient a Problem?

Gradient๊ฐ€ ๋„ˆ๋ฌด ํฌ๋ฉด, SGD update step์ด ์ปค์ง€๊ฒŒ ๋œ๋‹ค.

๋„ˆ๋ฌด ํฐ step์œผ๋กœ ์—…๋ฐ์ดํŠธ๋ฅผ ์ˆ˜ํ–‰ํ•˜์—ฌ ์ž˜๋ชป๋œ ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ตฌ์„ฑ์— ๋„๋‹ฌํ•˜๋ฉด, ํฐ ์†์‹ค์ด ๋ฐœ์ƒํ•˜์—ฌ ์ž˜๋ชป๋œ ์—…๋ฐ์ดํŠธ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค.

How to fix vanishing/exploding gradient problem?

Long Short-Term Memory (LSTM)

LSTM์˜ ๊ฐ Gate๋“ค์˜ ์—ญํ• 

1. Forget Gate ($f_{t}$)

2. Input Gate ($i_{t}$)

3. Output Gate ($o_{t}$)

References

  1. ์ธ๊ณต์ง€๋Šฅ ์‘์šฉ (ICE4104), ์ธํ•˜๋Œ€ํ•™๊ต ์ •๋ณดํ†ต์‹ ๊ณตํ•™๊ณผ ํ™์„ฑ์€ ๊ต์ˆ˜๋‹˜
  2. [Stanford CS224N - NLP w/ DL Winter 2021 Lecture 5 - Recurrent Neural networks (RNNs)](https://www.youtube.com/watch?v=PLryWeHPcBs&list=PLoROMvodv4rMFqRtEuo6SGjY4XbRIVRd4)
  3. [๋จธ์‹ ๋Ÿฌ๋‹/๋”ฅ๋Ÿฌ๋‹] 10-1. Recurrent Neural Network(RNN)
  4. RNN Tutorial Part 2 - Python, NumPy์™€ Theano๋กœ RNN ๊ตฌํ˜„ํ•˜๊ธฐ
  5. Neural Network Notes
  6. (2) ํ™œ์„ฑํ•จ์ˆ˜์™€ ์ˆœ์ „ํŒŒ ๋ฐ ์—ญ์ „ํŒŒ
  7. Vanishing gradient / Exploding gradient
  8. LSTM(Long-Short Term Memory)๊ณผ GRU(gated Recurrent Unit)(์ฝ”๋“œ ์ถ”๊ฐ€ํ•ด์•ผํ•จ)
  9. 07-07 ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค(Gradient Vanishing)๊ณผ ํญ์ฃผ(Exploding) - ๋”ฅ ๋Ÿฌ๋‹์„ ์ด์šฉํ•œ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ์ž…๋ฌธ
  10. LSTM - ์ธ์ฝ”๋ค, ์ƒ๋ฌผ์ •๋ณด ์ „๋ฌธ์œ„ํ‚ค
  11. [๋”ฅ๋Ÿฌ๋‹] LSTM ์‰ฝ๊ฒŒ ์ดํ•ดํ•˜๊ธฐ - YouTube
  12. Loner์˜ ํ•™์Šต๋…ธํŠธ :: LSTM ๊ฐœ์ธ์ •๋ฆฌ