Skip to content
Su-minn edited this page Apr 24, 2022 · 8 revisions
  • ResNet ์€ ์ตœ์ดˆ๋กœ 100 ๊ฐœ ์ด์ƒ์˜ Layer ๋ฅผ ์Œ“์œผ๋ฉด์„œ, layer๊ฐ€ ๊นŠ์–ด์ ธ๋„ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜์ง€ ์•Š๋˜ ํ•œ๊ณ„ (Degradation problem) ๋ฅผ ๊ทน๋ณตํ•˜๋ฉฐ ์ข‹์€ ์„ฑ๋Šฅ์„ ์ด๋Œ์–ด๋‚ธ ๋ชจ๋ธ
  • ์ธ๊ฐ„์˜ ๋Šฅ๋ ฅ์„ ๋›ฐ์–ด๋„˜์œผ๋ฉฐ ImageNet Classification ๋ฟ์•„๋‹ˆ๋ผ, localization, Detection, Segmentation ๋„ 1๋“ฑ ์ฐจ์ง€

Model Keypoint

  • Architecture
  • Shortcut connection

Architecture

image-20220316142440075

  • ์‹œ์ž‘ part

    • 7x7 Conv layer 1๊ฐœ
    • He initialization
      • ResNet ์— ์ ํ•ฉํ•œ initialization
      • He ๊ฐ€ ์•„๋‹Œ ์ผ๋ฐ˜์ ์ธ initialization ์„ ์‚ฌ์šฉํ•˜๋ฉด,
        skip connection ๊ณผ์ •์—์„œ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋”ํ•ด์ง€๋Š” ๊ฐ’์ด ์ปค์ง€๊ฒŒ๋œ๋‹ค
  • Residual Block part

    • Stack residual blocks
    • Every residual block has two 3x3 conv layers
      • residual block ๋‚ด๋ถ€์˜ layer ๋Š” ๋ชจ๋‘ 3x3 conv ์‚ฌ์šฉ
      • parameter ์ˆ˜๊ฐ€ ๊ธ‰๊ฒฉํ•˜๊ฒŒ ์ฆ๊ฐ€ํ•˜์ง€ ์•Š์œผ๋ฉฐ, ์ƒ๋Œ€์ ์œผ๋กœ ์—ฐ์‚ฐ์ด ๋น ๋ฅธ ์ด์œ ๊ฐ€ ๋œ๋‹ค
    • Batch norm after every conv layer
    • Doubling the number of filters and spatially down-sampling by stride 2 instead of spatial pooling
      • residual block part ๊ฐ€ ๋ฐ”๋€”๋•Œ๋งˆ๋‹ค,
        down sampling ์„ ํ†ตํ•ด ๊ณต๊ฐ„์ƒ์˜ ํฌ๊ธฐ๋Š” ์ ˆ๋ฐ˜์œผ๋กœ ์ค„๊ณ , ์ฑ„๋„ ์ˆ˜๋Š” ๋‘ ๋ฐฐ์”ฉ ์ฆ๊ฐ€
  • ์ตœ์ข… ์ถœ๋ ฅ

    • Only a single FC layer for output classes
    • avg pool ์„ ์ ์šฉํ•œ ํ›„, ํ•˜๋‚˜์˜ FC layer ๋กœ ๊ตฌ์„ฑ
  • cf) stride 2๋กœ spatially down sampling ํ•˜๋Š” ์ด์œ 

    • downsampling : ์ด๋ฏธ์ง€์˜ ํฌ๊ธฐ๋ฅผ ์ค„์ด๋Š” ๊ณผ์ •

    • Convolutional Neural Network์—์„œ feature์˜ resolution์„ ์ค„์ผ ๋•Œ, stride=2 ๋˜๋Š” max/average pooling์„ ์ด์šฉํ•˜์—ฌ resolution์„ 1/2๋กœ ์ค„์ด๋Š” ๋ฐฉ๋ฒ•์„ ๋งŽ์ด ์‚ฌ์šฉ

    • convolution layer๋ฅผ ์ด์šฉํ•˜์—ฌ stride = 2๋กœ ์ค„์ด๋ฉด ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์ถ”๊ฐ€๋˜๋ฏ€๋กœ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋ฐฉ์‹์œผ๋กœ resolution์„ ์ค„์ด๊ฒŒ ๋˜๋‚˜
      ๊ทธ๋งŒํผ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์ฆ๊ฐ€ ๋ฐ ์—ฐ์‚ฐ๋Ÿ‰์ด ์ฆ๊ฐ€ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

    • pooling์„ ์ด์šฉํ•˜์—ฌ resolution์„ ์ค„์ด๊ฒŒ ๋˜๋ฉด ํ•™์Šต๊ณผ ๋ฌด๊ด€ํ•ด์ง€๋ฉฐ ํ•™์Šตํ•  ํŒŒ๋ผ๋ฏธํ„ฐ ์—†์ด ์ •ํ•ด์ง„ ๋ฐฉ์‹ (max, average)์œผ๋กœ resolution์„ ์ค„์ด๊ฒŒ ๋˜์–ด ์—ฐ์‚ฐ ๋ฐ ํ•™์Šต๋Ÿ‰์€ ์ค„์–ด๋“ค์ง€๋งŒ convolution with stride ๋ฐฉ์‹๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹์ง€ ๋ชปํ•˜๋‹ค๊ณ  ์•Œ๋ ค์ ธ ์žˆ๋‹ค

    • ์ฐธ๊ณ  : Stride์™€ Pooling์˜ ๋น„๊ต - gaussian37

  • cf) He initializaiton

  • cf) Batch normalization

    • Batch normalization ๋Š” ์ดˆ๊ธฐ ๊ฐ€์ค‘์น˜ ์„ค์ • ๋ฌธ์ œ์™€ ๋น„์Šทํ•˜๊ฒŒ ๊ฐ€์ค‘์น˜ ์†Œ๋ฉธ ๋ฌธ์ œ(Gradient Vanishing) ๋˜๋Š” ๊ฐ€์ค‘์น˜ ํญ๋ฐœ ๋ฌธ์ œ(Gradient Exploding)๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ์ ‘๊ทผ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜
    • Batch Normalization ์˜ ํšจ๊ณผ
    • ํ•™์Šต ์†๋„๊ฐ€ ๊ฐœ์„ ๋œ๋‹ค (ํ•™์Šต๋ฅ ์„ ๋†’๊ฒŒ ์„ค์ •ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ)
    • ๊ฐ€์ค‘์น˜ ์ดˆ๊นƒ๊ฐ’ ์„ ํƒ์˜ ์˜์กด์„ฑ์ด ์ ์–ด์ง„๋‹ค (ํ•™์Šต์„ ํ•  ๋•Œ๋งˆ๋‹ค ์ถœ๋ ฅ๊ฐ’์„ ์ •๊ทœํ™”ํ•˜๊ธฐ ๋•Œ๋ฌธ)
    • ๊ณผ์ ํ•ฉ(overfitting) ์œ„ํ—˜์„ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค (๋“œ๋กญ์•„์›ƒ ๊ฐ™์€ ๊ธฐ๋ฒ• ๋Œ€์ฒด ๊ฐ€๋Šฅ)
    • Gradient Vanishing ๋ฌธ์ œ ํ•ด๊ฒฐ
  • ์ฐธ๊ณ  : ๋ฌธ๊ณผ์ƒ๋„ ์ดํ•ดํ•˜๋Š” ๋”ฅ๋Ÿฌ๋‹ (10) - ๋ฐฐ์น˜ ์ •๊ทœํ™” (tistory.com)

Shortcut connection

  • skip connection ์ด๋ผ๊ณ ๋„ ํ•œ๋‹ค
  • ๊ธฐ์กด์˜ gradient ๊ฐ€ vanishing ๋˜๋”๋ผ๋„, shortcut connection (identity) ์— ์˜ํ•œ gradient ๊ฐ€ ๋‚จ์•„์žˆ๊ธฐ์—,
    gradient vanishing ๋ฌธ์ œ๋ฅผ ํ•ด์†Œํ•  ์ˆ˜ ์žˆ๊ฒŒ๋จ
  • ๋” ๊นŠ๊ฒŒ layer ๋ฅผ ์Œ“์„ ์ˆ˜ ์žˆ๊ฒŒ๋˜์–ด degradation problem ์„ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋จ

image-20220316134559604

  • Plain Layer

    • As the layers get deeper, it is hard to learn good $H(x)$ directly
    • $H(x)$ ๋ผ๋Š” mapping ์„ ํ•™์Šตํ•  ๋•Œ์—,
      layer ๋ฅผ ๋†’๊ฒŒ ์Œ“์•„์„œ ๊ณง๋ฐ”๋กœ $x$ ์—์„œ $H(x)$ ์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•˜๋ ค๊ณ ํ•˜๋ฉด,
      ๋ณต์žกํ•˜๊ธฐ์— ํ•™์Šตํ•˜๊ธฐ ์–ด๋ ต๋‹ค
  • Residual block

    • ์ž…๋ ฅ์œผ๋กœ ์ฃผ์–ด์ง„ $x$ (identity) ์™ธ์˜ ์ž”์—ฌ ๋ถ€๋ถ„ (residual) ๋งŒ ๋ชจ๋ธ๋งํ•˜์—ฌ,
      ํ•™์Šตํ•˜๊ฒŒ๋” ๋ณ€๊ฒฝ
    • Target function : $H(x) = F(x) + x$
    • Residual function : $F(x) = H(x) - x$

Degradation problem

  • As the network depth increases, accuracy gets saturated -> degrade rapidly
  • ๋ชจ๋ธ์˜ parameter ๊ฐ€ ๋งŽ์•„์ง€๋ฉด Overfitting ์— ์ทจ์•ฝํ•  ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ–ˆ์—ˆ์ง€๋งŒ,
    ๋” ๊นŠ์€ ์ธต (56-layer) ์˜ training, test error ๊ฐ€ ๋ชจ๋‘ ๋” ์–•์€ ์ธต (20-layer) ์˜ error ๋ณด๋‹ค ์•ˆ์ข‹์€
    ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๋ฉฐ Overfitting ๋ฌธ์ œ๊ฐ€ ์•„๋‹Œ degradation problem (optimization) ๋ฌธ์ œ๋ผ๋Š” ๊ฒฐ๋ก ์„ ๋ƒ„
    • cf) Overfitting ์ด ๋ฌธ์ œ์˜€๋‹ค๋ฉด, ๋” ๊นŠ์€ ์ธต์˜ training error ๋Š” ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ด์ง€๋งŒ,
      test error ์—์„œ ๋” ์•ˆ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ์–ด์•ผํ•œ๋‹ค

image-20220316133849697

Analysis of residual connection

  • gradient ๊ฐ€ ์ง€๋‚˜๊ฐˆ ์ˆ˜ ์žˆ๋Š” 2^n^ ๊ฐœ์˜ input, output path ๊ฐ€ ์ƒ์„ฑ๋œ๋‹ค
    • ๋‹ค์–‘ํ•œ ๊ฒฝ์šฐ์˜ ์ˆ˜๋ฅผ ๊ฐ–๋Š” ๊ฒฝ๋กœ๋ฅผ ํ†ตํ•ด์„œ ๋ณต์žกํ•œ mapping ์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค๋Š” ๋ถ„์„
    • residual block ์ด ํ•˜๋‚˜ ์ถ”๊ฐ€๋  ๋•Œ๋งˆ๋‹ค ๊ฒฝ๋กœ ์ˆ˜๊ฐ€ 2๋ฐฐ์”ฉ ์ฆ๊ฐ€
  • Residual networks have $O(2^n)$ implicit paths connecting input and output,
    and adding a block doubles the number of paths

image-20220316134944556

AITech study archive CV wiki

Image Classification

Object detection

Segmentation

Human Pose Estimation

CNN Visualization

Image Generation

Multi-modal Learning

Clone this wiki locally