Skip to content

DenseNet

Jeongsu Lee edited this page Apr 28, 2022 · 14 revisions

Model Keypoint

  • ์ตœ๋Œ€ํ•œ์˜ ์ •๋ณด ํ๋ฆ„์„ ๋ณด์žฅํ•˜๊ธฐ ์œ„ํ•ด์„œ, ๋ชจ๋“  layer๋ฅผ ๊ฐ๊ฐ ์ง์ ‘ ์—ฐ๊ฒฐ
  • L(L+1)/2๋ฒˆ์˜ direct connections์ด ์ด๋ฃจ์–ด์ง
โœ”๏ธ information preservation
  • ResNet์€ identity transformation์„ ๋”ํ•ด์„œ(summation) later layer๋กœ๋ถ€ํ„ฐ early layer๋กœ์˜ gradient flow๊ฐ€ ์ง์ ‘ ์—ฐ๊ฒฐ๋œ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ์ง€๋งŒ, identity transformation๊ณผ ์ถœ๋ ฅย H(xโˆ’1)์ด summation๋จ์— ๋”ฐ๋ผ information flow๋ฅผ ๋ฐฉํ•ดํ•  ์ˆ˜ ์žˆ๋‹ค.

  • gradient๊ฐ€ ํ๋ฅด๊ฒŒ ๋œ๋‹ค๋Š” ์ ์€ ๋„์›€์ด ๋˜์ง€๋งŒ, forward pass์—์„œ ๋ณด์กด๋˜์–ด์•ผ ํ•˜๋Š” ์ •๋ณด๋“ค์ด summation์„ ํ†ตํ•ด ๋ณ€๊ฒฝ๋˜์–ด ๋ณด์กด๋˜์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค. (DenseNet์€ concatenation์„ ํ†ตํ•ด ๊ทธ๋Œ€๋กœ ๋ณด์กด)

  • DenseNet์€ feature map์„ ๊ทธ๋Œ€๋กœ ๋ณด์กดํ•˜๋ฉด์„œ, feature map์˜ ์ผ๋ถ€๋ฅผ layer์— concatenation โ†’ ๋„คํŠธ์›Œํฌ์— ๋”ํ•ด์งˆ information๊ณผ ๋ณด์กด๋˜์–ด์•ผ ํ•  information์„ ๋ถ„๋ฆฌํ•ด์„œ ์ฒ˜๋ฆฌ โ†’ information ๋ณด์กด

โœ”๏ธ improved flow of information and gradient
  • ๋ชจ๋“  layer๊ฐ€ ์ด์ „์˜ ๋‹ค๋ฅธ layer๋“ค๊ณผ ์ง์ ‘์ ์œผ๋กœ ์—ฐ๊ฒฐ๋˜์–ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, loss function์ด๋‚˜ input signal์˜ gradient์— ์ง์ ‘์ ์œผ๋กœ ์ ‘๊ทผ ๊ฐ€๋Šฅ + gradient vanishing์ด ์—†์–ด์ง โ†’ ๋„คํŠธ์›Œํฌ๋ฅผ ๊นŠ์€ ๊ตฌ์กฐ๋กœ ๋งŒ๋“œ๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅ
โœ”๏ธ regularizing effect
  • ๋งŽ์€ connection์œผ๋กœ depth๊ฐ€ ์งง์•„์ง€๋Š” ํšจ๊ณผ โ†’ regularization ํšจ๊ณผ (overfitting ๋ฐฉ์ง€)
  • ์ƒ๋Œ€์ ์œผ๋กœ ์ž‘์€ train set์„ ์ด์šฉํ•˜์—ฌ๋„ overfitting ๋ฌธ์ œ์—์„œ ์ž์œ ๋กœ์›€

Architecture

Resdual ๋ธ”๋ก: skip connection์„ ํ†ตํ•ด์„œ ์ „๋‹ฌ๋œ x(identity) mapping์„ ๋”ํ•จ & ์ง์ „ ๋ธ”๋ก์˜ ์ •๋ณด ๋ฐ›์Œ
Dense ๋ธ”๋ก: channel ์ถ• ๊ธฐ์ค€์œผ๋กœ concatenation & ์ง์ „ ๋ธ”๋ก/ํ›จ์”ฌ ์ด์ „ ๋ธ”๋ก์˜ ์ •๋ณด๋„ ๋„˜๊ฒจ๋ฐ›์Œ

resdual ๋ธ”๋ก์—์„œ์˜ + vs. Dense ๋ธ”๋ก์—์„œ์˜ concatenation

  • summation(+) ์—ฐ์‚ฐ: ๋‘ ์‹ ํ˜ธ๋ฅผ ํ•ฉ์นจ
  • concatenation ์—ฐ์‚ฐ: (channel์€ ๋Š˜์–ด๋‚˜์ง€๋งŒ) ์‹ ํ˜ธ๊ฐ€ ๋ณด์กด๋˜์–ด์žˆ์Œ. ๋”ฐ๋ผ์„œ ํ•˜์œ„ ์ •๋ณด ์ด์šฉ์‹œ ์šฉ์ด

DenseNet

Dense Connectivity

  • ResNet์€ gradient๊ฐ€ identity function์„ ํ†ตํ•ด ์ง์ ‘ earlier layer์—์„œ later layer๋กœ ํ๋ฅผ ์ˆ˜ ์žˆ์œผ๋‚˜, identity function๊ณผ output์„ ๋”ํ•˜๋Š”(summation) ๊ณผ์ •์—์„œ information flow๋ฅผ ๋ฐฉํ•ดํ•  ์ˆ˜ ์žˆ์Œ โ†’ L๋ฒˆ์˜ connections
  • DenseNet์€ summation์œผ๋กœ layer ์‚ฌ์ด๋ฅผ ์—ฐ๊ฒฐํ•˜๋Š” ๋Œ€์‹ ์—, concatenation์œผ๋กœ layer ์‚ฌ์ด๋ฅผ ์ง์ ‘ ์—ฐ๊ฒฐ โ†’ L(L+1)/2๋ฒˆ์˜ connections โ‡’ dense connectivity๋ผ์„œ DenseNet(Dense Convolutional Network)์œผ๋กœ ๋ช…๋ช…

Pooling layers

  • feature map์˜ ํฌ๊ธฐ๊ฐ€ ๋ณ€๊ฒฝ๋  ๊ฒฝ์šฐ, concatenation ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์—†์Œ (โˆต ํ‰ํ–‰ํ•˜๊ฒŒ ํ•ฉ์น˜๋Š” ๊ฒƒ์ด ๋ถˆ๊ฐ€๋Šฅ) โ†” CNN์€ down-sampling์€ ํ•„์ˆ˜์ด๋ฏ€๋กœ, layer๋งˆ๋‹ค feature map์˜ ํฌ๊ธฐ๊ฐ€ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ๋ฐ–์— ์—†์Œ

  • DenseNet์€ ๋„คํŠธ์›Œํฌ ์ „์ฒด๋ฅผ ๋ช‡ ๊ฐœ์˜ dense block์œผ๋กœ ๋‚˜๋ˆ ์„œ ๊ฐ™์€ feature map size๋ฅผ ๊ฐ€์ง€๋Š” ๋ ˆ์ด์–ด๋“ค์€ ๊ฐ™์€ dense block๋‚ด๋กœ ๋ฌถ์Œ

  • ์œ„ ๊ทธ๋ฆผ์—์„œ๋Š” ์ด 3๊ฐœ์˜ dense block์œผ๋กœ ๋‚˜๋ˆ”

    • ๊ฐ™์€ ๋ธ”๋Ÿญ ๋‚ด์˜ ๋ ˆ์ด์–ด๋“ค์€ ์ „๋ถ€ ๊ฐ™์€ feature map size๋ฅผ ๊ฐ€์ง โ‡’ concatenation ์—ฐ์‚ฐ ๊ฐ€๋Šฅ

    • transition layer(๋นจ๊ฐ„ ๋„ค๋ชจ๋ฅผ ์นœ pooling๊ณผ convolution ๋ถ€๋ถ„) โ‡’ down-sampling ๊ฐ€๋Šฅ

      • Batch Normalization(BN)
      • 1ร—1ย convolution โ†’ feature map์˜ ๊ฐœ์ˆ˜(= channel ๊ฐœ์ˆ˜)๋ฅผ ์ค„์ž„
      • 2ร—2ย average pooling โ†’ feature map์˜ ๊ฐ€๋กœ/์„ธ๋กœ ํฌ๊ธฐ๋ฅผ ์ค„์ž„
    • ex. dense block1์—์„œ 100x100 size์˜ feature map์„ ๊ฐ€์ง€๊ณ  ์žˆ์—ˆ๋‹ค๋ฉด dense block2์—์„œ๋Š” 50x50 size์˜ feature map

  • ์œ„ ๊ทธ๋ฆผ์—์„œ ๊ฐ€์žฅ ์ฒ˜์Œ์— ์‚ฌ์šฉ๋˜๋Š” convolution ์—ฐ์‚ฐ โ†’ input ์ด๋ฏธ์ง€์˜ ์‚ฌ์ด์ฆˆ๋ฅผ dense block์— ๋งž๊ฒŒ ์กฐ์ ˆํ•˜๊ธฐ ์œ„ํ•œ ์šฉ๋„๋กœ ์‚ฌ์šฉ๋จ โ†’ ์ด๋ฏธ์ง€์˜ ์‚ฌ์ด์ฆˆ์— ๋”ฐ๋ผ์„œ ์‚ฌ์šฉํ•ด๋„ ๋˜๊ณ  ์‚ฌ์šฉํ•˜์ง€ ์•Š์•„๋„ ๋จ

Bottleneck layers

  • output์˜ feature map ์ˆ˜(= channel ๊ฐœ์ˆ˜)๋ฅผ ์กฐ์ ˆํ•˜๋Š”ย bottleneck layer๋ฅผ ์‚ฌ์šฉ
  • ๋ณธ ๋…ผ๋ฌธ์—์„œย H()์— bottleneck layer๋ฅผ ์‚ฌ์šฉํ•œ ๋ชจ๋ธ์„ย DenseNet-B๋กœ ํ‘œ๊ธฐ
    • Batch Normย โ†’ย ReLUย โ†’ย Conv (1ร—1)ย โ†’ย Batch Normย โ†’ย ReLUย โ†’ย Conv (3ร—3)
    • ๋ณธ ๋…ผ๋ฌธ์—์„œ, ๊ฐย 1ร—1ย Conv๋Š”ย 4k๊ฐœ์˜ feature map์„ ์ถœ๋ ฅ (๋‹จ, 4 * growth rate์˜ย 4๋ฐฐย ๋ผ๋Š” ์ˆ˜์น˜๋Š” hyper-parameter์ด๊ณ  ์ด์— ๋Œ€ํ•œ ์ž์„ธํ•œ ์„ค๋ช…์€ ํ•˜๊ณ  ์žˆ์ง€ ์•Š์Œ)
  • 1x1 convolution โ†’ channel ๊ฐœ์ˆ˜ ์ค„์ž„ โ‡’ ํ•™์Šต์— ์‚ฌ์šฉ๋˜๋Š” 3x3 convolution์˜ parameter ๊ฐœ์ˆ˜ ์ค„์ž„
  • ResNet์€ย Bottleneckย ๊ตฌ์กฐ๋ฅผ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด์„œ

    • 1x1 convolution์œผ๋กœย dimension reduction์„ ํ•œ ๋‹ค์Œ + ๋‹ค์‹œย 1x1 convolution์„ ์ด์šฉํ•˜์—ฌย expansion
  • DenseNet์€ย Bottleneckย ๊ตฌ์กฐ๋ฅผ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด์„œ

    • 1x1 convolution์œผ๋กœย dimension reduction + but, expansion์€ ํ•˜์ง€ ์•Š์Œ
    • ๋Œ€์‹ ์— feature๋“ค์˜ย concatenation์„ ์ด์šฉํ•˜์—ฌ expansion ์—ฐ์‚ฐ๊ณผ ๊ฐ™์€ ํšจ๊ณผ๋ฅผ ๋งŒ๋“ฆ
      • (์ƒ๊ฐ) feature๋“ค์˜ concatenation์œผ๋กœ ์ฑ„๋„ ๊ฐœ์ˆ˜ expansion โ†’ ex. 6 + 4 + ... + 4
  • (๊ณตํ†ต์ ) 3x3 convolution ์ „์— 1x1 convolution์„ ๊ฑฐ์ณ์„œ input feature map์˜ channel ๊ฐœ์ˆ˜๋ฅผ ์ค„์ž„

  • (์ฐจ์ด์ ) ๋‹ค์‹œ input feature map์˜ channel ๊ฐœ์ˆ˜ ๋งŒํผ ์ƒ์„ฑ(ResNet)ํ•˜๋Š” ๋Œ€์‹  growth rate ๋งŒํผ์˜ feature map์„ ์ƒ์„ฑ(DenseNet) โ‡’ ์ด๋ฅผ ํ†ตํ•ด computational cost๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ์Œ

โœ”๏ธ Growth rate
  • input์˜ ์ฑ„๋„ ๊ฐœ์ˆ˜ k_0์™€ ์ด์ „ (l-1)๊ฐœ์˜ layer โ†’ H(x) โ†’ output์œผ๋กœ, k feature maps (๋‹จ, k_0 : input layer์˜ channel ๊ฐœ์ˆ˜)

    • input : k_0+k*(l-1)
    • output : k
  • Growth rate(= hyperparameter k) โ†’ ๊ฐ layer์˜ feature map์˜ channel ๊ฐœ์ˆ˜

  • ๊ฐ feature map๋ผ๋ฆฌ densely connection ๋˜๋Š” ๊ตฌ์กฐ์ด๋ฏ€๋กœ ์ž์นซ feature map์˜ channel ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ์„ ๊ฒฝ์šฐ, ๊ณ„์†ํ•ด์„œ channel-wise๋กœ concatenate ๋˜๋ฉด์„œ channel์ด ๋งŽ์•„์งˆ ์ˆ˜ ์žˆ์Œ โ‡’ DenseNet์—์„œ๋Š” ๊ฐ layer์˜ feature map์˜ channel ๊ฐœ์ˆ˜๋กœ ์ž‘์€ ๊ฐ’์„ ์‚ฌ์šฉ

  • concatenation ์—ฐ์‚ฐ์„ ํ•˜๊ธฐ ์œ„ํ•ด์„œ ๊ฐ layer ์—์„œ์˜ output ์ด ๋˜‘๊ฐ™์€ channel ๊ฐœ์ˆ˜๊ฐ€ ๋˜๋Š” ๊ฒƒ์ด ์ข‹์Œ โ†’ 1x1 convolution์œผ๋กœ growth rate ์กฐ์ ˆ

  • ์œ„์˜ ๊ทธ๋ฆผ 1์€ k(growth rate) = 4 ์ธ ๊ฒฝ์šฐ๋ฅผ ์˜๋ฏธ

    • 6 channel feature map์ธ input์ด dense block์˜ 4๋ฒˆ์˜ convolution block์„ ํ†ตํ•ด (6 + 4 + 4 + 4 + 4 = 22) ๊ฐœ์˜ channel์„ ๊ฐ–๋Š” feature map output์œผ๋กœ ๊ณ„์‚ฐ์ด ๋˜๋Š” ๊ณผ์ •
    • DenseNet์˜ ๊ฐ dense block์˜ ๊ฐ layer๋งˆ๋‹ค feature map์˜ channel ๊ฐœ์ˆ˜ ๋˜ํ•œ ๊ฐ„๋‹จํ•œ ๋“ฑ์ฐจ์ˆ˜์—ด๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Œ
  • DenseNet์€ ์ž‘์€ย k๋ฅผ ์‚ฌ์šฉ โ†’ (๋‹ค๋ฅธ ๋ชจ๋ธ์— ๋น„ํ•ด) ์ข์€ layer๋กœ ๊ตฌ์„ฑ โ‡’ ์ข์€ layer๋กœ ๊ตฌ์„ฑํ•ด๋„ DenseNet์ด ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ์ด์œ ?

    • Dense block๋‚ด์—์„œ ๊ฐ layer๋“ค์€ ๋ชจ๋“  preceding feature map์— ์ ‘๊ทผ ๊ฐ€๋Šฅ (= ๋„คํŠธ์›Œํฌ์˜ โ€œcollective knowledgeโ€์— ์ ‘๊ทผ)
      โ‡’ (์ƒ๊ฐ) preceding feature map = ๋„คํŠธ์›Œํฌ์˜ global state
    • growth rateย k โ†’ ๊ฐ layer๊ฐ€ global state์— ์–ผ๋งˆ๋‚˜ ๋งŽ์€ ์ƒˆ๋กœ์šด ์ •๋ณด๋ฅผ contributeํ•  ๊ฒƒ์ธ์ง€๋ฅผ ์กฐ์ ˆ
    • โ‡’ ๋ชจ๋“  layer๊ฐ€ ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋Š” global state๋กœ ์ธํ•ด DenseNet์€ย ๊ธฐ์กด์˜ ๋„คํŠธ์›Œํฌ๋“ค๊ณผ ๊ฐ™์ดย layer์˜ feature map์„ ๋ณต์‚ฌํ•ด์„œ ๋‹ค๋ฅธ layer๋กœ ๋„˜๊ฒจ์ฃผ๋Š” ๋“ฑ์˜ ์ž‘์—…์„ ํ•  ํ•„์š”๊ฐ€ ์—†์Œ (= feature reuse)
โœ”๏ธ Compression
  • Compression์€ pooling layer(Transition layer)์˜ 1x1 Convolution layer ์—์„œ channel ๊ฐœ์ˆ˜(= feature map์˜ ๊ฐœ์ˆ˜)๋ฅผ ์ค„์—ฌ์ฃผ๋Š” ๋น„์œจ (hyperparameter ฮธ)
    • ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ฮธ=0.5๋กœ ์„ค์ • โ†’ transition layer๋ฅผ ํ†ต๊ณผํ•˜๋ฉด feature map์˜ ๊ฐœ์ˆ˜(channel)์ด ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์–ด๋“ค๊ณ , 2x2 average pooling layer๋ฅผ ํ†ตํ•ด feature map์˜ ๊ฐ€๋กœ ์„ธ๋กœ ํฌ๊ธฐ ๋˜ํ•œ ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์–ด๋“ฆ
    • ฮธ=1๋กœ ์„ค์ • ์‹œ โ†’ feature map์˜ ๊ฐœ์ˆ˜๋ฅผ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉ

DenseNet pros

1. ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค ๋ฌธ์ œ(gradient vanishing) ์™„ํ™”

  • ์œ„ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด DenseNet ๋˜ํ•œ ResNet ์ฒ˜๋Ÿผ gradient๋ฅผ ๋‹ค์–‘ํ•œ ๊ฒฝ๋กœ๋ฅผ ํ†ตํ•ด์„œ ๋ฐ›์„ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ํ•™์Šตํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค.

2. feature propagation ๊ฐ•ํ™”

  • ์œ„ ๊ทธ๋ฆผ์„ ๋ณด๋ฉด ์•ž๋‹จ์—์„œ ๋งŒ๋“ค์–ด์ง„ feature๋ฅผ ๊ทธ๋Œ€๋กœ ๋’ค๋กœ ์ „๋‹ฌ์„ ํ•ด์„œ concatenation ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉ์„ ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ feature๋ฅผ ๊ณ„์†ํ•ด์„œ ๋๋‹จ ๊นŒ์ง€ ์ „๋‹ฌํ•˜๋Š” ๋ฐ ์žฅ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

3. feature reuse

๐Ÿ“Ž Feature reuse

Feature reuse

  • ํ•™์Šต๋œ DenseNet์˜ ๊ฐ layer๊ฐ€ ์‹ค์ œ๋กœ preceding layer๋“ค์˜ feature map์„ ํ™œ์šฉํ•˜๋Š”์ง€๋ฅผ ์‹คํ—˜

    • ํ•™์Šตํ•œ ๋„คํŠธ์›Œํฌ์˜ ๊ฐ dense block์—์„œ, โ„“๋ฒˆ์งธ convolution layer์—์„œ s๋ฒˆ์งธ layer๋กœ์˜ ํ• ๋‹น๋œ average absolute weight๋ฅผ ๊ณ„์‚ฐ (absolute๋Š” ์Œ์˜ ๊ฐ’์„ ๊ฐ–๋Š” weight๋ฅผ ๊ณ ๋ คํ•œ ๊ฒƒ์œผ๋กœ ๋ณด์ž„)
  • ์œ„ ๊ทธ๋ฆผ์€ dense block ๋‚ด๋ถ€์—์„œ convolution layer๋“ค์˜ weight์˜ ํ‰๊ท ์ด ์–ด๋–ป๊ฒŒ ๋ถ„ํฌ๋˜์–ด์žˆ๋Š”์ง€ ๋ณด์—ฌ์คŒ

  • Pixel (s,โ„“)์˜ ์ƒ‰๊น”์€ dense block ๋‚ด์˜ conv layer s์™€ โ„“์„ ์—ฐ๊ฒฐํ•˜๋Š” weight์˜ average L1 norm์œผ๋กœ ์ธ์ฝ”๋”ฉ ํ•œ ๊ฒƒ โ‡’ ๊ฐ dense block์˜ weight๋“ค์ด ๊ฐ€์ง€๋Š” ๊ทธ ํฌ๊ธฐ ๊ฐ’์„ 0 ~ 1 ์‚ฌ์ด ๋ฒ”์œ„๋กœ normalization ํ•œ ๊ฒฐ๊ณผ

    • ๋นจ๊ฐ„์ƒ‰์ธ 1์— ๊ฐ€๊นŒ์šธ ์ˆ˜๋ก ํฐ ๊ฐ’ โ†” ํŒŒ๋ž€์ƒ‰์ธ 0์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ์ž‘์€ ๊ฐ’
  • ์‹คํ—˜ ๊ฒฐ๊ณผ

    • ๊ฐ layer๋“ค์ด ๋™์ผํ•œ block ๋‚ด์— ์žˆ๋Š” preceding layer๋“ค์— weight๋ฅผ ๋ถ„์‚ฐ ์‹œํ‚ด (โˆต ๊ฐ ์—ด์—์„œ weight๊ฐ€ ๊ณจ๊ณ ๋ฃจ spread๋˜์–ด ์žˆ์Œ

      • โ‡’ Dense block ๋‚ด์—์„œ,ย ์‹ค์ œ๋กœ later layer๋Š” early layer์˜ feature map์„ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Œ
    • Transition layer๋„ preceding layer๋“ค์— weight๋ฅผ ๋ถ„์‚ฐ ์‹œํ‚ด (โˆต ๊ฐ€์žฅ ์˜ค๋ฅธ์ชฝ ์—ด์—์„œ weight๊ฐ€ ๊ณจ๊ณ ๋ฃจ spread ๋˜์–ด ์žˆ์Œ)

      • โ‡’ Dense block ๋‚ด์—์„œ,ย 1๋ฒˆ์งธ layer์—์„œ ๊ฐ€์žฅ ๋งˆ์ง€๋ง‰ layer๊นŒ์ง€ information flow๊ฐ€ ํ˜•์„ฑ๋˜์–ด ์žˆ์Œ
    • 2, 3๋ฒˆ์งธ dense block์€ transition layer์˜ output์— ๋งค์šฐ ์ ์€ weight๋ฅผ ์ผ๊ด€๋˜๊ฒŒ ํ• ๋‹น (โˆต 2, 3๋ฒˆ์งธ dense block์˜ ์ฒซ๋ฒˆ์งธ ํ–‰์—์„œ weight๊ฐ€ ๊ฑฐ์˜ 0์— ๊ฐ€๊นŒ์›€)

      • โ‡’ 2, 3๋ฒˆ์งธ dense block์˜ transition layer output์€ redundant features๊ฐ€ ๋งŽ์•„์„œ ๋งค์šฐ ์ ์€ weight๋ฅผ ํ• ๋‹น(์ค‘๋ณต๋œ ์ •๋ณด๋“ค์ด ๋งŽ์•„ ๋ชจ๋‘ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•„๋„ ๋œ๋‹ค๋Š” ์˜๋ฏธ)
      • โ‡’ DenseNet-BC์—์„œ compressionย ฮธ๋กœ ์ด๋Ÿฌํ•œ redundant feature๋“ค์„ compressํ•˜๋Š” ํ˜„์ƒ๊ณผ ์ผ์น˜
      • (์ƒ๊ฐ) Compression์€ pooling layer(Transition layer)์˜ 1x1 Convolution layer ์—์„œ channel ๊ฐœ์ˆ˜(= feature map์˜ ๊ฐœ์ˆ˜)๋ฅผ ์ค„์—ฌ์ฃผ๋Š” ๋น„์œจ (hyperparameter ฮธ)์ด๋ฏ€๋กœ, ์ค‘๋ณต๋œ ์ •๋ณด๋“ค์ด transition layer์—์„œ ์ œ๊ฑฐ๋œ๋‹ค๋Š” ์˜๋ฏธ โ†’ channel ๊ฐœ์ˆ˜ ๊ฐ์†Œ
    • ๋งˆ์ง€๋ง‰ classification layer๋Š” ์ „์ฒด dense block์˜ weight๋ฅผ ์‚ฌ์šฉํ•˜๊ธด ํ•˜์ง€๋งŒ, early layer๋ณด๋‹ค later layer์˜ feature map์„ ๋” ๋งŽ์ด ์‚ฌ์šฉํ•จ (โˆต 3๋ฒˆ์งธ dense block์˜ ๊ฐ€์žฅ ๋งˆ์ง€๋ง‰ ์—ด์—์„œ weight๊ฐ€ ์•„๋ž˜์ชฝ์œผ๋กœ ์น˜์šฐ์ณ ์žˆ์Œ)

      • โ‡’ High-level feature๊ฐ€ later layer์— ๋” ๋งŽ์ด ์กด์žฌํ•จ
    • ์ฐธ๊ณ  : DenseNet (Densely connected convolution networks) - gaussian37

    https://gaussian37.github.io/assets/img/dl/concept/densenet/24.png

    • ์œ„ ๊ทธ๋ฆผ์€ย ๊ฐ source โ†’ target์œผ๋กœ propagation๋œ weight์˜ ๊ฐ’ ๋ถ„ํฌ๋ฅผ ๋‚˜ํƒ€๋ƒ„
    • ์„ธ๋กœ์ถ•ย Source layerย โ†’ layer๊ฐ€ propagation ํ•  ๋•Œ, ๊ทธ Source์— ํ•ด๋‹นํ•˜๋Š” layer๊ฐ€ ๋ช‡๋ฒˆ์งธ layer์ธ ์ง€ ๋‚˜ํƒ€๋ƒ„
    • ๊ฐ€๋กœ์ถ•ย Target layer โ†’ Source์—์„œ ๋ถ€ํ„ฐ ์ „ํŒŒ๋œ layer์˜ ๋ชฉ์ ์ง€๊ฐ€ ์–ด๋””์ธ์ง€ ๋‚˜ํƒ€๋ƒ„
    • ex. dense block 1์˜ ์„ธ๋กœ์ถ•(5), ๊ฐ€๋กœ์ถ• (8)์— ๊ต์ฐจํ•˜๋Š” ์ž‘์€ ์‚ฌ๊ฐํ˜•์ด ์˜๋ฏธํ•˜๋Š” ๊ฒƒ์€ dense block 1์—์„œ 5๋ฒˆ์งธ layer์—์„œ ์‹œ์ž‘ํ•˜์—ฌ 8๋ฒˆ์งธ layer๋กœ propagation ๋œย weight

    https://gaussian37.github.io/assets/img/dl/concept/densenet/22.png

    • ex. ๊ฐ dense block์˜ย Source๊ฐ€ 1์ธ ๋ถ€๋ถ„๋“ค์„ ์‚ดํŽด ๋ณด๋ฉด ๊ฐ Block์˜ย ์ฒซ layer์—์„œ ํŽผ์ณ์ง„ propagation์— ํ•ด๋‹น (์œ„ ๊ทธ๋ฆผ์—์„œ ๋นจ๊ฐ„์ƒ‰ ๋™๊ทธ๋ผ๋ฏธ์— ํ•ด๋‹นํ•˜๋Š” ๋ถ€๋ถ„)

    https://gaussian37.github.io/assets/img/dl/concept/densenet/23.png

    • ex. ๊ฐ dense block์˜ย Target์ด 12์ธ ๋ถ€๋ถ„๋“ค์„ ์‚ดํŽด ๋ณด๋ฉดย ๋‹ค์–‘ํ•œ Source์—์„œ weight๋“ค์ด ๋ชจ์ด๊ฒŒย ๋œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Œ (์œ„ ๊ทธ๋ฆผ์—์„œ ๋นจ๊ฐ„์ƒ‰ ๋™๊ทธ๋ผ๋ฏธ์— ํ•ด๋‹นํ•˜๋Š” ๋ถ€๋ถ„)

4. parameter ๊ฐœ์ˆ˜ ์ค„์ž„

DenseNet cons

1. channel์ด ๋Š˜์–ด๋‚จ์œผ๋กœ์จ ๋ฉ”๋ชจ๋ฆฌ์™€ computational complexity ์ฆ๊ฐ€

DenseNet Experiments

๐Ÿ“Ž Experiments

Experiments

Datasets

  • CIFAR
    • 32 x 32 pixels
    • CIFAR-10 : 10 classes / CIFAR-100 : 100 classes
    • training set : 50,000 images / test set : 10,000 images / validations set : 5,000 training images
    • data augmentation : mirroring / shifting
    • preprocessing : normalize the data using channel means + standard deviations
  • SVHN
    • 32 x 32 digit images
    • training set : 73,257 images / test set : 26,032 images / validation set : 6,000 images
    • additional training set : 531,131 images
  • ImageNet
    • training set : 1,2 million images / validation set : 50,000 images
    • 1000 classes
    • data augmentation + 10-crop/single-crop
    • 224 x 224 images

Training

  • stochastic gradient descent (SGD)๋กœ train

  • weight decay : 10^{-4}

  • Nesterov momentum : 0.9 without dampening

  • CIFAR, SVHN

    • batch size : 64
    • 300 or 40 epochs
    • learning rate : 0.1 โ†’ training epoch๊ฐ€ 50%, 75%์ผ ๋•Œ 0.1๋ฐฐ
  • ImageNet

    • batch size : 256
    • 90 epochs
    • learning rate : 0.1 โ†’ 30 epochs, 60 epochs๋งˆ๋‹ค 0.1๋ฐฐ

โ”momentum : parameter๋ฅผ updateํ•  ๋–„, ํ˜„์žฌ gradient์— ๊ณผ๊ฑฐ์— ๋ˆ„์ ํ–ˆ๋˜ gradient๋ฅผ ์–ด๋Š์ •๋„ ๋ณด์ •ํ•ด์„œ ๊ณผ๊ฑฐ์˜ ๋ฐฉํ–ฅ์„ ๋ฐ˜์˜ํ•˜๋Š” ๊ฒƒ

Classification Results on CIFAR and SVHN

Accuracy

  • DenseNet-BC with {L=190, k=40} โ†’ C10+, C100+์— ๋Œ€ํ•ด ์„ฑ๋Šฅ ์ข‹์Œ
  • C10/C100์— ๋Œ€ํ•ด, FractalNet with drop path-regularization ๊ณผ ๋น„๊ตํ•ด์„œ error๊ฐ€ 30% ์ ์Œ
  • DenseNet-BC with {L=100, k=24} โ†’ C10, C100, SVHN์— ๋Œ€ํ•ด ์„ฑ๋Šฅ ์ข‹์Œ
  • SVHN์ด ๋น„๊ต์  ์‰ฌ์šด task์ด๊ธฐ ๋•Œ๋ฌธ์—, ๊นŠ์€ ๋ชจ๋ธ์€ overfittingํ•  ์ˆ˜ ์žˆ์–ด์„œ, DenseNet-BC with {L=250, k=24} ๋Š” ๋” ์ด์ƒ ์„ฑ๋Šฅ์ด ๊ฐœ์„ ๋˜์ง€ ์•Š์Œ

Capacity

  • compression๊ณผ bottleneck layer๊ฐ€ ์—†์„ ๋•Œ, L๊ณผ k๊ฐ€ ์ปค์งˆ์ˆ˜๋ก โ†’ DenseNet์˜ ์„ฑ๋Šฅ์ด ์ข‹์•„์ง
    • ๋ชจ๋ธ์ด ๋” ํฌ๊ณ (k) ๋” ๊นŠ์–ด์งˆ์ˆ˜๋ก(L) ๋” ๋งŽ๊ณ  ํ’๋ถ€ํ•œ representation์„ ํ•™์Šต ๊ฐ€๋Šฅ
  • paramter ๊ฐœ์ˆ˜๊ฐ€ ๋Š˜์–ด๋‚ ์ˆ˜๋ก โ†’ error ์ค„์–ด๋“ฆ
    • Error : 5.24%ย โ†’ย 4.10%ย โ†’ย 3.74%
    • Number of parameters : 1.0Mย โ†’ย 7.0Mย โ†’ย 27.2M
    • Overfitting์ด๋‚˜ optimization(= parameter update) difficulty๊ฐ€ ๋‚˜ํƒ€๋‚˜์ง€ ์•Š์Œ

Parameter Efficiency

  • DenseNet-BC with bottleneck structure + transition layer์—์„œ์˜ ์ฐจ์› ์ถ•์†Œ(dimension reduction)๋Š” parameter์˜ ํšจ์œจ์„ฑ์„ ๋†’์ž„
  • FractalNet๊ณผ Wide ResNets๋Š” 30M parameter์ด๊ณ , 250-layer DenseNet์€ 15.3M parameter ์ธ๋ฐ, DenseNet์˜ ์„ฑ๋Šฅ์ด ๋” ์ข‹์Œ

Overfitting

  • DenseNet์€ overfitting ๋  ๊ฐ€๋Šฅ์„ฑ์ด ์ ์Œ
  • DenseNet-BC with bottleneck structure์™€ compression layer๊ฐ€ overfitting์„ ๋ฐฉ์ง€ํ•˜๋Š”๋ฐ ๋„์›€
  • ResNet-1001๊ณผ DenseNet-BC(L=100,k=12)์˜ error๋ฅผ ๋น„๊ต (๋งจ ์˜ค๋ฅธ์ชฝ ๊ทธ๋ž˜ํ”„)
    • ResNet-1001์€ DenseNet-BC์— ๋น„ํ•ด training loss๋Š” ๋” ๋‚ฎ์ง€๋งŒ, test error๋Š” ๋น„์Šทํ•œ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ด๋Š”ย DenseNet์ด ResNet๋ณด๋‹ค overfitting์ด ์ผ์–ด๋‚˜๋Š” ๊ฒฝํ–ฅ์ด ๋” ์ ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์คŒ

Classification Results on ImageNet

  • Table 3(์™ผ์ชฝ ํ‘œ)์€ DenseNet์˜ ImageNet์—์„œ์˜ single crop, 10-crop validation error
  • Figure 3(์˜ค๋ฅธ์ชฝ ๊ทธ๋ฆผ)๋Š” DenseNet๊ณผ ResNet์˜ single crop top-1 validation error๋ฅผ parameter ๊ฐœ์ˆ˜์™€ flops๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋‚˜ํƒ€๋ƒ„
    • DenseNet-201 with 20M parameters์™€ 101-layer ResNet with more than 40 parameter๊ฐ€ ๋น„์Šทํ•œ ์„ฑ๋Šฅ

reference

๋…ผ๋ฌธ | https://arxiv.org/abs/1608.06993
์ฐธ๊ณ ์ž๋ฃŒ | https://csm-kr.tistory.com/10

AITech study archive CV wiki

Image Classification

Object detection

Segmentation

Human Pose Estimation

CNN Visualization

Image Generation

Multi-modal Learning

Clone this wiki locally