UPerNet

scene recognition, object detection, texture recognition, material recognition task를 동시에 해결하는 framework를 제안

Model Keypoint

FPN (Feature Pyramid Network)
- multi-level feature representations를 사용하는 feature extractor
- pyramidal hierarchy 구조
- top-down architecture + lateral connections -> high-level semantic information를 middle 혹은 low level information과 융합
PPM (Pyramid Pooling Module)
- backbone network의 마지막 layer에 추가
- FPN의 top-down branch로 진행하기 전에 PPM 위치
- 효과적인 global prior representations
Head
- scene recognition, object detection, texture recognition, material recognition task를 동시에 해결 가능하도록 만듦
  - single network에서, multiple level에 존재하는 visual attributes를 parse하고 unify 할 수 있음
- Scene Head / Object Head / Part Head / Material Head / Texture Head

Conv 3x3 -> Classifier로 구성
실험 결과, FPN의 모든 feature map을 융합하여 사용하는 경우가, highest resolution의 feature map만 사용하는 경우보다 결과가 더 좋았음

Reference

Human Pose Estimation

CNN Visualization

Image Generation

Multi-modal Learning