Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of ResNet block #2

Open
ThomasRochefortB opened this issue Jan 24, 2023 · 3 comments
Open

Number of ResNet block #2

ThomasRochefortB opened this issue Jan 24, 2023 · 3 comments

Comments

@ThomasRochefortB
Copy link

@OrigamiDream

The paper mentions on page 4 "Tokens belonging to image patches for any time-step are embedded using a single ResNet block".

So we could update the Readme.md and just use a number of 1 ResNet block during image embedding.

@ThomasRochefortB
Copy link
Author

Also, the ResNet embedding is done on single image patches instead of the whole complete image as it is hinted on the figure "Full Episode Sequence" (See Figure 15 in the appendix)

@OrigamiDream
Copy link
Owner

The Figure 5 shows they uses patches from complete image for full episode sequences. (They explicitly marked them with comma and ellipsis)
Also in page 3, "Images are first transformed into sequences of non-overlapping 16 x 16 patches in raster order, as done in ViT ...".
Therefore we must use the whole complete image for a single observation sequences
As you said, however, we can update the code to use single ResNet block to make 16 x 16 patches.

@OrigamiDream
Copy link
Owner

Now I understand what you said

I have read the paper of Gato and ViT again to clarify how Gato handles the image patches.
Like Vision Transformer (but which is not hybrid), the input images must be patched into 16x16 prior to be embedded via ResNet.

What I understand about the process is:

  • For image captioning task:
    224x224x3 → 196x16x16x3 (patching) → 196x1x1x768 (using single ResNet block) → 196x768 (reshaping)
  • For Atari task:
    64x80x3 → 20x16x16x3 (patching) → 20x1x1x768 (using single ResNet block) → 20x768 (reshaping)

I'm gonna update the ResNet code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants