Also download and tokenize datasets in pure C #230

matiasdelellis · 2024-04-23T12:57:10Z

matiasdelellis
Apr 23, 2024

There is no need for 245MB of PyTorch or 107MB of cPython... ?
I loved this statement, but when proceeding to install the dependencies, it seems that it needs several gigabytes of python dependencies just to download the datasets.. 😞

matias@nube:~/llm.c$ pip install -r requirements.txt
Defaulting to user installation because normal site-packages is not writeable.
.................................             ..............................                 ..................
Collecting mpmath>=0.19 (from sympy->torch->-r requirements.txt (line 3))
  Obtaining dependency information for mpmath>=0.19 from https://files.pythonhosted.org/packages/43/e3/7d92a15f894aa0c9c4b49b8ee9ac9850d6e63b03c9c32c0367a13ae62209/mpmath-1.3.0-py3-none-any.whl.metadata
  Downloading mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
Downloading torch-2.2.2-cp312-cp312-manylinux1_x86_64.whl (755.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 755.5/755.5 MB 1.4 MB/s eta 0:00:00
Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 410.6/410.6 MB 3.0 MB/s eta 0:00:00
Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.1/14.1 MB 9.2 MB/s eta 0:00:00
Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 9.1 MB/s eta 0:00:00
Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 823.6/823.6 kB 10.4 MB/s eta 0:00:00
Downloading nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 731.7/731.7 MB 1.6 MB/s eta 0:00:00
Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.6/121.6 MB 4.3 MB/s eta 0:00:00
Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB

I guess this could also be implemented in pure C... Of course I say this even without understanding how this works 😅 , but your projects are great, and I suppose this would be a good goal in line with the project...😬

karpathy · 2024-04-23T15:02:05Z

karpathy
Apr 23, 2024
Maintainer

That's just because we're initializing from OpenAI GPT-2 weights and we're using Python to download and write them conveniently.
Soon we'll start training from scratch and this step will become optional.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Also download and tokenize datasets in pure C #230

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Also download and tokenize datasets in pure C #230

matiasdelellis Apr 23, 2024

Replies: 1 comment

karpathy Apr 23, 2024 Maintainer

matiasdelellis
Apr 23, 2024

karpathy
Apr 23, 2024
Maintainer