How to train the model in Google Colab #1291

dmetola · 2023-10-03T15:46:35Z

dmetola
Oct 3, 2023

Hi all!

I'm about to start training a new language model, and I have some hardware restrictions. I'm researching about Google Colab and how to train models there, and I was wondering if someone has already done this, and which is the best approach to do that.

I also would like to ask about how to save the model architecture once the training has finished. As far as I know, the model will be trained with the PyTorch library. Should I add the corresponding lines in the script for each of the processors? Is there a better way? Or is it already implemented in the code?

Thanks for your attention, and have a nice day!

AngledLuffa · 2023-10-03T18:02:51Z

AngledLuffa
Oct 3, 2023
Maintainer

the training scripts save the models by default to saved_models/<type>/<dataset>.pt

but you can change that by changing the --save_dir and --save_name parameters. you should not have to change the saving mechanism

I don't have any particular experience training on Colab, but I will say that training on CPU is slooow compared to GPU

1 reply

dmetola Oct 4, 2023
Author

Thanks for your response! Apparently, Google Colab runs also on GPU, so I'll try and give it a go, if not, I'll try locally, hopefully it will work.

I have another question: I haven't worked on this project for a year, I want to build an Old English model, and I think the layout of the repository has changed since then. I have my word vectors and my training splits ready, in the corresponding format. Could you please point me out to what folder to add them? I have cloned the latest version of your repository.

Thanks for your help!

AngledLuffa · 2023-10-04T14:35:29Z

AngledLuffa
Oct 4, 2023
Maintainer

I don't think the layout has changed in the past year. There are a couple places to look for documentation:

https://stanfordnlp.github.io/stanza/training.html

https://github.com/stanfordnlp/stanza-train/

Ultimately it looks for the data processed into CoNLL format in data/pos, but for a dataset we don't already support, it's up to you where you put it. (And if you have a publicly available source of data you want processed, we can always work on adding more support for that dataset ourselves)

1 reply

dmetola Oct 4, 2023
Author

Thanks for your fast response. You're right, nothing was changed, and now everything is ready.
I'm getting an error where it ways: "Unable to find language code for Old English", when preparing the tokenizer before training. From previous conversations that we had, the language code for Old English, "ang", has been added to the constant.py file, line 211. My training files are named as follows:

ang-ud-foo (where foo stands for dev, test, and train)

I think I'm missing something here?

Thanks!

AngledLuffa · 2023-10-04T17:02:07Z

AngledLuffa
Oct 4, 2023
Maintainer

I agree that is weird. I will note that the error is complaining about "Old English" with no underscore, whereas the long name was written as "Old_English". Perhaps that is the source of the problem.

If not, please include a full stack trace if possible

3 replies

dmetola Oct 6, 2023
Author

Hi, sorry for taking a little bit long to answer.

I checked the error again, I did a typo in what I asked in my previous comment. I'm attaching the whole error message:

(nlp) dario@192 training % python3 -m stanza.utils.datasets.prepare_tokenizer_treebank UD_Old_English-TEST 2023-10-06 14:54:22 INFO: Datasets program called with: /Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_tokenizer_treebank.py UD_Old_English-TEST Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1136, in <module> main() File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1133, in main common.main(process_treebank, add_specific_args) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/common.py", line 134, in main process_treebank(treebank, paths, args) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1089, in process_treebank short_name = common.project_to_short_name(treebank) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/common.py", line 19, in project_to_short_name return treebank_to_short_name(treebank) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/models/common/constant.py", line 189, in treebank_to_short_name raise ValueError("Unable to find language code for %s" % lang) ValueError: Unable to find language code for Old_English

Thanks!

AngledLuffa Oct 6, 2023
Maintainer

For future reference, I would suggest ``` instead of single backtick for stack traces, as that way there is a better visual separation of the error lines

In this case I can read the line numbers well enough to recognize this is an older version of Stanza. I suggest updating to the latest, unless there's something preventing that

dmetola Oct 6, 2023
Author

My apologies, I'll use ``` for following errors.

I have updated Stanza, and that error does not appear for the moment. I'm getting the following:

2023-10-06 15:32:18 INFO: Datasets program called with:
/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_tokenizer_treebank.py UD_Old_English-TEST
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1217, in <module>
    main()
  File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1214, in main
    common.main(process_treebank, common.ModelType.TOKENIZER, add_specific_args)
  File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/common.py", line 271, in main
    process_treebank(treebank, model_type, paths, args)
  File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1197, in process_treebank
    train_conllu_file = common.find_treebank_dataset_file(treebank, udbase_dir, "train", "conllu", fail=True)
  File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/common.py", line 170, in find_treebank_dataset_file
    raise FileNotFoundError("Could not find any treebank files which matched {}".format(filename))
FileNotFoundError: Could not find any treebank files which matched extern_data/ud2/ud-treebanks-v2.11/UD_Old_English-TEST/*-ud-train.conllu```

Thanks!

AngledLuffa · 2023-10-06T14:44:13Z

AngledLuffa
Oct 6, 2023
Maintainer

Well, that does tell you exactly where it's looking... are the data files where it expects, or do you have them somewhere else?

4 replies

dmetola Oct 6, 2023
Author

Before updating Stanza, my files were in stanza-train/data/udbase/UD_Old_English-Test, and I wasn't getting that error.

I haven't moved them to any other folder, and I can't find a folder extern_data/ud2... I don't know if the update changes some directory, or if I need to generate that directory

AngledLuffa Oct 6, 2023
Maintainer

I think you should be able to set UDBASE to be the directory where you have the data, and it will find it there without moving.

dmetola Oct 9, 2023
Author

Thanks for your response, I have tried your solutions and reviewed the documentation, and still the issue persists.

I have checked the config.sh file and changed the directories accordingly, I have also checked the variables in prepare_tokenizer and common, and everything seems OK.

To see if it was a problem with my configuration, or my dataset, I tried running the first two commands with the English TEST data included. For the first command in the documentation, the following appears:

(nlp) dario@192 VSCode-Projects % python3 -m stanza.utils.datasets.prepare_tokenizer_treebank UD_English-TEST
2023-10-09 13:56:04 INFO: Datasets program called with:
/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_tokenizer_treebank.py UD_English-TEST
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1217, in <module>
    main()
  File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1214, in main
    common.main(process_treebank, common.ModelType.TOKENIZER, add_specific_args)
  File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/common.py", line 271, in main
    process_treebank(treebank, model_type, paths, args)
  File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1197, in process_treebank
    train_conllu_file = common.find_treebank_dataset_file(treebank, udbase_dir, "train", "conllu", fail=True)
  File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/common.py", line 170, in find_treebank_dataset_file
    raise FileNotFoundError("Could not find any treebank files which matched {}".format(filename))
FileNotFoundError: Could not find any treebank files which matched extern_data/ud2/ud-treebanks-v2.11/UD_English-TEST/*-ud-train.conllu

As the English TEST data is already split in 3 sets, I tried running the second command, and this error appeared:

(nlp) dario@192 VSCode-Projects % python3 -m stanza.utils.training.run_tokenizer UD_English-TEST             
2023-10-09 13:57:25 INFO: Training program called with:
/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/training/run_tokenizer.py UD_English-TEST
2023-10-09 13:57:25 DEBUG: UD_English-TEST: en_test
2023-10-09 13:57:25 INFO: Save file for en_test model: en_test_tokenizer.pt
2023-10-09 13:57:25 INFO: UD_English-TEST: saved_models/tokenize/en_test_tokenizer.pt does not exist, training new model
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/training/run_tokenizer.py", line 109, in <module>
    main()
  File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/training/run_tokenizer.py", line 106, in main
    common.main(run_treebank, "tokenize", "tokenizer", sub_argparse=tokenizer.build_argparse())
  File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/training/common.py", line 183, in main
    run_treebank(mode, paths, treebank, short_name,
  File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/training/run_tokenizer.py", line 71, in run_treebank
    seqlen = str(math.ceil(avg_sent_len(label_file) * 3 / 100) * 100)
  File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/avg_sent_len.py", line 11, in avg_sent_len
    with open(toklabels, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/tokenize/en_test-ud-train.toklabels'

Before this, I run the update command again, just to be sure I had the latest Stanza version. I don't have in my directory the avg_sent_len.py script, and the script is also looking for a file "en_test-ud-train.toklabels", which doesn't exist either.

I don't know if that file should appear after preparing the treebank, but since I have my dataset, and the TEST dataset, split in 3, I don't think that it should be an issue. Also bearing in mind that I tried running these same commands last year and I didn't get any of these error messages.

Could you please check if everything is clear from the error messages I'm including? Maybe something has not been installed properly on my end, although I removed everything and started again. I'm willing to try that again if you think it may be convenient, but, after trying running the commands for the TEST data and getting the same errors as with my data, it makes me think that something is wrong after installation.

Thanks for your attention and cooperation! Really appreciated!

AngledLuffa Oct 9, 2023
Maintainer

When you got this error from preparing the test data:

File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/common.py", line 170, in find_treebank_dataset_file
    raise FileNotFoundError("Could not find any treebank files which matched {}".format(filename))
FileNotFoundError: Could not find any treebank files which matched extern_data/ud2/ud-treebanks-v2.11/UD_English-TEST/*-ud-train.conllu

That should indicate that wherever it looked for the data, it did not find it. So why is it looking there? What is your UDBASE environment variable at that time, for example: echo %UDBASE% assuming this is a Windows platform based on your path?

dmetola · 2023-10-10T16:46:10Z

dmetola
Oct 10, 2023
Author

Hi, thanks for bearing with me.

I have managed to sort the issue, thanks for your help.

When running the command

python -m stanza.utils.datasets.prepare_tokenizer_treebank UD_Old_English-TEST

I'm getting the following error:

2023-10-10 17:30:05 INFO: Datasets program called with:
/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py UD_Old_English-TEST
Preparing data for UD_Old_English-TEST: ang_test, ang
Reading from ../data/udbase/UD_Old_English-TEST/ang_ewt-ud-train.conllu and writing to ../data/processed/tokenize/ang_test.train.gold.conllu
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1217, in <module>
    main()
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1214, in main
    common.main(process_treebank, common.ModelType.TOKENIZER, add_specific_args)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/common.py", line 271, in main
    process_treebank(treebank, model_type, paths, args)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1204, in process_treebank
    process_ud_treebank(treebank, udbase_dir, tokenizer_dir, short_name, short_language, args.augment)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1098, in process_ud_treebank
    prepare_ud_dataset(treebank, udbase_dir, tokenizer_dir, short_name, short_language, "train", augment)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1087, in prepare_ud_dataset
    write_augmented_dataset(input_conllu, output_conllu, augment_punct)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 724, in write_augmented_dataset
    new_sents = augment_function(sents)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 703, in augment_punct
    new_sents = augment_apos(sents)
  File "/Users/dario/VSCode-Projects/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 424, in augment_apos
    raise ValueError("Cannot find '# text'")
ValueError: Cannot find '# text'

However, my dataset already presents that information for each sentence. I'm attaching a excerpt from my dataset, and another one from the dummy data in the repo

# text = Her ys godspellys angyn Hælyndes cristes godes suna.
1	Her	hēr	ADV	adverb	Uninflected=Yes	0	root	_	_
2	ys	bēon/wesan/sēon ‘to be’	AUX	auxiliary verb	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	1	cop	_	_
3	godspellys	godspel 	NOUN	common noun	Case=Gen|Gender=Neut|Number=Sing	4	nmod	_	_
4	angyn	anginn	NOUN	common noun	Case=Nom|Gender=Neut|Number=Sing	1	nsubj	_	_
5	Hælyndes	hǣlend 	NOUN	common noun	Case=Gen|Gender=Masc|Number=Sing	3	nmod	_	_
6	cristes	Crīst	PROPN	proper noun	Case=Gen|Gender=Masc|Number=Sing	5	flat	_	_
7	godes	God	PROPN	proper noun	Case=Gen|Gender=Masc|Number=Sing	8	nmod	_	_
8	suna	sunu	NOUN	common noun	Case=Gen|Gender=Masc|Number=Sing	5	appos	_	SpaceAfter=No
9	.	.	PUNCT	punctuation	Uninflected=Yes	1	punct	_	_

# text = Barack Obama was born in Hawaii.
1	Barack	Barack	PROPN	NNP	Number=Sing	4	nsubj:pass	_	_
2	Obama	Obama	PROPN	NNP	Number=Sing	1	flat	_	_
3	was	be	AUX	VBD	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	4	aux:pass	_	_
4	born	bear	VERB	VBN	Tense=Past|VerbForm=Part|Voice=Pass	0	root	_	_
5	in	in	ADP	IN	_	6	case	_	_
6	Hawaii	Hawaii	PROPN	NNP	Number=Sing	4	obl	_	SpaceAfter=No
7	.	.	PUNCT	.	_	4	punct	_	_

Any ideas on how to sort this out? I'm guessing it has to do with the formatting of my dataset. It seems that in VSCode, spacing is a little bit different from the first to the second instances. However, tabulation and spaces are exactly the same.

Thanks for your help!

0 replies

AngledLuffa · 2023-10-10T20:18:24Z

AngledLuffa
Oct 10, 2023
Maintainer

I'm actually going to push back on the idea that each sentence already has "# text", but perhaps there is a different format for some sentence, as you suggest. I added a bit more information to the error message to hopefully help you narrow down where it's happening. You can install Stanza with extra information in the error message as follows: pip install --no-deps --upgrade --force -i https://test.pypi.org/simple/ stanza==1.6.1.1

1 reply

dmetola Oct 16, 2023
Author

Hi,

Thanks for your reply. Everything seems to be working now, and I have started training the tokenizer. Apologies for the late reply.

I'll close this discussion now. Thanks for your help and attention.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to train the model in Google Colab #1291

{{title}}

Replies: 6 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to train the model in Google Colab #1291

dmetola Oct 3, 2023

Replies: 6 comments · 10 replies

AngledLuffa Oct 3, 2023 Maintainer

dmetola Oct 4, 2023 Author

AngledLuffa Oct 4, 2023 Maintainer

dmetola Oct 4, 2023 Author

AngledLuffa Oct 4, 2023 Maintainer

dmetola Oct 6, 2023 Author

AngledLuffa Oct 6, 2023 Maintainer

dmetola Oct 6, 2023 Author

AngledLuffa Oct 6, 2023 Maintainer

dmetola Oct 6, 2023 Author

AngledLuffa Oct 6, 2023 Maintainer

dmetola Oct 9, 2023 Author

AngledLuffa Oct 9, 2023 Maintainer

dmetola Oct 10, 2023 Author

AngledLuffa Oct 10, 2023 Maintainer

dmetola Oct 16, 2023 Author

dmetola
Oct 3, 2023

Replies: 6 comments 10 replies

AngledLuffa
Oct 3, 2023
Maintainer

dmetola Oct 4, 2023
Author

AngledLuffa
Oct 4, 2023
Maintainer

dmetola Oct 4, 2023
Author

AngledLuffa
Oct 4, 2023
Maintainer

dmetola Oct 6, 2023
Author

AngledLuffa Oct 6, 2023
Maintainer

dmetola Oct 6, 2023
Author

AngledLuffa
Oct 6, 2023
Maintainer

dmetola Oct 6, 2023
Author

AngledLuffa Oct 6, 2023
Maintainer

dmetola Oct 9, 2023
Author

AngledLuffa Oct 9, 2023
Maintainer

dmetola
Oct 10, 2023
Author

AngledLuffa
Oct 10, 2023
Maintainer

dmetola Oct 16, 2023
Author