These are the Python scripts I created to conduct my work for my Master's Thesis on semantic diversity and the concreteness advantage. Here is a quick guide to the directories I created and the scripts:
- The code_bottini directory contains all the scripts used for calculations for Bottini et al.'s stimuli, specifically:
- --> this is the script used to clean the itwac corpus, train the w2v model, and create all the context files
- --> this is the script used to compile the context files for each stimulus (so for each stimulus we create a directory and we populate it with contexts in which this stimulus was found)
- --> this is the script used to select random 100k contexts for those stimuli that had more than that
- --> finally, this is the script used to calculate sem_d (as well as cont_num), including the creation of the final csv file
- The code_database directory is just the same, but applied to the English data:
- --> cleaning ukwac and creating the context files (no model training because we used a pretrained model here)
- --> making folders of stimuli with their contexts
- --> selecting random 100k contexts for stimuli that have more than that
- --> calculating sem_d & cont_num + creating csv file
Additionally, the file database_semD makes available the calculated semD scores for the Word Prevalence database (Brysbaert et al., 2014).