Here some details about how the various files has been retrieved (or reprocessed, or created).
In the vocabularies/
directory you can find the files ready to be loaded into CKAN.
They may be either the original official files, or a processed version that only includes entries fitting DCATAPIT specs.
You will also find copies of the original files in a Release entry in github (in order not to add unused heavy files in the repository checkout), and the scripts for processing the original files in vocabularies/scripts
.
In order to be able to run the trimming scripts, you need to install the xmlstarlet package.
-
Official page:
https://op.europa.eu/it/web/eu-vocabularies/at-dataset/-/resource/dataset/data-theme/ -
Download command:
wget "https://op.europa.eu/o/opportal-service/euvoc-download-handler?cellarURI=http%3A%2F%2Fpublications.europa.eu%2Fresource%2Fcellar%2F2c758808-fdd6-11ea-b44f-01aa75ed71a1.0001.02%2FDOC_1&fileName=data-theme-skos.rdf" -O official/data-theme-skos.rdf
-
Processing:
- Script
scripts/data-theme-trim.sh
:- Read
official/data-theme-skos.rdf
- Filter out unused languages
- Read
- Final processed file:
data-theme-filtered.rdf
- Script
-
Official page:
https://op.europa.eu/it/web/eu-vocabularies/at-dataset/-/resource/dataset/place/ -
Download command:
wget "https://op.europa.eu/o/opportal-service/euvoc-download-handler?cellarURI=http%3A%2F%2Fpublications.europa.eu%2Fresource%2Fcellar%2F87ec948c-581c-11ec-91ac-01aa75ed71a1.0001.04%2FDOC_1&fileName=places-skos.rdf" -O official/places-skos.rdf
-
Processing:
- Script
scripts/places-trim.sh
:- Read
official/places-skos.rdf
- Filter out non Italian places
- Filter out unused languages
- Read
- Final processed file
places-filtered.rdf
- Script
-
Official page:
https://op.europa.eu/it/web/eu-vocabularies/at-dataset/-/resource/dataset/language/ -
Download command:
wget "https://op.europa.eu/o/opportal-service/euvoc-download-handler?cellarURI=http%3A%2F%2Fpublications.europa.eu%2Fresource%2Fcellar%2F87f03e0d-581c-11ec-91ac-01aa75ed71a1.0001.05%2FDOC_1&fileName=languages-skos.rdf" -O official/languages-skos.rdf
-
Processing:
- Script
scripts/languages-trim.sh
:- Read
official/languages-skos.rdf
- Filter out unused languages
- Filter out unused elements
- Read
- Final processed file
languages-filtered.rdf
- Script
-
Official page:
https://op.europa.eu/it/web/eu-vocabularies/at-dataset/-/resource/dataset/frequency/ -
Download command:
wget "https://op.europa.eu/o/opportal-service/euvoc-download-handler?cellarURI=http%3A%2F%2Fpublications.europa.eu%2Fresource%2Fcellar%2Fe20301fe-928e-11e9-9369-01aa75ed71a1.0001.02%2FDOC_1&fileName=frequencies-skos.rdf" -O official/frequencies-skos.rdf
-
Processing:
- Script
scripts/frequencies-trim.sh
:- Read
official/frequencies-skos.rdf
- Filter out unused languages
- Filter out unused elements
- Read
- Final processed file
frequencies-filtered.rdf
- Script
-
Official page:
https://op.europa.eu/it/web/eu-vocabularies/at-dataset/-/resource/dataset/file-type/ -
Download command:
wget "https://op.europa.eu/o/opportal-service/euvoc-download-handler?cellarURI=http%3A%2F%2Fpublications.europa.eu%2Fresource%2Fcellar%2F7c112635-581c-11ec-91ac-01aa75ed71a1.0001.04%2FDOC_1&fileName=filetypes-skos.rdf" -O official/filetypes-skos.rdf
-
Processing
- Script
scripts/filetypes-trim.sh
:- Read
official/filetypes-skos.rdf
- Filter out unused languages
- Filter out unused elements
- Add missing
it
de
fr
es
entries
- Read
- Final processed file
frequencies-filtered.rdf
- Script
Subthemes mapping file is not from op.europa.eu
and it does not need any further processing.
- Download command:
wget https://github.com/italia/daf-ontologie-vocabolari-controllati/raw/master/VocabolariControllati/theme-subtheme-mapping/theme-subtheme-mapping.rdf
We're using a subset of EUROVOC concepts for extracting localized label for subthemes,
which are only provided in it
and en
in the theme-subtheme-mapping.rdf
file.
-
Official page: https://op.europa.eu/it/web/eu-vocabularies/dataset/-/resource?uri=http://publications.europa.eu/resource/dataset/eurovoc#
-
Download command:
wget "https://op.europa.eu/o/opportal-service/euvoc-download-handler?cellarURI=http%3A%2F%2Fpublications.europa.eu%2Fresource%2Fcellar%2Fbcd714b1-5f05-11ec-9c6c-01aa75ed71a1.0001.05%2FDOC_1&fileName=eurovoc_skos.zip" -O official/eurovoc_skos.zip
-
Processing
- Unzip the file, you'll get the file
eurovoc-skos-ap-eu.rdf
unzip official/eurovoc_skos.zip -d official/
- Call script
scripts/eurovoc-trim.sh official/eurovoc-skos-ap-eu.rdf
It's a big file, so it will take a bit. - Final processed file
eurovoc-skos-filtered.rdf
(about 1% of the original size)
- Unzip the file, you'll get the file
- Download command:
wget https://raw.githubusercontent.com/italia/daf-ontologie-vocabolari-controllati/master/VocabolariControllati/licences/licences.rdf
In the examples/
directory there are some sample files.
In the tests/files/
directory there are overly simplified files used in tests.
-
/ckanext/dcatapit/tests/files/data-theme-skos.rdf
:xmlstarlet ed -d "//*[@xml:lang][not(contains('it en fr de', @xml:lang))]" vocabularies/data-theme-skos.rdf > ckanext/dcatapit/tests/files/data-theme-skos.rdf
-
/ckanext/dcatapit/tests/files/eurovoc_filtered.rdf
This file contains the labels for only the used subthemes.
In order to create this file you need an usable EUROVOC file (theeurovoc-filtered.rdf
described above is perfectly fine) and the subtheme mapping.
You can recreate this file using thecreate_eurovoc_for_test.py
script:./create_eurovoc_for_test.py ../../../../vocabularies/theme-subtheme-mapping.rdf ../../../../vocabularies/eurovoc-filtered.rdf