BTSpeech

Python scripts to extract textual data from unstructured parliamentary protocol files (XML) provided by the German Bundestag. For all protocols from the 19th legislative period onwards please use my new repository BTSpeech2.0.

Python 3 scripts to extract speech texts from XML data provided by Bundestag Open Data Service for election periods where the provided XMLs do not follow a structured form (true for datasets of the 19th election period and earlier). Due to the unstructured nature of the provided data, the speech extraction will not always be 100% accurate. However, the quality of the generated dataset should be sufficient for performing quantitative text analysis.

Usage

First, run:

python process_xmls.py <folder_with_xmls> <output_folder>

This will extract the individual speeches from every XML file in folder_with_xmls and write the speaker's name, their fraction, date of the speech and the speech text to a json file in output_folder.

Then, to generate a CSV dataset from these json files, run the second script:

python process_jsons.py <output_folder> <output_dataset.csv>

with output_folder as the folder where the json files were generated previously.

Requirements

Python >= 3.6
Pandas

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
19_BTSpeeches_data.zip		19_BTSpeeches_data.zip
LICENSE		LICENSE
README.md		README.md
process_jsons.py		process_jsons.py
process_xmls.py		process_xmls.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BTSpeech

Usage

Requirements

About

Releases

Packages

Languages

License

jonasschm/BTSpeech

Folders and files

Latest commit

History

Repository files navigation

BTSpeech

Usage

Requirements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages