Skip to content

Commit

Permalink
Merge pull request #68 from tanishq-ids/kpi_answering
Browse files Browse the repository at this point in the history
Kpi answering
  • Loading branch information
tanishq-ids authored Oct 29, 2024
2 parents 7f46f66 + 47d51a9 commit 61022fe
Show file tree
Hide file tree
Showing 3 changed files with 183 additions and 123 deletions.
280 changes: 161 additions & 119 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,128 +4,182 @@ OSC Transformer Based Extractor

|osc-climate-project| |osc-climate-slack| |osc-climate-github| |pypi| |build-status| |pdm| |PyScaffold|



***********************************
OS-Climate Data Extraction Tool
***********************************


This project provides an CLI tool and python scripts to train a HuggingFace Transformer model or a local Transformer model and perform inference with it. The primary goal of the inference is to determine the relevance between a given question and context.
This project provides a CLI tool and Python scripts to train Transformer models (via Hugging Face) for two primary tasks:
1. **Relevance Detection**: Determines if a question-context pair is relevant.
2. **KPI Detection**: Fine-tunes models to extract key performance indicators (KPIs) from datasets like annual reports and perform inference.

Quick Start
^^^^^^^^^^^^^

To install the OSC Transformer Based Extractor CLI, use pip:
To install the tool, use pip:

.. code-block:: shell
$ pip install osc-transformer-based-extractor
Afterwards you can use the tooling as a CLI tool by simply typing:
After installation, you can access the CLI tool with:

.. code-block:: shell
$ osc-transformer-based-extractor
We are using typer to have a nice CLI tool here. All details and help will be shown in the CLI
tool itself and are not described here in more detail.
This command will show the available commands and help via Typer, our CLI library.

Commands and Workflow
^^^^^^^^^^^^^^^^^^^^^^^

**Example**: Assume the folder structure is like that:
1. Relevance Detection
--------------------------

**Fine-tuning the Model:**

Assume your project structure looks like this:

.. code-block:: text
project/
├── kpi_mapping.csv
├── training_data.csv
├── data/
│ └── (json files for inference command)
├── model/
│ └── (model-related files go here)
|── saved__model/
| └── (output files trained models)
├── output/
│ └── (ouput files from inference command)
├── data/
│ └── (JSON files for inference)
├── model/
│ └── (Model-related files)
├── saved__model/
│ └── (Output from training)
├── output/
│ └── (Results from inference)
Use the following command to fine-tune the model:

.. code-block:: shell
$ osc-transformer-based-extractor relevance-detector fine-tune \
--data_path "project/training_data.csv" \
--model_name "bert-base-uncased" \
--num_labels 2 \
--max_length 128 \
--epochs 3 \
--batch_size 16 \
--output_dir "project/saved__model/" \
--save_steps 500
**Running Inference:**

.. code-block:: shell
$ osc-transformer-based-extractor relevance-detector perform-inference \
--folder_path "project/data/" \
--kpi_mapping_path "project/kpi_mapping.csv" \
--output_path "project/output/" \
--model_path "project/model/" \
--tokenizer_path "project/model/" \
--threshold 0.5
2. KPI Detection
---------------------

The KPI detection functionality includes **fine-tuning** and **inference**.

**Fine-tuning the KPI Model:**

Assume your project structure looks like this:

.. code-block:: text
project/
├── kpi_mapping.csv
├── training_data.csv
├── model/
│ └── (model-related files, e.g., tokenizer, config, checkpoints)
├── saved__model/
│ └── (Folder to store output from fine-tuning)
├── output/
│ └── (output files, e.g., inference_results.xlsx)
Then you can now simply run (after installation of osc-transformer-based-extractor)
the following command to fine-tune the model on the data:
.. code-block:: shell
$ osc-transformer-based-extractor relevance-detector fine-tune \
--data_path "project/training_data.csv" \
--model_name "bert-base-uncased" \
--num_labels 2 \
--max_length 128 \
--epochs 3 \
--batch_size 16 \
--output_dir "project/saved__model/" \
--save_steps 500
$ osc-transformer-based-extractor kpi-detection fine-tune \
--data_path "project/training_data.csv" \
--model_name "bert-base-uncased" \
--max_length 128 \
--epochs 3 \
--batch_size 16 \
--learning_rate 5e-5 \
--output_dir "project/saved__model/" \
--save_steps 500
Also, the following command can be run to perform inference:
**Performing Inference:**

.. code-block:: shell
$ osc-transformer-based-extractor relevance-detector perform-inference \
--folder_path "project/data/" \
--kpi_mapping_path "project/kpi_mapping.csv" \
--output_path "project/output/" \
--model_path "project/model/" \
--tokenizer_path "project/model/" \
--threshold 0.5
$ osc-transformer-based-extractor kpi-detection inference \
--data_file_path "project/data/input_dataset.csv" \
--output_path "project/output/inference_results.xlsx" \
--model_path "project/model/"
***************
Training Data
***************
Training Data Requirements
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Training File
^^^^^^^^^^^^^^^
1. Relevance Detection Training File:

To train the model, you need a CSV file with columns:
* ``Question``
* ``Context``
* ``Label``
The training file should have the following columns:
- ``Question``
- ``Context``
- ``Label``

Also additionally, the output of the https://github.com/os-climate/osc-transformer-presteps module can also be used. the output will look like following
Sample Data:
Example:

.. list-table:: traning_Data.csv
.. list-table:: Training Data Example
:header-rows: 1

* - Question
- Context
- Label
* - What is the company name?
- The Company is exposed to a risk...
- 0

2. KPI Detection Training File:

For KPI detection, the dataset should have these additional columns:

.. list-table:: KPI Detection Training Example
:header-rows: 1

* - Question
- Context
- Label
- Company
- Source File
- Source Page
- KPI ID
- Year
- Answer
- Data Type
- Annotator
- Index
* - What is the company name?
- The Company is exposed to a risk of by losses counterparties their contractual financial obligations when due, and in particular depends on the reliability of banks the Company deposits its available cash.
- ...
- 0
- NOVATEK
- 04_NOVATEK_AR_2016_ENG_11.pdf
- ['0']
- 0
- 2016
- PAO NOVATEK
- TEXT
- train_anno_large.xlsx
- 1022

3. KPI Mapping File:

KPI Mapping File
^^^^^^^^^^^^^^^^^^^^^
The Inference command will need a kpi-mapping.csv file, which looks like:

.. list-table:: kpi_mapping.csv
.. list-table:: KPI Mapping File Example
:header-rows: 1

* - kpi_id
Expand All @@ -134,105 +188,93 @@ The Inference command will need a kpi-mapping.csv file, which looks like:
- add_year
- kpi_category
* - 1
- In which year was the annual report or the sustainability report published?
- In which year was the annual report...
- OG, CM, CU
- FALSE
- TEXT




************************
Developer Notes
************************

Use code directly without CLI via Github Repository
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

First clone the repository to your local environment::
^^^^^^^^^^^^^^^^^

$ git clone https://github.com/os-climate/osc-transformer-based-extractor/
Local Development
----------------------

We are using pdm to manage the packages and tox for a stable test framework.
Hence, first install pdm (possibly in a virtual environment) via::
Clone the repository:

$ pip install pdm
.. code-block:: shell
Afterwards sync you system via::
$ git clone https://github.com/os-climate/osc-transformer-based-extractor/
$ pdm sync
We use **pdm** for package management and **tox** for testing.

Now you have multiple demos on how to go on. See folder
[here](demo)
1. Install ``pdm``:

pdm
---
.. code-block:: shell
For adding new dependencies use pdm. You could add new packages via pdm add.
For example numpy via::
$ pip install pdm
$ pdm add numpy
2. Sync dependencies:

For a very detailed description check the homepage of the pdm project:
.. code-block:: shell
https://pdm-project.org/en/latest/
$ pdm sync
3. Add new packages (e.g., numpy):

tox
---
.. code-block:: shell
For running linting tools we use tox which you run outside of your virtual environment::
$ pdm add numpy
$ pip install tox
$ tox -e lint
$ tox -e test
4. Run ``tox`` for linting and testing:

This will automatically apply some checks on your code and run the provided pytests. See
more details on tox on the homepage of the tox project:
.. code-block:: shell
https://tox.wiki/en/4.16.0/
$ pip install tox
$ tox -e lint
$ tox -e test
************************
Contributing
************************

Contributions are welcome! Please fork the repository and submit a pull request for any enhancements or bug fixes.

All contributions (including pull requests) must agree to the Developer Certificate of Origin (DCO) version 1.1. This is exactly the same one created and used by the Linux kernel developers and posted on http://developercertificate.org/. This is a developer's certification that he or she has the right to submit the patch for inclusion into the project. Simply submitting a contribution implies this agreement, however, please include a "Signed-off-by" tag in every patch (this tag is a conventional way to confirm that you agree to the DCO).


On June 26 2024, Linux Foundation announced the merger of its financial services umbrella, the Fintech Open Source Foundation ([FINOS](https://finos.org)), with OS-Climate, an open source community dedicated to building data technologies, modeling, and analytic tools that will drive global capital flows into climate change mitigation and resilience; OS-Climate projects are in the process of transitioning to the [FINOS governance framework](https://community.finos.org/docs/governance); read more on [finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg](https://finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg)

^^^^^^^^^^^^^^

We welcome contributions! Please fork the repository and submit a pull request.
Ensure you sign off each commit with the **Developer Certificate of Origin (DCO)**.
Read more: http://developercertificate.org/.

Governance Transition
^^^^^^^^^^^^^^^^^^^^^^^^

On June 26, 2024, the **Linux Foundation** announced the merger of **FINOS** with OS-Climate.
Projects are now transitioning to the [FINOS governance framework](https://community.finos.org/docs/governance).

Shields
^^^^^^^^^

|osc-climate-project| |osc-climate-slack| |osc-climate-github| |pypi| |build-status| |pdm| |PyScaffold|

.. |osc-climate-project| image:: https://img.shields.io/badge/OS-Climate-blue
:alt: An OS-Climate Project
:target: https://os-climate.org/
:alt: An OS-Climate Project
:target: https://os-climate.org/

.. |osc-climate-slack| image:: https://img.shields.io/badge/slack-osclimate-brightgreen.svg?logo=slack
:alt: Join OS-Climate on Slack
:target: https://os-climate.slack.com
:alt: Join OS-Climate on Slack
:target: https://os-climate.slack.com

.. |osc-climate-github| image:: https://img.shields.io/badge/GitHub-100000?logo=github&logoColor=white
:alt: Source code on GitHub
:target: https://github.com/ModeSevenIndustrialSolutions/osc-data-extractor
:alt: Source code on GitHub
:target: https://github.com/ModeSevenIndustrialSolutions/osc-data-extractor

.. |pypi| image:: https://img.shields.io/pypi/v/osc-data-extractor.svg
:alt: PyPI package
:target: https://pypi.org/project/osc-data-extractor/
:alt: PyPI package
:target: https://pypi.org/project/osc-data-extractor/

.. |build-status| image:: https://api.cirrus-ci.com/github/os-climate/osc-data-extractor.svg?branch=main
:alt: Built Status
:target: https://cirrus-ci.com/github/os-climate/osc-data-extractor
:alt: Build Status
:target: https://cirrus-ci.com/github/os-climate/osc-data-extractor

.. |pdm| image:: https://img.shields.io/badge/PDM-Project-purple
:alt: Built using PDM
:target: https://pdm-project.org/latest/
:alt: Built using PDM
:target: https://pdm-project.org/latest/

.. |PyScaffold| image:: https://img.shields.io/badge/-PyScaffold-005CA0?logo=pyscaffold
:alt: Project generated with PyScaffold
:target: https://pyscaffold.org/
:alt: Project generated with PyScaffold
:target: https://pyscaffold.org/
Loading

0 comments on commit 61022fe

Please sign in to comment.