Transactional Voice AI (Serving)

Modular, Scalable and Optimized deployment code for all the Transactional Voice AI modules namely Automatic Speech Recognition (ASR), Inverse Text Normalization (ITN), Intent and Entity Recognition using Triton Inference Server and FastAPI.

For more info on Transactional Voice AI development codebase, refer here.

Prerequisites • Setup • Schema • Test • Benchmark

Prerequisites

Make sure that the system have NVIDIA GPU card(s) and corresponding latest drivers installed.

Docker (and Docker Compose) - Follow steps outlined here.
Install nvidia-container-toolkit - Official Guide (Note: it also contains steps on how to install Docker too!)

Setup

Clone the repository:

git clone https://github.com/AI4Bharat/transactional-voice-ai_serving.git
cd transactional-voice-ai_serving

Copy (and modify, if required) .env file:

cp .env.example .env

Build and Run Docker Containers using docker compose:

docker compose up --build

This will start two servers, one Triton Inference Server and its wrapper FastAPI server which also acts as an entrypoint to every request.

Schema

Payload

The payload to the server must follow the following schema:

{
  "config": {
    "language": {
      "sourceLanguage": "en"
    },
    "transcriptionFormat": {
      "value": "transcript"
    },
    "audioFormat": "wav",
    "samplingRate": 8000,
    "postProcessors": [
      "tag_entities"
    ]
  },
  "audio": [
    {
      "audioUri": "https://t3638486.p.clickup-attachments.com/t3638486/b6f63475-a96f-4c25-be45-0495946d440e/8797501890_mobile_number440_08_09_2022_20_46_25.wav"
    }
  ]
}

The audioUri should contain a link to a wav file. Instead of an URI, one can also provide audio in base64 format by using audioContent key:

{
  "config": {
    "language": {
      "sourceLanguage": "en"
    },
    "transcriptionFormat": {
      "value": "transcript"
    },
    "audioFormat": "wav",
    "samplingRate": 8000,
    "postProcessors": [
      "tag_entities"
    ]
  },
  "audio": [
    {
      "audioContent": "GkXfo59ChoEBQveBAULygQRC84EIQoKEd2VibUKHgQRChYECGFOAZw…"
    }
  ]
}

Response

The server responds with the ASR transcript along with intent and entity predictions. The intent is provided as a single string indicating the intent name, while entity is a list containing one dictionary per predicted entity giving the information regarding entity type, word, value, start and end indices.

{
  "status": {
    "success": true,
    "message": ""
  },
  "output": [
    {
      "source": "please transfer 200 rupees to mobile number 9998887776 from my sbi account",
      "entities": [
        {
          "entity": "bank_name",
          "word": "sbi account",
          "start": 42,
          "end": 45,
          "value": "state_bank"
        },
        {
          "entity": "amount_of_money",
          "word": "200 rupees",
          "start": 17,
          "end": 27,
          "value": "200"
        },
        {
          "entity": "mobile_number",
          "word": "9998887776",
          "start": 38,
          "end": 48,
          "value": "9998887776"
        }
      ],
      "id": "abcdefghij1234567890AB",
      "intent": "p2p_transfer"
    }
  ]
}

Test

Test the complete pipeline using the python client on a sample .wav file:

cd fastapi_client/scripts
python single_file_inference.py <lang-code> # lang-code could be "en", "hi" or "ta"

Benchmark

All the results shown below can be reproduced by running the following commands -

cd fastapi_client/scripts

For Tamil language -

python benchmark_npci_concurrent.py --gt-file ../data/ta/npci_pipeline_benchmark/ta-benchmark-07-19.csv --audio-folder ../data/ta/npci_pipeline_benchmark/audio --savefile ../data/ta/npci_pipeline_benchmark/results/temp.csv --lang "ta" --batchsize 1

For Hindi language -

python benchmark_npci_concurrent.py --gt-file ../data/hi/npci_pipeline_benchmark/hi-benchmark-v0109-fixed.csv --audio-folder ../data/hi/npci_pipeline_benchmark/audio --savefile ../data/hi/npci_pipeline_benchmark/results/temp.csv --lang "hi" --batchsize 1

For English language -

python benchmark_npci_concurrent.py --gt-file ../data/en/npci_pipeline_benchmark/en-benchmark-v0109-fixed.csv --audio-folder ../data/en/npci_pipeline_benchmark/audio --savefile ../data/en/npci_pipeline_benchmark/results/temp.csv --lang "en" --batchsize 1

Benchmark Data Statistics -

Data Source - NPCI's collection of samples through IVRS

Data Type - 8Khz single channel audio data, human-annotated entity/intent labels

Language	# Utterances	Total Duration (hrs)	Average length (sec)	# Entities	# Intents
en	2584	9.50	13.23	2239	894
hi	4411	8.9	12.88	2671	1279
ta	665	1.75	9.48	676	-

Performance Statistics -

Metrics - Accuracy for Intent Recognition and F1 Score for Entity Recognition

Language	Intent Type	Intent Accuracy	Entity Type	Entity F1 Score
en	p2p_transfer	85	amount_of_money	89
			bank_name	86
			mobile_number	88
hi	p2p_transfer	87	amount_of_money	90
			bank_name	90
			mobile_number	86
ta	p2p_transfer	-	amount_of_money	69
			bank_name	82
			mobile_number	75

Runtime Statistics -

Setup1 (Default) - Single instance of all the models loaded in GPU memory using Triton Inference Server, except for Pyctcdecode CPU module (8 instances)

Lang	Hardware Type	Avg. GPU VRAM usage (GB)	Avg. GPU utilization	Avg. CPU RAM usage(GB)	Avg. CPU utilization	Total Time taken (s)/samples	Avg Latency (s)	Avg Throughput (RPS)
en	A100-80GB, 16-core-110GB	9.23	6.89	21.15	34.75	393/2584	1.19	5.92
en	T4-16GB, 4-core-28GB	4.16	16.47	23	83.95	759/2584	2.37	3.25
hi	A100-80GB, 16-core-110GB	11.8	5.19	22.26	13.48	968/4411	1.73	4.32
hi	T4-16GB, 4-core-28GB	5.2	16.6	24.28	51.58	1483/4411	2.72	2.94
ta	A100-80GB, 16-core-110GB	14.6	5.45	22.23	44.84	172/665	1.91	3.52
ta	T4-16GB, 4-core-28GB	6.15	11.4	24.49	93.22	381/665	4.77	1.63

Setup2 - Two instances of all the models loaded in GPU memory using Triton Inference Server, except for Pyctcdecode CPU module (8 instances)

Lang	Hardware Type	Avg. GPU VRAM usage (GB)	Avg. GPU utilization	Avg. CPU RAM usage(GB)	Avg. CPU utilization	Total Time taken (s)/samples	Avg Latency (s)	Avg Throughput (RPS)
en	A100-80GB, 16-core-110GB	13.3	11.35	25.04	51.86	273/2584	1.89	8.84
hi	A100-80GB, 16-core-110GB	17.1	10.94	26.7	25.6	968/4411	1.86	8.35
ta	A100-80GB, 16-core-110GB	13.3	6.3	24.73	53.91	147/665	3.22	4.09

For higher throughput values, increase the count of instance_group in Triton's configuration or/and scale horizontally.

Note: All the stats are for the end-to-end system including the FastAPI wrapper on top of Triton server.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
export		export
fastapi_client		fastapi_client
fastapi_server		fastapi_server
triton_client/scripts		triton_client/scripts
triton_server		triton_server
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transactional Voice AI (Serving)

Prerequisites

Setup

Schema

Payload

Response

Test

Benchmark

Benchmark Data Statistics -

Performance Statistics -

Runtime Statistics -

About

Releases

Packages

Contributors 3

Languages

AI4Bharat/transactional-voice-ai_serving

Folders and files

Latest commit

History

Repository files navigation

Transactional Voice AI (Serving)

Prerequisites

Setup

Schema

Payload

Response

Test

Benchmark

Benchmark Data Statistics -

Performance Statistics -

Runtime Statistics -

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages