In this example we will show how to serve Huggingface Transformers with TorchServe model locally using kserve.
Clone pytorch/serve repository.
Copy the Transformer_kserve_handler.py handler file to examples/Huggingface_Transformers
folder
Navigate to examples/Huggingface_Transformers
Run the following command to download the model
python Download_Transformer_models.py
torch-model-archiver --model-name BERTSeqClassification --version 1.0 \
--serialized-file Transformer_model/pytorch_model.bin \
--handler ./Transformer_kserve_handler.py \
--extra-files "Transformer_model/config.json,./setup_config.json,./Seq_classification_artifacts/index_to_name.json,./Transformer_handler_generalized.py"
The command will create BERTSeqClassification.mar
file in current directory
Move the mar file to model-store
sudo mv BERTSeqClassification.mar /mnt/models/model-store
and use the following config properties (/mnt/models/config
)
inference_address=http://127.0.0.1:8085
management_address=http://127.0.0.1:8085
metrics_address=http://127.0.0.1:8082
enable_envvars_config=true
install_py_dep_per_model=true
enable_metrics_api=true
service_envelope=kservev2
metrics_mode=prometheus
NUM_WORKERS=1
number_of_netty_threads=4
job_queue_size=10
model_store=/mnt/models/model-store
model_snapshot={"name":"startup.cfg","modelCount":1,"models":{"BERTSeqClassification":{"1.0":{"defaultVersion":true,"marName":"BERTSeqClassification.mar","minWorkers":1,"maxWorkers":5,"batchSize":1,"maxBatchDelay":5000,"responseTimeout":120}}}}
Use bert_bytes_v2.json or bert_tensor_v2.
For new sample text, follow the instructions below
For bytes input, use tobytes utility.
python tobytes.py --input_text "this year business is good"
For tensor input, use bert_tokenizer utility
python bert_tokenizer.py --input_text "this year business is good"
Start TorchServe
torchserve --start --ts-config /mnt/models/config/config.properties --ncs
To test locally, clone TorchServe and move to the following folder kubernetes/kserve/kserve_wrapper
Start Kserve
python __main__.py
Navigate to kubernetes/kserve/kf_request_json/v2/bert
Run the following command
curl -v -H "ContentType: application/json" http://localhost:8080/v2/models/BERTSeqClassification/infer -d @./bert_bytes_v2.json
Expected Output
{"id": "d3b15cad-50a2-4eaf-80ce-8b0a428bd298", "model_name": "BERTSeqClassification", "model_version": "1.0", "outputs": [{"name": "predict", "shape": [], "datatype": "BYTES", "data": ["Not Accepted"]}]}
Run the following command
curl -v -H "ContentType: application/json" http://localhost:8080/v2/models/BERTSeqClassification/infer -d @./bert_tensor_v2.json
Expected output
{"id": "33abc661-7265-42fc-b7d9-44e5f79a7a67", "model_name": "BERTSeqClassification", "model_version": "1.0", "outputs": [{"name": "predict", "shape": [], "datatype": "BYTES", "data": ["Not Accepted"]}]}