PPML (Privacy Preserving Machine Learning) in BigDL 2.0 provides a Trusted Cluster Environment for secure Big Data & AI applications, even in an untrusted cloud environment. By combining Intel Software Guard Extensions (SGX) with several other security technologies (e.g., attestation, key management service, private set intersection, federated learning, homomorphic encryption, etc.), BigDL PPML ensures end-to-end security enabled for the entire distributed workflows, such as Apache Spark, Apache Flink, XGBoost, TensorFlow, PyTorch, etc.
PPML ensures security for all dimensions of the data lifecycle: data at rest, data in transit, and data in use. Data being transferred on a network is in transit
, data in storage is at rest
, and data being processed is in use
.
PPML protects compute and memory by SGX Enclaves, storage (e.g., data and model) by encryption, network communication by remote attestation and Transport Layer Security (TLS), and optional Federated Learning support.
With BigDL PPML, you can run trusted Big Data & AI applications
- Trusted Spark SQL & Dataframe: with trusted Big Data analytics and ML/DL support, users can run standard Spark data analysis (such as Spark SQL, Dataframe, MLlib, etc.) in a secure and trusted fashion.
- Trusted ML (Machine Learning): with trusted Big Data analytics and ML/DL support, users can run distributed machine learning (such as MLlib, XGBoost etc.) in a secure and trusted fashion.
- Trusted DL (Deep Learning): with trusted Big Data analytics and ML/DL support, users can run distributed deep learning (such as BigDL, Orca, Nano, DLlib etc.) in a secure and trusted fashion.
- Trusted FL (Federated Learning): with PSI (Private Set Intersection), Secured Aggregation and trusted federated learning support, users can build united models across different parties without compromising privacy, even if these parties have different datasets or features.
In this section, we take SimpleQuery as an example to go through the entire BigDL PPML end-to-end workflow. SimpleQuery is a simple example to query developers between the ages of 20 and 40 from people.csv.
Prepare your environment first, including K8s cluster setup, K8s-SGX plugin setup, key/password preparation, key management service (KMS) and attestation service (AS) setup, BigDL PPML client container preparation. Please follow the detailed steps in Prepare Environment.
Next, you are going to build a base image, and a custom image on top of it to avoid leaving secrets e.g. enclave key in images/containers. After that, you need to register the MRENCLAVE
in your customer image to Attestation Service Before running your application, and PPML will verify the runtime MREnclave automatically at the backend. The below chart illustrated the whole workflow:
Start your application with the following guide step by step:
To build a secure PPML image for a production environment, BigDL prepared a public base image that does not contain any secrets. You can customize your image on top of this base image.
-
Prepare BigDL Base Image
Users can pull the base image from dockerhub or build it by themselves.
Pull the base image
docker pull intelanalytics/bigdl-ppml-trusted-big-data-ml-python-gramine-base:2.2.0-SNAPSHOT
-
Build Custom Image
When the base image is ready, you need to generate your enclave key which will be used when building a custom image. Keep the enclave key safe for future remote attestations.
Running the following command to generate the enclave key
enclave-key.pem
, which is used to launch and sign SGX Enclave.cd bigdl-gramine openssl genrsa -3 -out enclave-key.pem 3072
When the enclave key
enclave-key.pem
is generated, you are ready to build your custom image by running the following command:# under bigdl-gramine dir # modify custom parameters in build-custom-image.sh ./build-custom-image.sh cd ..
Warning: If you want to skip DCAP (Data Center Attestation Primitives) attestation in runtime containers, you can set
ENABLE_DCAP_ATTESTATION
to false inbuild-custom-image.sh
, and this will generate a none-attestation image. But never do this unsafe operation in production!The sensitive enclave key will not be saved in the built image. Two values
mr_enclave
andmr_signer
are recorded while the Enclave is built, you can findmr_enclave
andmr_signer
values in the console log, which are hash values and used to register your MREnclave in the following attestation step.[INFO] Use the below hash values of mr_enclave and mr_signer to register enclave: mr_enclave : c7a8a42af...... mr_signer : 6f0627955......
Note: you can also customize the image according to your own needs, e.g. install third-parity python libraries or jars.
Then, start a client container:
export K8S_MASTER=k8s://$(sudo kubectl cluster-info | grep 'https.*6443' -o -m 1) echo The k8s master is $K8S_MASTER . export DATA_PATH=/YOUR_DIR/data export KEYS_PATH=/YOUR_DIR/keys export SECURE_PASSWORD_PATH=/YOUR_DIR/password export KUBECONFIG_PATH=/YOUR_DIR/kubeconfig export LOCAL_IP=$LOCAL_IP export DOCKER_IMAGE=intelanalytics/bigdl-ppml-trusted-big-data-ml-python-gramine-reference:2.2.0-SNAPSHOT # or the custom image built by yourself sudo docker run -itd \ --privileged \ --net=host \ --name=bigdl-ppml-client-k8s \ --cpuset-cpus="0-4" \ --oom-kill-disable \ --device=/dev/sgx/enclave \ --device=/dev/sgx/provision \ -v /var/run/aesmd/aesm.socket:/var/run/aesmd/aesm.socket \ -v $DATA_PATH:/ppml/trusted-big-data-ml/work/data \ -v $KEYS_PATH:/ppml/trusted-big-data-ml/work/keys \ -v $SECURE_PASSWORD_PATH:/ppml/trusted-big-data-ml/work/password \ -v $KUBECONFIG_PATH:/root/.kube/config \ -e RUNTIME_SPARK_MASTER=$K8S_MASTER \ -e RUNTIME_K8S_SPARK_IMAGE=$DOCKER_IMAGE \ -e LOCAL_IP=$LOCAL_IP \ $DOCKER_IMAGE bash
Encrypt the input data of your Big Data & AI applications (here we use SimpleQuery) and then upload encrypted data to the Network File System (NFS) server. More details in Encrypt Your Data.
-
Generate the input data
people.csv
for SimpleQuery application you can use generate_people_csv.py. The usage command of the script ispython generate_people.py </save/path/of/people.csv> <num_lines>
. -
Encrypt
people.csv
docker exec -i $KMSUTIL_CONTAINER_NAME bash -c "bash /home/entrypoint.sh encrypt $appid $apikey $input_file_path"
To build your own Big Data & AI applications, refer to develop your own Big Data & AI applications with BigDL PPML. The code of SimpleQuery is in here, it is already built into bigdl-ppml-spark_${SPARK_VERSION}-${BIGDL_VERSION}.jar, and the jar is put into PPML image.
Enter the client container:
sudo docker exec -it bigdl-ppml-client-k8s bash
-
Disable attestation
If you do not need the attestation, you can disable the attestation service. You should configure spark-driver-template.yaml and spark-executor-template.yaml to set
ATTESTATION
value tofalse
. By default, the attestation service is disabled.apiVersion: v1 kind: Pod spec: ... env: - name: ATTESTATION value: false ...
-
Enable attestation
The bi-attestation guarantees that the MREnclave in runtime containers is a secure one made by you. Its workflow is as below:
To enable attestation, you should have a running Attestation Service in your environment.
2.1. Deploy EHSM KMS & AS
KMS (Key Management Service) and AS (Attestation Service) make sure applications of the customer run in the SGX MREnclave signed above by customer-self, rather than a fake one fake by an attacker.
BigDL PPML uses EHSM as a reference KMS & AS, you can follow the guide here to deploy EHSM in your environment.
2.2. Enroll in EHSM
Execute the following command to enroll yourself in EHSM, The
<kms_ip>
is your configured-ip of EHSM service in the deployment section:curl -v -k -G "https://<kms_ip>:9000/ehsm?Action=Enroll" ...... {"code":200,"message":"successful","result":{"apikey":"E8QKpBB******","appid":"8d5dd3b*******"}}
You will get a
appid
andapikey
pair. Please save it for later use.2.3. Attest EHSM Server (optional)
You can attest EHSM server and verify the service is trusted before running workloads to avoid sending your secrets to a fake service.
To attest EHSM server, start a BigDL container using the custom image built before. Note: this is the other container different from the client.
export KEYS_PATH=YOUR_LOCAL_SPARK_SSL_KEYS_FOLDER_PATH export LOCAL_IP=YOUR_LOCAL_IP export CUSTOM_IMAGE=YOUR_CUSTOM_IMAGE_BUILT_BEFORE export PCCS_URL=YOUR_PCCS_URL # format like https://1.2.3.4:xxxx, obtained from KMS services or a self-deployed one sudo docker run -itd \ --privileged \ --net=host \ --cpuset-cpus="0-5" \ --oom-kill-disable \ -v /var/run/aesmd/aesm.socket:/var/run/aesmd/aesm.socket \ -v $KEYS_PATH:/ppml/trusted-big-data-ml/work/keys \ --name=gramine-verify-worker \ -e LOCAL_IP=$LOCAL_IP \ -e PCCS_URL=$PCCS_URL \ $CUSTOM_IMAGE bash
Enter the docker container:
sudo docker exec -it gramine-verify-worker bash
Set the variables in
verify-attestation-service.sh
before running it:`ATTESTATION_URL`: URL of attestation service. Should match the format `<ip_address>:<port>`. `APP_ID`, `API_KEY`: The appID and apiKey pair generated by your attestation service. `ATTESTATION_TYPE`: Type of attestation service. Currently support `EHSMAttestationService`. `CHALLENGE`: Challenge to get quote of attestation service which will be verified by local SGX SDK. Should be a BASE64 string. It can be a casual BASE64 string, for example, it can be generated by the command `echo anystring|base64`.
In the container, execute
verify-attestation-service.sh
to verify the attestation service quote.bash verify-attestation-service.sh
2.4. Register your MREnclave to EHSM
Register the MREnclave with metadata of your MREnclave (appid, apikey, mr_enclave, mr_signer) obtained in the above steps to EHSM through running a python script:
# At /ppml/trusted-big-data-ml inside the container now python register-mrenclave.py --appid <your_appid> \ --apikey <your_apikey> \ --url https://<kms_ip>:9000 \ --mr_enclave <your_mrenclave_hash_value> \ --mr_signer <your_mrensigner_hash_value>
You will receive a response containing a
policyID
and save it which will be used to attest runtime MREnclave when running distributed Kubernetes application.2.5. Enable Attestation in configuration
First, upload
appid
,apikey
andpolicyID
obtained before to Kubernetes as secrets:kubectl create secret generic kms-secret \ --from-literal=app_id=YOUR_KMS_APP_ID \ --from-literal=api_key=YOUR_KMS_API_KEY \ --from-literal=policy_id=YOUR_POLICY_ID
Configure
spark-driver-template.yaml
andspark-executor-template.yaml
to enable Attestation as follows:apiVersion: v1 kind: Pod spec: containers: - name: spark-driver securityContext: privileged: true env: - name: ATTESTATION value: true - name: PCCS_URL value: your_pccs_url -----> <set_the_value_to_your_pccs_url> - name: ATTESTATION_URL value: your_attestation_url - name: APP_ID valueFrom: secretKeyRef: name: kms-secret key: app_id - name: API_KEY valueFrom: secretKeyRef: name: kms-secret key: app_key - name: ATTESTATION_POLICYID valueFrom: secretKeyRef: name: policy-id-secret key: policy_id ...
You should get
Attestation Success!
in logs after you submit a PPML job if the quote generated withuser_report
is verified successfully by Attestation Service, or you will getAttestation Fail! Application killed!
and the job will be stopped.
When the Big Data & AI application and its input data is prepared, you are ready to submit BigDL PPML jobs. You need to choose the deploy mode and the way to submit job first.
-
There are 4 modes to submit job:
-
local mode: run jobs locally without connecting to a cluster. It is exactly the same as using spark-submit to run your application:
$SPARK_HOME/bin/spark-submit --class "SimpleApp" --master local[4] target.jar
, driver and executors are not protected by SGX. -
local SGX mode: run jobs locally with SGX guarded. As the picture shows, the client JVM is running in a SGX Enclave so that driver and executors can be protected.
-
client SGX mode: run jobs in k8s client mode with SGX guarded. As we know, in K8s client mode, the driver is deployed locally as an external client to the cluster. With client SGX mode, the executors running in K8S cluster are protected by SGX, the driver running in client is also protected by SGX.
-
cluster SGX mode: run jobs in k8s cluster mode with SGX guarded. As we know, in K8s cluster mode, the driver is deployed on the k8s worker nodes like executors. With cluster SGX mode, the driver and executors running in K8S cluster are protected by SGX.
-
-
There are two options to submit PPML jobs:
- use PPML CLI to submit jobs manually
- use helm chart to submit jobs automatically
Here we use k8s client mode and PPML CLI to run SimpleQuery. Check other modes, please see PPML CLI Usage Examples. Alternatively, you can also use Helm to submit jobs automatically, see the details in Helm Chart Usage.
-
enter the ppml container
docker exec -it bigdl-ppml-client-k8s bash
-
run simplequery on k8s client mode
#!/bin/bash export secure_password=`openssl rsautl -inkey /ppml/trusted-big-data-ml/work/password/key.txt -decrypt </ppml/trusted-big-data-ml/work/password/output.bin` bash bigdl-ppml-submit.sh \ --master $RUNTIME_SPARK_MASTER \ --deploy-mode client \ --sgx-enabled true \ --sgx-driver-jvm-memory 12g \ --sgx-executor-jvm-memory 12g \ --driver-memory 32g \ --driver-cores 8 \ --executor-memory 32g \ --executor-cores 8 \ --num-executors 2 \ --conf spark.kubernetes.container.image=$RUNTIME_K8S_SPARK_IMAGE \ --name simplequery \ --verbose \ --class com.intel.analytics.bigdl.ppml.examples.SimpleQuerySparkExample \ --jars local:///ppml/trusted-big-data-ml/spark-encrypt-io-0.3.0-SNAPSHOT.jar \ local:///ppml/trusted-big-data-ml/work/data/simplequery/spark-encrypt-io-0.3.0-SNAPSHOT.jar \ --inputPath /ppml/trusted-big-data-ml/work/data/simplequery/people_encrypted \ --outputPath /ppml/trusted-big-data-ml/work/data/simplequery/people_encrypted_output \ --inputPartitionNum 8 \ --outputPartitionNum 8 \ --inputEncryptModeValue AES/CBC/PKCS5Padding \ --outputEncryptModeValue AES/CBC/PKCS5Padding \ --primaryKeyPath /ppml/trusted-big-data-ml/work/data/simplequery/keys/primaryKey \ --dataKeyPath /ppml/trusted-big-data-ml/work/data/simplequery/keys/dataKey \ --kmsType EHSMKeyManagementService --kmsServerIP your_ehsm_kms_server_ip \ --kmsServerPort your_ehsm_kms_server_port \ --ehsmAPPID your_ehsm_kms_appid \ --ehsmAPIKEY your_ehsm_kms_apikey
-
check runtime status: exit the container or open a new terminal
To check the logs of the Spark driver, run
sudo kubectl logs $( sudo kubectl get pod | grep "simplequery.*-driver" -m 1 | cut -d " " -f1 )
To check the logs of a Spark executor, run
sudo kubectl logs $( sudo kubectl get pod | grep "simplequery-.*-exec" -m 1 | cut -d " " -f1 )
-
If you setup PPML Monitoring, you can check PPML Dashboard to monitor the status at http://kubernetes_master_url:3000
You can monitor spark events using a history server. The history server provides an interface to watch and log spark performance and metrics.
First, create a shared directory that can be accessed by both the client and the other worker containers in your cluster. For example, you can create an empty directory under the mounted NFS path or HDFS. The spark drivers and executors will write their event logs to this destination, and the history server will read logs here as well.
Second, enter your client container and edit $SPARK_HOME/conf/spark-defaults.conf
, where the histroy server reads the configurations:
spark.eventLog.enabled true
spark.eventLog.dir <your_shared_dir_path> ---> e.g. file://<your_nfs_dir_path> or hdfs://<your_hdfs_dir_path>
spark.history.fs.logDirectory <your_shared_dir_path> ---> similiar to spark.eventLog.dir
Third, run the below command and the history server will start to watch automatically:
$SPARK_HOME/sbin/start-history-server.sh
Next, when you run spark jobs, enable writing driver and executor event logs in java/spark-submit commands by setting spark conf like below:
...
--conf spark.eventLog.enabled=true \
--conf spark.eventLog.dir=<your_shared_dir_path> \
...
Starting spark jobs, you can find event log files at <your_shared_dir_path>
like:
$ ls
local-1666143241860 spark-application-1666144573580
$ cat spark-application-1666144573580
......
{"Event":"SparkListenerJobEnd","Job ID":0,"Completion Time":1666144848006,"Job Result":{"Result":"JobSucceeded"}}
{"Event":"SparkListenerApplicationEnd","Timestamp":1666144848021}
You can use these logs to analyze spark jobs. Moreover, you are also allowed to surf from a web UI provided by the history server by accessing http://localhost:18080
:
When the job is done, you can decrypt and read the results of the job. More details in Decrypt Job Result.
docker exec -i $KMSUTIL_CONTAINER_NAME bash -c "bash /home/entrypoint.sh decrypt $appid $apikey $input_path"
simplequery.e2e.mp4
The hardware below is recommended for use with this reference implementation.
Intel® 3th Gen Xeon® Scalable Performance processors or later
-
Is SGX supported on CentOS 6/7? No. Please upgrade your OS if possible.
-
Do we need Internet connection for SGX node? No. We can use PCCS for registration and certificate download. Only PCCS need Internet connection.
-
Does PCCS require SGX? No. PCCS can be installed on any server with Internet connection.
-
Can we turn off SGX attestation? Of course. But, turning off attestation will break the integrity provided by SGX. Attestation is turned off to simplify installation for a quick start.
-
Does we need to rewrite my applications? No. In most cases, you don't have to rewrite your applications