TensorFlow Serving is an open-source system developed by Google for serving machine learning models, particularly those built using TensorFlow, in production environments. Its primary purpose is to provide a flexible, high-performance serving system for deploying new algorithms and experiments while maintaining the same server architecture and APIs. This framework is widely adopted for model deployment due to its ability to manage multiple models, versioning, and efficient inference requests.
Introduction to TensorFlow Serving
TensorFlow Serving supports the deployment of trained models for inference (prediction) in a scalable and efficient way. It is designed to handle real-time predictions (online serving) and offers features such as model version management, hot-swapping of models, and advanced configuration options for model deployment.
The system is typically used in scenarios where a trained model needs to be exposed as a service, accessible via API calls. This enables seamless integration into production applications where predictions are required.
Step 1: Preparing a Trained Model
Before using TensorFlow Serving, a model must be trained and exported in the TensorFlow SavedModel format. The SavedModel is the universal serialization format for TensorFlow models, containing the graph, variables, and metadata necessary for serving.
Suppose a simple estimator model is built using TensorFlow’s high-level Estimator API:
python import tensorflow as tf # Define a simple linear regression estimator feature_columns = [tf.feature_column.numeric_column("x", shape=[1])] estimator = tf.estimator.LinearRegressor(feature_columns=feature_columns) # Prepare training data import numpy as np x_train = np.array([[1.], [2.], [3.], [4.]]) y_train = np.array([[0.], [-1.], [-2.], [-3.]]) input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn( {"x": x_train}, y_train, batch_size=1, num_epochs=None, shuffle=True ) # Train the estimator estimator.train(input_fn=input_fn, steps=1000) # Export the trained model def serving_input_receiver_fn(): inputs = {"x": tf.placeholder(shape=[None, 1], dtype=tf.float32)} return tf.estimator.export.ServingInputReceiver(inputs, inputs) export_dir = estimator.export_savedmodel('exported_model', serving_input_receiver_fn)
After training, the exported model is available in the `exported_model` directory, typically with a timestamped subdirectory representing the model version.
Step 2: Installing TensorFlow Serving
TensorFlow Serving can be installed and run in several ways: natively on Linux, via Docker containers, or by building from source. The Docker approach is the most convenient and is officially supported.
To install Docker on your system, refer to the official Docker documentation. Once Docker is available, the TensorFlow Serving image can be pulled:
{{EJS15}}Step 3: Serving the Model with TensorFlow Serving
Assuming the exported model is located in `/models/my_model/1/`, where `1` is the version number, run the TensorFlow Serving Docker container as follows:sh docker run -p 8501:8501 --name=tf_serving_linear \ --mount type=bind,source=/models/my_model,target=/models/my_model \ -e MODEL_NAME=my_model -t tensorflow/servingExplanation of the parameters:
- `-p 8501:8501` maps the container’s port 8501 to the host, exposing the REST API.
- `--name=tf_serving_linear` assigns a name to the container.
- `--mount type=bind,source=...,target=...` mounts the local model directory into the Docker container.
- `-e MODEL_NAME=my_model` specifies the model name TensorFlow Serving will serve.
- `-t tensorflow/serving` specifies the TensorFlow Serving image.The directory structure for models should be:
/models/ my_model/ 1/ saved_model.pb variables/TensorFlow Serving automatically detects the version subdirectory (`1`), allowing for easy model versioning and upgrades.
Step 4: Making Predictions via REST API
Once the server is running, predictions can be made via HTTP POST requests to the REST API endpoint.
Here is an example using `curl` to send a prediction request:
sh curl -d '{"instances": [{"x": [1.0]}, {"x": [2.0]}]}' \ -H "Content-Type: application/json" \ http://localhost:8501/v1/models/my_model:predict- The `'instances'` key contains a list of input examples, each matching the input signature expected by the model (`x` in this case).
TensorFlow Serving returns a prediction response in JSON format:
json { "predictions": [[output_1], [output_2]] }Where `output_1` and `output_2` are the predicted values for the inputs 1.0 and 2.0, respectively.
Step 5: Making Predictions via gRPC API
TensorFlow Serving also supports gRPC, which provides better performance and is commonly used in high-throughput production environments.
Example Python code using the gRPC API:
python import grpc import tensorflow as tf from tensorflow_serving.apis import predict_pb2, prediction_service_pb2_grpc # Connect to TensorFlow Serving server channel = grpc.insecure_channel('localhost:8500') stub = prediction_service_pb2_grpc.PredictionServiceStub(channel) # Prepare request request = predict_pb2.PredictRequest() request.model_spec.name = 'my_model' request.model_spec.signature_name = 'serving_default' request.inputs['x'].CopyFrom( tf.make_tensor_proto([[1.0], [2.0]], shape=[2, 1]) ) # Make prediction result = stub.Predict(request, 10.0) print(result)- The server must be started with the gRPC port exposed (`-p 8500:8500`), which is default in the Docker image.
Step 6: Model Versioning and Management
TensorFlow Serving is designed to efficiently handle model versioning. The directory structure allows multiple versions of a model to coexist. For example:
/models/ my_model/ 1/ 2/If a new version (e.g., `2`) is added, TensorFlow Serving can automatically switch to the new version without downtime, depending on configuration. By default, the highest numbered version is served.
To specify the model version in a request, the REST API provides an endpoint:
http://localhost:8501/v1/models/my_model/versions/2:predictThis enables canarying, blue-green deployments, and rollback strategies.
Step 7: Advanced Configuration
TensorFlow Serving supports more advanced features, such as serving multiple models simultaneously, monitoring, and custom batching.
- Serving Multiple Models:
Create a `models.config` file:
model_config_list: { config: { name: 'model1', base_path: '/models/model1', model_platform: 'tensorflow' }, config: { name: 'model2', base_path: '/models/model2', model_platform: 'tensorflow' } }Start the server with:
sh docker run -p 8501:8501 \ --mount type=bind,source=/models,target=/models \ -t tensorflow/serving \ --model_config_file=/models/models.config- Monitoring:
TensorFlow Serving exposes metrics via a Prometheus-compatible endpoint at `/monitoring/prometheus/metrics`.
- Batching:
For performance, enabling batching can be beneficial for high-throughput workloads. Batching can be configured via command-line parameters or configuration files.
Step 8: Security and Production Considerations
In production, ensure that TensorFlow Serving endpoints are secured using authentication and authorization layers, as well as network-level protections (for example, running behind a reverse proxy or API gateway).
Logging, monitoring, and alerting are critical for production deployments. Integrate TensorFlow Serving with centralized logging and monitoring solutions to track usage, performance, and failures.
Example End-to-End Workflow
Training and Exporting a Model:
1. Train a simple estimator as shown in Step 1.
2. Export the model to the SavedModel format, e.g., `/models/linear/1/`.Starting TensorFlow Serving:
sh docker run -p 8501:8501 --name=tf_serving_example \ --mount type=bind,source=/models/linear,target=/models/linear \ -e MODEL_NAME=linear -t tensorflow/servingMaking a Prediction:
sh curl -d '{"instances": [{"x": [5.0]}]}' \ -H "Content-Type: application/json" \ http://localhost:8501/v1/models/linear:predictResponse:
{{EJS27}}Troubleshooting Common Issues
1. Model Not Found: Ensure the model directory structure is correct and the `MODEL_NAME` environment variable corresponds to the correct directory.
2. Signature Mismatch: The exported model’s input signature must match the input provided during prediction requests. Use the `saved_model_cli` tool to inspect the SavedModel signature.
3. Port Conflicts: Ensure the specified ports (8501 for REST, 8500 for gRPC) are not in use by other processes.
4. File Permissions: Verify that Docker has permission to access the model files on the host machine.Integration with Google Cloud
TensorFlow Serving can be integrated with Google Cloud AI Platform for managed deployments. However, the fundamental principles of exporting models, serving, and querying remain consistent. Google Cloud AI Platform provides a managed service for serving TensorFlow models, abstracting away the infrastructure management.
Paragraph
TensorFlow Serving provides a robust and flexible solution for serving TensorFlow models in a production environment. It supports both REST and gRPC interfaces, enables version management, and integrates smoothly into scalable deployment architectures. Starting with model export, proceeding through Docker-based serving, and culminating in API-based inference, TensorFlow Serving streamlines the transition from model development to real-world deployment. Its compatibility with both simple estimators and complex models makes it a versatile tool in the machine learning deployment toolkit.
Other recent questions and answers regarding EITC/AI/GCML Google Cloud Machine Learning:
- Is the so called part of "Inference" equivalent to the description in the step-by-step process of machine learning described as "evaluating, iterating, improving"?
- What are some common AI/ML algorithms to be used on the processed data?
- How Keras models replace TensorFlow estimators?
- How to configure specific Python environment with Jupyter notebook?
- What is Classifier.export_saved_model and how to use it?
- Why is regression frequently used as a predictor?
- Are Lagrange multipliers and quadratic programming techniques relevant for machine learning?
- Can more than one model be applied during the machine learning process?
- Can Machine Learning adapt which algorithm to use depending on a scenario?
- What is the simplest route to most basic didactic AI model training and deployment on Google AI Platform using a free tier/trial using a GUI console in a step-by-step manner for an absolute begginer with no programming background?
View more questions and answers in EITC/AI/GCML Google Cloud Machine Learning