Deploying a custom container on Google Cloud AI Platform (now part of Vertex AI) is a process that allows practitioners to leverage their own software environments, dependencies, and frameworks for training and prediction tasks. This approach is particularly beneficial when default environments do not meet the requirements of a project, such as when custom libraries, proprietary code, or unsupported frameworks are needed.
1. Overview of Custom Containers in Cloud AI Platform
A custom container is a Docker image containing all the code, packages, and dependencies necessary for a machine learning task. Google Cloud AI Platform supports custom containers for both training and serving models. This flexibility ensures that developers can maintain control over their runtime environment, implement advanced workflows, and use any language or ML framework.
The process broadly involves creating a Docker image, uploading it to Google Container Registry (GCR) or Artifact Registry, and configuring an AI Platform job to use the image.
2. Prerequisites
– A Google Cloud Project with billing enabled.
– The Google Cloud SDK (`gcloud`) installed and authenticated.
– Docker installed locally.
– Permissions: At minimum, `roles/ml.admin` and `roles/storage.admin` for the project.
– (Optional) Service account with necessary permissions for programmatic access.
3. Constructing the Dockerfile
The Dockerfile defines the environment for training or serving. It should:
– Start from a suitable base image (e.g., a Python image, TensorFlow, PyTorch, or a custom environment).
– Copy source code into the image.
– Install required system libraries and Python packages.
– Define the entry point for training or prediction.
Example Dockerfile for a Custom Training Job (Python-based):
dockerfile
FROM python:3.9-slim
# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
&& rm -rf /var/lib/apt/lists/*
# Copy source code
WORKDIR /app
COPY . /app/
# Install Python dependencies
RUN pip install --upgrade pip
RUN pip install -r requirements.txt
# Set entry point for training
ENTRYPOINT ["python", "train.py"]
In this example, `train.py` is the main script that initiates the training logic.
4. Building and Pushing the Docker Image
After writing the Dockerfile and placing the source code in the build context, the image can be built and pushed to a container registry accessible by AI Platform.
Steps:
– Set your Google Cloud project:
bash gcloud config set project [PROJECT_ID]
– Build the Docker image:
bash docker build -t gcr.io/[PROJECT_ID]/custom-ml-image:latest .
– Authenticate Docker to the Google Container Registry:
bash gcloud auth configure-docker
– Push the image to the registry:
bash docker push gcr.io/[PROJECT_ID]/custom-ml-image:latest
5. Preparing Training Code and Entry Point
The training code must conform to certain standards:
– Accept command-line arguments for hyperparameters, data paths, and output directories.
– Write model artifacts to the directory specified by the `–model-dir` or a similar argument.
– Log progress and errors using standard output (stdout) and standard error (stderr).
Sample `train.py`:
python
import argparse
import logging
import os
from model import train_model # Assume train_model is defined in model.py
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument('--data-dir', type=str, required=True)
parser.add_argument('--model-dir', type=str, required=True)
parser.add_argument('--epochs', type=int, default=10)
return parser.parse_args()
def main():
args = parse_args()
logging.info(f"Training with data from {args.data_dir}")
# Training logic
train_model(data_dir=args.data_dir, output_dir=args.model_dir, epochs=args.epochs)
if __name__ == "__main__":
main()
6. Configuring and Submitting the Training Job
Training jobs can be submitted using `gcloud` or via the AI Platform API.
Using `gcloud` CLI:
bash gcloud ai custom-jobs create \ --region=us-central1 \ --display-name=custom-container-job \ --worker-pool-spec=machine-type=n1-standard-4,replica-count=1,container-image-uri=gcr.io/[PROJECT_ID]/custom-ml-image:latest,local-package-path=.,python-module=train
Key Parameters:
– `–region`: The region for the training job.
– `–display-name`: Friendly name for identification.
– `–worker-pool-spec`: Specifies the compute resources, image, and entry point.
– `container-image-uri`: URI of the pushed Docker image.
For distributed training, adjust `replica-count` and specify additional worker pool specs if needed.
7. Data and Artifact Management
Datasets and model artifacts should be stored in Google Cloud Storage (GCS). The training script should read data from GCS and write outputs back to GCS.
References in the training script:
– Input data path: `gs://[BUCKET_NAME]/data/`
– Output model path: `gs://[BUCKET_NAME]/models/`
8. Monitoring and Logging
Once the job is submitted, its progress can be monitored from the Google Cloud Console under Vertex AI > Training, or via `gcloud ai custom-jobs describe [JOB_ID]`. Logs are streamed to Stackdriver Logging (now Cloud Logging), where stdout and stderr from the container are accessible.
9. Deploying a Custom Container for Prediction
Serving with a custom container is similar. The container must implement a web server that exposes HTTP endpoints conforming to the Vertex AI prediction protocol (typically `/v1/endpoints` with `predict` method). The container must listen on port 8080.
Dockerfile for Prediction:
dockerfile FROM python:3.9-slim WORKDIR /app COPY . /app/ RUN pip install -r requirements.txt EXPOSE 8080 CMD ["gunicorn", "--bind", "0.0.0.0:8080", "predict:app"]
Here, `predict:app` refers to a Python module `predict.py` with a WSGI app named `app` (for example, built with Flask or FastAPI).
Sample Flask-Based Model Server:
python
from flask import Flask, request, jsonify
import joblib
app = Flask(__name__)
model = joblib.load('model.joblib')
@app.route('/v1/endpoints/predict', methods=['POST'])
def predict():
data = request.get_json()
prediction = model.predict(data['instances'])
return jsonify({'predictions': prediction.tolist()})
10. Deploying the Prediction Service
Once the image is pushed to GCR, deploy the model:
– Register the model:
bash
gcloud ai models upload \
--region=us-central1 \
--display-name=custom-container-model \
--container-image-uri=gcr.io/[PROJECT_ID]/custom-ml-image:latest
– Deploy an endpoint:
bash gcloud ai endpoints create --region=us-central1 --display-name=custom-endpoint
– Deploy the model to the endpoint:
bash
gcloud ai endpoints deploy-model [ENDPOINT_ID] \
--region=us-central1 \
--model=[MODEL_ID] \
--display-name=custom-container-deployment \
--machine-type=n1-standard-4
11. Best Practices
– Automate Image Builds: Use Cloud Build or CI/CD pipelines for consistent, repeatable container builds.
– Parameterize Scripts: Design code to accept parameters via command-line or environment variables for flexibility and reproducibility.
– Handle Errors Gracefully: Ensure the container exits with non-zero status codes on failure for proper job monitoring.
– Testing: Test the container locally with representative data before deploying to production.
– Security: Use least-privilege IAM roles for the service account running the job, and scan images for vulnerabilities.
12. Example Workflow: End-to-End
Suppose a team develops a PyTorch-based image classification model that requires custom C++ dependencies and a particular version of torchvision not available in managed environments.
Steps:
1. Write Dockerfile: Install required system libraries, PyTorch, and custom dependencies.
2. Prepare Code: Ensure `train.py` reads from GCS and writes outputs to GCS, accepts hyperparameters as arguments.
3. Build and Push Image: Build the image locally and push to GCR.
4. Submit Training Job: Use `gcloud ai custom-jobs create` with the custom image, specifying data and output directories.
5. Monitor Progress: Use Google Cloud Console and Cloud Logging for job monitoring.
6. Model Serving: Write a Flask app exposing a `/predict` endpoint (listening on port 8080), package it in a Docker image, push to GCR, and deploy via Vertex AI endpoints.
13. Troubleshooting
– Job Fails to Start: Check logs for errors in Dockerfile or entry point.
– Container Not Found: Ensure correct image URI and that the image is in a registry accessible to Vertex AI.
– Port Not Exposed: For prediction, the container must listen on port 8080.
– Data Access Issues: Verify GCS paths and service account permissions.
14. Further Reading
– [Vertex AI Custom Containers Documentation](https://cloud.google.com/vertex-ai/docs/training/custom-containers-training)
– [Docker Reference](https://docs.docker.com/engine/reference/builder/)
– [gcloud AI Platform Commands](https://cloud.google.com/sdk/gcloud/reference/ai/)
Other recent questions and answers regarding Training models with custom containers on Cloud AI Platform:
- Can one utilize the configuration file for the CMLE model deployment when using a distributed ML model training to define how many machines will be used in training?
- Why would you use custom containers on Google Cloud AI Platform instead of running the training locally?
- What additional functionality do you need to install when building your own container image?
- What is the advantage of using custom containers in terms of library versions?
- How can custom containers future-proof your workflow in machine learning?
- What are the benefits of using custom containers on Google Cloud AI Platform for running machine learning?

