How do I deploy a custom container on Google Cloud AI Platform?

by MIRNA HANŽEK / Tuesday, 25 November 2025 / Published in Artificial Intelligence, EITC/AI/GCML Google Cloud Machine Learning, Google Cloud AI Platform, Training models with custom containers on Cloud AI Platform

Deploying a custom container on Google Cloud AI Platform (now part of Vertex AI) is a process that allows practitioners to leverage their own software environments, dependencies, and frameworks for training and prediction tasks. This approach is particularly beneficial when default environments do not meet the requirements of a project, such as when custom libraries, proprietary code, or unsupported frameworks are needed.

1. Overview of Custom Containers in Cloud AI Platform

A custom container is a Docker image containing all the code, packages, and dependencies necessary for a machine learning task. Google Cloud AI Platform supports custom containers for both training and serving models. This flexibility ensures that developers can maintain control over their runtime environment, implement advanced workflows, and use any language or ML framework.

The process broadly involves creating a Docker image, uploading it to Google Container Registry (GCR) or Artifact Registry, and configuring an AI Platform job to use the image.

2. Prerequisites

– A Google Cloud Project with billing enabled.
– The Google Cloud SDK (`gcloud`) installed and authenticated.
– Docker installed locally.
– Permissions: At minimum, `roles/ml.admin` and `roles/storage.admin` for the project.
– (Optional) Service account with necessary permissions for programmatic access.

3. Constructing the Dockerfile

The Dockerfile defines the environment for training or serving. It should:

– Start from a suitable base image (e.g., a Python image, TensorFlow, PyTorch, or a custom environment).
– Copy source code into the image.
– Install required system libraries and Python packages.
– Define the entry point for training or prediction.

Example Dockerfile for a Custom Training Job (Python-based):

dockerfile
FROM python:3.9-slim

# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

# Copy source code
WORKDIR /app
COPY . /app/

# Install Python dependencies
RUN pip install --upgrade pip
RUN pip install -r requirements.txt

# Set entry point for training
ENTRYPOINT ["python", "train.py"]

In this example, `train.py` is the main script that initiates the training logic.

4. Building and Pushing the Docker Image

After writing the Dockerfile and placing the source code in the build context, the image can be built and pushed to a container registry accessible by AI Platform.

Steps:
– Set your Google Cloud project:

bash
  gcloud config set project [PROJECT_ID]

– Build the Docker image:

bash
  docker build -t gcr.io/[PROJECT_ID]/custom-ml-image:latest .

– Authenticate Docker to the Google Container Registry:

bash
  gcloud auth configure-docker

– Push the image to the registry:

bash
  docker push gcr.io/[PROJECT_ID]/custom-ml-image:latest

5. Preparing Training Code and Entry Point

The training code must conform to certain standards:

– Accept command-line arguments for hyperparameters, data paths, and output directories.
– Write model artifacts to the directory specified by the `–model-dir` or a similar argument.
– Log progress and errors using standard output (stdout) and standard error (stderr).

Sample `train.py`:

python
import argparse
import logging
import os
from model import train_model  # Assume train_model is defined in model.py

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('--data-dir', type=str, required=True)
    parser.add_argument('--model-dir', type=str, required=True)
    parser.add_argument('--epochs', type=int, default=10)
    return parser.parse_args()

def main():
    args = parse_args()
    logging.info(f"Training with data from {args.data_dir}")
    # Training logic
    train_model(data_dir=args.data_dir, output_dir=args.model_dir, epochs=args.epochs)

if __name__ == "__main__":
    main()

6. Configuring and Submitting the Training Job

Training jobs can be submitted using `gcloud` or via the AI Platform API.

Using `gcloud` CLI:

bash
gcloud ai custom-jobs create \
  --region=us-central1 \
  --display-name=custom-container-job \
  --worker-pool-spec=machine-type=n1-standard-4,replica-count=1,container-image-uri=gcr.io/[PROJECT_ID]/custom-ml-image:latest,local-package-path=.,python-module=train

Key Parameters:
– `–region`: The region for the training job.
– `–display-name`: Friendly name for identification.
– `–worker-pool-spec`: Specifies the compute resources, image, and entry point.
– `container-image-uri`: URI of the pushed Docker image.

For distributed training, adjust `replica-count` and specify additional worker pool specs if needed.

7. Data and Artifact Management

Datasets and model artifacts should be stored in Google Cloud Storage (GCS). The training script should read data from GCS and write outputs back to GCS.

References in the training script:
– Input data path: `gs://[BUCKET_NAME]/data/`
– Output model path: `gs://[BUCKET_NAME]/models/`

8. Monitoring and Logging

Once the job is submitted, its progress can be monitored from the Google Cloud Console under Vertex AI > Training, or via `gcloud ai custom-jobs describe [JOB_ID]`. Logs are streamed to Stackdriver Logging (now Cloud Logging), where stdout and stderr from the container are accessible.

9. Deploying a Custom Container for Prediction

Serving with a custom container is similar. The container must implement a web server that exposes HTTP endpoints conforming to the Vertex AI prediction protocol (typically `/v1/endpoints` with `predict` method). The container must listen on port 8080.

Dockerfile for Prediction:

dockerfile
FROM python:3.9-slim

WORKDIR /app
COPY . /app/
RUN pip install -r requirements.txt
EXPOSE 8080

CMD ["gunicorn", "--bind", "0.0.0.0:8080", "predict:app"]

Here, `predict:app` refers to a Python module `predict.py` with a WSGI app named `app` (for example, built with Flask or FastAPI).

Sample Flask-Based Model Server:

python
from flask import Flask, request, jsonify
import joblib

app = Flask(__name__)
model = joblib.load('model.joblib')

@app.route('/v1/endpoints/predict', methods=['POST'])
def predict():
    data = request.get_json()
    prediction = model.predict(data['instances'])
    return jsonify({'predictions': prediction.tolist()})

10. Deploying the Prediction Service

Once the image is pushed to GCR, deploy the model:

– Register the model:

bash
  gcloud ai models upload \
    --region=us-central1 \
    --display-name=custom-container-model \
    --container-image-uri=gcr.io/[PROJECT_ID]/custom-ml-image:latest

– Deploy an endpoint:

bash
  gcloud ai endpoints create --region=us-central1 --display-name=custom-endpoint

– Deploy the model to the endpoint:

bash
  gcloud ai endpoints deploy-model [ENDPOINT_ID] \
    --region=us-central1 \
    --model=[MODEL_ID] \
    --display-name=custom-container-deployment \
    --machine-type=n1-standard-4

11. Best Practices

– Automate Image Builds: Use Cloud Build or CI/CD pipelines for consistent, repeatable container builds.
– Parameterize Scripts: Design code to accept parameters via command-line or environment variables for flexibility and reproducibility.
– Handle Errors Gracefully: Ensure the container exits with non-zero status codes on failure for proper job monitoring.
– Testing: Test the container locally with representative data before deploying to production.
– Security: Use least-privilege IAM roles for the service account running the job, and scan images for vulnerabilities.

12. Example Workflow: End-to-End

Suppose a team develops a PyTorch-based image classification model that requires custom C++ dependencies and a particular version of torchvision not available in managed environments.

Steps:
1. Write Dockerfile: Install required system libraries, PyTorch, and custom dependencies.
2. Prepare Code: Ensure `train.py` reads from GCS and writes outputs to GCS, accepts hyperparameters as arguments.
3. Build and Push Image: Build the image locally and push to GCR.
4. Submit Training Job: Use `gcloud ai custom-jobs create` with the custom image, specifying data and output directories.
5. Monitor Progress: Use Google Cloud Console and Cloud Logging for job monitoring.
6. Model Serving: Write a Flask app exposing a `/predict` endpoint (listening on port 8080), package it in a Docker image, push to GCR, and deploy via Vertex AI endpoints.

13. Troubleshooting

– Job Fails to Start: Check logs for errors in Dockerfile or entry point.
– Container Not Found: Ensure correct image URI and that the image is in a registry accessible to Vertex AI.
– Port Not Exposed: For prediction, the container must listen on port 8080.
– Data Access Issues: Verify GCS paths and service account permissions.

14. Further Reading

– [Vertex AI Custom Containers Documentation](https://cloud.google.com/vertex-ai/docs/training/custom-containers-training)
– [Docker Reference](https://docs.docker.com/engine/reference/builder/)
– [gcloud AI Platform Commands](https://cloud.google.com/sdk/gcloud/reference/ai/)

EITCA Academy

EITCA Academy is a part of the European IT Certification framework

EITCA Academy

LOG IN TO YOUR ACCOUNT

FORGOT YOUR PASSWORD?

CREATE AN ACCOUNT

How do I deploy a custom container on Google Cloud AI Platform?

Other recent questions and answers regarding Training models with custom containers on Cloud AI Platform:

More questions and answers: