Google Cloud AI Platform offers a comprehensive environment to build, train, and deploy machine learning models at scale, utilizing the robust infrastructure of Google Cloud. Utilizing the GUI of the Google Cloud Console, users can orchestrate workflows for model development without needing to interact directly with command-line tools. The step-by-step tutorial below demonstrates how to train and deploy a simple AI model—specifically, a neural network for classification—using the graphical interface, highlighting best practices and providing didactic value throughout.
Prerequisites
Before proceeding, ensure you have:
1. A Google Cloud Platform (GCP) account with billing enabled.
2. Adequate permissions (such as Project Editor or Owner) to use AI Platform services.
3. A Cloud Storage bucket in your GCP project to store data and models.
4. The AI Platform, Compute Engine, and Cloud Storage APIs enabled for your project.
Step 1: Prepare Your Data
The quality and format of your data significantly influence model performance. For demonstration, consider the well-known Iris dataset, a simple multi-class classification problem.
1. Obtain the Dataset
– Download the Iris dataset in CSV format from a reputable source (such as UCI Machine Learning Repository).
2. Upload Data to Cloud Storage
– Log into the GCP Console.
– Navigate to "Storage" > "Browser".
– Click "Create Bucket" if you don't have one, or select an existing bucket.
– Click "Upload files" and select your CSV file.
3. Check Data Schema
– Use the "Preview" tab in Cloud Storage to visualize your CSV and verify the integrity of your data.
Didactic Note: Storing input data in Cloud Storage is a standard practice in distributed cloud training. It decouples data and compute resources, enabling seamless access from multiple training workers.
Step 2: Create a Training Application
For distributed training or using custom models, AI Platform expects your training code in Python, typically packaged as a Python module. However, for simple models, AI Platform provides pre-built containers via "Custom Jobs" or "AutoML" options. This tutorial focuses on using a custom Python script for maximum flexibility.
1. Write Your Model Code
– Create a Python script (e.g., `trainer/task.py`) that:
– Loads the dataset from Cloud Storage.
– Preprocesses the data.
– Defines a simple Keras Sequential model for classification.
– Trains the model and saves the output to Cloud Storage.
– Example code excerpt (simplified):
python
import os
import pandas as pd
import tensorflow as tf
from tensorflow import keras
def load_data(file_path):
return pd.read_csv(file_path)
def create_model(input_shape):
model = keras.Sequential([
keras.layers.Dense(10, activation='relu', input_shape=(input_shape,)),
keras.layers.Dense(3, activation='softmax')
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
return model
def main():
# Environment variables set by AI Platform
input_path = os.environ['AIP_DATA_FORMAT']
output_dir = os.environ['AIP_MODEL_DIR']
df = load_data(input_path)
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
model = create_model(X.shape[1])
model.fit(X, y, epochs=10)
model.save(output_dir)
if __name__ == '__main__':
main()
– Note: For production or larger datasets, incorporate best practices such as shuffling, train-test split, and data normalization.
2. Package the Code
– Structure your code directory as follows:
trainer/
__init__.py
task.py
– Compress the `trainer` directory into a `.tar.gz` file for uploading.
3. Upload Code to Cloud Storage
– Navigate to your storage bucket.
– Upload the `.tar.gz` file containing your training code.
Didactic Note: Packaging code enables reproducibility and version control, both of which are important in collaborative and distributed ML workflows.
Step 3: Create a Training Job via GCP Console
1. Navigate to Vertex AI
– In the GCP Console, go to "Vertex AI" > "Training".
2. Start a New Training Job
– Click "Create" to start a new training job.
– Choose "Custom training".
3. Configure the Training Job
– Display Name: Enter a recognizable name for your job.
– Region: Select a region close to your data location for performance and cost-efficiency.
– Python Package Location: Enter the Cloud Storage path to your `.tar.gz` file (e.g., `gs://your-bucket/trainer.tar.gz`).
– Python Module Name: Specify the module entry point (e.g., `trainer.task`).
4. Specify Training Container
– Select "TensorFlow" as the framework (e.g., TensorFlow 2.8) if your code relies on it.
– The system auto-fills compatible Docker containers.
5. Set Input Arguments and Hyperparameters (Optional)
– You may add arguments for hyperparameters, paths, or other runtime variables.
– Example: `–input-path=gs://your-bucket/iris.csv –output-dir=gs://your-bucket/model-output`
6. Configure Compute Resources
– Machine Type: For simple models, `n1-standard-4` is sufficient.
– Accelerator: None required unless training deep or complex models.
– Worker Pool Size: Set to 1 for single-node training, or more for distributed training (see distributed section below).
7. Output Model Directory
– Specify a Cloud Storage path for the trained model artifacts (e.g., `gs://your-bucket/model-output`).
8. Create and Run the Job
– Click "Create" to start the job. Monitor progress in the Console UI under the "Training jobs" tab.
Didactic Note: Using the GUI abstracts away command line complexity, making the workflow more accessible to new practitioners and those focused on prototyping.
Step 4: (Optional) Enable Distributed Training
For larger datasets or deep neural networks, training can be distributed across multiple machines.
1. In the "Training job" configuration, locate the "Worker pool configuration."
2. Add additional worker pools:
– Chief: 1 (main node)
– Workers: Set number based on dataset/model size.
– Parameter servers: Used for model parameter coordination, relevant for distributed TensorFlow jobs.
3. For each pool, specify machine type and Docker image (should match your framework and version).
4. Ensure your code supports distributed training (e.g., using `tf.distribute.Strategy`).
Didactic Note: Distributed training can significantly reduce training time for large-scale problems. It introduces considerations such as data sharding, synchronization, and network overhead. For simple models and datasets, single-node training suffices.
Step 5: Deploy the Model via the GCP Console GUI
Once training completes, the model artifacts are available in your specified Cloud Storage bucket. Next, deploy the model for online prediction.
1. Register the Model
– In "Vertex AI" > "Models", click "Upload Model".
– Select "From trained model artifacts".
– Specify the Cloud Storage path to your saved model directory (e.g., `gs://your-bucket/model-output`).
2. Model Framework and Format
– Specify the framework (e.g., TensorFlow) and version.
– Vertex AI auto-detects the model type (SavedModel, scikit-learn pickle, etc.).
3. Model Display Name
– Enter a unique display name for your model.
4. Region
– Choose the region matching your training and storage location.
5. Create the Model
– Click "Create" and wait for the registration to complete.
6. Deploy to an Endpoint
– Once the model is registered, click "Deploy to endpoint."
– Create a new endpoint or select an existing one.
– Configure traffic splitting if deploying multiple versions.
– Machine Type: Select an appropriate instance type for serving; for small models, `n1-standard-2` is sufficient.
– Minimum/Maximum Replicas: Set scaling parameters based on expected request volume.
– Optionally enable GPU/TPU acceleration for faster inference.
7. Deploy
– Click "Deploy". The deployment process may take several minutes.
Didactic Note: Separating model registration and deployment allows robust version control and A/B testing, supporting MLOps best practices.
Step 6: Test the Deployed Model
1. Prepare Test Data
– Format input data as required by the model (e.g., as a JSON object with the same features as the training data).
2. Use the Console for Testing
– In Vertex AI > Endpoints, select your deployed endpoint.
– Click "Test Endpoint".
– Paste your test data into the request body.
– Click "Send Request" and observe the prediction results.
Didactic Note: Testing via the GUI facilitates quick validation before integrating the model into production applications via REST API or client libraries.
Step 7: Monitor and Manage Model Performance
1. View Predictions and Logs
– Access logs from the endpoint for prediction requests, latency, and errors.
– Use "Monitoring" in Vertex AI to set up alerts and track resource utilization.
2. Update or Retrain the Model
– When improved data or model versions are available, repeat the training and deployment process.
– Update the endpoint to direct traffic to the new model version without downtime.
Step 8: Clean Up Resources
To avoid unnecessary charges:
1. Delete unused models, endpoints, and training jobs from Vertex AI.
2. Remove large files from Cloud Storage buckets if no longer needed.
3. Release any reserved IP addresses or compute resources.
Examples and Didactic Value
– Example 1: Single-Node Training with Iris Dataset
– Training a Keras model with the Iris dataset via the GUI demonstrates the principles of cloud-based machine learning, including data storage decoupling, module packaging, and reproducible training.
– Example 2: Multi-Worker Distributed Training
– When scaling to large tabular datasets, configuring multiple worker pools via the GUI enables distributed data and model parallelism. This introduces students to advanced ML engineering concepts such as synchronization barriers and parameter servers.
The GUI-driven workflow on Google Cloud AI Platform is pedagogically valuable as it:
– Lowers barriers to entry for beginners by providing a visual, stepwise process.
– Illustrates the separation between data, model, and infrastructure.
– Reinforces the iterative nature of model development—preparing data, training, evaluating, deploying, monitoring, and retraining.
– Demonstrates industry-standard practices for MLOps, including version control, monitoring, and scalable deployment.
Common Pitfalls and Troubleshooting
– Permissions Errors: Ensure your service account has the necessary permissions to access Cloud Storage, AI Platform, and Compute Engine.
– Resource Limits: If training jobs fail due to quota issues, check your project’s quotas and request increases if necessary.
– Data Format Mismatch: Always verify the CSV schema and preprocessing steps to match model input expectations.
– Model Not Deploying: Ensure the model is saved in a format compatible with Vertex AI (e.g., TensorFlow SavedModel).
Advanced Topics for Exploration
– Hyperparameter Tuning: Use Vertex AI’s built-in hyperparameter tuning feature to automate optimal parameter search.
– Pipelines: Orchestrate multi-step ML workflows for reproducibility and automation.
– Model Monitoring: Set up continuous evaluation and drift detection for models in production.
This step-by-step approach through the GUI builds foundational competence in cloud-based machine learning, preparing users for more sophisticated, automated, or code-driven workflows as their expertise grows.
Other recent questions and answers regarding Distributed training in the cloud:
- What is the simplest, step-by-step procedure to practice distributed AI model training in Google Cloud?
- What is the first model that one can work on with some practical suggestions for the beginning?
- What are the disadvantages of distributed training?
- What are the steps involved in using Cloud Machine Learning Engine for distributed training?
- How can you monitor the progress of a training job in the Cloud Console?
- What is the purpose of the configuration file in Cloud Machine Learning Engine?
- How does data parallelism work in distributed training?
- What are the advantages of distributed training in machine learning?

