×
1 Choose EITC/EITCA Certificates
2 Learn and take online exams
3 Get your IT skills certified

Confirm your IT skills and competencies under the European IT Certification framework from anywhere in the world fully online.

EITCA Academy

Digital skills attestation standard by the European IT Certification Institute aiming to support Digital Society development

LOG IN TO YOUR ACCOUNT

CREATE AN ACCOUNT FORGOT YOUR PASSWORD?

FORGOT YOUR PASSWORD?

AAH, WAIT, I REMEMBER NOW!

CREATE AN ACCOUNT

ALREADY HAVE AN ACCOUNT?
EUROPEAN INFORMATION TECHNOLOGIES CERTIFICATION ACADEMY - ATTESTING YOUR PROFESSIONAL DIGITAL SKILLS
  • SIGN UP
  • LOGIN
  • INFO

EITCA Academy

EITCA Academy

The European Information Technologies Certification Institute - EITCI ASBL

Certification Provider

EITCI Institute ASBL

Brussels, European Union

Governing European IT Certification (EITC) framework in support of the IT professionalism and Digital Society

  • CERTIFICATES
    • EITCA ACADEMIES
      • EITCA ACADEMIES CATALOGUE<
      • EITCA/CG COMPUTER GRAPHICS
      • EITCA/IS INFORMATION SECURITY
      • EITCA/BI BUSINESS INFORMATION
      • EITCA/KC KEY COMPETENCIES
      • EITCA/EG E-GOVERNMENT
      • EITCA/WD WEB DEVELOPMENT
      • EITCA/AI ARTIFICIAL INTELLIGENCE
    • EITC CERTIFICATES
      • EITC CERTIFICATES CATALOGUE<
      • COMPUTER GRAPHICS CERTIFICATES
      • WEB DESIGN CERTIFICATES
      • 3D DESIGN CERTIFICATES
      • OFFICE IT CERTIFICATES
      • BITCOIN BLOCKCHAIN CERTIFICATE
      • WORDPRESS CERTIFICATE
      • CLOUD PLATFORM CERTIFICATENEW
    • EITC CERTIFICATES
      • INTERNET CERTIFICATES
      • CRYPTOGRAPHY CERTIFICATES
      • BUSINESS IT CERTIFICATES
      • TELEWORK CERTIFICATES
      • PROGRAMMING CERTIFICATES
      • DIGITAL PORTRAIT CERTIFICATE
      • WEB DEVELOPMENT CERTIFICATES
      • DEEP LEARNING CERTIFICATESNEW
    • CERTIFICATES FOR
      • EU PUBLIC ADMINISTRATION
      • TEACHERS AND EDUCATORS
      • IT SECURITY PROFESSIONALS
      • GRAPHICS DESIGNERS & ARTISTS
      • BUSINESSMEN AND MANAGERS
      • BLOCKCHAIN DEVELOPERS
      • WEB DEVELOPERS
      • CLOUD AI EXPERTSNEW
  • FEATURED
  • SUBSIDY
  • HOW IT WORKS
  •   IT ID
  • ABOUT
  • CONTACT
  • MY ORDER
    Your current order is empty.
EITCIINSTITUTE
CERTIFIED

What is the simplest, step-by-step procedure to practice distributed AI model training in Google Cloud?

by EITCA Academy / Sunday, 11 May 2025 / Published in Artificial Intelligence, EITC/AI/GCML Google Cloud Machine Learning, Further steps in Machine Learning, Distributed training in the cloud

Distributed training is an advanced technique in machine learning that enables the use of multiple computing resources to train large models more efficiently and at greater scale. Google Cloud Platform (GCP) provides robust support for distributed model training, particularly via its AI Platform (Vertex AI), Compute Engine, and Kubernetes Engine, with support for popular frameworks such as TensorFlow and PyTorch. Below is a comprehensive, step-by-step procedure to practice distributed AI model training in Google Cloud, ensuring a clear understanding of both the practical steps and the underlying concepts.

1. Understanding Distributed Training Paradigms

Distributed training generally falls into two primary paradigms:
– Data Parallelism: The dataset is split among multiple replicas of the model, each processing a subset of data, with periodic synchronization of weights.
– Model Parallelism: The model itself is split across different computing nodes, suitable for extremely large models that cannot fit into a single device's memory.

Most introductory distributed training exercises in the cloud employ data parallelism due to its relative simplicity and wide framework support.

2. Preparing Your Environment

Before proceeding, ensure you have:
– A Google Cloud account with billing enabled.
– The Google Cloud SDK (gcloud CLI) installed and authenticated on your local machine.
– Permissions to access and create resources in your Google Cloud project.

3. Setting Up Google Cloud Storage

Distributed training requires that data and model artifacts be accessible to all training nodes. Cloud Storage provides a unified, high-performance storage layer.

Steps:
– Create a Cloud Storage bucket:

sh
  gsutil mb gs://your-bucket-name
  

– Upload your dataset and, optionally, your model code:

sh
  gsutil cp local-data-path gs://your-bucket-name/data/
  gsutil cp local-model-code-path gs://your-bucket-name/code/
  

4. Choosing the Right Compute Infrastructure

The main GCP options for distributed training are:
– Vertex AI (formerly AI Platform): Managed service for ML workflows, supporting distributed training with minimal setup.
– Compute Engine VM Instances: Allows custom environments for more control.
– Google Kubernetes Engine (GKE): Container orchestration for complex workflows.

Vertex AI is recommended for most users due to its managed nature, ease of use, and integration with other Google Cloud services.

5. Preparing the Training Code for Distribution

Frameworks like TensorFlow and PyTorch offer APIs for distributed training:
– TensorFlow: `tf.distribute.Strategy` API. For multi-worker distributed training, use `tf.distribute.MultiWorkerMirroredStrategy`.
– PyTorch: `torch.nn.parallel.DistributedDataParallel` and `torch.distributed.launch`.

Example for TensorFlow:

python
import tensorflow as tf

strategy = tf.distribute.MultiWorkerMirroredStrategy()

def build_and_compile_model():
    model = tf.keras.Sequential([...])
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

with strategy.scope():
    model = build_and_compile_model()

model.fit(train_dataset, epochs=NUM_EPOCHS)

You must also ensure that your code can read data from Cloud Storage, e.g., using TensorFlow I/O or GCS Python libraries.

6. Packaging Your Training Application

To run on Google Cloud, your code should be packaged with a `setup.py` (if running as a Python package) or as a Docker container (for more portability). For Vertex AI, a Python package is sufficient for standard jobs.

Directory structure:

your_training_app/
  - trainer/
      - __init__.py
      - task.py
  - setup.py

Sample `setup.py`:

python
from setuptools import find_packages
from setuptools import setup

setup(
    name='trainer',
    version='0.1',
    packages=find_packages(),
    install_requires=['tensorflow==2.11.0'],
    entry_points={
        'console_scripts': [
            'task = trainer.task:main',
        ],
    },
)

7. Configuring Distributed Training on Vertex AI

Vertex AI allows you to specify the number and type of worker and parameter server instances for distributed jobs.

– Chief: The main worker responsible for orchestration.
– Worker(s): Additional workers.
– Parameter server(s): Nodes holding model parameters (for certain distributed strategies).

Submit a distributed training job with the following command:

sh
gcloud ai custom-jobs create \
  --region=us-central1 \
  --display-name=distributed-training-job \
  --python-package-uris=gs://your-bucket-name/code/trainer-0.1.tar.gz \
  --python-module=trainer.task \
  --worker-pool-spec=machine-type=n1-standard-4,replica-count=1,executor-image-uri=gcr.io/cloud-aiplatform/training/tf-cpu.2-11:latest \
  --worker-pool-spec=machine-type=n1-standard-4,replica-count=3,executor-image-uri=gcr.io/cloud-aiplatform/training/tf-cpu.2-11:latest

This launches a distributed job with one chief and three workers.

8. Monitoring Training and Retrieving Results

Vertex AI provides a web interface for monitoring job status, viewing logs, and examining resource utilization. Logs and model artifacts can be written to Cloud Storage for easy retrieval.

– Monitor logs:
– Vertex AI Console: Navigate to your project and open the job's details.
– Command line: `gcloud ai custom-jobs describe JOB_ID`
– Retrieve model artifacts:
– Models are typically saved to a Cloud Storage bucket specified in your code, e.g., `gs://your-bucket-name/models/model_name/`.

9. Autoscaling and Hyperparameter Tuning

Distributed training can be combined with hyperparameter tuning using Vertex AI’s hyperparameter tuning service. You define the search space and Vertex AI launches multiple distributed jobs with different parameters.

Example configuration:

yaml
trainingInput:
  scaleTier: CUSTOM
  masterType: n1-standard-4
  workerType: n1-standard-4
  parameterServerType: n1-standard-4
  workerCount: 3
  parameterServerCount: 2
  hyperparameters:
    goal: MAXIMIZE
    maxTrials: 10
    maxParallelTrials: 2
    hyperparameterMetricTag: accuracy
    params:
      - parameterName: learning_rate
        type: DOUBLE
        minValue: 0.001
        maxValue: 0.1

10. Example Workflow for a Distributed TensorFlow Job

Let us illustrate the above steps with a practical example: distributed training of an image classification model using TensorFlow on Vertex AI.

A. Prepare the dataset
– Assume an image dataset is stored in `gs://your-bucket-name/data/`.

B. Write distributed training code
– Use `tf.distribute.MultiWorkerMirroredStrategy`.
– Set TensorFlow’s `TF_CONFIG` environment variable for multi-worker coordination. On Vertex AI, this is handled automatically.

C. Package and upload the application
– Build the package:

sh
  python setup.py sdist
  

– Upload to Cloud Storage:

sh
  gsutil cp dist/trainer-0.1.tar.gz gs://your-bucket-name/code/
  

D. Submit the distributed job
– Use the `gcloud` command as above, or configure via the Vertex AI Console.

E. Monitor and retrieve results
– Check progress in the Vertex AI Console.
– Download the trained model from the specified Cloud Storage location for evaluation or deployment.

11. Further Considerations

– Networking: Distributed training requires communication between nodes. Vertex AI handles networking, but when using custom infrastructure (e.g., GKE), you must configure firewalls and networking appropriately.
– GPU/TPU Support: GCP supports distributed training on GPU and TPU nodes. Specify the appropriate machine types and images to leverage these accelerators.
– Custom Containers: For advanced use cases, package your code and dependencies as Docker containers and submit custom jobs to Vertex AI.
– Resource Management: Monitor costs and utilization, as distributed jobs can incur significant resource consumption.
– Fault Tolerance: Distributed frameworks often support checkpointing and recovery. Ensure your code saves checkpoints to Cloud Storage.

12. Example of a Distributed PyTorch Job Using GKE

For more control or when working outside Vertex AI, GKE can be used with Kubernetes-native tools such as Kubeflow.

– Step 1: Containerize your PyTorch application.
– Step 2: Push the container image to Google Container Registry (GCR).
– Step 3: Define a Kubernetes manifest for a distributed PyTorch job (using e.g., Kubeflow PyTorchJob CRD).
– Step 4: Submit the job to your GKE cluster.
– Step 5: Monitor using Kubernetes and GCP tools.

13. Best Practices

– Use managed services like Vertex AI for ease, reliability, and scalability.
– Prefer data parallelism for most practical use cases.
– Store datasets and artifacts in Cloud Storage for unified access.
– Monitor job metrics and logs to identify bottlenecks.
– Use pre-built containers/images for supported frameworks to avoid dependency issues.
– Clean up unused resources to avoid unnecessary costs.

14. Common Pitfalls and Troubleshooting

– Ensure all training nodes can access Cloud Storage and required data.
– Match software versions (e.g., TensorFlow, CUDA, cuDNN) across all nodes.
– Watch for out-of-memory errors; adjust batch sizes and model architectures accordingly.
– Check Google Cloud IAM permissions for storage, compute, and AI Platform APIs.

15. Documentation and Learning Resources

– [Vertex AI Documentation](https://cloud.google.com/vertex-ai/docs)
– [Distributed TensorFlow Guide](https://www.tensorflow.org/guide/distributed_training)
– [Distributed PyTorch Documentation](https://pytorch.org/tutorials/intermediate/dist_tuto.html)
– [Google Cloud Storage Documentation](https://cloud.google.com/storage/docs)
– [Kubeflow on GKE](https://www.kubeflow.org/docs/gke/)

This structured procedure provides a didactic roadmap for practicing distributed model training on Google Cloud, from setup to execution, monitoring, and beyond. With foundational understanding and careful attention to the steps outlined, practitioners can effectively leverage Google Cloud’s infrastructure to accelerate and scale their machine learning workflows.

Other recent questions and answers regarding Distributed training in the cloud:

  • How to practically train and deploy simple AI model in Google Cloud AI Platform via the GUI interface of GCP console in a step-by-step tutorial?
  • What is the first model that one can work on with some practical suggestions for the beginning?
  • What are the disadvantages of distributed training?
  • What are the steps involved in using Cloud Machine Learning Engine for distributed training?
  • How can you monitor the progress of a training job in the Cloud Console?
  • What is the purpose of the configuration file in Cloud Machine Learning Engine?
  • How does data parallelism work in distributed training?
  • What are the advantages of distributed training in machine learning?

More questions and answers:

  • Field: Artificial Intelligence
  • Programme: EITC/AI/GCML Google Cloud Machine Learning (go to the certification programme)
  • Lesson: Further steps in Machine Learning (go to related lesson)
  • Topic: Distributed training in the cloud (go to related topic)
Tagged under: Artificial Intelligence, Distributed Systems, GKE, Google Cloud Storage, Machine Learning, PyTorch, TensorFlow, Vertex AI
Home » Artificial Intelligence » EITC/AI/GCML Google Cloud Machine Learning » Further steps in Machine Learning » Distributed training in the cloud » » What is the simplest, step-by-step procedure to practice distributed AI model training in Google Cloud?

Certification Center

USER MENU

  • My Account

CERTIFICATE CATEGORY

  • EITC Certification (105)
  • EITCA Certification (9)

What are you looking for?

  • Introduction
  • How it works?
  • EITCA Academies
  • EITCI DSJC Subsidy
  • Full EITC catalogue
  • Your order
  • Featured
  •   IT ID
  • EITCA reviews (Medium publ.)
  • About
  • Contact

EITCA Academy is a part of the European IT Certification framework

The European IT Certification framework has been established in 2008 as a Europe based and vendor independent standard in widely accessible online certification of digital skills and competencies in many areas of professional digital specializations. The EITC framework is governed by the European IT Certification Institute (EITCI), a non-profit certification authority supporting information society growth and bridging the digital skills gap in the EU.
Eligibility for EITCA Academy 90% EITCI DSJC Subsidy support
90% of EITCA Academy fees subsidized in enrolment

    EITCA Academy Secretary Office

    European IT Certification Institute ASBL
    Brussels, Belgium, European Union

    EITC / EITCA Certification Framework Operator
    Governing European IT Certification Standard
    Access contact form or call +32 25887351

    Follow EITCI on X
    Visit EITCA Academy on Facebook
    Engage with EITCA Academy on LinkedIn
    Check out EITCI and EITCA videos on YouTube

    Funded by the European Union

    Funded by the European Regional Development Fund (ERDF) and the European Social Fund (ESF) in series of projects since 2007, currently governed by the European IT Certification Institute (EITCI) since 2008

    Information Security Policy | DSRRM and GDPR Policy | Data Protection Policy | Record of Processing Activities | HSE Policy | Anti-Corruption Policy | Modern Slavery Policy

    Automatically translate to your language

    Terms and Conditions | Privacy Policy
    EITCA Academy
    • EITCA Academy on social media
    EITCA Academy


    © 2008-2026  European IT Certification Institute
    Brussels, Belgium, European Union

    TOP
    CHAT WITH SUPPORT
    Do you have any questions?
    We will reply here and by email. Your conversation is tracked with a support token.