×
1 Choose EITC/EITCA Certificates
2 Learn and take online exams
3 Get your IT skills certified

Confirm your IT skills and competencies under the European IT Certification framework from anywhere in the world fully online.

EITCA Academy

Digital skills attestation standard by the European IT Certification Institute aiming to support Digital Society development

LOG IN TO YOUR ACCOUNT

CREATE AN ACCOUNT FORGOT YOUR PASSWORD?

FORGOT YOUR PASSWORD?

AAH, WAIT, I REMEMBER NOW!

CREATE AN ACCOUNT

ALREADY HAVE AN ACCOUNT?
EUROPEAN INFORMATION TECHNOLOGIES CERTIFICATION ACADEMY - ATTESTING YOUR PROFESSIONAL DIGITAL SKILLS
  • SIGN UP
  • LOGIN
  • INFO

EITCA Academy

EITCA Academy

The European Information Technologies Certification Institute - EITCI ASBL

Certification Provider

EITCI Institute ASBL

Brussels, European Union

Governing European IT Certification (EITC) framework in support of the IT professionalism and Digital Society

  • CERTIFICATES
    • EITCA ACADEMIES
      • EITCA ACADEMIES CATALOGUE<
      • EITCA/CG COMPUTER GRAPHICS
      • EITCA/IS INFORMATION SECURITY
      • EITCA/BI BUSINESS INFORMATION
      • EITCA/KC KEY COMPETENCIES
      • EITCA/EG E-GOVERNMENT
      • EITCA/WD WEB DEVELOPMENT
      • EITCA/AI ARTIFICIAL INTELLIGENCE
    • EITC CERTIFICATES
      • EITC CERTIFICATES CATALOGUE<
      • COMPUTER GRAPHICS CERTIFICATES
      • WEB DESIGN CERTIFICATES
      • 3D DESIGN CERTIFICATES
      • OFFICE IT CERTIFICATES
      • BITCOIN BLOCKCHAIN CERTIFICATE
      • WORDPRESS CERTIFICATE
      • CLOUD PLATFORM CERTIFICATENEW
    • EITC CERTIFICATES
      • INTERNET CERTIFICATES
      • CRYPTOGRAPHY CERTIFICATES
      • BUSINESS IT CERTIFICATES
      • TELEWORK CERTIFICATES
      • PROGRAMMING CERTIFICATES
      • DIGITAL PORTRAIT CERTIFICATE
      • WEB DEVELOPMENT CERTIFICATES
      • DEEP LEARNING CERTIFICATESNEW
    • CERTIFICATES FOR
      • EU PUBLIC ADMINISTRATION
      • TEACHERS AND EDUCATORS
      • IT SECURITY PROFESSIONALS
      • GRAPHICS DESIGNERS & ARTISTS
      • BUSINESSMEN AND MANAGERS
      • BLOCKCHAIN DEVELOPERS
      • WEB DEVELOPERS
      • CLOUD AI EXPERTSNEW
  • FEATURED
  • SUBSIDY
  • HOW IT WORKS
  •   IT ID
  • ABOUT
  • CONTACT
  • MY ORDER
    Your current order is empty.
EITCIINSTITUTE
CERTIFIED

How to label data that should not affect model training (e.g., important only for humans)?

by Michał Otoka / Monday, 29 September 2025 / Published in Artificial Intelligence, EITC/AI/GCML Google Cloud Machine Learning, Google Cloud AI Platform, Cloud AI Data labeling service

When preparing datasets for supervised machine learning tasks on the Google Cloud AI Platform, it is common to encounter metadata or annotations that serve informational or organizational purposes for human users but are not intended to influence the training process of a machine learning model. Properly managing these data points is important to prevent unintentional data leakage, maintain reproducibility, and ensure the clarity of the dataset for both machine and human consumers.

Understanding Data Labeling Distinctions

In the context of Google Cloud AI Data Labeling Service, data labeling typically refers to the process of assigning ground truth annotations to data instances (such as images, text, or audio) that will be used as targets during model training or evaluation. These annotations might include class labels, bounding boxes, segmentation masks, or transcriptions, depending on the task.

However, datasets often contain additional fields or metadata that provide informational value. Examples include:

– Reviewer comments or notes
– Quality control flags
– Human-readable descriptions or rationales
– Annotator or reviewer identifiers
– Timestamps of labeling events
– Confidence scores entered by a human, not a model
– Internal tags for workflow management

While these attributes can help with data management, auditing, or interpretability, they are not intended as features or targets for the machine learning model.

Strategies for Labeling Non-Training Data

To ensure that certain annotations do not influence model training when using Google Cloud AI Platform, several strategies can be employed during dataset preparation, storage, and ingestion.

1. Schema Design and Separation

Design the dataset schema such that non-training fields are clearly separated from features and labels. For example, if using CSV, JSONL, or TFRecord formats to store data, group fields into:

– Features: Used as input to the model
– Labels: Used as ground truth for supervised training
– Metadata: Used for human reference or workflow

Example (JSONL format):

json
{
  "image_uri": "gs://bucket/path/image1.jpg",
  "label": "cat",
  "annotator_comment": "Blurry image, but still recognizable.",
  "reviewer_id": "user_23",
  "created_at": "2023-04-12T10:23:34Z"
}

In this example, only `image_uri` and `label` would be used for model training. `annotator_comment`, `reviewer_id`, and `created_at` are metadata fields for human consumption.

2. Use of Custom Annotation Fields

Within Google Cloud AI Data Labeling Service, custom annotation fields can be defined. These fields can be marked for internal use and not exported as part of the training dataset. For instance, the tool allows for creation of task-specific instructions and custom attributes that help labelers but are not exported to the final dataset schema used for model training.

3. Explicit Exclusion During Export

When exporting labeled datasets from Google Cloud AI Data Labeling Service, configure export settings to exclude metadata fields not intended for training. The platform allows selection of specific annotation fields to include in the exported dataset, enabling a clean separation between training data and auxiliary information.

4. Data Ingestion Pipeline Filtering

In the data ingestion pipeline that feeds data to the training process (e.g., Dataflow, Apache Beam, custom Python scripts), apply explicit filtering to ensure only relevant fields are passed to the training job. This can be done by specifying which columns to read from the dataset, or by transforming the dataset to a format (e.g., TensorFlow Examples, CSV with only feature/label columns) that omits metadata.

Example (Python pseudocode):

{{EJS4}}
5. Documentation and Data Contracts
Maintain clear data documentation or data contracts specifying which fields are used for model training and which are for informational purposes only. This aids both current and future stakeholders in understanding the intended use of each field, minimizing the risk of unintentionally including irrelevant data in the training process.

Use Cases and Examples

Consider an image classification task where labelers are asked to classify images as "cat" or "dog" and provide a comment on any ambiguities they encounter. - The "label" field is the ground truth the model will learn to predict. - The "comment" field is for audit and review, helping data scientists understand labeling challenges or ambiguities. - The "annotator_id" field helps track who labeled each image for quality management. If the "comment" or "annotator_id" fields are included as features during model training, the model might inadvertently learn patterns based on the annotator’s behavior or comments, leading to data leakage and reduced generalization. By isolating these fields and ensuring only the "label" is used as the target, and only relevant features (such as the image pixels) are used as model inputs, the integrity of the training process is preserved.

Preventing Data Leakage

Data leakage occurs when information that would not be available at prediction time is included in the training data, resulting in over-optimistic model performance during training and evaluation, but poor generalization in production. Including human-only fields (such as reviewer comments or internal tags) as model features is a common source of data leakage. This risk can be mitigated by: - Rigorous data review processes before training - Automated schema validation and pipeline checks - Continuous education of the data engineering and data science teams about leakage risks

Recommendations for Google Cloud AI Data Labeling Service

- Task Configuration: When setting up a labeling task, define which fields are to be used as labels for model training and which are for auxiliary purposes. - Export Templates: Customize export templates to ensure only the relevant subset of fields is included in the dataset used for downstream training tasks. - Access Control: Use Google Cloud IAM policies to restrict access to sensitive metadata fields as needed, especially if the metadata includes personally identifiable information (PII) or other sensitive content. - Data Versioning: Version both the raw labeled data and the filtered training datasets to ensure reproducibility and traceability.

Storing and Tracking Metadata

While metadata fields should not be included in model training, they can be valuable for: - Auditing: Tracking labeling quality, reviewing individual labeler performance, or investigating labeling inconsistencies. - Workflow Management: Managing labeling progress, assigning tasks, or tracking review status. - Error Analysis: Understanding model errors in the context of challenging labeling cases. Google Cloud AI Platform supports storing such metadata fields in BigQuery tables or as part of the source files in Cloud Storage, separate from the datasets ingested into Vertex AI or AutoML services.

Example: Cloud AI Data Labeling Service Annotation Export

A typical annotation export for an image classification task might look like the following JSON object:
json
{
  "input_gcs_uri": "gs://bucket/images/img1.jpg",
  "classification_annotations": [
    {
      "display_name": "cat"
    }
  ],
  "annotation_metadata": {
    "labeler_notes": "Blurry but likely a cat.",
    "created_by": "labeler_123",
    "timestamp": "2023-04-12T10:23:34Z"
  }
}

In this example, only the `classification_annotations` field is used as the ground truth label for training. The `annotation_metadata` object is kept for human reference and should be excluded from the training dataset.

Managing Data in Vertex AI

When using Vertex AI on Google Cloud, datasets are often registered within the platform, and schema management is handled explicitly. Vertex AI allows users to define which columns are used as features and which as labels. Metadata or auxiliary columns can be included in the dataset for reference, but must not be marked as features or labels in the model configuration.

Best Practices

1. Clearly Separate Training Data and Metadata: Maintain distinct storage and schema definitions for data intended for model consumption and for human-only fields.
2. Automate Filtering: Use automated tools or scripts to filter out non-training fields before ingesting data into the training pipeline.
3. Document Data Usage: Maintain comprehensive documentation for each dataset, explaining the role of each field.
4. Review and Validate Schema: Before each training run, validate the dataset schema to confirm that only the intended fields are included.
5. Enable Traceability: Keep raw data and metadata accessible for audit, but ensure only filtered data feeds into training.

Proper management of data labeling for fields not intended to affect model training is a key aspect of building robust machine learning pipelines on Google Cloud AI Platform. By designing clear data schemas, using explicit export and filtering mechanisms, and maintaining thorough documentation, it is possible to ensure that only valid training data influences the model, while still capturing valuable metadata for human use. Adhering to these practices helps prevent data leakage, supports reproducibility, and enhances both the trustworthiness and maintainability of machine learning workflows.

Other recent questions and answers regarding Cloud AI Data labeling service:

  • How does an AI data labeling service ensure that labelers are not biased?
  • In what way should data related to time series prediction be labeled, where the result is the last x elements in a given row?
  • What is the recommended approach for ramping up data labeling jobs to ensure the best results and efficient use of resources?
  • What security measures are in place to protect the data during the labeling process in the data labeling service?
  • How does the data labeling service ensure high labeling quality when multiple labelers are involved?
  • What are the different types of labeling tasks supported by the data labeling service for image, video, and text data?
  • What are the three core resources required to create a labeling task using the data labeling service?

More questions and answers:

  • Field: Artificial Intelligence
  • Programme: EITC/AI/GCML Google Cloud Machine Learning (go to the certification programme)
  • Lesson: Google Cloud AI Platform (go to related lesson)
  • Topic: Cloud AI Data labeling service (go to related topic)
Tagged under: Artificial Intelligence, Data Labeling, Data Leakage, Model Training, Schema Management, Vertex AI
Home » Artificial Intelligence » EITC/AI/GCML Google Cloud Machine Learning » Google Cloud AI Platform » Cloud AI Data labeling service » » How to label data that should not affect model training (e.g., important only for humans)?

Certification Center

USER MENU

  • My Account

CERTIFICATE CATEGORY

  • EITC Certification (105)
  • EITCA Certification (9)

What are you looking for?

  • Introduction
  • How it works?
  • EITCA Academies
  • EITCI DSJC Subsidy
  • Full EITC catalogue
  • Your order
  • Featured
  •   IT ID
  • EITCA reviews (Medium publ.)
  • About
  • Contact

EITCA Academy is a part of the European IT Certification framework

The European IT Certification framework has been established in 2008 as a Europe based and vendor independent standard in widely accessible online certification of digital skills and competencies in many areas of professional digital specializations. The EITC framework is governed by the European IT Certification Institute (EITCI), a non-profit certification authority supporting information society growth and bridging the digital skills gap in the EU.
Eligibility for EITCA Academy 90% EITCI DSJC Subsidy support
90% of EITCA Academy fees subsidized in enrolment

    EITCA Academy Secretary Office

    European IT Certification Institute ASBL
    Brussels, Belgium, European Union

    EITC / EITCA Certification Framework Operator
    Governing European IT Certification Standard
    Access contact form or call +32 25887351

    Follow EITCI on X
    Visit EITCA Academy on Facebook
    Engage with EITCA Academy on LinkedIn
    Check out EITCI and EITCA videos on YouTube

    Funded by the European Union

    Funded by the European Regional Development Fund (ERDF) and the European Social Fund (ESF) in series of projects since 2007, currently governed by the European IT Certification Institute (EITCI) since 2008

    Information Security Policy | DSRRM and GDPR Policy | Data Protection Policy | Record of Processing Activities | HSE Policy | Anti-Corruption Policy | Modern Slavery Policy

    Automatically translate to your language

    Terms and Conditions | Privacy Policy
    EITCA Academy
    • EITCA Academy on social media
    EITCA Academy


    © 2008-2026  European IT Certification Institute
    Brussels, Belgium, European Union

    TOP
    CHAT WITH SUPPORT
    Do you have any questions?
    We will reply here and by email. Your conversation is tracked with a support token.