×
1 Choose EITC/EITCA Certificates
2 Learn and take online exams
3 Get your IT skills certified

Confirm your IT skills and competencies under the European IT Certification framework from anywhere in the world fully online.

EITCA Academy

Digital skills attestation standard by the European IT Certification Institute aiming to support Digital Society development

LOG IN TO YOUR ACCOUNT

CREATE AN ACCOUNT FORGOT YOUR PASSWORD?

FORGOT YOUR PASSWORD?

AAH, WAIT, I REMEMBER NOW!

CREATE AN ACCOUNT

ALREADY HAVE AN ACCOUNT?
EUROPEAN INFORMATION TECHNOLOGIES CERTIFICATION ACADEMY - ATTESTING YOUR PROFESSIONAL DIGITAL SKILLS
  • SIGN UP
  • LOGIN
  • INFO

EITCA Academy

EITCA Academy

The European Information Technologies Certification Institute - EITCI ASBL

Certification Provider

EITCI Institute ASBL

Brussels, European Union

Governing European IT Certification (EITC) framework in support of the IT professionalism and Digital Society

  • CERTIFICATES
    • EITCA ACADEMIES
      • EITCA ACADEMIES CATALOGUE<
      • EITCA/CG COMPUTER GRAPHICS
      • EITCA/IS INFORMATION SECURITY
      • EITCA/BI BUSINESS INFORMATION
      • EITCA/KC KEY COMPETENCIES
      • EITCA/EG E-GOVERNMENT
      • EITCA/WD WEB DEVELOPMENT
      • EITCA/AI ARTIFICIAL INTELLIGENCE
    • EITC CERTIFICATES
      • EITC CERTIFICATES CATALOGUE<
      • COMPUTER GRAPHICS CERTIFICATES
      • WEB DESIGN CERTIFICATES
      • 3D DESIGN CERTIFICATES
      • OFFICE IT CERTIFICATES
      • BITCOIN BLOCKCHAIN CERTIFICATE
      • WORDPRESS CERTIFICATE
      • CLOUD PLATFORM CERTIFICATENEW
    • EITC CERTIFICATES
      • INTERNET CERTIFICATES
      • CRYPTOGRAPHY CERTIFICATES
      • BUSINESS IT CERTIFICATES
      • TELEWORK CERTIFICATES
      • PROGRAMMING CERTIFICATES
      • DIGITAL PORTRAIT CERTIFICATE
      • WEB DEVELOPMENT CERTIFICATES
      • DEEP LEARNING CERTIFICATESNEW
    • CERTIFICATES FOR
      • EU PUBLIC ADMINISTRATION
      • TEACHERS AND EDUCATORS
      • IT SECURITY PROFESSIONALS
      • GRAPHICS DESIGNERS & ARTISTS
      • BUSINESSMEN AND MANAGERS
      • BLOCKCHAIN DEVELOPERS
      • WEB DEVELOPERS
      • CLOUD AI EXPERTSNEW
  • FEATURED
  • SUBSIDY
  • HOW IT WORKS
  •   IT ID
  • ABOUT
  • CONTACT
  • MY ORDER
    Your current order is empty.
EITCIINSTITUTE
CERTIFIED

What are the techniques for handling missing data? How do I realize I am missing data? Are there general references on pretraining treatment of data?

by Francesco Spanò / Sunday, 10 May 2026 / Published in Artificial Intelligence, EITC/AI/GCML Google Cloud Machine Learning, First steps in Machine Learning, The 7 steps of machine learning

Handling missing data effectively is a foundational aspect of preparing datasets for machine learning tasks, as the quality and completeness of data directly influence model performance and the validity of predictive outcomes. Missing data can originate from various sources, including equipment malfunctions, human error, data corruption, or intentional omission. Understanding techniques for handling such instances, methods for detecting missingness, and available literature are important components of the broader data preprocessing workflow, particularly during the early stages—often conceptualized as part of the "Data Preparation" or "Data Cleaning" phase in the canonical seven steps of machine learning.

Recognition and Detection of Missing Data

Before applying any technique to handle missing data, it is necessary to accurately identify where and how missingness occurs. This process typically involves:

1. Data Exploration and Profiling:
Conducting exploratory data analysis (EDA) is the first step. By examining summary statistics, shape, and structure of the dataset, one can identify variables with missing entries. Functions in popular libraries such as Pandas (`isnull()`, `info()`, `describe()` in Python) or the DataFrame's `summary()` in R are routinely used to summarize missing values across columns.

2. Visualization:
Visualization techniques provide an intuitive understanding of the pattern and extent of missingness. Heatmaps (e.g., via `seaborn.heatmap`), bar plots, or dedicated missing value visualization packages (such as `missingno` in Python) are instrumental in revealing whether missing data are randomly distributed or exhibit systematic structure.

3. Statistical Tests:
Statistical testing can determine the mechanism of missingness:
– MCAR (Missing Completely at Random): No pattern, missingness is independent of any variable.
– MAR (Missing at Random): Missingness is related to observed data, but not the missing data itself.
– MNAR (Missing Not at Random): Missingness relates to unobserved data.
Techniques such as Little’s MCAR test or logistic regression models for missingness can help in discerning these mechanisms.

Techniques for Handling Missing Data

Several strategies exist for dealing with missing data, each with its own assumptions, advantages, and trade-offs. The choice of method depends on the nature of the dataset, the proportion of missing data, the missingness mechanism, and the downstream machine learning model requirements.

1. Deletion Methods

– Listwise Deletion (Complete Case Analysis):
This approach involves removing entire records (rows) where any value is missing. It is straightforward and often implemented as a default in many tools. However, it is only appropriate when missingness is MCAR, as it can otherwise introduce bias and significantly reduce data size, leading to loss of statistical power.
*Example*: In a medical dataset with 10,000 patient records, if 2,000 have at least one missing value, listwise deletion would result in a working dataset of 8,000 patients.

– Pairwise Deletion:
Rather than removing entire rows, pairwise deletion uses all available data for each analysis. For example, pairwise correlations between variables are computed using all cases where both variables are observed. This preserves more data but can lead to inconsistencies in sample sizes across analyses.

2. Imputation Methods

– Mean/Median/Mode Imputation:
For numerical data, replacing missing values with the mean or median of the observed data is common, while categorical data often use the mode. This method is simple but can underestimate variability and distort relationships between variables.
*Example*: If the ‘age’ variable has missing values, one might replace them with the median age of the observed data.

– Constant Value Imputation:
Sometimes a special value (e.g., -999 or "Unknown") is used to indicate missingness, allowing models to treat these cases distinctly. However, this may introduce artificial outliers or bias if not handled appropriately.

– K-Nearest Neighbors (KNN) Imputation:
KNN imputation fills missing values by averaging the values of the k nearest data points, determined by similarity on other observed variables. This can preserve local data structure but may be computationally expensive on large datasets.

– Regression Imputation:
A regression model predicts the missing value based on other observed variables. For example, if income is missing, a regression using age, education, and occupation can estimate the missing income value. This method can reflect relationships in the data but may amplify modelled correlations.

– Multiple Imputation:
Involves creating several plausible imputed datasets by drawing values from a predictive distribution and then combining results. This approach reflects the uncertainty inherent in the missing data and is widely considered a robust method for handling MAR scenarios. Packages like `mice` in R or `IterativeImputer` in scikit-learn implement this approach.

– Model-Based Imputation:
More advanced models, such as Expectation-Maximization (EM) algorithms, probabilistic graphical models, or deep learning (e.g., autoencoders), can be used to infer missing values, especially in complex, high-dimensional data.

3. Indicator Methods

– Missingness Indicator Variables:
Creation of binary indicators (e.g., “is_missing”) flags which values are missing. These can be fed to machine learning models to capture any predictive power associated with the fact that data are missing.

4. Domain-Specific Methods

– Data Augmentation:
In cases where missingness is significant, one might use domain knowledge to simulate or synthesize missing data points, although this is highly context-dependent.

– Temporal or Spatial Interpolation:
For time series or spatial data, imputation methods that consider the temporal or spatial continuity (such as linear interpolation, forward-fill, or spatial kriging) are frequently used.

Practical Examples

– In a retail transaction dataset, suppose the ‘customer_age’ column is missing for 5% of records. If missingness is random, mean or median imputation may suffice. If age is missing more frequently for certain store locations, one might stratify imputation by location or use regression models incorporating store features.
– In an IoT sensor dataset, where missing values occur due to transmission errors, interpolation may be used for time series features, while more complex methods like KNN imputation could be employed for cross-sensor data.
– For survey data with skipped questions, indicator variables might be introduced to capture the information that a respondent chose not to answer, which itself may have predictive value.

Considerations When Selecting a Method

– Proportion of Missing Data:
If a variable is missing a high proportion of values (commonly thresholds range from 20% to 50%), it may be prudent to drop the variable entirely, unless it holds significant domain importance.
– Downstream Algorithm Sensitivity:
Some machine learning models, such as tree-based methods (e.g., Random Forest, XGBoost), can handle missingness natively to some extent, while others (e.g., linear regression, SVM) require imputation or deletion.
– Assumptions about Missingness:
Understanding whether data are MCAR, MAR, or MNAR is critical, as the appropriateness and impact of each technique differ accordingly.
– Data Distribution Preservation:
Methods like mean/median imputation can distort the original data distribution, especially with skewed variables. More sophisticated imputation (regression, multiple imputation) better preserve statistical properties.

General References on Pretraining Treatment of Data

Numerous authoritative texts and research articles address data preprocessing and the treatment of missing data, providing theoretical foundations and practical guidance. Key references include:

– "The Elements of Statistical Learning" by Hastie, Tibshirani, and Friedman:
This classic text covers the statistical aspects of missing data and imputation in the context of predictive modeling.

– "Data Preparation for Data Mining" by Dorian Pyle:
A comprehensive resource focusing on practical aspects of data cleaning, including missing data treatment.

– "Applied Predictive Modeling" by Kuhn and Johnson:
Contains a dedicated section on handling missing data during the model-building pipeline and illustrates approaches with code examples.

– "Statistical Analysis with Missing Data" by Little and Rubin:
The definitive monograph on the statistical theory and methodology for handling missing data, including MCAR, MAR, and MNAR frameworks.

– Scikit-learn Documentation:
Provides practical implementation details for various imputation techniques, including KNN, IterativeImputer (multiple imputation), and simple imputation, with code samples.

– Google Cloud AI Platform Documentation:
Offers best practices for preparing data for cloud-based machine learning workflows, including recommendations for missing data management.

– Research Articles:
– Schafer, J.L., & Graham, J.W. (2002). Missing data: Our view of the state of the art. *Psychological Methods*, 7(2), 147–177.
– Van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. *Journal of Statistical Software*, 45(3), 1-67.

– Online Tutorials and Guides:
– Kaggle’s Data Cleaning and Preprocessing tutorials.
– Google’s "Machine Learning Crash Course" section on data preparation.

Integration into Machine Learning Pipelines

Missing data management is not an isolated task but an integral part of the broader machine learning pipeline. Its impact reverberates through feature engineering, model fitting, evaluation, and even deployment. Modern platforms such as Google Cloud AI Platform, TensorFlow Extended (TFX), and Kubeflow Pipelines facilitate modular integration of data cleaning steps, including missing value imputation, as discrete, reproducible pipeline components.

– Automated Data Validation:
Tools such as TensorFlow Data Validation (TFDV) can detect missing values and distributional anomalies as part of automated data pipeline checks.
– Feature Store Integration:
Google’s Vertex AI Feature Store allows for the specification of default values and imputation strategies at the feature engineering stage, ensuring consistency across modeling and serving environments.

Best Practices

1. Always Document Data Cleaning Steps:
Maintain rigorous records of which imputation or deletion strategies were applied, along with rationale and statistical impact, to ensure reproducibility and facilitate model auditing.
2. Evaluate Multiple Imputation Strategies:
Empirically compare the impact of different imputation techniques on downstream model performance using hold-out validation or cross-validation.
3. Leverage Domain Knowledge:
Engage subject matter experts to assess whether certain missing values indicate data errors, meaningful absence, or require special treatment.
4. Use Automated Tools Judiciously:
While automated imputation tools save time, they should be complemented with careful validation and statistical scrutiny.

Challenges and Research Directions

The field continues to evolve, with research focusing on:

– Deep learning-based imputation methods (e.g., using Generative Adversarial Networks or Variational Autoencoders).
– Handling missing data in streaming or real-time settings.
– Causal inference approaches to missing data.
– Improved diagnostics to distinguish between MCAR, MAR, and MNAR in high-dimensional datasets.

Dealing with missing data is a multifaceted task that requires a blend of statistical insight, domain expertise, and practical engineering. The correct choice of technique depends on the data, the context, and the goals of the machine learning project. A robust data preparation phase, documented and tested with multiple techniques, is indispensable for building reliable machine learning models and ensuring that results are interpretable, reproducible, and actionable.

Other recent questions and answers regarding The 7 steps of machine learning:

  • How is data training done? Is it done using libraries available for the Python language, or are there specific programs for this purpose?
  • What considerations are relevant for choosing the right training algorithm to start with?
  • How similar is machine learning with genetic optimization of an algorithm?
  • Can we use streaming data to train and use a model continuously and improve it at the same time?
  • What is PINN-based simulation?
  • What are the hyperparameters m and b from the video?
  • What data do I need for machine learning? Pictures, text?
  • What is the most effective way to create test data for the ML algorithm? Can we use synthetic data?
  • Can PINNs-based simulation and dynamic knowledge graph layers be used as a fabric together with an optimization layer in a competitive environment model? Is this okay for small sample size ambiguous real-world data sets?
  • Could training data be smaller than evaluation data to force a model to learn at higher rates via hyperparameter tuning, as in self-optimizing knowledge-based models?

View more questions and answers in The 7 steps of machine learning

More questions and answers:

  • Field: Artificial Intelligence
  • Programme: EITC/AI/GCML Google Cloud Machine Learning (go to the certification programme)
  • Lesson: First steps in Machine Learning (go to related lesson)
  • Topic: The 7 steps of machine learning (go to related topic)
Tagged under: Artificial Intelligence, Data Cleaning, Data Preprocessing, EDA, Google Cloud, Imputation, Machine Learning, MAR, MCAR, MISSING DATA, MNAR
Home » Artificial Intelligence » EITC/AI/GCML Google Cloud Machine Learning » First steps in Machine Learning » The 7 steps of machine learning » » What are the techniques for handling missing data? How do I realize I am missing data? Are there general references on pretraining treatment of data?

Certification Center

USER MENU

  • My Account

CERTIFICATE CATEGORY

  • EITC Certification (105)
  • EITCA Certification (9)

What are you looking for?

  • Introduction
  • How it works?
  • EITCA Academies
  • EITCI DSJC Subsidy
  • Full EITC catalogue
  • Your order
  • Featured
  •   IT ID
  • EITCA reviews (Medium publ.)
  • About
  • Contact

EITCA Academy is a part of the European IT Certification framework

The European IT Certification framework has been established in 2008 as a Europe based and vendor independent standard in widely accessible online certification of digital skills and competencies in many areas of professional digital specializations. The EITC framework is governed by the European IT Certification Institute (EITCI), a non-profit certification authority supporting information society growth and bridging the digital skills gap in the EU.
Eligibility for EITCA Academy 90% EITCI DSJC Subsidy support
90% of EITCA Academy fees subsidized in enrolment

    EITCA Academy Secretary Office

    European IT Certification Institute ASBL
    Brussels, Belgium, European Union

    EITC / EITCA Certification Framework Operator
    Governing European IT Certification Standard
    Access contact form or call +32 25887351

    Follow EITCI on X
    Visit EITCA Academy on Facebook
    Engage with EITCA Academy on LinkedIn
    Check out EITCI and EITCA videos on YouTube

    Funded by the European Union

    Funded by the European Regional Development Fund (ERDF) and the European Social Fund (ESF) in series of projects since 2007, currently governed by the European IT Certification Institute (EITCI) since 2008

    Information Security Policy | DSRRM and GDPR Policy | Data Protection Policy | Record of Processing Activities | HSE Policy | Anti-Corruption Policy | Modern Slavery Policy

    Automatically translate to your language

    Terms and Conditions | Privacy Policy
    EITCA Academy
    • EITCA Academy on social media
    EITCA Academy


    © 2008-2026  European IT Certification Institute
    Brussels, Belgium, European Union

    TOP

    We care about your privacy

    EITCI uses cookies and similar technologies to keep this site secure, remember your choices, provide personalized experience, measure the traffic, serve more relevant content and certification programmes. You can accept all cookies or customize your preferences. Cookies are variables used to store website specific information on your device to facilitate processing of data for personalized website visit, such as login to your account, accessing the programmes, placing enrolment orders in chosen programmes and improving your EITC certification journey. You can change or withdraw your consent at any time by clicking the Consent Preferences button at the left-bottom of your screen. We respect your choices and are committed to providing you with a transparent and secure browsing experience, which may be limited when cookies aren't accepted. For more details refer to the Privacy Policy
    Customize Consent Preferences
    We use cookies to help you navigate efficiently and perform certain functions. You will find detailed information about all cookies under each consent category below.
    The cookies categorized as Necessary are stored on your browser as they are essential for enabling the basic functionalities of the site.
    To learn more about how Google processes personal information, visit: Google privacy policy

    Necessary

    Always Active

    Necessary cookies are required to enable the basic features of this site, such as providing secure log-in or adjusting your consent preferences. These cookies do not store any personally identifiable data.

    Functional

    Functional cookies help perform certain functionalities like sharing the content of the website on social media platforms, collecting feedback, and other third-party features.

    Preferences

    Stores personalization choices such as interface preferences.

    External media and social features

    Allows embedded video, social, chat, and external interactive services that may set their own cookies. Keep off until the user chooses these features.

    Analytics

    Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.

    Marketing and conversions

    Advertisement cookies are used to provide visitors with customized advertisements based on the pages you visited previously and to analyze the effectiveness of the ad campaigns.

    CHAT WITH SUPPORT
    Do you have any questions?
    Attach files with the paperclip or paste screenshots into the message box (Ctrl+V). Max 5 file(s), 10 MB each.
    We will reply here and by email. Your conversation is tracked with a support token.