Hyperparameter tuning is a critical phase in the machine learning workflow, directly impacting the performance and generalization ability of models. Understanding when to calibrate hyperparameters requires a solid grasp of both the machine learning process and the function of hyperparameters within it.
Hyperparameters are configuration variables that are set prior to the commencement of the learning process. Unlike model parameters—such as weights in a neural network—that are learned during training, hyperparameters govern the model training process itself. Examples include the learning rate for gradient descent, the number of trees in a random forest, the maximum depth of a decision tree, regularization strength, and batch size, among many others. The configuration of these hyperparameters determines how the model learns from data and can significantly influence performance on unseen data.
In the context of the widely recognized 7-step machine learning workflow, hyperparameter tuning aligns with the model training and evaluation phases. The seven steps typically follow this sequence:
1. Data collection
2. Data preparation and cleaning
3. Feature engineering
4. Model selection
5. Model training
6. Model evaluation
7. Model deployment
Hyperparameter calibration occurs after initial data preparation, feature engineering, and model selection. At this point, the practitioner has selected a model architecture or algorithm (for example, a support vector machine, random forest, or a neural network) and is ready to train it on the data. The next consideration is configuring the hyperparameters that govern the learning strategy and model complexity.
The timing for hyperparameter tuning is specifically after a baseline model has been trained and evaluated using default or heuristic hyperparameter values. The initial training serves as a reference point, providing a performance benchmark. Once baseline metrics (such as accuracy, F1-score, mean squared error, or others, depending on the problem type) are established, hyperparameter tuning commences to systematically explore the impact of different configurations on model performance.
The rationale for this ordering is twofold:
– Hyperparameter tuning is computationally expensive and time-consuming, particularly with large datasets or complex models. Beginning with a baseline using default settings prevents unnecessary expenditure of resources on models or configurations that are fundamentally unsuited to the problem.
– The insights gained from baseline performance can guide hyperparameter search strategies, helping to focus efforts on the most promising areas of the hyperparameter space.
A practical example can further clarify this process. Consider a practitioner building a classification model using a support vector machine (SVM). After preparing the data and performing feature engineering, the practitioner selects the SVM algorithm. Initially, the model is trained using default values for the regularization parameter C and the kernel type (e.g., linear kernel). Suppose the baseline accuracy is acceptable but not optimal.
At this stage, the practitioner proceeds to hyperparameter tuning. This involves systematically varying the values of C (e.g., trying values such as 0.1, 1, 10), experimenting with different kernel types (e.g., linear, polynomial, radial basis function), and adjusting other relevant hyperparameters. The model is retrained and re-evaluated for each configuration, typically using cross-validation to assess generalization performance. The hyperparameter combination that yields the best cross-validation performance is selected, resulting in a better-optimized model.
A similar process applies when using other algorithms. For random forests, the number of trees, maximum tree depth, and minimum samples per leaf are common hyperparameters to tune. For deep neural networks, learning rate, batch size, number of layers, and activation functions are typical targets for optimization.
Hyperparameter calibration is not performed before the model architecture or algorithm has been selected and a baseline has been established. Nor is it typically performed after the model has been deployed, except in scenarios involving online or continual learning, where models are periodically retrained or updated with new data.
There are several strategies for hyperparameter tuning. The most common approaches include:
– Grid search: Exhaustively tries all combinations of specified hyperparameter values. While thorough, it is computationally expensive and may not be feasible for high-dimensional hyperparameter spaces.
– Random search: Samples random combinations of hyperparameter values, which has been demonstrated to be more efficient than grid search in many cases, particularly when only a subset of hyperparameters significantly impacts performance.
– Bayesian optimization: Utilizes probabilistic models to predict the performance of hyperparameter configurations, iteratively refining its search to focus on promising regions of the hyperparameter space.
– Automated machine learning (AutoML): Leverages advanced algorithms to automate the hyperparameter tuning process, often combining multiple search techniques.
The choice of tuning strategy is determined by factors such as the complexity of the model, the size of the dataset, available computational resources, and the nature of the hyperparameters themselves.
An important aspect to consider during hyperparameter tuning is the risk of overfitting to the validation set. Excessive tuning can lead to models that perform well on the validation data but generalize poorly to new, unseen data. To mitigate this risk, practitioners often use a three-way split of the available data: a training set for fitting the model, a validation set for hyperparameter tuning, and a test set for final model evaluation. The test set is reserved exclusively for the final assessment, ensuring an unbiased estimate of generalization performance.
Calibrating hyperparameters is also closely tied to the goals and constraints of the project. For example, in applications where model interpretability is paramount, such as in healthcare or finance, hyperparameter tuning may prioritize settings that yield simpler, more transparent models. Conversely, in tasks where predictive performance is the primary objective, more complex models and extensive hyperparameter searches may be warranted.
The process of tuning hyperparameters is iterative and may require several cycles of experimentation and evaluation. Each iteration provides insights into the relationship between hyperparameter values and model performance, informing subsequent search directions. Tools and platforms, including those provided by Google Cloud Machine Learning, facilitate this process by enabling distributed hyperparameter tuning and automatic experiment tracking, allowing practitioners to efficiently manage and analyze large numbers of model training runs.
It is also important to consider the reproducibility of hyperparameter tuning experiments. Documenting hyperparameter configurations, random seeds, and evaluation metrics is necessary to ensure that results can be validated and compared across different runs and by different team members.
Hyperparameter tuning is revisited whenever there is a significant change in the data, feature engineering process, or model architecture. For instance, if new features are added or the model is switched from a decision tree to a neural network, previous hyperparameter settings may no longer be optimal, necessitating a new round of tuning. Similarly, hyperparameter calibration may be repeated periodically in production environments where data distributions evolve over time (a phenomenon known as data drift).
Hyperparameter calibration is performed after initial data preparation, feature engineering, and a baseline model have been established, and before the model is finalized and deployed. It is an iterative, resource-intensive process that involves systematically searching for the combination of hyperparameters that yields the best model performance, as measured by validation metrics. The process is informed by baseline results, guided by project objectives and constraints, and facilitated by a range of search strategies and tooling.
Other recent questions and answers regarding The 7 steps of machine learning:
- How similar is machine learning with genetic optimization of an algorithm?
- Can we use streaming data to train and use a model continuously and improve it at the same time?
- What is PINN-based simulation?
- What are the hyperparameters m and b from the video?
- What data do I need for machine learning? Pictures, text?
- What is the most effective way to create test data for the ML algorithm? Can we use synthetic data?
- Can PINNs-based simulation and dynamic knowledge graph layers be used as a fabric together with an optimization layer in a competitive environment model? Is this okay for small sample size ambiguous real-world data sets?
- Could training data be smaller than evaluation data to force a model to learn at higher rates via hyperparameter tuning, as in self-optimizing knowledge-based models?
- Since the ML process is iterative, is it the same test data used for evaluation? If yes, does repeated exposure to the same test data compromise its usefulness as an unseen dataset?
- What is a concrete example of a hyperparameter?
View more questions and answers in The 7 steps of machine learning

