19.4. Multi-Fidelity Hyperparameter Optimization
Open the notebook in Colab
Open the notebook in Colab
Open the notebook in Colab
Open the notebook in Colab
Open the notebook in SageMaker Studio Lab

Training neural networks can be expensive even on moderate size datasets. Depending on the configuration space (Section, hyperparameter optimization requires tens to hundreds of function evaluations to find a well-performing hyperparameter configuration. As we have seen in Section 19.3, we can significantly speed up the overall wall-clock time of HPO by exploiting parallel resources, but this does not reduce the total amount of compute required.

In this section, we will show how the evaluation of hyperparameter configurations can be sped up. Methods such as random search allocate the same amount of resources (e.g., number of epochs, training data points) to each hyperparameter evaluation. Fig. 19.4.1 depicts learning curves of a set of neural networks trained with different hyperparameter configurations. After a few epochs we are already able to visually distinguish between well-performing and suboptimal configurations. However, the learning curves are noisy, and we might still require the full amount of 100 epochs to identify the best performing one.


Fig. 19.4.1 Learning curves of random hyperparameter configurations

Multi-fidelity hyperparameter optimization allocates more resources to promising configurations and stop evaluations of poorly performing ones early. This speeds up the optimization process, since we can try a larger number of configurations for the same total amount of resources.

More formally, we expand our definition in Section 19.1.1, such that our objective function \(f(\mathbf{x}, r)\) gets an additional input \(r \in [r_{\mathrm{min}}, r_{max}]\), specifying the amount of resources that we are willing to spend for the evaluation of configuration \(\mathbf{x}\). We assume that the error \(f(\mathbf{x}, r)\) decreases with \(r\), whereas the computational cost \(c(\mathbf{x}, r)\) increases. Typically, \(r\) represents the number of epochs for training the neural network, but it could also be the training subset size or the number of cross-validation folds.

from collections import defaultdict
import numpy as np
from scipy import stats
from d2l import torch as d2l


19.4.1. Successive Halving

One of the simplest ways to adapt random search to the multi-fidelity setting is successive halving (Jamieson and Talwalkar, 2016, Karnin et al., 2013). The basic idea is to start with \(N\) configurations, for example randomly sampled from the configuration space, and to train each of them for \(r_{\mathrm{min}}\) epochs only. We then discard a fraction of the worst performing trials and train the remaining ones for longer. Iterating this process, fewer trials run for longer, until at least one trial reaches \(r_{max}\) epochs.

More formally, consider a minimum budget \(r_{\mathrm{min}}\) (for example 1 epoch), a maximum budget \(r_{max}\), for example max_epochs in our previous example, and a halving constant \(\eta\in\{2, 3, \dots\}\). For simplicity, assume that \(r_{max} = r_{\mathrm{min}} \eta^K\), with \(K \in \mathbb{I}\) . The number of initial configurations is then \(N = \eta^K\). Let us define the set of rungs \(\mathcal{R} = \{ r_{\mathrm{min}}, r_{\mathrm{min}}\eta, r_{\mathrm{min}}\eta^2, \dots, r_{max} \}\).

One round of successive halving proceeds as follows. We start with running \(N\) trials until the first rung \(r_{\mathrm{min}}\). Sorting the validation errors, we keep the top \(1 / \eta\) fraction (which amounts to \(\eta^{K-1}\) configurations) and discard all the rest. The surviving trials are trained for the next rung (\(r_{\mathrm{min}}\eta\) epochs), and the process is repeated. At each rung, a \(1 / \eta\) fraction of trials survives and their training continues with a \(\eta\) times larger budget. With this particular choice of \(N\), only a single trial will be trained to the full budget \(r_{max}\). Once such a round of successive halving is done, we start the next one with a new set of initial configurations, iterating until the total budget is spent.


Fig. 19.4.2 Learning curves of random hyperparameter configurations.

We subclass the HPOScheduler base class from Section 19.2 in order to implement successive halving, allowing for a generic HPOSearcher object to sample configurations (which, in our example below, will be a RandomSearcher). Additionally, the user has to pass the minimum resource \(r_{\mathrm{min}}\), the maximum resource \(r_{max}\) and \(\eta\) as input. Inside our scheduler, we maintain a queue of configurations that still need to be evaluated for the current rung \(r_i\). We update the queue every time we jump to the next rung.

class SuccessiveHalvingScheduler(d2l.HPOScheduler):  #@save
    def __init__(self, searcher, eta, r_min, r_max, prefact=1):
        # Compute K, which is later used to determine the number of configurations
        self.K = int(np.log(r_max / r_min) / np.log(eta))
        # Define the rungs
        self.rung_levels = [r_min * eta ** k for k in range(self.K + 1)]
        if r_max not in self.rung_levels:
            # The final rung should be r_max
            self.K += 1
        # Bookkeeping
        self.observed_error_at_rungs = defaultdict(list)
        self.all_observed_error_at_rungs = defaultdict(list)
        # Our processing queue
        self.queue = []

In the beginning our queue is empty, and we fill it with \(n = \textrm{prefact} \cdot \eta^{K}\) configurations, which are first evaluated on the smallest rung \(r_{\mathrm{min}}\). Here, \(\textrm{prefact}\) allows us to reuse our code in a different context. For the purpose of this section, we fix \(\textrm{prefact} = 1\). Every time resources become available and the HPOTuner object queries the suggest function, we return an element from the queue. Once we finish one round of successive halving, which means that we evaluated all surviving configurations on the highest resource level \(r_{max}\) and our queue is empty, we start the entire process again with a new, randomly sampled set of configurations.

@d2l.add_to_class(SuccessiveHalvingScheduler)  #@save
def suggest(self):
    if len(self.queue) == 0:
        # Start a new round of successive halving
        # Number of configurations for the first rung:
        n0 = int(self.prefact * self.eta ** self.K)
        for _ in range(n0):
            config = self.searcher.sample_configuration()
            config["max_epochs"] = self.r_min  # Set r = r_min
    # Return an element from the queue
    return self.queue.pop()

When we collected a new data point, we first update the searcher module. Afterwards we check if we already collect all data points on the current rung. If so, we sort all configurations and push the top \(\frac{1}{\eta}\) configurations into the queue.

@d2l.add_to_class(SuccessiveHalvingScheduler)  #@save
def update(self, config: dict, error: float, info=None):
    ri = int(config["max_epochs"])  # Rung r_i
    # Update our searcher, e.g if we use Bayesian optimization later
    self.searcher.update(config, error, additional_info=info)
    self.all_observed_error_at_rungs[ri].append((config, error))
    if ri < self.r_max:
        # Bookkeeping
        self.observed_error_at_rungs[ri].append((config, error))
        # Determine how many configurations should be evaluated on this rung
        ki = self.K - self.rung_levels.index(ri)
        ni = int(self.prefact * self.eta ** ki)
        # If we observed all configuration on this rung r_i, we estimate the
        # top 1 / eta configuration, add them to queue and promote them for
        # the next rung r_{i+1}
        if len(self.observed_error_at_rungs[ri]) >= ni:
            kiplus1 = ki - 1
            niplus1 = int(self.prefact * self.eta ** kiplus1)
            best_performing_configurations = self.get_top_n_configurations(
                rung_level=ri, n=niplus1
            riplus1 = self.rung_levels[self.K - kiplus1]  # r_{i+1}
            # Queue may not be empty: insert new entries at the beginning
            self.queue = [
                dict(config, max_epochs=riplus1)
                for config in best_performing_configurations
            ] + self.queue
            self.observed_error_at_rungs[ri] = []  # Reset

Configurations are sorted based on their observed performance on the current rung.

@d2l.add_to_class(SuccessiveHalvingScheduler)  #@save
def get_top_n_configurations(self, rung_level, n):
    rung = self.observed_error_at_rungs[rung_level]
    if not rung:
        return []
    sorted_rung = sorted(rung, key=lambda x: x[1])
    return [x[0] for x in sorted_rung[:n]]

Let us see how successive halving is doing on our neural network example. We will use \(r_{\mathrm{min}} = 2\), \(\eta = 2\), \(r_{max} = 10\), so that rung levels are \(2, 4, 8, 10\).

min_number_of_epochs = 2
max_number_of_epochs = 10
eta = 2

config_space = {
    "learning_rate": stats.loguniform(1e-2, 1),
    "batch_size": stats.randint(32, 256),
initial_config = {
    "learning_rate": 0.1,
    "batch_size": 128,

We just replace the scheduler with our new SuccessiveHalvingScheduler.

searcher = d2l.RandomSearcher(config_space, initial_config=initial_config)
scheduler = SuccessiveHalvingScheduler(
tuner = d2l.HPOTuner(
    error = 0.17762434482574463, runtime = 53.576584339141846

We can visualize the learning curves of all configurations that we evaluated. Most of the configurations are stopped early and only the better performing configurations survive until \(r_{max}\). Compare this to vanilla random search, which would allocate \(r_{max}\) to every configuration.

for rung_index, rung in scheduler.all_observed_error_at_rungs.items():
    errors = [xi[1] for xi in rung]
    d2l.plt.scatter([rung_index] * len(errors), errors)
d2l.plt.xlim(min_number_of_epochs - 0.5, max_number_of_epochs + 0.5)
    np.arange(min_number_of_epochs, max_number_of_epochs + 1),
    np.arange(min_number_of_epochs, max_number_of_epochs + 1)
d2l.plt.ylabel("validation error")
Text(0.5, 0, 'epochs')

Finally, note some slight complexity in our implementation of SuccessiveHalvingScheduler. Say that a worker is free to run a job, and suggest is called when the current rung has almost been completely filled, but another worker is still busy with an evaluation. Since we lack the metric value from this worker, we cannot determine the top \(1 / \eta\) fraction to open up the next rung. On the other hand, we want to assign a job to our free worker, so it does not remain idle. Our solution is to start a new round of successive halving and assign our worker to the first trial there. However, once a rung is completed in update, we make sure to insert new configurations at the beginning of the queue, so they take precedence over configurations from the next round.

19.4.2. Summary

In this section, we introduced the concept of multi-fidelity hyperparameter optimization, where we assume to have access to cheap-to-evaluate approximations of the objective function, such as validation error after a certain number of epochs of training as proxy to validation error after the full number of epochs. Multi-fidelity hyperparameter optimization allows to reduce the overall computation of the HPO instead of just reducing the wall-clock time.

We implemented and evaluated successive halving, a simple yet efficient multi-fidelity HPO algorithm.