.. _sec_sh_async:

Asynchronous Successive Halving
===============================


As we have seen in :numref:`sec_rs_async`, we can accelerate HPO by
distributing the evaluation of hyperparameter configurations across
either multiple instances or multiples CPUs / GPUs on a single instance.
However, compared to random search, it is not straightforward to run
successive halving (SH) asynchronously in a distributed setting. Before
we can decide which configuration to run next, we first have to collect
all observations at the current rung level. This requires to synchronize
workers at each rung level. For example, for the lowest rung level
:math:`r_{\mathrm{min}}`, we first have to evaluate all
:math:`N = \eta^K` configurations, before we can promote the
:math:`\frac{1}{\eta}` of them to the next rung level.

In any distributed system, synchronization typically implies idle time
for workers. First, we often observe high variations in training time
across hyperparameter configurations. For example, assuming the number
of filters per layer is a hyperparameter, then networks with less
filters finish training faster than networks with more filters, which
implies idle worker time due to stragglers. Moreover, the number of
slots in a rung level is not always a multiple of the number of workers,
in which case some workers may even sit idle for a full batch.

Figure :numref:`synchronous_sh` shows the scheduling of synchronous SH
with :math:`\eta=2` for four different trials with two workers. We start
with evaluating Trial-0 and Trial-1 for one epoch and immediately
continue with the next two trials once they are finished. We first have
to wait until Trial-2 finishes, which takes substantially more time than
the other trials, before we can promote the best two trials, i.e.,
Trial-0 and Trial-3 to the next rung level. This causes idle time for
Worker-1. Then, we continue with Rung 1. Also, here Trial-3 takes longer
than Trial-0, which leads to an additional ideling time of Worker-0.
Once, we reach Rung-2, only the best trial, Trial-0, remains which
occupies only one worker. To avoid that Worker-1 idles during that time,
most implementaitons of SH continue already with the next round, and
start evaluating new trials (e.g Trial-4) on the first rung.

.. _synchronous_sh:

.. figure:: ../img/sync_sh.svg

   Synchronous successive halving with two workers.


Asynchronous successive halving (ASHA) :cite:`li-arxiv18` adapts SH to
the asynchronous parallel scenario. The main idea of ASHA is to promote
configurations to the next rung level as soon as we collected at least
:math:`\eta` observations on the current rung level. This decision rule
may lead to suboptimal promotions: configurations can be promoted to the
next rung level, which in hindsight do not compare favourably against
most others at the same rung level. On the other hand, we get rid of all
synchronization points this way. In practice, such suboptimal initial
promotions have only a modest impact on performance, not only because
the ranking of hyperparameter configurations is often fairly consistent
across rung levels, but also because rungs grow over time and reflect
the distribution of metric values at this level better and better. If a
worker is free, but no configuration can be promoted, we start a new
configuration with :math:`r = r_{\mathrm{min}}`, i.e the first rung
level.

:numref:`asha` shows the scheduling of the same configurations for
ASHA. Once Trial-1 finishes, we collect the results of two trials (i.e
Trial-0 and Trial-1) and immediately promote the better of them
(Trial-0) to the next rung level. After Trial-0 finishes on rung 1,
there are too few trials there in order to support a further promotion.
Hence, we continue with rung 0 and evaluate Trial-3. Once Trial-3
finishes, Trial-2 is still pending. At this point we have 3 trials
evaluated on rung 0 and one trial evaluated already on rung 1. Since
Trial-3 performs worse than Trial-0 at rung 0, and :math:`\eta=2`, we
cannot promote any new trial yet, and Worker-1 starts Trial-4 from
scratch instead. However, once Trial-2 finishes and scores worse than
Trial-3, the latter is promoted towards rung 1. Afterwards, we collected
2 evaluations on rung 1, which means we can now promote Trial-0 towards
rung 2. At the same time, Worker-1 continues with evaluating new trials
(i.e., Trial-5) on rung 0.

.. _asha:

.. figure:: ../img/asha.svg

   Asynchronous successive halving (ASHA) with two workers.


.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    import logging
    from d2l import torch as d2l
    
    logging.basicConfig(level=logging.INFO)
    import matplotlib.pyplot as plt
    from syne_tune import StoppingCriterion, Tuner
    from syne_tune.backend.python_backend import PythonBackend
    from syne_tune.config_space import loguniform, randint
    from syne_tune.experiments import load_experiment
    from syne_tune.optimizer.baselines import ASHA


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    INFO:root:SageMakerBackend is not imported since dependencies are missing. You can install them with
       pip install 'syne-tune[extra]'
    AWS dependencies are not imported since dependencies are missing. You can install them with
       pip install 'syne-tune[aws]'
    or (for everything)
       pip install 'syne-tune[extra]'
    AWS dependencies are not imported since dependencies are missing. You can install them with
       pip install 'syne-tune[aws]'
    or (for everything)
       pip install 'syne-tune[extra]'
    INFO:root:Ray Tune schedulers and searchers are not imported since dependencies are missing. You can install them with
       pip install 'syne-tune[raytune]'
    or (for everything)
       pip install 'syne-tune[extra]'


Objective Function
------------------

We will use *Syne Tune* with the same objective function as in
:numref:`sec_rs_async`.

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def hpo_objective_lenet_synetune(learning_rate, batch_size, max_epochs):
        from syne_tune import Reporter
        from d2l import torch as d2l
    
        model = d2l.LeNet(lr=learning_rate, num_classes=10)
        trainer = d2l.HPOTrainer(max_epochs=1, num_gpus=1)
        data = d2l.FashionMNIST(batch_size=batch_size)
        model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn)
        report = Reporter()
        for epoch in range(1, max_epochs + 1):
            if epoch == 1:
                # Initialize the state of Trainer
                trainer.fit(model=model, data=data)
            else:
                trainer.fit_epoch()
            validation_error = trainer.validation_error().cpu().detach().numpy()
            report(epoch=epoch, validation_error=float(validation_error))

We will also use the same configuration space as before:

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    min_number_of_epochs = 2
    max_number_of_epochs = 10
    eta = 2
    
    config_space = {
        "learning_rate": loguniform(1e-2, 1),
        "batch_size": randint(32, 256),
        "max_epochs": max_number_of_epochs,
    }
    initial_config = {
        "learning_rate": 0.1,
        "batch_size": 128,
    }

Asynchronous Scheduler
----------------------

First, we define the number of workers that evaluate trials
concurrently. We also need to specify how long we want to run random
search, by defining an upper limit on the total wall-clock time.

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    n_workers = 2  # Needs to be <= the number of available GPUs
    max_wallclock_time = 12 * 60  # 12 minutes

The code for running ASHA is a simple variation of what we did for
asynchronous random search.

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    mode = "min"
    metric = "validation_error"
    resource_attr = "epoch"
    
    scheduler = ASHA(
        config_space,
        metric=metric,
        mode=mode,
        points_to_evaluate=[initial_config],
        max_resource_attr="max_epochs",
        resource_attr=resource_attr,
        grace_period=min_number_of_epochs,
        reduction_factor=eta,
    )


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    INFO:syne_tune.optimizer.schedulers.fifo:max_resource_level = 10, as inferred from config_space
    INFO:syne_tune.optimizer.schedulers.fifo:Master random_seed = 3140976097


Here, ``metric`` and ``resource_attr`` specify the key names used with
the ``report`` callback, and ``max_resource_attr`` denotes which input
to the objective function corresponds to :math:`r_{\mathrm{max}}`.
Moreover, ``grace_period`` provides :math:`r_{\mathrm{min}}`, and
``reduction_factor`` is :math:`\eta`. We can run Syne Tune as before
(this will take about 12 minutes):

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    trial_backend = PythonBackend(
        tune_function=hpo_objective_lenet_synetune,
        config_space=config_space,
    )
    
    stop_criterion = StoppingCriterion(max_wallclock_time=max_wallclock_time)
    tuner = Tuner(
        trial_backend=trial_backend,
        scheduler=scheduler,
        stop_criterion=stop_criterion,
        n_workers=n_workers,
        print_update_interval=int(max_wallclock_time * 0.6),
    )
    tuner.run()


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    INFO:syne_tune.tuner:results of trials will be saved on /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046
    INFO:root:Detected 4 GPUs
    INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.1 --batch_size 128 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/0/checkpoints
    INFO:syne_tune.tuner:(trial 0) - scheduled config {'learning_rate': 0.1, 'batch_size': 128, 'max_epochs': 10}
    INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.44639554136672527 --batch_size 196 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/1/checkpoints
    INFO:syne_tune.tuner:(trial 1) - scheduled config {'learning_rate': 0.44639554136672527, 'batch_size': 196, 'max_epochs': 10}
    INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.011548051321691994 --batch_size 254 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/2/checkpoints
    INFO:syne_tune.tuner:(trial 2) - scheduled config {'learning_rate': 0.011548051321691994, 'batch_size': 254, 'max_epochs': 10}
    INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.14942487313193167 --batch_size 132 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/3/checkpoints
    INFO:syne_tune.tuner:(trial 3) - scheduled config {'learning_rate': 0.14942487313193167, 'batch_size': 132, 'max_epochs': 10}
    INFO:syne_tune.tuner:Trial trial_id 1 completed.
    INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.06317157191455719 --batch_size 242 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/4/checkpoints
    INFO:syne_tune.tuner:(trial 4) - scheduled config {'learning_rate': 0.06317157191455719, 'batch_size': 242, 'max_epochs': 10}
    INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.48801815412811467 --batch_size 41 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/5/checkpoints
    INFO:syne_tune.tuner:(trial 5) - scheduled config {'learning_rate': 0.48801815412811467, 'batch_size': 41, 'max_epochs': 10}
    INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.5904067586747807 --batch_size 244 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/6/checkpoints
    INFO:syne_tune.tuner:(trial 6) - scheduled config {'learning_rate': 0.5904067586747807, 'batch_size': 244, 'max_epochs': 10}
    INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.08812857364095393 --batch_size 148 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/7/checkpoints
    INFO:syne_tune.tuner:(trial 7) - scheduled config {'learning_rate': 0.08812857364095393, 'batch_size': 148, 'max_epochs': 10}
    INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.012271314788363914 --batch_size 235 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/8/checkpoints
    INFO:syne_tune.tuner:(trial 8) - scheduled config {'learning_rate': 0.012271314788363914, 'batch_size': 235, 'max_epochs': 10}
    INFO:syne_tune.tuner:Trial trial_id 5 completed.
    INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.08845692598296777 --batch_size 236 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/9/checkpoints
    INFO:syne_tune.tuner:(trial 9) - scheduled config {'learning_rate': 0.08845692598296777, 'batch_size': 236, 'max_epochs': 10}
    INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.0825770880068151 --batch_size 75 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/10/checkpoints
    INFO:syne_tune.tuner:(trial 10) - scheduled config {'learning_rate': 0.0825770880068151, 'batch_size': 75, 'max_epochs': 10}
    INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.20235201406823256 --batch_size 65 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/11/checkpoints
    INFO:syne_tune.tuner:(trial 11) - scheduled config {'learning_rate': 0.20235201406823256, 'batch_size': 65, 'max_epochs': 10}
    INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.3359885631737537 --batch_size 58 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/12/checkpoints
    INFO:syne_tune.tuner:(trial 12) - scheduled config {'learning_rate': 0.3359885631737537, 'batch_size': 58, 'max_epochs': 10}
    INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.7892434579795236 --batch_size 89 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/13/checkpoints
    INFO:syne_tune.tuner:(trial 13) - scheduled config {'learning_rate': 0.7892434579795236, 'batch_size': 89, 'max_epochs': 10}
    INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.1233786579597858 --batch_size 176 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/14/checkpoints
    INFO:syne_tune.tuner:(trial 14) - scheduled config {'learning_rate': 0.1233786579597858, 'batch_size': 176, 'max_epochs': 10}
    INFO:syne_tune.tuner:Trial trial_id 13 completed.
    INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.13707981127012328 --batch_size 141 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/15/checkpoints
    INFO:syne_tune.tuner:(trial 15) - scheduled config {'learning_rate': 0.13707981127012328, 'batch_size': 141, 'max_epochs': 10}
    INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.02913976299993913 --batch_size 116 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/16/checkpoints
    INFO:syne_tune.tuner:(trial 16) - scheduled config {'learning_rate': 0.02913976299993913, 'batch_size': 116, 'max_epochs': 10}
    INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.033362897489792855 --batch_size 154 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/17/checkpoints
    INFO:syne_tune.tuner:(trial 17) - scheduled config {'learning_rate': 0.033362897489792855, 'batch_size': 154, 'max_epochs': 10}
    INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.29442952580755816 --batch_size 210 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/18/checkpoints
    INFO:syne_tune.tuner:(trial 18) - scheduled config {'learning_rate': 0.29442952580755816, 'batch_size': 210, 'max_epochs': 10}
    INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.10214259921521483 --batch_size 239 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/19/checkpoints
    INFO:syne_tune.tuner:(trial 19) - scheduled config {'learning_rate': 0.10214259921521483, 'batch_size': 239, 'max_epochs': 10}
    INFO:syne_tune.tuner:tuning status (last metric is reported)
     trial_id     status  iter  learning_rate  batch_size  max_epochs  epoch  validation_error  worker-time
            0    Stopped     4       0.100000         128          10    4.0          0.430578    29.093798
            1  Completed    10       0.446396         196          10   10.0          0.205652    72.747496
            2    Stopped     2       0.011548         254          10    2.0          0.900570    13.729115
            3    Stopped     8       0.149425         132          10    8.0          0.259171    58.980305
            4    Stopped     4       0.063172         242          10    4.0          0.900579    27.773950
            5  Completed    10       0.488018          41          10   10.0          0.140488   113.171314
            6    Stopped    10       0.590407         244          10   10.0          0.193776    70.364757
            7    Stopped     2       0.088129         148          10    2.0          0.899955    14.169738
            8    Stopped     2       0.012271         235          10    2.0          0.899840    13.434274
            9    Stopped     2       0.088457         236          10    2.0          0.899801    13.034437
           10    Stopped     4       0.082577          75          10    4.0          0.385970    35.426524
           11    Stopped     4       0.202352          65          10    4.0          0.543102    34.653495
           12    Stopped    10       0.335989          58          10   10.0          0.149558    90.924182
           13  Completed    10       0.789243          89          10   10.0          0.144887    77.365970
           14    Stopped     2       0.123379         176          10    2.0          0.899987    12.422906
           15    Stopped     2       0.137080         141          10    2.0          0.899983    13.395153
           16    Stopped     4       0.029140         116          10    4.0          0.900532    27.834111
           17    Stopped     2       0.033363         154          10    2.0          0.899996    13.407285
           18 InProgress     1       0.294430         210          10    1.0          0.899878     6.126259
           19 InProgress     0       0.102143         239          10      -                 -            -
    2 trials running, 18 finished (3 until the end), 437.07s wallclock-time
    
    INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.02846298236356246 --batch_size 115 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/20/checkpoints
    INFO:syne_tune.tuner:(trial 20) - scheduled config {'learning_rate': 0.02846298236356246, 'batch_size': 115, 'max_epochs': 10}
    INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.037703019195187606 --batch_size 91 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/21/checkpoints
    INFO:syne_tune.tuner:(trial 21) - scheduled config {'learning_rate': 0.037703019195187606, 'batch_size': 91, 'max_epochs': 10}
    INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.0741039859356903 --batch_size 192 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/22/checkpoints
    INFO:syne_tune.tuner:(trial 22) - scheduled config {'learning_rate': 0.0741039859356903, 'batch_size': 192, 'max_epochs': 10}
    INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.3032613031191755 --batch_size 252 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/23/checkpoints
    INFO:syne_tune.tuner:(trial 23) - scheduled config {'learning_rate': 0.3032613031191755, 'batch_size': 252, 'max_epochs': 10}
    INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.019823425532533637 --batch_size 252 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/24/checkpoints
    INFO:syne_tune.tuner:(trial 24) - scheduled config {'learning_rate': 0.019823425532533637, 'batch_size': 252, 'max_epochs': 10}
    INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.8203370335228594 --batch_size 77 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/25/checkpoints
    INFO:syne_tune.tuner:(trial 25) - scheduled config {'learning_rate': 0.8203370335228594, 'batch_size': 77, 'max_epochs': 10}
    INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.2960420911378594 --batch_size 104 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/26/checkpoints
    INFO:syne_tune.tuner:(trial 26) - scheduled config {'learning_rate': 0.2960420911378594, 'batch_size': 104, 'max_epochs': 10}
    INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.2993874715754653 --batch_size 192 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/27/checkpoints
    INFO:syne_tune.tuner:(trial 27) - scheduled config {'learning_rate': 0.2993874715754653, 'batch_size': 192, 'max_epochs': 10}
    INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.08056711961080017 --batch_size 36 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/28/checkpoints
    INFO:syne_tune.tuner:(trial 28) - scheduled config {'learning_rate': 0.08056711961080017, 'batch_size': 36, 'max_epochs': 10}
    INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.26868380288030347 --batch_size 151 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/29/checkpoints
    INFO:syne_tune.tuner:(trial 29) - scheduled config {'learning_rate': 0.26868380288030347, 'batch_size': 151, 'max_epochs': 10}
    INFO:syne_tune.tuner:Trial trial_id 29 completed.
    INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.9197404791177789 --batch_size 66 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/30/checkpoints
    INFO:syne_tune.tuner:(trial 30) - scheduled config {'learning_rate': 0.9197404791177789, 'batch_size': 66, 'max_epochs': 10}
    INFO:syne_tune.stopping_criterion:reaching max wallclock time (720), stopping there.
    INFO:syne_tune.tuner:Stopping trials that may still be running.
    INFO:syne_tune.tuner:Tuning finished, results of trials can be found on /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046
    --------------------
    Resource summary (last result is reported):
     trial_id     status  iter  learning_rate  batch_size  max_epochs  epoch  validation_error  worker-time
            0    Stopped     4       0.100000         128          10      4          0.430578    29.093798
            1  Completed    10       0.446396         196          10     10          0.205652    72.747496
            2    Stopped     2       0.011548         254          10      2          0.900570    13.729115
            3    Stopped     8       0.149425         132          10      8          0.259171    58.980305
            4    Stopped     4       0.063172         242          10      4          0.900579    27.773950
            5  Completed    10       0.488018          41          10     10          0.140488   113.171314
            6    Stopped    10       0.590407         244          10     10          0.193776    70.364757
            7    Stopped     2       0.088129         148          10      2          0.899955    14.169738
            8    Stopped     2       0.012271         235          10      2          0.899840    13.434274
            9    Stopped     2       0.088457         236          10      2          0.899801    13.034437
           10    Stopped     4       0.082577          75          10      4          0.385970    35.426524
           11    Stopped     4       0.202352          65          10      4          0.543102    34.653495
           12    Stopped    10       0.335989          58          10     10          0.149558    90.924182
           13  Completed    10       0.789243          89          10     10          0.144887    77.365970
           14    Stopped     2       0.123379         176          10      2          0.899987    12.422906
           15    Stopped     2       0.137080         141          10      2          0.899983    13.395153
           16    Stopped     4       0.029140         116          10      4          0.900532    27.834111
           17    Stopped     2       0.033363         154          10      2          0.899996    13.407285
           18    Stopped     8       0.294430         210          10      8          0.241193    52.089688
           19    Stopped     2       0.102143         239          10      2          0.900002    12.487762
           20    Stopped     2       0.028463         115          10      2          0.899995    14.100359
           21    Stopped     2       0.037703          91          10      2          0.900026    14.664848
           22    Stopped     2       0.074104         192          10      2          0.901730    13.312770
           23    Stopped     2       0.303261         252          10      2          0.900009    12.725821
           24    Stopped     2       0.019823         252          10      2          0.899917    12.533380
           25    Stopped    10       0.820337          77          10     10          0.196842    81.816103
           26    Stopped    10       0.296042         104          10     10          0.198453    81.121330
           27    Stopped     4       0.299387         192          10      4          0.336183    24.610689
           28 InProgress     9       0.080567          36          10      9          0.203052   104.303746
           29  Completed    10       0.268684         151          10     10          0.222814    68.217289
           30 InProgress     1       0.919740          66          10      1          0.900037    10.070776
    2 trials running, 29 finished (4 until the end), 723.70s wallclock-time
    
    validation_error: best 0.1404876708984375 for trial-id 5
    --------------------


Note that we are running a variant of ASHA where underperforming trials
are stopped early. This is different to our implementation in
:numref:`sec_mf_hpo_sh`, where each training job is started with a
fixed ``max_epochs``. In the latter case, a well-performing trial which
reaches the full 10 epochs, first needs to train 1, then 2, then 4, then
8 epochs, each time starting from scratch. This type of pause-and-resume
scheduling can be implemented efficiently by checkpointing the training
state after each epoch, but we avoid this extra complexity here. After
the experiment has finished, we can retrieve and plot results.

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    d2l.set_figsize()
    e = load_experiment(tuner.name)
    e.plot()


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    WARNING:matplotlib.legend:No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.


.. figure:: output_sh-async_bb0ea6_13_1.svg


Visualize the Optimization Process
----------------------------------

Once more, we visualize the learning curves of every trial (each color
in the plot represents a trial). Compare this to asynchronous random
search in :numref:`sec_rs_async`. As we have seen for successive
halving in :numref:`sec_mf_hpo`, most of the trials are stopped at 1
or 2 epochs (:math:`r_{\mathrm{min}}` or
:math:`\eta * r_{\mathrm{min}}`). However, trials do not stop at the
same point, because they require different amount of time per epoch. If
we ran standard successive halving instead of ASHA, we would need to
synchronize our workers, before we can promote configurations to the
next rung level.

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    d2l.set_figsize([6, 2.5])
    results = e.results
    for trial_id in results.trial_id.unique():
        df = results[results["trial_id"] == trial_id]
        d2l.plt.plot(
            df["st_tuner_time"],
            df["validation_error"],
            marker="o"
        )
    d2l.plt.xlabel("wall-clock time")
    d2l.plt.ylabel("objective function")


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    Text(0, 0.5, 'objective function')


.. figure:: output_sh-async_bb0ea6_15_1.svg


Summary
-------

Compared to random search, successive halving is not quite as trivial to
run in an asynchronous distributed setting. To avoid synchronisation
points, we promote configurations as quickly as possible to the next
rung level, even if this means promoting some wrong ones. In practice,
this usually does not hurt much, and the gains of asynchronous versus
synchronous scheduling are usually much higher than the loss of the
suboptimal decision making.

`Discussions <https://discuss.d2l.ai/t/12101>`__