.. _sec_sh_async: Asynchronous Successive Halving =============================== As we have seen in :numref:`sec_rs_async`, we can accelerate HPO by distributing the evaluation of hyperparameter configurations across either multiple instances or multiples CPUs / GPUs on a single instance. However, compared to random search, it is not straightforward to run successive halving (SH) asynchronously in a distributed setting. Before we can decide which configuration to run next, we first have to collect all observations at the current rung level. This requires to synchronize workers at each rung level. For example, for the lowest rung level :math:`r_{\mathrm{min}}`, we first have to evaluate all :math:`N = \eta^K` configurations, before we can promote the :math:`\frac{1}{\eta}` of them to the next rung level. In any distributed system, synchronization typically implies idle time for workers. First, we often observe high variations in training time across hyperparameter configurations. For example, assuming the number of filters per layer is a hyperparameter, then networks with less filters finish training faster than networks with more filters, which implies idle worker time due to stragglers. Moreover, the number of slots in a rung level is not always a multiple of the number of workers, in which case some workers may even sit idle for a full batch. Figure :numref:`synchronous_sh` shows the scheduling of synchronous SH with :math:`\eta=2` for four different trials with two workers. We start with evaluating Trial-0 and Trial-1 for one epoch and immediately continue with the next two trials once they are finished. We first have to wait until Trial-2 finishes, which takes substantially more time than the other trials, before we can promote the best two trials, i.e., Trial-0 and Trial-3 to the next rung level. This causes idle time for Worker-1. Then, we continue with Rung 1. Also, here Trial-3 takes longer than Trial-0, which leads to an additional ideling time of Worker-0. Once, we reach Rung-2, only the best trial, Trial-0, remains which occupies only one worker. To avoid that Worker-1 idles during that time, most implementaitons of SH continue already with the next round, and start evaluating new trials (e.g Trial-4) on the first rung. .. _synchronous_sh: .. figure:: ../img/sync_sh.svg Synchronous successive halving with two workers. Asynchronous successive halving (ASHA) :cite:`li-arxiv18` adapts SH to the asynchronous parallel scenario. The main idea of ASHA is to promote configurations to the next rung level as soon as we collected at least :math:`\eta` observations on the current rung level. This decision rule may lead to suboptimal promotions: configurations can be promoted to the next rung level, which in hindsight do not compare favourably against most others at the same rung level. On the other hand, we get rid of all synchronization points this way. In practice, such suboptimal initial promotions have only a modest impact on performance, not only because the ranking of hyperparameter configurations is often fairly consistent across rung levels, but also because rungs grow over time and reflect the distribution of metric values at this level better and better. If a worker is free, but no configuration can be promoted, we start a new configuration with :math:`r = r_{\mathrm{min}}`, i.e the first rung level. :numref:`asha` shows the scheduling of the same configurations for ASHA. Once Trial-1 finishes, we collect the results of two trials (i.e Trial-0 and Trial-1) and immediately promote the better of them (Trial-0) to the next rung level. After Trial-0 finishes on rung 1, there are too few trials there in order to support a further promotion. Hence, we continue with rung 0 and evaluate Trial-3. Once Trial-3 finishes, Trial-2 is still pending. At this point we have 3 trials evaluated on rung 0 and one trial evaluated already on rung 1. Since Trial-3 performs worse than Trial-0 at rung 0, and :math:`\eta=2`, we cannot promote any new trial yet, and Worker-1 starts Trial-4 from scratch instead. However, once Trial-2 finishes and scores worse than Trial-3, the latter is promoted towards rung 1. Afterwards, we collected 2 evaluations on rung 1, which means we can now promote Trial-0 towards rung 2. At the same time, Worker-1 continues with evaluating new trials (i.e., Trial-5) on rung 0. .. _asha: .. figure:: ../img/asha.svg Asynchronous successive halving (ASHA) with two workers. .. raw:: latex \diilbookstyleinputcell .. code:: python import logging from d2l import torch as d2l logging.basicConfig(level=logging.INFO) import matplotlib.pyplot as plt from syne_tune import StoppingCriterion, Tuner from syne_tune.backend.python_backend import PythonBackend from syne_tune.config_space import loguniform, randint from syne_tune.experiments import load_experiment from syne_tune.optimizer.baselines import ASHA .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output INFO:root:SageMakerBackend is not imported since dependencies are missing. You can install them with pip install 'syne-tune[extra]' AWS dependencies are not imported since dependencies are missing. You can install them with pip install 'syne-tune[aws]' or (for everything) pip install 'syne-tune[extra]' AWS dependencies are not imported since dependencies are missing. You can install them with pip install 'syne-tune[aws]' or (for everything) pip install 'syne-tune[extra]' INFO:root:Ray Tune schedulers and searchers are not imported since dependencies are missing. You can install them with pip install 'syne-tune[raytune]' or (for everything) pip install 'syne-tune[extra]' Objective Function ------------------ We will use *Syne Tune* with the same objective function as in :numref:`sec_rs_async`. .. raw:: latex \diilbookstyleinputcell .. code:: python def hpo_objective_lenet_synetune(learning_rate, batch_size, max_epochs): from syne_tune import Reporter from d2l import torch as d2l model = d2l.LeNet(lr=learning_rate, num_classes=10) trainer = d2l.HPOTrainer(max_epochs=1, num_gpus=1) data = d2l.FashionMNIST(batch_size=batch_size) model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn) report = Reporter() for epoch in range(1, max_epochs + 1): if epoch == 1: # Initialize the state of Trainer trainer.fit(model=model, data=data) else: trainer.fit_epoch() validation_error = trainer.validation_error().cpu().detach().numpy() report(epoch=epoch, validation_error=float(validation_error)) We will also use the same configuration space as before: .. raw:: latex \diilbookstyleinputcell .. code:: python min_number_of_epochs = 2 max_number_of_epochs = 10 eta = 2 config_space = { "learning_rate": loguniform(1e-2, 1), "batch_size": randint(32, 256), "max_epochs": max_number_of_epochs, } initial_config = { "learning_rate": 0.1, "batch_size": 128, } Asynchronous Scheduler ---------------------- First, we define the number of workers that evaluate trials concurrently. We also need to specify how long we want to run random search, by defining an upper limit on the total wall-clock time. .. raw:: latex \diilbookstyleinputcell .. code:: python n_workers = 2 # Needs to be <= the number of available GPUs max_wallclock_time = 12 * 60 # 12 minutes The code for running ASHA is a simple variation of what we did for asynchronous random search. .. raw:: latex \diilbookstyleinputcell .. code:: python mode = "min" metric = "validation_error" resource_attr = "epoch" scheduler = ASHA( config_space, metric=metric, mode=mode, points_to_evaluate=[initial_config], max_resource_attr="max_epochs", resource_attr=resource_attr, grace_period=min_number_of_epochs, reduction_factor=eta, ) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output INFO:syne_tune.optimizer.schedulers.fifo:max_resource_level = 10, as inferred from config_space INFO:syne_tune.optimizer.schedulers.fifo:Master random_seed = 3140976097 Here, ``metric`` and ``resource_attr`` specify the key names used with the ``report`` callback, and ``max_resource_attr`` denotes which input to the objective function corresponds to :math:`r_{\mathrm{max}}`. Moreover, ``grace_period`` provides :math:`r_{\mathrm{min}}`, and ``reduction_factor`` is :math:`\eta`. We can run Syne Tune as before (this will take about 12 minutes): .. raw:: latex \diilbookstyleinputcell .. code:: python trial_backend = PythonBackend( tune_function=hpo_objective_lenet_synetune, config_space=config_space, ) stop_criterion = StoppingCriterion(max_wallclock_time=max_wallclock_time) tuner = Tuner( trial_backend=trial_backend, scheduler=scheduler, stop_criterion=stop_criterion, n_workers=n_workers, print_update_interval=int(max_wallclock_time * 0.6), ) tuner.run() .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output INFO:syne_tune.tuner:results of trials will be saved on /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046 INFO:root:Detected 4 GPUs INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.1 --batch_size 128 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/0/checkpoints INFO:syne_tune.tuner:(trial 0) - scheduled config {'learning_rate': 0.1, 'batch_size': 128, 'max_epochs': 10} INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.44639554136672527 --batch_size 196 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/1/checkpoints INFO:syne_tune.tuner:(trial 1) - scheduled config {'learning_rate': 0.44639554136672527, 'batch_size': 196, 'max_epochs': 10} INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.011548051321691994 --batch_size 254 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/2/checkpoints INFO:syne_tune.tuner:(trial 2) - scheduled config {'learning_rate': 0.011548051321691994, 'batch_size': 254, 'max_epochs': 10} INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.14942487313193167 --batch_size 132 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/3/checkpoints INFO:syne_tune.tuner:(trial 3) - scheduled config {'learning_rate': 0.14942487313193167, 'batch_size': 132, 'max_epochs': 10} INFO:syne_tune.tuner:Trial trial_id 1 completed. INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.06317157191455719 --batch_size 242 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/4/checkpoints INFO:syne_tune.tuner:(trial 4) - scheduled config {'learning_rate': 0.06317157191455719, 'batch_size': 242, 'max_epochs': 10} INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.48801815412811467 --batch_size 41 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/5/checkpoints INFO:syne_tune.tuner:(trial 5) - scheduled config {'learning_rate': 0.48801815412811467, 'batch_size': 41, 'max_epochs': 10} INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.5904067586747807 --batch_size 244 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/6/checkpoints INFO:syne_tune.tuner:(trial 6) - scheduled config {'learning_rate': 0.5904067586747807, 'batch_size': 244, 'max_epochs': 10} INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.08812857364095393 --batch_size 148 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/7/checkpoints INFO:syne_tune.tuner:(trial 7) - scheduled config {'learning_rate': 0.08812857364095393, 'batch_size': 148, 'max_epochs': 10} INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.012271314788363914 --batch_size 235 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/8/checkpoints INFO:syne_tune.tuner:(trial 8) - scheduled config {'learning_rate': 0.012271314788363914, 'batch_size': 235, 'max_epochs': 10} INFO:syne_tune.tuner:Trial trial_id 5 completed. INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.08845692598296777 --batch_size 236 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/9/checkpoints INFO:syne_tune.tuner:(trial 9) - scheduled config {'learning_rate': 0.08845692598296777, 'batch_size': 236, 'max_epochs': 10} INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.0825770880068151 --batch_size 75 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/10/checkpoints INFO:syne_tune.tuner:(trial 10) - scheduled config {'learning_rate': 0.0825770880068151, 'batch_size': 75, 'max_epochs': 10} INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.20235201406823256 --batch_size 65 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/11/checkpoints INFO:syne_tune.tuner:(trial 11) - scheduled config {'learning_rate': 0.20235201406823256, 'batch_size': 65, 'max_epochs': 10} INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.3359885631737537 --batch_size 58 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/12/checkpoints INFO:syne_tune.tuner:(trial 12) - scheduled config {'learning_rate': 0.3359885631737537, 'batch_size': 58, 'max_epochs': 10} INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.7892434579795236 --batch_size 89 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/13/checkpoints INFO:syne_tune.tuner:(trial 13) - scheduled config {'learning_rate': 0.7892434579795236, 'batch_size': 89, 'max_epochs': 10} INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.1233786579597858 --batch_size 176 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/14/checkpoints INFO:syne_tune.tuner:(trial 14) - scheduled config {'learning_rate': 0.1233786579597858, 'batch_size': 176, 'max_epochs': 10} INFO:syne_tune.tuner:Trial trial_id 13 completed. INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.13707981127012328 --batch_size 141 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/15/checkpoints INFO:syne_tune.tuner:(trial 15) - scheduled config {'learning_rate': 0.13707981127012328, 'batch_size': 141, 'max_epochs': 10} INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.02913976299993913 --batch_size 116 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/16/checkpoints INFO:syne_tune.tuner:(trial 16) - scheduled config {'learning_rate': 0.02913976299993913, 'batch_size': 116, 'max_epochs': 10} INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.033362897489792855 --batch_size 154 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/17/checkpoints INFO:syne_tune.tuner:(trial 17) - scheduled config {'learning_rate': 0.033362897489792855, 'batch_size': 154, 'max_epochs': 10} INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.29442952580755816 --batch_size 210 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/18/checkpoints INFO:syne_tune.tuner:(trial 18) - scheduled config {'learning_rate': 0.29442952580755816, 'batch_size': 210, 'max_epochs': 10} INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.10214259921521483 --batch_size 239 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/19/checkpoints INFO:syne_tune.tuner:(trial 19) - scheduled config {'learning_rate': 0.10214259921521483, 'batch_size': 239, 'max_epochs': 10} INFO:syne_tune.tuner:tuning status (last metric is reported) trial_id status iter learning_rate batch_size max_epochs epoch validation_error worker-time 0 Stopped 4 0.100000 128 10 4.0 0.430578 29.093798 1 Completed 10 0.446396 196 10 10.0 0.205652 72.747496 2 Stopped 2 0.011548 254 10 2.0 0.900570 13.729115 3 Stopped 8 0.149425 132 10 8.0 0.259171 58.980305 4 Stopped 4 0.063172 242 10 4.0 0.900579 27.773950 5 Completed 10 0.488018 41 10 10.0 0.140488 113.171314 6 Stopped 10 0.590407 244 10 10.0 0.193776 70.364757 7 Stopped 2 0.088129 148 10 2.0 0.899955 14.169738 8 Stopped 2 0.012271 235 10 2.0 0.899840 13.434274 9 Stopped 2 0.088457 236 10 2.0 0.899801 13.034437 10 Stopped 4 0.082577 75 10 4.0 0.385970 35.426524 11 Stopped 4 0.202352 65 10 4.0 0.543102 34.653495 12 Stopped 10 0.335989 58 10 10.0 0.149558 90.924182 13 Completed 10 0.789243 89 10 10.0 0.144887 77.365970 14 Stopped 2 0.123379 176 10 2.0 0.899987 12.422906 15 Stopped 2 0.137080 141 10 2.0 0.899983 13.395153 16 Stopped 4 0.029140 116 10 4.0 0.900532 27.834111 17 Stopped 2 0.033363 154 10 2.0 0.899996 13.407285 18 InProgress 1 0.294430 210 10 1.0 0.899878 6.126259 19 InProgress 0 0.102143 239 10 - - - 2 trials running, 18 finished (3 until the end), 437.07s wallclock-time INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.02846298236356246 --batch_size 115 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/20/checkpoints INFO:syne_tune.tuner:(trial 20) - scheduled config {'learning_rate': 0.02846298236356246, 'batch_size': 115, 'max_epochs': 10} INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.037703019195187606 --batch_size 91 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/21/checkpoints INFO:syne_tune.tuner:(trial 21) - scheduled config {'learning_rate': 0.037703019195187606, 'batch_size': 91, 'max_epochs': 10} INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.0741039859356903 --batch_size 192 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/22/checkpoints INFO:syne_tune.tuner:(trial 22) - scheduled config {'learning_rate': 0.0741039859356903, 'batch_size': 192, 'max_epochs': 10} INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.3032613031191755 --batch_size 252 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/23/checkpoints INFO:syne_tune.tuner:(trial 23) - scheduled config {'learning_rate': 0.3032613031191755, 'batch_size': 252, 'max_epochs': 10} INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.019823425532533637 --batch_size 252 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/24/checkpoints INFO:syne_tune.tuner:(trial 24) - scheduled config {'learning_rate': 0.019823425532533637, 'batch_size': 252, 'max_epochs': 10} INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.8203370335228594 --batch_size 77 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/25/checkpoints INFO:syne_tune.tuner:(trial 25) - scheduled config {'learning_rate': 0.8203370335228594, 'batch_size': 77, 'max_epochs': 10} INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.2960420911378594 --batch_size 104 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/26/checkpoints INFO:syne_tune.tuner:(trial 26) - scheduled config {'learning_rate': 0.2960420911378594, 'batch_size': 104, 'max_epochs': 10} INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.2993874715754653 --batch_size 192 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/27/checkpoints INFO:syne_tune.tuner:(trial 27) - scheduled config {'learning_rate': 0.2993874715754653, 'batch_size': 192, 'max_epochs': 10} INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.08056711961080017 --batch_size 36 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/28/checkpoints INFO:syne_tune.tuner:(trial 28) - scheduled config {'learning_rate': 0.08056711961080017, 'batch_size': 36, 'max_epochs': 10} INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.26868380288030347 --batch_size 151 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/29/checkpoints INFO:syne_tune.tuner:(trial 29) - scheduled config {'learning_rate': 0.26868380288030347, 'batch_size': 151, 'max_epochs': 10} INFO:syne_tune.tuner:Trial trial_id 29 completed. INFO:root:running subprocess with command: /usr/bin/python /home/ci/.local/lib/python3.8/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.9197404791177789 --batch_size 66 --max_epochs 10 --tune_function_root /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/tune_function --tune_function_hash e03d187e043d2a17cae636d6af164015 --st_checkpoint_dir /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046/30/checkpoints INFO:syne_tune.tuner:(trial 30) - scheduled config {'learning_rate': 0.9197404791177789, 'batch_size': 66, 'max_epochs': 10} INFO:syne_tune.stopping_criterion:reaching max wallclock time (720), stopping there. INFO:syne_tune.tuner:Stopping trials that may still be running. INFO:syne_tune.tuner:Tuning finished, results of trials can be found on /home/ci/syne-tune/python-entrypoint-2023-08-18-20-01-52-046 -------------------- Resource summary (last result is reported): trial_id status iter learning_rate batch_size max_epochs epoch validation_error worker-time 0 Stopped 4 0.100000 128 10 4 0.430578 29.093798 1 Completed 10 0.446396 196 10 10 0.205652 72.747496 2 Stopped 2 0.011548 254 10 2 0.900570 13.729115 3 Stopped 8 0.149425 132 10 8 0.259171 58.980305 4 Stopped 4 0.063172 242 10 4 0.900579 27.773950 5 Completed 10 0.488018 41 10 10 0.140488 113.171314 6 Stopped 10 0.590407 244 10 10 0.193776 70.364757 7 Stopped 2 0.088129 148 10 2 0.899955 14.169738 8 Stopped 2 0.012271 235 10 2 0.899840 13.434274 9 Stopped 2 0.088457 236 10 2 0.899801 13.034437 10 Stopped 4 0.082577 75 10 4 0.385970 35.426524 11 Stopped 4 0.202352 65 10 4 0.543102 34.653495 12 Stopped 10 0.335989 58 10 10 0.149558 90.924182 13 Completed 10 0.789243 89 10 10 0.144887 77.365970 14 Stopped 2 0.123379 176 10 2 0.899987 12.422906 15 Stopped 2 0.137080 141 10 2 0.899983 13.395153 16 Stopped 4 0.029140 116 10 4 0.900532 27.834111 17 Stopped 2 0.033363 154 10 2 0.899996 13.407285 18 Stopped 8 0.294430 210 10 8 0.241193 52.089688 19 Stopped 2 0.102143 239 10 2 0.900002 12.487762 20 Stopped 2 0.028463 115 10 2 0.899995 14.100359 21 Stopped 2 0.037703 91 10 2 0.900026 14.664848 22 Stopped 2 0.074104 192 10 2 0.901730 13.312770 23 Stopped 2 0.303261 252 10 2 0.900009 12.725821 24 Stopped 2 0.019823 252 10 2 0.899917 12.533380 25 Stopped 10 0.820337 77 10 10 0.196842 81.816103 26 Stopped 10 0.296042 104 10 10 0.198453 81.121330 27 Stopped 4 0.299387 192 10 4 0.336183 24.610689 28 InProgress 9 0.080567 36 10 9 0.203052 104.303746 29 Completed 10 0.268684 151 10 10 0.222814 68.217289 30 InProgress 1 0.919740 66 10 1 0.900037 10.070776 2 trials running, 29 finished (4 until the end), 723.70s wallclock-time validation_error: best 0.1404876708984375 for trial-id 5 -------------------- Note that we are running a variant of ASHA where underperforming trials are stopped early. This is different to our implementation in :numref:`sec_mf_hpo_sh`, where each training job is started with a fixed ``max_epochs``. In the latter case, a well-performing trial which reaches the full 10 epochs, first needs to train 1, then 2, then 4, then 8 epochs, each time starting from scratch. This type of pause-and-resume scheduling can be implemented efficiently by checkpointing the training state after each epoch, but we avoid this extra complexity here. After the experiment has finished, we can retrieve and plot results. .. raw:: latex \diilbookstyleinputcell .. code:: python d2l.set_figsize() e = load_experiment(tuner.name) e.plot() .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output WARNING:matplotlib.legend:No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument. .. figure:: output_sh-async_bb0ea6_13_1.svg Visualize the Optimization Process ---------------------------------- Once more, we visualize the learning curves of every trial (each color in the plot represents a trial). Compare this to asynchronous random search in :numref:`sec_rs_async`. As we have seen for successive halving in :numref:`sec_mf_hpo`, most of the trials are stopped at 1 or 2 epochs (:math:`r_{\mathrm{min}}` or :math:`\eta * r_{\mathrm{min}}`). However, trials do not stop at the same point, because they require different amount of time per epoch. If we ran standard successive halving instead of ASHA, we would need to synchronize our workers, before we can promote configurations to the next rung level. .. raw:: latex \diilbookstyleinputcell .. code:: python d2l.set_figsize([6, 2.5]) results = e.results for trial_id in results.trial_id.unique(): df = results[results["trial_id"] == trial_id] d2l.plt.plot( df["st_tuner_time"], df["validation_error"], marker="o" ) d2l.plt.xlabel("wall-clock time") d2l.plt.ylabel("objective function") .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output Text(0, 0.5, 'objective function') .. figure:: output_sh-async_bb0ea6_15_1.svg Summary ------- Compared to random search, successive halving is not quite as trivial to run in an asynchronous distributed setting. To avoid synchronisation points, we promote configurations as quickly as possible to the next rung level, even if this means promoting some wrong ones. In practice, this usually does not hurt much, and the gains of asynchronous versus synchronous scheduling are usually much higher than the loss of the suboptimal decision making. `Discussions `__