pradachan's picture
Upload folder using huggingface_hub
f71c233 verified
raw
history blame
89.1 kB
# WHAT TO EXPECT OF HARDWARE METRIC PREDIC## TORS IN NEURAL ARCHITECTURE SEARCH
**Anonymous authors**
Paper under double-blind review
ABSTRACT
Modern Neural Architecture Search (NAS) focuses on finding the best performing architectures in hardware-aware settings; e.g., those with an optimal tradeoff
of accuracy and latency. Due to many advantages of prediction models over live
measurements, the search process is often guided by estimates of how well each
considered network architecture performs on the desired metrics. Typical prediction models range from operation-wise lookup tables over gradient-boosted trees
and neural networks, with little known information on how they compare. We
evaluate 18 different performance predictors on ten combinations of metrics, devices, network types, and training tasks, and find that MLP models are the most
promising. We then simulate and evaluate how the guidance of such prediction
models affects the subsequent architecture selection. Due to inaccurate predictions, the selected architectures are generally suboptimal, which we quantify as
an expected reduction in accuracy and hypervolume. We show that simply verifying the predictions of just the selected architectures can lead to substantially
improved results. Under a time budget, we find it preferable to use a fast and
inaccurate prediction model over accurate but slow live measurements.
1 INTRODUCTION
Modern neural network architectures are designed not only considering their primary objective,
such as accuracy. While existing architectures can be scaled down to work with the limited available
memory and computational power of, e.g., mobile phones, they are significantly outperformed by
specifically designed architectures (Howard et al., 2017; Sandler et al., 2018; Zhang et al., 2018;
Ma et al., 2018). Standard hardware metrics include memory usage, number of model parameters,
Multiply-Accumulate operations, energy consumption, latency, and more; each of which may be
limited by the hardware platform or network task. As the range of tasks and target platforms grows,
specialized architectures and the methods to find them efficiently are gaining importance.
The automated design and discovery of specialized architectures is the main intent of Neural Architecture Search (NAS). This recent field of study repeatedly broke state of the art records (Zoph et al.,
2018; Real et al., 2018; Cai et al., 2019; Tan & Le, 2019; Chu et al., 2019a; Hu et al., 2020) while
aiming to reduce the researchers’ involvement with this tedious and time-consuming process to a
minimum. As the performance of each considered architecture needs to be evaluated, the hardware
metrics need to be either measured live or guessed by a trained prediction model. While measuring live has the advantage of not suffering from inaccurate predictions, the corresponding hardware
needs to be available during the search process. Measuring on-demand may also significantly slow
down the search process and necessitates further measurements for each new architecture search.
On the other hand, a prediction model abstracts the hardware from the search code and simplifies
changes to the optimization targets, such as metrics or devices. The data set to train the predictor
also has to be collected only once so that a trained predictor then works in the absence of the hardware it is predicting for, e.g., in a cloud environment. Furthermore, a differentiable predictor can be
used for gradient-based architecture optimization of typically non-differentiable metrics (Cai et al.,
2019; Xu et al., 2020; Nayman et al., 2021).
While the many advantages make predictors a popular choice of hardware-aware NAS (e.g. Xu
et al. (2020); Wu et al. (2019); Wan et al. (2020); Dai et al. (2020); Nayman et al. (2021)), there
are no guidelines on which predictors perform best, how many training samples are required, or
-----
what happens when a predictor is inaccurate. This work investigates the above points. As a first
contribution, we conduct large-scale experiments on ten hardware-metric datasets chosen from HWNAS-Bench (Li et al., 2021a) and TransNAS-Bench-101 (Duan et al., 2021). We explore how
powerful the different predictors are when using different amounts of training data and whether
these results generalize across different network architecture types. As a second contribution, we
extensively simulate the subsequent architecture selection to investigate the impact of inaccurate
predictors. Our results demonstrate the effectiveness of network-based prediction models; provide
insights into predictor mistakes and what to expect from them. To facilitate reproducibility and
further research, our experimental results and code are made available in Appendix A.
2 RELATED WORK
**NAS Benchmarks:** As the search spaces of NAS methods often differ from one another and lack
extensive studies, the difficulty of fair comparisons and reproducibility have become a major concern
(Yang et al., 2019; Li & Talwalkar, 2020). To alleviate this problem, researchers have exhaustively
evaluated search spaces of several thousand architectures to create benchmarks (Ying et al., 2019;
Dong & Yang, 2020; Dong et al., 2020; Siems et al., 2020), containing detailed statistics for each
architecture. TransNAS-Bench-101 (Duan et al., 2021) evaluates several thousand architectures
across seven diverse tasks and finds that the best task-specific architectures may vary significantly.
The popular NAS-BENCH 201 benchmark (Dong & Yang, 2020) has been further extended with ten
different hardware metrics for all 15625 architectures on each of the three data sets CIFAR10, CIFAR100 (Krizhevsky et al., 2009) and ImageNet16-120 (Chrabaszcz et al., 2017). Major findings of
this HW-NAS Bench (Li et al., 2021a) include that FLOPs and the number of parameters are a poor
approximation for other metrics such as latency. Many existing NAS methods use such inadequate
substitutes for their simplicity and would benefit from their replacement with better prediction models. Li et al. also find that hardware-specific costs do not correlate well across hardware platforms.
While accounting for each device’s characteristics improves the NAS results, it is also expensive.
Predictors can reduce costs by requiring fewer measurements and shorter query times. [1].
**Predictors in NAS:** Aside from real-time measurements (Tan et al., 2019; Yang et al., 2018),
hardware metric estimation in NAS is commonly performed via Lookup Table (Wu et al., 2019),
Analytical Estimation or a Prediction Model (Dai et al., 2020; Xu et al., 2020). While an operationand layer-wise Lookup Table can accurately estimate hardware-agnostic metrics, such as FLOPs or
the number of parameters (Cai et al., 2019; Guo et al., 2020; Chu et al., 2019a), they may be suboptimal for device-dependent metrics. Latency and energy consumption have non-obvious factors that
depend on hardware specifics such as memory, cache usage, the ability to parallelize each operation,
and an interplay between different network operations. Such details can be captured with neural
networks (Dai et al., 2020; Mendoza & Wang, 2020; Ponomarev et al., 2020; Xu et al., 2020) or
other specialized models (Yao et al., 2018; Wess et al., 2021).
Of particular interest is the correct prediction of the model loss or accuracy, possibly reducing the
architecture search time by orders of magnitude (Mellor et al., 2020; Wang et al., 2021; Li et al.,
2021b). In addition to common predictors such as Linear Regression, Random Forests (Liaw et al.,
2002) or Gaussian Processes (Rasmussen, 2003); specialized techniques may exploit training curve
extrapolation, network weight sharing or gradient information. Our experiments follow the recent
large-scale study of White et al. (2021), who compare 31 diverse accuracy prediction methods based
on initialization and query time, using three NAS benchmarks.
3 PREDICTING HARDWARE METRICS
Our methods follow the large-scale study of White et al. (2021), who compared a total of 31 accuracy prediction methods. The differences between accuracy and hardware-metric prediction, our
selection of predictors, and the general training pipeline are described in this section. In our experiments on HW-NAS-Bench and TransNAS-Bench-101, described in Section 4, we then compare
these predictors across different training set sizes.
1For further reading, we recommend a recent survey on hardware-aware NAS (Benmeziane et al., 2021)
-----
**Differences to accuracy predictors:** There are fundamental differences when predicting hardware metrics and the accuracy of network topologies. The most essential is the cost to obtain a
helpful predictor, which may vary widely for accuracy prediction methods. While determining the
test accuracy requires the costly and lengthy training of networks, measuring hardware metrics does
not necessitate any network training. Consequentially, specialized accuracy-estimation methods that
rely on trained networks, loss history, learning curve extrapolation, or early stopping do not apply to
hardware metrics. Furthermore, so-called zero-cost proxies that predict metrics from the gradients
of a single batch are dependant on the network topology but not on the hardware the network is
placed on. Therefore, the dominant hardware-metric predictor family is model-based.
Since all relevant predictors are model-based, they can be compared by their training set size. This
simplifies the initialization time of a predictor as the number of prior measured architectures on
which they are trained. In stark contrast, some accuracy predictors do not need any training data,
while others require several partially or fully trained networks. Since an untrained network and a
few batches suffice to measure a hardware-metric, the collection of such a training set is comparably
inexpensive.
Additionally, hardware predictors are generally used supplementary to a one-shot network optimized
for loss or accuracy. Depending on the NAS method, a fully differentiable predictor is required in
order to guide the gradient-based architecture selection. Typical choices are Lookup Tables (Cai
et al., 2019; Nayman et al., 2021) and neural networks (Xu et al., 2020).
**Model-based predictors:** The goal of a predictor fp(a) is to accurately approximate the function
_f_ (a), which may be, e.g., the latency of an architecture a from the search space A. A model-based
predictor is trained via supervised learning on a set Dtrain of datapoints (a, f (a)), after which it can
be inexpensively queried for estimates on further architectures. The collection of the dataset and the
duration of the training are referred to as initialization time and training time respectively.
The quality of such a trained predictor is generally determined by the (ranking) correlation between
measurements _f_ (a) _a_ _test_ and predictions _fp(a)_ _a_ _test_ on the unseen architectures
_{_ _|_ _∈A_ _}_ _{_ _|_ _∈A_ _}_
_test_ . Common correlation metric choices are Pearson (PCC), Spearman (SCC) and Kendall’s
_ATau (KT) (Chu et al., 2019b; Yu et al., 2020; Siems et al., 2020). ⊂A_
Our experiments include 18 model-based predictors from different families: Linear Regression,
Ridge Regression (Saunders et al., 1998), Bayesian Linear Regression (Bishop, 2007), Support
Vector Machines (Cortes & Vapnik, 1995), Gaussian Process (Rasmussen, 2003), Sparse Gaussian Process (Candela & Rasmussen, 2005), Random Forests (Liaw et al., 2002), XGBoost (Chen &
Guestrin, 2016), NGBoost (Duan et al., 2020), LGBoost (Ke et al., 2017), BOHAMIANN (Springenberg et al., 2016), BANANAS (White et al., 2019), BONAS (Shi et al., 2020), GCN (Wen
et al., 2020), small and large Multi-Layer-Perceptrons (MLP), NAO (Luo et al., 2018), and a layeroperation-wise Lookup Table model. We provide further descriptions and implementation details in
Appendix B.
**Hyper-parameter tuning:** The default hyperparameters of the used predictors vary significantly
in their levels of hyper-parameter tuning, especially in the context of NAS. Additionally, some predictors may internally make use of cross-validation, while others do not. Following White et al.
(2021), we attempt to level the playing field by running a cross-validation random-search over hyperparameters each time a predictor is fit to data. Each search is limited to 5000 iterations and a total
run time of 15 minutes and naturally excludes any test data. The predictor-specific parameter details
are given in Appendix C.
**Training pipeline** To make a reliable comparison, we use the NASLib library (Ruchte et al.
(2020), see Appendix A). We fit each predictor on each dataset and training size 50 times, using
seeds {0, ..., 49}.
Some predictors internally normalize the training values (subtract mean, divide by standard deviation). We choose to explicitly do this for all predictors and datasets, which reduces the dependency
of hyper-parameters (e.g. learning rate) on the dataset and allows us to analyze and compare the
prediction errors across datasets more effectively.
-----
4 PREDICTOR EXPERIMENTS
We compare the different predictor models based on two NAS benchmarks, HW-NAS-Bench (Li
et al., 2021a) and TransNAS-Bench-101 (Duan et al., 2021). They differ considerably by their
network tasks, hardware devices, and architecture designs.
**HW-NAS-Bench architecture design and datasets** In HW-NAS-Bench, each architecture is
solely defined by the topology of a building block (”cell”), which is stacked multiple times to create
a complete network. Each cell is completely defined by choosing six candidate operations. Since
they select from five different candidates each time, there are 5[6] = 15625 unique cell topologies.
These cells are not fully sequential but contain paths of different lengths, which is visualized in
Appendix D.
HW-NAS-Bench provides ten hardware statistics on CIFAR10, CIFAR100 Krizhevsky et al. (2009)
and ImageNet16-120 Chrabaszcz et al. (2017), of which we exclude the incomplete EdgeTPU metric. Thus there are 27 data sets of varying difficulty. As detailed in Appendix E, 12 of them can
be accurately fit with Linear Regression and only 25 training samples. Many are also very similar
since their measured networks differ only by the number of image classes. We therefore select five
datasets that (1.) are not trivial to learn as they are non-linear and (2.) not redundant:
_• ImageNet16-120, raspi4, latency_
_• CIFAR100, pixel3, latency_
_• CIFAR10, edgegpu, latency_
_• CIFAR100, edgegpu, energy consumption_
_• ImageNet16-120, eyeriss, arithmetic intensity_
**TransNAS-Bench-101 architecture design and datasets** TransNAS-Bench-101 contains information for 7,352 different network architectures, used as backbones in seven diverse vision tasks.
Since 4,096 are also a subset of HW-NAS-Bench, we focus on the remaining 3,256 architectures
with a macro-level search space. Unlike a micro-level search space, where a cell is stacked multiple
times to create a network, each network layer and block is considered individually. In particular, the
TransNAS-Bench-101 networks consist of four to six pairs of ResNet blocks (He et al., 2016), which
may modify the image size and channels in four ways: not at all, double the channel count, halve the
spatial size, and both. Every network has to double the channel count 1 to 3 times, resulting in 3,256
unique architectures. The networks may consequentially differ in their number of layers (depth), the
number of channels (width), and image size at any layer.
As done for HW-NAS-Bench, we select five of the seven available datasets for their latency measurements. Aside from the self-supervised Jigsaw task, there is little difference between the cross-task
latency measurements (see Appendix E). We evaluate the possibly redundant datasets nonetheless,
since latency predictions in macro-level search spaces are an important domain for NAS on image
classification and object detection tasks:
_• Object classification_
_• Scene classification_
_• Room layout_
_• Jigsaw_
_• Semantic segmentation_
**Fitting results and comparison** The results, averaged over all selected HW-NAS-Bench and
TransNAS-Bench-101 datasets, are presented in Figures 1a and 1b, respectively. The left plots
present the absolute predictor performance, the right ones make relative comparisons easier.
Unsurprisingly, more training samples (i.e., evaluated architectures) generally lead to better prediction results, even until the entire search space is known (aside from the test set). This is true for
most of the predictors, although e.g. Gaussian Processes and BOHAMIANN saturate early. The
simple Linear Regression and Ridge Regression models also fail to make proper use of hundreds
of data points but perform decently when only a few training samples are available. Interestingly,
the same is true for the graph-encoding network-based predictors BONAS and GCN. While knowing how the different paths within each cell connect (see Appendix B) is especially useful given
less than fifty training samples, the advantage disappears afterward. In contrast, the graph-encoding
encoder-decoder approach of NAO performs decently at all times.
-----
(b) Results on TransNAS-Bench-101. Since all network architectures are purely sequential by design, we do not
Average over HW-NAS datasets Average over HW-NAS datasets
0.2
Lin. Reg.
0.8 Bayes. Lin. Reg.
Ridge Reg.
XGBoost
0.1 NGBoost
LGBoost
0.6
Random Forests
Sparse GP
GP
0.0
BOHAMIANN
0.4 SVM Reg.
NAO
Kendall's Tau (absolute) GCN
0.1 BONAS
0.2 Kendall's Tau (centered on average) BANANAS
MLP (large)
MLP (small)
Lookup Table
0.2
0.0
10[1] 10[2] 10[3] 10[4] 10[1] 10[2] 10[3] 10[4]
training set size training set size
(a) Results on HW-NAS-Bench. NAO performs decently at all times, and none of the prediction models requires
more than 60 training samples to improve over a Lookup Table model.
Average over TransNAS datasets Average over TransNAS datasets
0.9
0.10
0.8
Lin. Reg.
Bayes. Lin. Reg.
0.7 0.05 Ridge Reg.
XGBoost
NGBoost
0.6
LGBoost
0.00 Random Forests
0.5 Sparse GP
GP
BOHAMIANN
0.4
Kendall's Tau (absolute) 0.05 SVM Reg.
MLP (large)
0.3 Kendall's Tau (centered on average) MLP (small)
Lookup Table
0.10
0.2
0.1
10[1] 10[2] 10[3] 10[1] 10[2] 10[3]
training set size training set size
evaluate predictors that specifically encode the architecture connectivity (BANANAS, BONAS, GCN, NAO).
After as few as 20 training samples, MLP models outclass all other predictors.
Figure 1: How well the different predictors rank the test architectures, depending on the training
set size and averaged over the five selected datasets. Left plots: absolute Kendall’s Tau ranking
correlation, higher is better. Right plots: same as left, but centered on the predictor-average.
Due to their powerful rule-based approach, tree-based models perform much better given many
training samples. Under such circumstances, LGBoost is a candidate for the best predictor model.
Similarly, the predictions of Support Vector Machines also benefit strongly from more samples.
The model we find to perform best for most training set sizes are MLPs. They are among the top
predictors at almost all times in the HW-NAS-Bench, although tree-based models are competitive
given enough data. After around 3,000 training samples, thinner and deeper MLPs improve over the
wider and smaller ones. The path-encoding BANANAS model behaves similarly to a regular large
MLP but requires more samples to reach the same performance. This is interesting since, aside from
the data encoding, BANANAS is an ensemble of three large MLP models. Even though only the first
network layer is affected by the data encoding, the more complicated path-encoding proves harmful
-----
HW-NAS-Bench TransNAS-Bench-101
Raspi4 FPGA Eyeriss Pixel3 EdgeGPU Tesla V100
latency 0.45 (0.75) 0.99 (0.97) 0.99 (0.96) 0.49 (0.78) 0.21 (0.79) 0.60 (0.70)
energy 0.99 (0.97) 1.00 (0.99) 0.23 (0.79)
arithmetic intensity 0.84 (0.81)
Table 1: The Kendall’s Tau correlation of Lookup Tables and Linear Regression (in brackets, using
only 124 training samples) across metrics and devices. Lookup Tables perform only marginally
better on the FPGA and Eyeriss devices, but considerably worse in all other cases. More detailed
statistics are available in Appendix E.
when the connectivity of the architectures in the search space is fixed. On TransNAS-Bench-101,
MLP perform exceptionally well. They are much better than any other tested predictor once more
than just 20 training samples are available. The small MLP model can achieve a KT correlation
of 80% with just 200 training samples, which takes the best non-network-based predictor (Support
Vector Machine) four times as many. They are also the only models that achieve a KT correlation of
over 90%, about 5% higher than the next best model (LGBoost).
Finally, the Lookup Table models (black horizontal lines) perform poorly in comparison to any other
predictor. Even though building such a model for HW-NAS-Bench datasets requires only 25 neighbored architectures, NAO and GCN perform better after just ten random samples. More than half
of the predictor models require less than 25 random samples, while the worst need at most 60. On
TransNAS-Bench-101, Lookup Tables perform comparably better. Building one requires only 21
neighbored architectures, and it takes most models between 50 and 100 random training samples to
achieve better performance. When measured on a per dataset basis, we find that the Lookup Table
models display a severe performance difference ranging from about 20% KT correlation (cifar10edgegpu latency and Jigsaw) to over 70% (ImageNet16-120-eyeriss arithmetic intensity and Semantic Segmentation, see Appendix E). Other models prove to be much more stable.
**Devices and Metrics** The previously described results are based on a specific selection of HWNAS-Bench and TransNAS-Bench-101 datasets that are hard to fit for Lookup Table models. As
shown in Table 1, that is not always the case. The FPGA and Eyeriss hardware devices are very suitable for Lookup Tables, achieving an almost perfect ranking correlation is possible. Nonetheless,
Linear Regression requires only 124 training samples to compete even there and is significantly better in every other case. We finally observe that the difficulty of fitting predictors primarily depends
on the hardware device, much more than the measured metric.
5 EVALUATING THE PREDICTOR-GUIDED ARCHITECTURE SELECTION
Although the experiments in Section 4 greatly assist us in selecting a predictor, it is not clear what a
specific Kendall’s Tau correlation implies for the subsequent architecture selection. Given a perfect
hardware metric predictor (Kendall’s Tau = 1.0), we can expect that an ideal architecture search
process will select the architectures with the best tradeoff of accuracy and the hardware metric, i.e.,
the true Pareto front. On the other hand, imperfect predictions result in the selection of supposedlybest architectures that are wrongly believed to be better.
To study how hardware predictors affect NAS results, we extensively evaluate the selection of such
supposedly-best architectures in simulation. This approach can evaluate any combination of predictor quality, test set size, and dataset, without the technical difficulties of obtaining actual predictor
models that precisely match such requirements. Since the hardware and accuracy prediction models
are usually independent and can be studied in isolation, we use ground-truth accuracy values in all
cases.
**Simulating predictors** The main challenge of the simulation is to quickly and accurately model
predictor outputs. We base our simulation on how predictor-generated values deviate from their
ground-truth targets on the test set, which is explained in Figure 2 and further detailed in Appendix G. Since the simulated deviations are similar to those of actual predictors, simulated predictions are obtained by drawing random values from this deviation distribution and adding them to
the ground-truth hardware measurements.
-----
Predictor deviations
|n d|ormal fit, std=0.477 eviations|
|---|---|
|||
1 0 1
normal fit, std=0.477
deviations
deviation of the predictions
Simulated predictor deviations
|normal fit, st mixed dist. g|normal fit, st mixed dist. g|d=0.500 enerated with std=0.5|
|---|---|---|
||||
2 1 0 1 2
normal fit, std=0.500
mixed dist. generated with std=0.5
deviation of the simulated predictions
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
Figure 2: A trained XGBoost prediction model on normalized ImageNet16-120 raspi4-latency test
Predictions and targets
KT=0.73, SCC=0.90, PCC=0.88
3
2
1
0
predicted values
1
2
2 1 0 1 2 3
true values
data. Left: The latency prediction (y-axis) for every architecture (blue dot) is approximately correct
(red line). Center: The same data as on the left, the distribution of deviations made by the predictor
(blue) and a normal distribution fit to them (orange). Right: A mixed distribution can simulate
0.45
0.40
0.45
0.40
0.35
0.30
0.35
0.30
0.25
0.25
|al deviation distributions as that in the center plot.|Col2|
|---|---|
|true pareto front predicted pareto front all architectures selected architectures||
|true pareto front predicted pareto front all architectures selected architectures||
|4 no|2 0 2 4 rmalized ImageNet16-120-raspi4_latency|
|Col1|Col2|Col3|Col4|
|---|---|---|---|
|||||
|||||
|||||
|||||
||true par discover selected|eto front, HV=2.9 ed pareto front, H arch., MRAall = 1.|3 V=2.86 06%, MRApareto = 0.43%|
|||||
|||||
|2.|0 1.5 1.0 0.5 0.0 0.5 normalized ImageNet16-120-raspi4_latency|||
Figure 3: An example of predictor-guided architecture selection, std=0.5. Left: The simulated predictor makes an inaccurate latency prediction for each architecture (blue), resulting in the selection
of the supposedly-best architectures (orange dots). Even though the predicted Pareto front (orange
line) may differ significantly from the ground-truth Pareto front (red line), most selected architectures are close to optimal. Right: Same data as left. The true Pareto front (red) and that of the
selected architectures (orange). Simply accepting all selected architectures results in a Mean Reduction of Accuracy (MRA) of 1.06%, while verifying the predictions and discarding inferior results
improves that to 0.43%. The hypervolume (HV, area under the Pareto-fronts) is reduced by 0.07.
A single example of a simulation can be seen in Figure 3. Although most selected architectures
(orange) are close to the true optimum (red Pareto front), there almost always exists an architecture that has superior accuracy and, at most, the same latency. Simply accepting the 13 selected
architectures in this particular example results in a mean reduction of accuracy (MRAall) of 1.06%.
In other words, the average selected architecture has 1.06% lower accuracy than a comparable one
on the true Paret front. However, simply verifying the hardware metric predictions through actual
measurements reveals that some selected architectures are suboptimal. By choosing only the Pareto
subset of the selection, the opportunity loss can be reduced to 0.43% (MRApareto).
An important property of this approach is that it is independent of any particular optimization
method. The supposedly-best architectures are always correctly identified, which is an upper bound
on how well Bayesian Optimization, Evolutionary Algorithms, and other approaches can perform.
The exemplary MRApareto opportunity loss of 0.43% is therefore unavoidable and depends solely
on the hardware metric predictor, the dataset, and the number of considered architectures.
**Results** We simulate 1,000 architecture selections for each of the five chosen HW-NAS-Bench
datasets, six different test set sizes, and eleven distribution standard deviations between 0.0 and 1.0.
As exemplarily shown in Figure 3, each such simulation allows us to compute the mean reduction
in accuracy (MRA) and the hypervolume (HV) under the Pareto fronts. The most important insights
are visualized in Figure 4 and summarized below.
-----
mean over any number of architectures
|Image cifar1 cifar1 cifar1|Net16-120 0-edgegpu_ 00-edgegpu 00-pixel3_l|-raspi4_late latency _energy atency|ncy|Col5|Col6|Col7|
|---|---|---|---|---|---|---|
|mean|||||||
0.0/1.00 0.2/0.88 0.4/0.78 0.6/0.69 0.8/0.62 1.0/0.57
ImageNet16-120-eyeriss_arithmetic_intensity
ImageNet16-120-raspi4_latency
cifar10-edgegpu_latency
cifar100-edgegpu_energy
cifar100-pixel3_latency
mean
Std. of prediction deviations / Kendall's Tau
mean over all datasets
4.5
4.0
3.5
3.0
2.5
2.5
2.0
1.5
1.0
0.5
|Col1|Col2|Col3|Col4|Col5|Col6|
|---|---|---|---|---|---|
|||all selec pareto-s|ted archite et of the s|ctures elected ar|chitectures|
0.0/1.00 0.2/0.88 0.4/0.78 0.6/0.69 0.8/0.62 1.0/0.57
all selected architectures
pareto-set of the selected architectures
Std. of prediction deviations / Kendall's Tau
|Col1|Col2|Col3|Col4|Col5|Col6|
|---|---|---|---|---|---|
|100|10|00|5000|||
2.0
0.0/1.00 0.2/0.88 0.4/0.78 0.6/0.69 0.8/0.62 1.0/0.57
100 1000 5000
500 2000 15625
Std. of prediction deviations / Kendall's Tau
Figure 4: Simulation results, with the standard deviation of the predictor deviations and the resulting
KT correlation on the x-axis. Left: Verifying the hardware predictions can significantly improve the
results, even more so for better predictors. Center: The drops in average accuracy are dependant on
the dataset and hardware metric. Right: Considering more candidate architectures and using better
prediction models improves the results; larger values are better.
Verifying the predicted results matters (Figure 4, left). The best prediction models achieve a KT
correlation of almost 0.9, which translates to a mean reduction in accuracy of MRAall 1.5%. That
means, for each selected architecture, there exists an architecture of equal or lower latency in the ≈
true Pareto set (if latency is the hardware metric) that improves the average accuracy by 1.5%. Even
though all selected architectures are believed to form a Pareto set, that is not the case. Their optimal
subset has a reduction of only MRApareto 0.5%, a significant improvement. However, finding
this optimal subset requires actually measuring the hardware metrics of the architectures selected by ≈
the used NAS method.
Furthermore, the left of Figure 4 aids in anticipating the MRA given a specific predictor. If one used
e.g. BOHAMIANN (KT 0.8, see Figure1a) instead of MLPs or LGBoost (KT 0.9), MRApareto
_≈_ _≈_
increases from around 0.5% to roughly 1.2%. The average accuracy of the selected architectures is
thus reduced by another 0.7%, just by using an unsuitable hardware metric predictor. Lookup Tables
(KT≈0.45) are not even visualized anymore, they have an MRApareto of over 2.5%.
Another interesting observation is that the gap between MRAall and MRApareto is wider for better
predictors. This is a shortcoming of the MRA metric that we elaborate on in Appendix H.
The dataset and metric matter (Figure 4, center). While we generally present the results averaged
over datasets, there exists some discrepancy among them. Most interestingly, predicting hardware
metrics on harder classification problems (ImageNet16-120 is harder than CIFAR10) also results in
a higher MRA. This is especially important since MRA is an absolute accuracy reduction. Even
though the CIFAR10 networks achieve twice the accuracy of ImageNet16-120 networks, they lose
less absolute accuracy to imperfect predictions. The order of MRA/dataset is primarily stable for
any predictor KT correlation. Finally, as visualized by the shaded areas, the standard deviation
of the MRA is generally huge. Consequentially, predictor-guided NAS is very likely to produce
results of varying quality for each different predictor or search attempt, especially with less accurate
predictors.
The number of considered architectures matters (Figure 4, right). We measure the hypervolume of
the discovered Pareto front (i.e., the area beneath it, see Appendix H), which, unlike MRA, also
considers the hardware metric. Quite obviously, if the architectures from the true Pareto set are not
considered, they can not be selected. To achieve the highest possible hypervolume of around 4.2
(i.e. find the true Pareto set), every architecture in the search space must be evaluated with a perfect
predictor. This is impossible in most real-world cases, where only a tiny fraction of all possible
architectures can ever be considered.
For HW-NAS-Bench, considering 5000 architectures with perfect live measurements and predicting
the metrics for all 15625 with ranking correlation KT≈0.73 results in selecting equivalent sets of
architectures. As seen in Figure1a, Ridge Regression can achieve this performance with fewer
than 100 training samples. Thus, a worse predictor leads to better results if it enables considering
more architectures. This insight is especially crucial for live measurements, which are accurate but
slow. Similarly, estimating the network accuracy with super-networks takes much more time than
predicting their performance with a neural predictor (Wen et al., 2020). If the measurement of any
metrics is the limiting factor, a guided selection of a cheap predictor is likely to do better.
-----
6 DISCUSSION
**Chosen prediction methods** Given the nature of hardware-metric prediction, only the subset of
model-based predictors evaluated by White et al. (2021) is suitable. We extended this subset with
four models, including the popular Lookup Table. We abstained from evaluating layer-wise predictors (e.g. Wess et al. (2021)) since such data is not available, and meta-learning predictors (Lee
et al., 2021) due to the vast possibilities to configure them. A separate and specialized comparison
between classic and meta-learning predictors seems favorable to us.
**Simulation limitations** In contrast to evaluating real predictors, the simulation allows us to
quickly make statements for any test set sizes and predictor-inaccuracies. However, naturally, the
results are only approximations. While they match actual values, they are generally slightly pessimistic (see Appendix I). We also limit the simulation to HW-NAS-Bench since the changes to
classification results are more accessible to interpretation than changes to loss values across different problem types. Finally, the current simulation approach can not investigate methods that
absolutely require a trained one-shot network, such as gradient-based approaches. Including such
methods is an interesting direction for future research.
**Transferability of the results** Our evaluation includes five challenging and diverse datasets
based on the micro-level search space of HW-NAS-Bench and five latency-based datasets of various macro-level search space architectures in TransNAS-Bench-101. Nonetheless, we find shared
trends: All tested prediction models improve over Lookup Tables with little amounts of training
data. Furthermore, most predictors benefit from more training data, even until the entire search
space (aside from the test set) is known. We also find that network-based predictors are generally
best but may be challenged by tree-based predictors if enough training data is available. Given only
a few samples, Ridge Regression performs better than most other models.
**Recommendations** While Lookup Tables are a cheap, simple, and popular model in gradientbased architecture selection, we find a significant variance in performance across tasks and devices
(see Table 1 and Appendix E). We recommend replacing such models with either MLPs or Ridge
Regression, which are more stable, fully differentiable, and often take less than 100 training samples
to achieve better results.
For most realistic scenarios where more than 100 training samples are available, MLP models are
the most promising. They are among the top predictors on HW-NAS-Bench and demonstrate outstanding performance on the TransNAS-Bench-101 datasets. We found that specialized architecture
encodings are primarily beneficial for little training data but suspect that they enjoy an additional
advantage when network topologies are more complex and diverse (White et al., 2021).
While the query time for all predictors is less than 0.05s and thus negligible, there is a notable
difference in training time (see Appendix F), primarily due to the hyper-parameter optimization. We
recommend Ridge Regression for very little amounts of training data and LGBoost otherwise if that
is an important factor.
If a NAS method selects architectures based on hardware metric predictions, we strongly suggest
verifying the results by measuring the true metric value afterward. Doing so may eliminate inferior
candidates and improve the average result substantially. Finally, if the limiting factor to a NAS
method is the slow measurement of hardware metrics, using a much faster predictor may lead to an
improvement, even if the prediction model is less accurate.
7 CONCLUSIONS
This work evaluated various hardware-metric prediction models on ten problems of different metrics, devices, and network architecture types. We then simulated the selection process for different
test set sizes and predictor inaccuracies to improve our understanding of predictor-based architecture selection. We find that even imperfect predictors may improve NAS if their low query time
enables considering more candidate architectures. Finally, verifying the predictions for the selected
candidates can lead to a drastic improvement of their average performance. The code and results are
made available, thus acting both for recommendation and as a baseline for future works.
-----
REFERENCES
Hadjer Benmeziane, Kaoutar El Maghraoui, Hamza Ouarnoughi, Sma¨ıl Niar, Martin Wistuba, and
Naigang Wang. A Comprehensive Survey on Hardware-Aware Neural Architecture Search.
_[CoRR, abs/2101.09336, 2021. URL https://arxiv.org/abs/2101.09336.](https://arxiv.org/abs/2101.09336)_
Christopher M. Bishop. Pattern recognition and machine learning, 5th Edition. Information science
[and statistics. Springer, 2007. ISBN 9780387310732. URL https://www.worldcat.org/](https://www.worldcat.org/oclc/71008143)
[oclc/71008143.](https://www.worldcat.org/oclc/71008143)
Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct Neural Architecture Search on
Target Task and Hardware. In 7th International Conference on Learning Representations,
_ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019._ [URL https:](https://openreview.net/forum?id=HylVB3AqYm)
[//openreview.net/forum?id=HylVB3AqYm.](https://openreview.net/forum?id=HylVB3AqYm)
Joaquin Qui˜nonero Candela and Carl Edward Rasmussen. A Unifying View of Sparse Approximate
[Gaussian Process Regression. J. Mach. Learn. Res., 6:1939–1959, 2005. URL http://jmlr.](http://jmlr.org/papers/v6/quinonero-candela05a.html)
[org/papers/v6/quinonero-candela05a.html.](http://jmlr.org/papers/v6/quinonero-candela05a.html)
Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. In Proceedings of the
_22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD_
’16, pp. 785–794, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4232-2. doi: 10.1145/
[2939672.2939785. URL http://doi.acm.org/10.1145/2939672.2939785.](http://doi.acm.org/10.1145/2939672.2939785)
Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. A downsampled variant of imagenet as an
alternative to the cifar datasets. arXiv preprint arXiv:1707.08819, 2017.
Xiangxiang Chu, Bo Zhang, Jixiang Li, Qingyuan Li, and Ruijun Xu. ScarletNAS: Bridging the
Gap Between Scalability and Fairness in Neural Architecture Search. CoRR, abs/1908.06022,
[2019a. URL http://arxiv.org/abs/1908.06022.](http://arxiv.org/abs/1908.06022)
Xiangxiang Chu, Bo Zhang, Ruijun Xu, and Jixiang Li. FairNAS: Rethinking Evaluation Fairness
[of Weight Sharing Neural Architecture Search. CoRR, abs/1907.01845, 2019b. URL http:](http://arxiv.org/abs/1907.01845)
[//arxiv.org/abs/1907.01845.](http://arxiv.org/abs/1907.01845)
Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273–297,
1995.
Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Bichen Wu, Zijian He, Zhen Wei, Kan Chen, Yuandong
Tian, Matthew Yu, Peter Vajda, and Joseph E. Gonzalez. FBNetV3: Joint Architecture-Recipe
Search using Neural Acquisition Function. _CoRR, abs/2006.02049, 2020._ [URL https://](https://arxiv.org/abs/2006.02049)
[arxiv.org/abs/2006.02049.](https://arxiv.org/abs/2006.02049)
Xuanyi Dong and Yi Yang. NAS-Bench-201: Extending the Scope of Reproducible Neural Architecture Search. In 8th International Conference on Learning Representations, ICLR 2020, Addis
_[Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.](https://openreview.net/forum?id=HJxyZkBKDr)_
[net/forum?id=HJxyZkBKDr.](https://openreview.net/forum?id=HJxyZkBKDr)
Xuanyi Dong, Lu Liu, Katarzyna Musial, and Bogdan Gabrys. NATS-Bench: Benchmarking NAS
Algorithms for Architecture Topology and Size. arXiv preprint arXiv:2009.00437, 2020.
Tony Duan, Avati Anand, Daisy Yi Ding, Khanh K. Thai, Sanjay Basu, Andrew Y. Ng, and Alejandro Schuler. Ngboost: Natural gradient boosting for probabilistic prediction. In Proceedings of
_the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual_
_Event, volume 119 of Proceedings of Machine Learning Research, pp. 2690–2700. PMLR, 2020._
[URL http://proceedings.mlr.press/v119/duan20a.html.](http://proceedings.mlr.press/v119/duan20a.html)
Yawen Duan, Xin Chen, Hang Xu, Zewei Chen, Xiaodan Liang, Tong Zhang, and Zhenguo Li.
TransNAS-Bench-101: Improving Transferability and Generalizability of Cross-Task Neural Ar[chitecture Search. CoRR, abs/2105.11871, 2021. URL https://arxiv.org/abs/2105.](https://arxiv.org/abs/2105.11871)
[11871.](https://arxiv.org/abs/2105.11871)
-----
Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. Single Path One-Shot Neural Architecture Search with Uniform Sampling. In European Conference
_[on Computer Vision, pp. 544–560. Springer, 2020. URL http://arxiv.org/abs/1904.](http://arxiv.org/abs/1904.00420)_
[00420.](http://arxiv.org/abs/1904.00420)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image
Recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,
[pp. 770–778, 2016. URL http://arxiv.org/abs/1512.03385.](http://arxiv.org/abs/1512.03385)
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand,
Marco Andreetto, and Hartwig Adam. MobileNets: Efficient Convolutional Neural Networks for
[Mobile Vision applications. CoRR, abs/1704.04861, 2017. URL http://arxiv.org/abs/](http://arxiv.org/abs/1704.04861)
[1704.04861.](http://arxiv.org/abs/1704.04861)
Shoukang Hu, Sirui Xie, Hehui Zheng, Chunxiao Liu, Jianping Shi, Xunying Liu, and Dahua Lin.
DSNAS: Direct Neural Architecture Search without Parameter Retraining. In Proceedings of the
_IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12084–12092, 2020._
[URL http://arxiv.org/abs/2002.09128.](http://arxiv.org/abs/2002.09128)
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and TieYan Liu. Lightgbm: A highly efficient gradient boosting decision tree. In Isabelle Guyon, Ulrike
von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman
Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on
_Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp._
[3146–3154, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/](https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html)
[6449f44a102fde848669bdd9eb6b76fa-Abstract.html.](https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html)
Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. CIFAR-10 (Canadian Institute for Advanced
[Research). 2009. URL http://www.cs.toronto.edu/˜kriz/cifar.html.](http://www.cs.toronto.edu/~kriz/cifar.html)
Hayeon Lee, Sewoong Lee, Song Chong, and Sung Ju Hwang. HELP: Hardware-Adaptive Efficient
[Latency Predictor for NAS via Meta-Learning. CoRR, abs/2106.08630, 2021. URL https:](https://arxiv.org/abs/2106.08630)
[//arxiv.org/abs/2106.08630.](https://arxiv.org/abs/2106.08630)
Chaojian Li, Zhongzhi Yu, Yonggan Fu, Yongan Zhang, Yang Zhao, Haoran You, Qixuan Yu, Yue
Wang, and Yingyan Lin. HW-NAS-Bench: Hardware-Aware Neural Architecture Search Bench[mark. CoRR, abs/2103.10584, 2021a. URL https://arxiv.org/abs/2103.10584.](https://arxiv.org/abs/2103.10584)
Guihong Li, Sumit K. Mandal, Umit Y. Ogras, and Radu Marculescu. FLASH: Fast Neural Ar-[¨]
[chitecture Search with Hardware Optimization. CoRR, abs/2108.00568, 2021b. URL https:](https://arxiv.org/abs/2108.00568)
[//arxiv.org/abs/2108.00568.](https://arxiv.org/abs/2108.00568)
Liam Li and Ameet Talwalkar. Random Search and Reproducibility for Neural Architecture Search.
In Uncertainty in Artificial Intelligence, pp. 367–377. PMLR, 2020.
Andy Liaw, Matthew Wiener, et al. Classification and Regression by randomForest. R news, 2(3):
18–22, 2002.
Marius Lindauer and Frank Hutter. Best Practices for Scientific Research on Neural Architecture
Search. Journal of Machine Learning Research, 21(243):1–18, 2020.
Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and Tie-Yan Liu. Neural Architecture Optimization.
In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicol`o Cesa-Bianchi,
and Roman Garnett (eds.), Advances in Neural Information Processing Systems 31: Annual Con_ference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018,_
_[Montr´eal, Canada, pp. 7827–7838, 2018. URL https://proceedings.neurips.cc/](https://proceedings.neurips.cc/paper/2018/hash/933670f1ac8ba969f32989c312faba75-Abstract.html)_
[paper/2018/hash/933670f1ac8ba969f32989c312faba75-Abstract.html.](https://proceedings.neurips.cc/paper/2018/hash/933670f1ac8ba969f32989c312faba75-Abstract.html)
Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. ShuffleNet V2: Practical Guidelines
for Efficient CNN Architecture Design. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (eds.), Computer Vision - ECCV 2018 - 15th European Conference, Mu_nich, Germany, September 8-14, 2018, Proceedings, Part XIV, volume 11218 of Lecture Notes_
_[in Computer Science, pp. 122–138. Springer, 2018. doi: 10.1007/978-3-030-01264-9\ 8. URL](https://doi.org/10.1007/978-3-030-01264-9_8)_
[https://doi.org/10.1007/978-3-030-01264-9_8.](https://doi.org/10.1007/978-3-030-01264-9_8)
-----
Joseph Mellor, Jack Turner, Amos Storkey, and Elliot J. Crowley. Neural Architecture Search with[out Training, 2020. URL http://arxiv.org/abs/2006.04647.](http://arxiv.org/abs/2006.04647)
Daniel M. Mendoza and Sijin Wang. Predicting Latency of Neural Network Inference,
2020. [URL http://cs230.stanford.edu/projects_fall_2020/reports/](http://cs230.stanford.edu/projects_fall_2020/reports/55793069.pdf)
[55793069.pdf.](http://cs230.stanford.edu/projects_fall_2020/reports/55793069.pdf)
Niv Nayman, Yonathan Aflalo, Asaf Noy, and Lihi Zelnik. HardCoRe-NAS: Hard Constrained
diffeRentiable Neural Architecture Search. In Marina Meila and Tong Zhang (eds.), Proceedings
_of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual_
_Event, volume 139 of Proceedings of Machine Learning Research, pp. 7979–7990. PMLR, 2021._
[URL http://proceedings.mlr.press/v139/nayman21a.html.](http://proceedings.mlr.press/v139/nayman21a.html)
Evgeny Ponomarev, Sergey A. Matveev, and Ivan V. Oseledets. LETI: Latency Estimation Tool and
Investigation of Neural Networks inference on Mobile GPU. CoRR, abs/2010.02871, 2020. URL
[https://arxiv.org/abs/2010.02871.](https://arxiv.org/abs/2010.02871)
Carl Edward Rasmussen. Gaussian Processes in Machine Learning. In Olivier Bousquet, Ulrike
von Luxburg, and Gunnar R¨atsch (eds.), Advanced Lectures on Machine Learning, ML Sum_mer Schools 2003, Canberra, Australia, February 2-14, 2003, T¨ubingen, Germany, August 4-_
_16, 2003, Revised Lectures, volume 3176 of Lecture Notes in Computer Science, pp. 63–71._
[Springer, 2003. doi: 10.1007/978-3-540-28650-9\ 4. URL https://doi.org/10.1007/](https://doi.org/10.1007/978-3-540-28650-9_4)
[978-3-540-28650-9_4.](https://doi.org/10.1007/978-3-540-28650-9_4)
Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized Evolution for Image
[Classifier Architecture Search, 2018. URL http://arxiv.org/abs/1802.01548.](http://arxiv.org/abs/1802.01548)
Michael Ruchte, Arber Zela, Julien Siems, Josif Grabocka, and Frank Hutter. Naslib: A modular
[and flexible neural architecture search library. https://github.com/automl/NASLib,](https://github.com/automl/NASLib)
2020.
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on
_computer vision and pattern recognition, pp. 4510–4520, 2018._
Craig Saunders, Alexander Gammerman, and Volodya Vovk. Ridge Regression Learning Algorithm
in Dual Variables. In Proceedings of the Fifteenth International Conference on Machine Learning,
ICML ’98, pp. 515521, San Francisco, CA, USA, 1998. Morgan Kaufmann Publishers Inc. ISBN
1558605568.
Han Shi, Renjie Pi, Hang Xu, Zhenguo Li, James T. Kwok, and Tong Zhang. Bridging
the Gap between Sample-based and One-shot Neural Architecture Search with BONAS. In
Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and HsuanTien Lin (eds.), Advances in Neural Information Processing Systems 33: _Annual Con-_
_ference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12,_
_[2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/](https://proceedings.neurips.cc/paper/2020/hash/13d4635deccc230c944e4ff6e03404b5-Abstract.html)_
[13d4635deccc230c944e4ff6e03404b5-Abstract.html.](https://proceedings.neurips.cc/paper/2020/hash/13d4635deccc230c944e4ff6e03404b5-Abstract.html)
Julien Siems, Lucas Zimmer, Arber Zela, Jovita Lukasik, Margret Keuper, and Frank Hutter. NASBench-301 and the Case for Surrogate Benchmarks for Neural Architecture Search, 2020.
Jost Tobias Springenberg, Aaron Klein, Stefan Falkner, and Frank Hutter. Bayesian
Optimization with Robust Bayesian Neural Networks. In Daniel D. Lee, Masashi
Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (eds.), Advances
_in Neural Information Processing Systems 29:_ _Annual Conference on Neural Infor-_
_mation Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 4134–_
4142, 2016. URL [https://proceedings.neurips.cc/paper/2016/hash/](https://proceedings.neurips.cc/paper/2016/hash/a96d3afec184766bfeca7a9f989fc7e7-Abstract.html)
[a96d3afec184766bfeca7a9f989fc7e7-Abstract.html.](https://proceedings.neurips.cc/paper/2016/hash/a96d3afec184766bfeca7a9f989fc7e7-Abstract.html)
Mingxing Tan and Quoc V. Le. EfficientNet: Rethinking Model Scaling for Convolutional Neural
[Networks. CoRR, abs/1905.11946, 2019. URL http://arxiv.org/abs/1905.11946.](http://arxiv.org/abs/1905.11946)
-----
Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and
Quoc V. Le. MnasNet: Platform-Aware Neural Architecture Search for Mobile. In IEEE
_Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA,_
_USA, June 16-20, 2019, pp. 2820–2828. Computer Vision Foundation / IEEE, 2019._ doi:
10.1109/CVPR.2019.00293. URL [http://openaccess.thecvf.com/content_](http://openaccess.thecvf.com/content_CVPR_2019/html/Tan_MnasNet_Platform-Aware_Neural_Architecture_Search_for_Mobile_CVPR_2019_paper.html)
[CVPR_2019/html/Tan_MnasNet_Platform-Aware_Neural_Architecture_](http://openaccess.thecvf.com/content_CVPR_2019/html/Tan_MnasNet_Platform-Aware_Neural_Architecture_Search_for_Mobile_CVPR_2019_paper.html)
[Search_for_Mobile_CVPR_2019_paper.html.](http://openaccess.thecvf.com/content_CVPR_2019/html/Tan_MnasNet_Platform-Aware_Neural_Architecture_Search_for_Mobile_CVPR_2019_paper.html)
Alvin Wan, Xiaoliang Dai, Peizhao Zhang, Zijian He, Yuandong Tian, Saining Xie, Bichen Wu,
Matthew Yu, Tao Xu, Kan Chen, Peter Vajda, and Joseph E. Gonzalez. FBNetV2: Differentiable
Neural Architecture Search for Spatial and Channel Dimensions. In 2020 IEEE/CVF Conference
_on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020,_
[pp. 12962–12971. IEEE, 2020. doi: 10.1109/CVPR42600.2020.01298. URL https://doi.](https://doi.org/10.1109/CVPR42600.2020.01298)
[org/10.1109/CVPR42600.2020.01298.](https://doi.org/10.1109/CVPR42600.2020.01298)
Ruochen Wang, Xiangning Chen, Minhao Cheng, Xiaocheng Tang, and Cho-Jui Hsieh. RANKNOSH: Efficient Predictor-Based Architecture Search via Non-Uniform Successive Halving.
_[CoRR, abs/2108.08019, 2021. URL https://arxiv.org/abs/2108.08019.](https://arxiv.org/abs/2108.08019)_
Wei Wen, Hanxiao Liu, Yiran Chen, Hai Helen Li, Gabriel Bender, and Pieter-Jan Kindermans.
Neural Predictor for Neural Architecture Search. In Andrea Vedaldi, Horst Bischof, Thomas
Brox, and Jan-Michael Frahm (eds.), Computer Vision - ECCV 2020 - 16th European Conference,
_Glasgow, UK, August 23-28, 2020, Proceedings, Part XXIX, volume 12374 of Lecture Notes in_
_[Computer Science, pp. 660–676. Springer, 2020. doi: 10.1007/978-3-030-58526-6\ 39. URL](https://doi.org/10.1007/978-3-030-58526-6_39)_
[https://doi.org/10.1007/978-3-030-58526-6_39.](https://doi.org/10.1007/978-3-030-58526-6_39)
Matthias Wess, Matvey Ivanov, Christoph Unger, Anvesh Nookala, Alexander Wendt, and Axel
Jantsch. ANNETTE: Accurate Neural Network Execution Time Estimation With Stacked Models.
_IEEE Access, 9:35453556, 2021. ISSN 2169-3536. doi: 10.1109/access.2020.3047259. URL_
[http://dx.doi.org/10.1109/ACCESS.2020.3047259.](http://dx.doi.org/10.1109/ACCESS.2020.3047259)
Colin White, Willie Neiswanger, and Yash Savani. BANANAS: Bayesian Optimization with Neural
Architectures for Neural Architecture Search. arXiv preprint arXiv:1910.11858, 2019.
Colin White, Arber Zela, Binxin Ru, Yang Liu, and Frank Hutter. How Powerful are Performance
Predictors in Neural Architecture Search? _CoRR, abs/2104.01177, 2021._ [URL https://](https://arxiv.org/abs/2104.01177)
[arxiv.org/abs/2104.01177.](https://arxiv.org/abs/2104.01177)
Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong
Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. FBNet: Hardware-Aware Efficient ConvNet
Design via Differentiable Neural Architecture Search. In IEEE Conference on Computer Vision
_and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 10734–_
10742. Computer Vision Foundation / IEEE, 2019. doi: 10.1109/CVPR.2019.01099. URL
[http://openaccess.thecvf.com/content_CVPR_2019/html/Wu_FBNet_](http://openaccess.thecvf.com/content_CVPR_2019/html/Wu_FBNet_Hardware-Aware_Efficient_ConvNet_Design_via_Differentiable_Neural_Architecture_Search_CVPR_2019_paper.html)
[Hardware-Aware_Efficient_ConvNet_Design_via_Differentiable_](http://openaccess.thecvf.com/content_CVPR_2019/html/Wu_FBNet_Hardware-Aware_Efficient_ConvNet_Design_via_Differentiable_Neural_Architecture_Search_CVPR_2019_paper.html)
[Neural_Architecture_Search_CVPR_2019_paper.html.](http://openaccess.thecvf.com/content_CVPR_2019/html/Wu_FBNet_Hardware-Aware_Efficient_ConvNet_Design_via_Differentiable_Neural_Architecture_Search_CVPR_2019_paper.html)
Yuhui Xu, Lingxi Xie, Xiaopeng Zhang, Xin Chen, Bowen Shi, Qi Tian, and Hongkai Xiong.
Latency-Aware Differentiable Neural Architecture Search. CoRR, abs/2001.06392, 2020. URL
[https://arxiv.org/abs/2001.06392.](https://arxiv.org/abs/2001.06392)
Antoine Yang, Pedro M Esperanc¸a, and Fabio M Carlucci. Nas evaluation is frustratingly hard.
_arXiv preprint arXiv:1912.12522, 2019._
Tien-Ju Yang, Andrew G. Howard, Bo Chen, Xiao Zhang, Alec Go, Mark Sandler, Vivienne Sze,
and Hartwig Adam. NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (eds.),
_Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14,_
_2018, Proceedings, Part X, volume 11214 of Lecture Notes in Computer Science, pp. 289–304._
[Springer, 2018. doi: 10.1007/978-3-030-01249-6\ 18. URL https://doi.org/10.1007/](https://doi.org/10.1007/978-3-030-01249-6_18)
[978-3-030-01249-6_18.](https://doi.org/10.1007/978-3-030-01249-6_18)
-----
Shuochao Yao, Yiran Zhao, Huajie Shao, ShengZhong Liu, Dongxin Liu, Lu Su, and Tarek Abdelzaher. FastDeepIoT: Towards Understanding and Optimizing Neural Network Execution Time
on Mobile and Embedded Devices. In Proceedings of the 16th ACM Conference on Embedded
_Networked Sensor Systems, pp. 278–291, 2018._
Chris Ying, Aaron Klein, Eric Christiansen, Esteban Real, Kevin Murphy, and Frank Hutter. NASBench-101: Towards Reproducible Neural Architecture Search. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learn_ing, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of_
_[Machine Learning Research, pp. 7105–7114. PMLR, 2019. URL http://proceedings.](http://proceedings.mlr.press/v97/ying19a.html)_
[mlr.press/v97/ying19a.html.](http://proceedings.mlr.press/v97/ying19a.html)
Kaicheng Yu, Ren´e Ranftl, and Mathieu Salzmann. How to Train Your Super-Net: An Analysis
[of Training Heuristics in Weight-Sharing NAS. CoRR, abs/2003.04276, 2020. URL https:](https://arxiv.org/abs/2003.04276)
[//arxiv.org/abs/2003.04276.](https://arxiv.org/abs/2003.04276)
Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. ShuffleNet: An Extremely
Efficient Convolutional Neural Network for Mobile Devices. In 2018 IEEE Conference
_on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA,_
_June 18-22, 2018, pp. 6848–6856. IEEE Computer Society, 2018._ doi: 10.1109/CVPR.
[2018.00716. URL http://openaccess.thecvf.com/content_cvpr_2018/html/](http://openaccess.thecvf.com/content_cvpr_2018/html/Zhang_ShuffleNet_An_Extremely_CVPR_2018_paper.html)
[Zhang_ShuffleNet_An_Extremely_CVPR_2018_paper.html.](http://openaccess.thecvf.com/content_cvpr_2018/html/Zhang_ShuffleNet_An_Extremely_CVPR_2018_paper.html)
Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures
for scalable image recognition. In Proceedings of the IEEE conference on computer vision and
_[pattern recognition, pp. 8697–8710, 2018. URL http://arxiv.org/abs/1707.07012.](http://arxiv.org/abs/1707.07012)_
-----
A BEST PRACTICES FOR NAS, CODE AND DATA
To improve the reproducibility and facilitate fair experimental comparisons, we follow the bestpractices checklist (Lindauer & Hutter, 2020):
_• Release Code for the Training Pipeline(s) you use. Our experiments are based on White_
et al. (2021), who use NASLib (Ruchte et al., 2020) to compare 31 methods for accuracy
prediction. Our NASLib fork, extending the framework for HW-NAS-Bench, TransNASBench, some performance predictors and the hypervolume simulations, is provided in the
supplementary materials. We intend to either make our fork available on GitHub or submit
the changes upstream once this paper is accepted/published.
_• Use the Same Evaluation Protocol for the Methods Being Compared. Aside from the_
implementation of each predictor, all experiments use the same pipeline.
_• Validate The Results Several Times._ We ran each predictor 50 times, with seeds
_{0, ..., 49}. The reductions in hypervolume are simulated 1000 times using different a_
different subset of the data set, for each combination of {iteration, HW-NAS data set, noise
on HW metric}.
_• Control Confounding Factors. While all experiments used the same software libraries_
and hardware resources, they were run on different machines to speed up the evaluation.
We found hardly any benefit in using a GPU even for the network-based predictors, which
is why every method only used two CPU cores. The OS is Ubuntu 18.04, notable software
packages are PyTorch 1.9.0, numpy 1.19.5, scikit-learn 0.24.2, pybnn 0.0.5, ngboost 0.3.11,
and xgboost 1.4.2
_• Report the Use of Hyperparameter Optimization. See Appendix C._
In addition to the code in the supplementary materials, we also provide the experimental results as
csv files. Running the predictors and hypervolume simulations takes some time, but the easy access
to the data of the finished experiments may prove useful for future research. Please see readme.md
in the accompanying code zip file for instructions.
B ENCODINGS AND PREDICTORS
B.1 DATA ENCODINGS
Every architecture a ∈A requires a unique representation, which depends on the used predictor.
The common encoding types are:
**Adjacency one-hot: Each architecture a is uniquely defined by the chosen candidate operation on**
every path. For example, each architecture in NAS-BENCH-201 consists of a repeated cell structure,
which has five candidate operations on each of the six paths. Therefore there are 5[6] = 15625 unique
architectures, which can each be referenced by a sequence of operation-indices such as [0 1 2 3 4 0].
Many predictors perform better if the sequence is presented as a one-hot encoding, which is in this
case [10000 01000 00100 00010 00001 10000].
Similarly, the path-encoding (used by BANANAS) is a one-hot representation over the used candidate operation all possible paths. Since the connectivity within cells for HW-NAS-Bench and
TransNAS-Bench-101 is fixed, it provides no more information than the adjacency one-hot encoding. If the connectivity can be adjusted more freely, as in the NAS-Bench-101 search space, the
additional information may improve the fit.
The encodings for BONAS, GCN, and NAO each provide further information in addition to the
Adjacency one-hot vector, most notably the adjacency-matrix. This {0, 1}[(][N] [+2)][×][(][N] [+2)] matrix lists
describes which of the N architecture paths (rows) serves as inputs for each other path (column),
and also includes input/output.
-----
B.2 PREDICTORS
We briefly describe the 18 predictor methods in our experiments. We adopt their implementations
from the NASLib library (see Appendix A), which we extend with Linear Regression, Ridge Regression, and Support Vector Machines from the scikit-learn package; and a simple Lookup Table
implementation. Unless specified otherwise, the methods use the adjacency one-hot encoding.
_• BANANAS An ensemble of three MLP models with five to 20 layers, each using the path-_
encoding (White et al., 2019).
_• Bayesian Linear Regression A bayesian model that assumes (1) a linear dependency be-_
tween inputs and outputs, and (2) that the samples are normally distributed (Bishop, 2007).
_• BOHAMIANN A bayesian inference predictor using stochastic gradient Hamiltonian_
Monte Carlo (SGHMC) to sample from a bayesian neural network (Springenberg et al.,
2016).
_• BONAS Bayesian Optimization for NAS (Shi et al., 2020) uses a GCN predictor within an_
outer loop of bayesian optimization, as a meta-learning task. The GCN requires encoding
the adjacency matrix of each architecture.
_• Gaussian Process A simple model that assumes a joint Gaussian distribution underlying_
the training data (Rasmussen, 2003).
_• GCN A Graph Convolutional Network that makes use of an adjacency-matrix encoding of_
each architecture (Wen et al., 2020).
_• Linear Regression A simple model that assumes an independent value/cost for each oper-_
ation/layer, which only need to be summed up. Unlike the Lookup Table model, it uses a
least-square fit on the training data.
_• Lookup Table The most simple and perhaps widely used model for differentiable archi-_
tecture selection. It generally assumes a single baseline architecture (e.g. [001 001] in
adjacency one-hot encoding), and a lookup matrix R[(num layers)][×][(num candidates)] that contains
the increases/reductions in the metric for each layer and candidate operation. The metric
value of a new architecture can be predicted with a simple sum over the respective matrix
entries and the baseline value. The model is obtained from measuring either each candidate
operation in isolation, or by computing the differences between the baseline architecture
and specific variations (e.g. [010 001] or [100 001], to measure the first candidates). This
model always requires 1+(num layers) · (num candidates−1) neighbored architectures to
fit. We detail the resulting correlation values for each used dataset in Appendix E.
_• LGBoost Light Gradient Boosting Machine (LightGBM or LGBoost, Ke et al. (2017)) is a_
lightweight gradient-boosted decision tree model.
_• MLP We use fully-connected Multi Layer Perceptrons in two size-categories._
_• NAO NAO (Luo et al., 2018) uses an encoder-decoder topology,_ which encodes/compresses an architecture to a continuous representation, and decodes it again. This
representation is further used to make architecture predictions.
_• NGBoost Natural Gradient Boosting (NGBoost, Duan et al. (2020)) is a gradient-boosted_
decision tree model that uses natural gradients to estimate uncertainty.
_• Ridge Regression Ridge Regression (Saunders et al., 1998) extends the Linear Regression_
least-squares fit with a regularization term that serves as bias-variance tradeoff.
_• Random Forests An ensemble of decision trees (Liaw et al., 2002)._
_• Sparse Gaussian Process an approximation of Gaussian Processes that summarizes train-_
ing data (Candela & Rasmussen, 2005).
_• Support Vector Machine A model that maps its inputs to a high-dimensional space, where_
training samples are used as support-vectors for decision-boundaries (Cortes & Vapnik,
1995).
_• XGBoost eXtreme Gradient Boosting (XGBoost, Chen & Guestrin (2016)) is a gradient-_
boosted decision tree model.
-----
C HYPERPARAMETERS
We list our default and hyper-parameter sample ranges in Table 2. For comparability with White
et al. (2021), we only change the values of newly introduced parameterized predictors: Ridge Regression, Support Vector Machines, and small MLPs.
Model Hyper-parameter Range/Choice Log-transform Default
Num. Layers [5, 25] false 20
BANANAS Layer width [5, 25] false 20
Learning rate [0.0001, 0.1] true 0.001
Num. Layers [16, 128] true 64
BONAS Batch size [32, 256] true 128
Learning rate [0.00001, 0.1] true 0.0001
Num. Layers [64, 200] true 144
Batch size [5, 32] true 7
Learning rate [0.00001, 0.1] true 0.0001
Weight decay [0.00001, 0.1] true 0.0003
GCN
Num. leaves [10, 100] false 31
LGBoost Learning rate [0.001, 0.1] true 0.05
Feature fraction [0.1, 1] false 0.9
Num. layers [2, 5] false 3
Layer width [16, 128] true 32
MLP (small)
Learning rate [0.0001, 0.1] true 0.001
Activation function _{relu, tanh, hardswish}_ relu
Num. layers [5, 25] false 20
MLP (huge) Layer width [5, 25] false 20
Learning rate [0.0001, 0.1] true 0.001
Num. layers [16, 128] true 64
NAO Batch size [32, 256] true 100
Learning rate [0.00001, 0.1] true 0.001
Num. estimators [128, 512] true 64
Learning rate [0.001, 0.1] true 0.081
Max depth [1, 25] false 6
Max features [0.1, 1] false 0.79
NGBoost
Ridge Regression Regularization α [0.25, 2.5] false 1.0
Num. estimators [16, 128] true 116
Max features [0.1, 0.9] true 0.17
Min samples (leaf) [1, 20] false 2
Min samples (split) [2, 20] true 2
Random Forests
Regularization C [0.5, 1.5] false 1.0
Support Vector Machine
Kernel _{linear, poly, rbf, sigmoid}_ rbf
Max depth [1, 15] false 6
Min child weight [1, 10] false 1
XGBoost Col sample (tree) [0, 1] false 1
Learning rate [0.001, 0.5] true 0.3
Col sample (level) [0, 1] false 1
Table 2: Hyper-parameter ranges and default values of the configurable predictors
-----
D NAS-BENCH-201 / HW-NAS-BENCH CELL DESIGN
with exactly one out of five candidate operations {Zero, Skip, Convolution 1×1, Convolution 3×3,
1 3
5
2
6
4
Figure 5: Basic NAS-Bench-201 / HW-NAS cell design. Each of the six orange paths is finalizedshared cell topology
Zero, Skip, Convolution 1
Average Pooling 3×3}.
E SELECTION OF DATASETS
Linear Regression XGBoost LUT
11 25 55 124 276 614 1366 3036 6748 15000 15000 -
ImageNet16-120-raspi4 latency 0.324 0.205 0.606 0.676 0.705 0.716 0.715 0.723 0.728 0.729 0.757 0.443
cifar100-pixel3 latency 0.392 0.292 0.732 0.780 0.797 0.803 0.806 0.809 0.812 0.812 0.877 0.484
cifar10-edgegpu latency 0.370 0.258 0.724 0.790 0.806 0.819 0.820 0.822 0.830 0.829 0.926 0.175
cifar100-edgegpu energy 0.376 0.275 0.732 0.793 0.812 0.821 0.821 0.823 0.831 0.831 0.920 0.221
ImageNet16-120-eyeriss arith. int. 0.369 0.293 0.748 0.805 0.817 0.827 0.825 0.832 0.843 0.846 0.970 0.861
cifar10-pixel3 latency 0.388 0.300 0.733 0.780 0.797 0.805 0.805 0.810 0.813 0.813 0.878 0.475
cifar10-raspi4 latency 0.393 0.315 0.740 0.787 0.799 0.805 0.807 0.810 0.813 0.813 0.890 0.462
cifar100-raspi4 latency 0.393 0.308 0.744 0.786 0.801 0.807 0.810 0.810 0.814 0.814 0.888 0.445
ImageNet16-120-pixel3 latency 0.398 0.312 0.739 0.786 0.799 0.807 0.809 0.812 0.815 0.816 0.884 0.509
cifar100-edgegpu latency 0.375 0.268 0.728 0.793 0.810 0.821 0.820 0.822 0.831 0.831 0.924 0.191
cifar10-edgegpu energy 0.375 0.284 0.728 0.792 0.810 0.821 0.823 0.824 0.831 0.831 0.922 0.183
ImageNet16-120-edgegpu energy 0.377 0.281 0.733 0.797 0.814 0.825 0.825 0.826 0.834 0.833 0.926 0.280
ImageNet16-120-edgegpu latency 0.379 0.264 0.737 0.799 0.817 0.826 0.826 0.828 0.836 0.835 0.938 0.277
cifar10-eyeriss arith. int. 0.384 0.296 0.757 0.811 0.826 0.835 0.832 0.843 0.854 0.854 0.969 0.826
cifar100-eyeriss arith. int. 0.384 0.297 0.757 0.811 0.826 0.835 0.833 0.844 0.855 0.856 0.971 0.830
ImageNet16-120-fpga latency 0.443 0.494 0.904 0.936 0.947 0.951 0.948 0.951 0.952 0.952 0.983 0.965
ImageNet16-120-fpga energy 0.443 0.494 0.905 0.935 0.947 0.951 0.948 0.951 0.952 0.952 0.983 0.965
ImageNet16-120-eyeriss latency 0.457 0.937 0.953 0.954 0.954 0.954 0.953 0.953 0.954 0.954 0.952 0.989
cifar10-eyeriss latency 0.461 0.943 0.959 0.959 0.960 0.960 0.959 0.960 0.960 0.960 0.958 0.995
cifar100-eyeriss latency 0.462 0.946 0.963 0.963 0.963 0.963 0.963 0.963 0.964 0.963 0.962 0.998
cifar10-eyeriss energy 0.456 0.967 0.985 0.985 0.985 0.985 0.985 0.985 0.985 0.985 0.975 0.996
ImageNet16-120-eyeriss energy 0.458 0.967 0.985 0.985 0.986 0.985 0.986 0.985 0.985 0.986 0.972 0.998
cifar100-eyeriss energy 0.457 0.967 0.985 0.985 0.985 0.986 0.985 0.986 0.986 0.986 0.976 0.998
cifar10-fpga energy 0.458 0.973 0.987 0.987 0.987 0.987 0.987 0.987 0.987 0.987 0.986 0.999
cifar100-fpga energy 0.458 0.973 0.987 0.987 0.987 0.987 0.987 0.987 0.987 0.987 0.986 0.999
cifar100-fpga latency 0.457 0.973 0.987 0.987 0.987 0.987 0.987 0.987 0.987 0.987 0.986 0.999
cifar10-fpga latency 0.457 0.973 0.987 0.987 0.987 0.987 0.987 0.987 0.987 0.987 0.986 0.999
Table 3: Kendall’s Tau test correlation for Linear Regression, XGBoost, and Lookup Table (LUT)
on all HW-NAS-Bench datasets (rows), for different amounts of available training data (columns),
tested on the remaining 625 samples. The Lookup Table model is tested on all 15625 architectures.
We selected the five data sets at the top.
Linear Regression XGBoost LUT
9 18 34 65 123 234 442 837 1585 2999 2999 -
jigsaw 0.201 0.227 0.410 0.535 0.586 0.605 0.616 0.624 0.631 0.632 0.661 0.201
class object 0.268 0.262 0.518 0.646 0.711 0.741 0.759 0.771 0.780 0.780 0.828 0.701
room layout 0.275 0.271 0.527 0.653 0.721 0.753 0.768 0.780 0.789 0.789 0.896 0.685
class scene 0.275 0.268 0.527 0.653 0.721 0.755 0.768 0.782 0.789 0.790 0.907 0.710
segmentsemantic 0.282 0.259 0.545 0.684 0.746 0.780 0.798 0.809 0.816 0.818 0.871 0.726
Table 4: Kendall’s Tau test correlation for Linear Regression and XGBoost on the five used
TransNAS datasets (rows), for different amounts of available training data (columns), tested on
the remaining 256 samples. The Lookup Table model (LUT) is tested on all 3256 architectures.
**HW-NAS-Bench:** To select five datasets that are (1) non-linear and (2) different from one another,
we first fit Linear Regression to every available dataset, with the results listed in Table 3. The bottom
12 datasets can be accurately fit with only 25 training samples, so they are not very interesting as a
-----
challenge. On these datasets, the Lookup Table model achieves exceptional performance. Since the
networks for CIFAR10, CIFAR100 and ImageNet16-120 only differ slightly, their measurements on
the same device and metric (e.g. raspi4 latency) is very similar. To improve the generalizability of
our results, we thus select datasets on different devices and metrics, which are listed at the top of
Table 3. As displayed in Figure 6, their data distributions are generally different.
**TransNAS-Bench-101:** Since the latency measurements of the architectures is generally very similarly distributed (see Figure 7), it is not necessary to train the predictors on all of them. We select
all data sets that provide the test loss and inference time attributes for all architectures, resulting in
exactly the five datasets listed in Section 4 (the other two datasets contain more specific test losses).
ImageNet16-120-raspi4_latency cifar100-pixel3_latency cifar10-edgegpu_latency
1400 1000 1000
1200
800 800
1000
800 600 600
occurrences 600 occurrences 400 occurrences 400
400
200 200
200
0 0 0
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 0 10 20 30 40 0 2 4 6 8 10 12
measurements measurements measurements
cifar100-edgegpu_energy ImageNet16-120-eyeriss_arithmetic_intensity
2000
800 1750
1500
600 1250
1000
occurrences 400 occurrences
750
200 500
250
0 0
0 10 20 30 40 0 1 2 3 4 5 6 7 8
measurements measurements
Figure 6: How the data of each selected HW-NAS-Bench dataset is distributed (not normalized).
class_object class_scene jigsaw
1200 1200 600
1000 1000 500
800 800 400
600 600 300
occurrences occurrences occurrences
400 400 200
200 200 100
0 0 0
0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25 0.05 0.10 0.15 0.20 0.25 0.30
measurements measurements measurements
room_layout segmentsemantic
1000
1000
800
800
600 600
occurrences occurrences
400 400
200 200
0 0
0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25
measurements measurements
Figure 7: How the data of each selected TransNAS-Bench-101 dataset is distributed (not normalized). Since all architectures are measured for latency on the same hardware, the resulting datasets
are much less diverse than the HW-NAS-Bench ones.
-----
PREDICTOR FIT TIME
Lin. Reg.
Bayes. Lin. R
Ridge Reg.
XGBoost
NGBoost
LGBoost
Random Fore
Sparse GP
GP
BOHAMIANN
SVM Reg.
NAO
GCN
BONAS
BANANAS
MLP (large)
MLP (small)
|Col1|Col2|Col3|
|---|---|---|
|. Reg. .|||
|Reg.|||
|orests|||
|rests NN|||
|N|||
|)|||
|) l)|||
||||
Average over TransNAS datasetsAverage over HW-NAS datasets Average over TransNAS datasetsAverage over HW-NAS datasetsAverage over HW-NAS datasets Average over HW-NAS datasets
150002500 15000
200003000 20000
12500 12500Lin. Reg.
17500 175002000 Bayes. Lin. Reg.
2500 10000 10000Lin. Reg.Ridge Reg.
15000 15000 Bayes. Lin. Reg.XGBoost
1500 Ridge Reg.NGBoost
2000 7500 7500XGBoostLGBoost
12500 12500 NGBoostRandom Forests
10005000 5000LGBoostSparse GP
100001500 10000 Random ForestsGP
Sparse GPBOHAMIANN
2500 2500GPSVM Reg.
500
7500 7500 BOHAMIANNNAO
1000 SVM Reg.GCN
0 0
MLP (large)BONAS
Time to fit the predictor (absolute)Time to fit the predictor (absolute) 5000 Time to fit the predictor (absolute) 50000 MLP (small)BANANAS
500 2500 2500MLP (large)
2500 Time to fit the predictor (centered on average)Time to fit the predictor (centered on average) 2500 Time to fit the predictor (centered on average) MLP (small)
500
5000 5000
00 0
1010[1][1] 10[2]10[2] 10[3] 10[3] 10[4] 101010[1][1][1] 1010[2][2]10[2] 1010[3][3] 10[3] 1010[4][4] 10[1] 10[2] 10[3] 10[4]
training set sizetraining set size training set sizetraining set sizetraining set size training set size
Figure 8: Fit time (in seconds) of predictors to data, depending on the training set size. By far the
most expensive methods are network-based. However, a significant portion of this time is spent on
the hyper-parameter optimization prior to the actual fitting.
G APPROXIMATING PREDICTOR MISTAKES
Predictor deviations
2.00
1.75
1.50
1.25
1.00
0.75
0.50
0.25
0.00
|Col1|normal fit, std=0.309 deviations|
|---|---|
normal fit, std=0.309
deviations
Predictor deviations Predictor deviations
1.75 normal fit, std=0.348 1.75 normal fit, std=0.456
deviations deviations
1.50 1.50
1.25 1.25
density 1.00 density 1.00
0.75 0.75
0.50 0.50
0.25 0.25
0.00 0.00
2 1 0 2 1 0 1
deviation of the predictions deviation of the predictions
Figure 9: Further examples of predictor deviation distributions, as visualized in the center of Figure 2. Left: Linear Regression on CIFAR100, edgegpu, energy consumption. Center: Support
Vector Machine on Jigsaw. Right: small MLP on ImageNet16-120, raspi4, latency.
Intuitively, the predictor deviation distributions (see Figures 2 and 9) generally resemble a normal
distribution. However, most predictors:
(1) Have a notable peak, sometimes off-center (e.g. at x=0.2)
(2) Have less density than a normal distribution almost everywhere else
(3) Have some outliers (e.g. at x>1.5) that are extremely unlikely for a normal distribution
We measured the p-value for different distributions on the first 100 test samples using a T-Test, every
time we evaluated a predictor. The average statistics can be found in Table 5. Since a large number
of empirical observations generally pushes the p-value to 0, this only serves to compare them to each
other. We find that the outliers (3) appear often enough and are so unlikely to happen for a normal
distribution, that even a uniform distribution has a higher statistical support. Consequentially, we
approximate the common predictor deviations by sampling from a mixed distribution that adresses
(1) to (3).
-----
p-value
normal 0.028
cauchy 0.030
lognorm 0.028
t 0.028
uniform 0.037
Table 5: P-values of different distributions, trying to fit the distribution of all predictor mistakes
according to a t-test. Larger values are better, but comparing many empirically sampled points with
a true density function tends to push the p-values to 0.
This mixed distribution consists of two Normal distributions (N1, N2) and one Uniform distribution
(U ), from which we sample with 72.5%, 26.5% and 1% respectively. For some constant v:
_• We uniformly sample a shift c from [0, 2 · v], that is used to push the centers of N1 and N2_
to x > 0 and x < 0 respectively.
We sample each value from N1(c, v), N2( _c, 3_ _v), and U1(_ 15 _v, 15_ _v) randomly, with_
_•_ _−_ _·_ _−_ _·_ _·_
the weighting given above.
_• We normalize (subtract mean, divide by standard deviation) our sampled distribution and_
then scale it to the desired standard deviation.
_• The predictors produce non-smooth distributions. We simulate that by sampling 15 times_
fewer values as needed, and repeat them as often.
The code for the simulation is also provided (see Appendix A). As seen in Figure 10, the resulting
simulated deviation distributions generally resemble a common predictor pattern. We do not account
for differences in predictors, training set sizes or more, since that may become too specific and overengineered.
Appendix I visualizes simulation sanity checks. We find that the simulation is slightly pessimistic
and simplified, but resembles the results of actual predictors.
Simulated predictor deviations
|normal fit, std mixed dist. ge|normal fit, std mixed dist. ge|=0.500 nerated with std=0.5|
|---|---|---|
||||
2 0 2
normal fit, std=0.500
mixed dist. generated with std=0.5
deviation of the simulated predictions
Simulated predictor deviations
|normal fit, std= mixed dist. gene|normal fit, std= mixed dist. gene|0.500 rated with std=0.5|
|---|---|---|
||||
2 1 0 1 2
normal fit, std=0.500
mixed dist. generated with std=0.5
deviation of the simulated predictions
Simulated predictor deviations
|normal fit, std mixed dist. ge|normal fit, std mixed dist. ge|=0.500 nerated with std=0.5|
|---|---|---|
||||
2 0 2
normal fit, std=0.500
mixed dist. generated with std=0.5
deviation of the simulated predictions
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
1.2
1.0
0.8
0.6
0.4
0.2
0.0
Figure 10: The sampled values of gaussian+uniform fit the measured predictor mistakes better than
a single distribution, as they are roughly normally distributed, but include outliers.
-----
MEASURING SIMULATED MISTAKES
0.45
0.40
0.45
0.40
0.35
0.30
0.35
0.30
0.25
0.25
|true pareto front predicted pareto front all architectures selected architectures|Col2|Col3|
|---|---|---|
||true pareto front predicted pareto front all architectures selected architectures||
||4 normali|2 0 2 4 zed ImageNet16-120-raspi4_latency|
|Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|
|---|---|---|---|---|---|---|---|
|||||||||
|||||||||
|||||||||
|||||||||
||true pa discove selecte|re re d|to front, d paret arch., M|HV=2.93 o front, HV RAall = 3.22||=2.67 %, MRApa|reto = 3.77%|
|||||||||
|||||||||
|2.|0 1.5 normali||1.0 0.5 0.0 0.5 zed ImageNet16-120-raspi4_latency|||||
true pareto front
predicted pareto front
all architectures
selected architectures
Figure 11: Similar to Figure 3. When the discovered Pareto set is considerably worse than the true
Pareto set, it is possible for the Mean Reduction of Accuracy of the Pareto subset (MRApareto) to be
_worse than the average over all architectures (MRAall). This naturally happens more frequently for_
worse predictors with a high sampling std. and low KT correlation. Consequentially, the difference
between MRAall and MRApareto is wider for better predictors (see Figure 4). Additionally, all of
the selected non-Pareto-front members are clustered in a high-latency area and redundant with each
other. This emphasizes the limitations of just considering drops in accuracy, as the hardware metric
aspect is ignored. In this case, the predictor-guided selection failed to find a low-latency solution.
In this regard, hypervolume is a better but less intuitive metric.
hardware metric hardware metric
Figure 12: Examples to explain measurement methods.
62
true pareto front A5
60 actually selected architecture
58
A4
56
54 A3
52 accuracy
accuracy [%] difference
A2
50
48
HW metric difference C1
46 A1
44
18 20 22 24 26 28 30
hardware metric
50 pareto front
hypervolume +10%
reference point
40
30
accuracy [%] 20
10 to 0
reference point
0
18 20 22 24 26 28 30 32
hardware metric
**Left: The distance of each selected candidate architecture C1 to the true Pareto front is measured,**
for accuracy and the hardware metric. C1 is dominated by A2, A3, and A4 of the true Pareto set. A2
has a slightly higher accuracy than C1 while being much better on the hardware metric, e.g. latency.
A4 has a slightly better hardware metric value, but much higher accuracy. Given several candidate
architectures, their differences are averaged.
**Right: We compute the reference point for the hypervolume (for two objectives: area under a**
Pareto front) by multiplying the highest hardware metric value from the true Pareto front with 1.1,
and accuracy 0. While we are consistent throughout all experiments, this choice is arbitrary, as there
is no obviously correct choice for the reference point. If the hypervolume of a supposed Pareto
front is computed, the reference point of the true Pareto front is reused. Thus, choosing inferior
architectures will always reduce the hypervolume. We arbitrarily chose the multiplier of m = 1.1
as a middle ground between making the rightmost point of the Pareto front irrelevant (m = 1.0) and
overemphasizing it (m >> 1.0).
-----
SIMULATION SANITY CHECK
1.0
mean over any number of architectures
0.7
0.6
0.5
ImageNet16-120-eyeriss_arithmetic_intensity
ImageNet16-120-raspi4_latency
cifar10-edgegpu_latency
cifar100-edgegpu_energy
cifar100-pixel3_latency
mean
0.0/1.00 0.2/0.88 0.4/0.78 0.6/0.69 0.8/0.62 1.0/0.57
|Imag Imag|eNet16-120 eNet16-120|-eyeriss_a -raspi4_lat|Col4|rithmetic_i ency|ntensity|Col7|Col8|
|---|---|---|---|---|---|---|---|
|cifar1 cifar1 cifar1|0-edgegpu 00-edgegp 00-pixel3_l|_latency u_energy atency||||||
|mean||||||||
|||||||||
|||||||||
|||||||||
Std. of prediction deviations / Kendall's Tau
0.8
0.6
0.4
0.2
|Col1|Col2|Col3|Col4|Col5|Col6|Col7|
|---|---|---|---|---|---|---|
||||||||
||||||||
||||||||
||||||||
||||||||
||KT=-0.7|5, SCC=-0.88|, PCC=0.77||||
0.0 0.2 0.4 0.6 0.8 1.0
KT=-0.75, SCC=-0.88, PCC=0.77
Std. of prediction deviations
Figure 13: Standard deviation over the predictor deviations (x axis) and Kendall’s Tau correlation
(y axis), for the trained predictors on HW-NAS-Bench (left) and in simulation (right). The simulated
predictor inaccuracies are slightly pessimistic (low KT), but still match the true values.
-----
Predictor deviations
1.75
1.50
1.25
1.00
0.75
0.50
0.25
0.00
normal fit, std=0.445
deviations
All candidates
|Col1|normal fit, std=0.445 deviations|
|---|---|
|||
|||
deviation of the predictions
candidate occurrences in the architecture
candidate not at all exactly once exactly twice
|Col1|normal fit, std=0.541 deviations|
|---|---|
|||
|||
|Col1|normal fit, std=0.532 deviations|
|---|---|
|||
|||
|Col1|normal fit, std=0.462 deviations|
|---|---|
|||
|||
|Predictor|deviations|
|---|---|
||normal fit, std=0.146 deviations|
|||
|||
|Col1|normal fit, std=0.446 deviations|
|---|---|
|||
|||
Predictor deviations Predictor deviations Predictor deviations
1.2 2.5
normal fit, std=0.541 2.00 normal fit, std=0.443 normal fit, std=0.356
1.0 deviations 1.75 deviations 2.0 deviations
0.8 1.50
1.25 1.5
density 0.6 density 1.00 density
1.0
0.4 0.75
0.2 0.50 0.5
0.25
0.0 0.00 0.0
1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1 0 1 2
Zero deviation of the predictions deviation of the predictions deviation of the predictions
Predictor deviations Predictor deviations Predictor deviations
1.4 normal fit, std=0.532 1.75 normal fit, std=0.436 normal fit, std=0.412
1.2 deviations 1.50 deviations 2.0 deviations
1.0 1.25 1.5
density 0.80.6 density 1.000.75 density 1.0
0.4 0.50
0.5
0.2 0.25
0.0 0.00 0.0
1 0 1 2 1.5 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.5
Skip deviation of the predictions deviation of the predictions deviation of the predictions
Predictor deviations Predictor deviations Predictor deviations
2.00
normal fit, std=0.462 normal fit, std=0.470 normal fit, std=0.393
2.0 deviations 1.75 deviations 2.0 deviations
1.50
1.5 1.25 1.5
1.00
density 1.0 density 0.75 density 1.0
0.5 0.50 0.5
0.25
0.0 0.00 0.0
1.0 0.5 0.0 0.5 1.0 1 0 1 2 1.0 0.5 0.0 0.5 1.0 1.5
Conv1x1 deviation of the predictions deviation of the predictions deviation of the predictions
Predictor deviations Predictor deviations Predictor deviations
2.00 1.4
normal fit, std=0.146 normal fit, std=0.403 normal fit, std=0.565
4 deviations 1.75 deviations 1.2 deviations
1.50 1.0
3 1.25
0.8
1.00
density 2 density 0.75 density 0.6
1 0.50 0.4
0.25 0.2
0 0.00 0.0
0.4 0.2 0.0 0.2 0.4 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5
Conv3x3 deviation of the predictions deviation of the predictions deviation of the predictions
Predictor deviations Predictor deviations Predictor deviations
2.00 normal fit, std=0.446 normal fit, std=0.411 2.00 normal fit, std=0.477
1.75 deviations 2.0 deviations 1.75 deviations
1.50 1.50
1.25 1.5 1.25
density 1.00 density 1.0 density 1.00
0.75 0.75
0.50 0.5 0.50
0.25 0.25
0.00 0.0 0.00
1 0 1 2 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5
Pool deviation of the predictions deviation of the predictions deviation of the predictions
Table 6: How a trained XGB predictor deviates from the ground-truth values for different architecture subsets, akin to Figure 2. While they are not exactly the same, they still resemble the distribution
over the entire test set (top plot, 625 samples). One noteworthy exception is when no Conv3x3 operations are used at all, in which case the standard deviation is considerably smaller.
-----