Overview of DeepZero, a principled ZO deep learning (DL) framework that can scale ZO optimization to DNN training from scratch.
Motivations: Why ZerothOrder (ZO) Optimization needed for model training?
Zerothorder (ZO) optimization has become a popular technique for solving machine learning (ML) problems when firstorder (FO) information is difficult or impossible to obtain:

Disciplines like physics and chemistry: ML models may interact with intricate simulators or experiments where the underlying systems are nondifferentiable.

Blackbox learning scenarios: When deep learning (DL) models are integrated with thirdparty APIs, such as adversarial attack and defense against blackbox DL models and blackbox prompt learning for languagemodelasaservice.

Limited hardware: The principled backpropagation (BP) mechanism for calculating FO gradients may also not be supported when implementing DL models on hardware systems.
However, the scalability of ZO optimization remains an open problem: Its use has primarily been limited to relatively smallscale ML problems, such as samplewise adversarial attack generation. As problem dimensionality increases, the accuracy and efficiency of traditional ZO methods deteriorate. This is because ZO finite differencebased gradient estimates are biased estimators of FO gradients, and the bias becomes more pronounced in higherdimensional spaces. These challenges motivate the central question addressed in this work:
How to scale up ZO optimization for training deep models ?
ZO Gradient Estimator: RGE or CGE?
The ZO optimizer interacts with the objective function \(\ell\) only by submitting inputs (i.e., realizations of \(\boldsymbol \theta\)) and receiving the corresponding function values. There are two main ZO gradient estimation schemes: deterministic coordinatewise gradient estimation (CGE) and randomized vectorwise gradient estimation (RGE) as shown below:
\[\hat{\nabla}_{\boldsymbol \theta} \ell(\boldsymbol \theta) = \frac{1}{q} \sum_{i=1}^q \left [ \frac{\ell(\boldsymbol \theta + \mu \mathbf u_i)  \ell(\boldsymbol \theta)}{\mu} \mathbf u_i \right ]; ~~~~~~~~ (\mathbf{RGE})\] \[\hat{\nabla}_{\boldsymbol \theta} \ell(\boldsymbol \theta) = \sum_{i=1}^d \left [ \frac{\ell(\boldsymbol \theta + \mu \mathbf e_i)  \ell(\boldsymbol \theta)}{\mu} \mathbf e_i \right ], ~~~~~~~~~~~~ (\mathbf{CGE})\]where \(\hat{\nabla}_{\boldsymbol \theta} \ell\) denotes an estimation of the FO gradient \(\nabla_{\boldsymbol \theta}\ell\) with respect to optimization variables \(\boldsymbol \theta \in \mathbb R^d\) (e.g., model parameters of a neural network).
In (RGE), \(\mathbf u_i\) denotes a randomized perturbation vector, e.g., drawn from the standard Gaussian distribution \(\mathcal N(\mathbf 0, \mathbf I)\), \(\mu > 0\) is a perturbation size (a.k.a. smoothing parameter), and \(q\) is the number of random directions used to acquire finite differences.
In (CGE), \(\mathbf e_i\) denotes a standard basis vector, and \(\frac{\ell(\boldsymbol \theta + \mu \mathbf e_i)  \ell(\boldsymbol \theta)}{\mu}\) provides the finitedifference estimation of the partial derivative of \(\ell(\boldsymbol \theta)\) at the \(i\)th coordinate \(\boldsymbol \theta_i\).
Compared to CGE, RGE has the flexibility to specify \(q < d\) to reduce the number of function evaluations. Despite the query efficiency, it remains uncertain whether RGE can deliver satisfactory accuracy when training a deep model from scratch. To this end, we undertake a preliminary investigation wherein we train a basic convolutional neural network (CNN) of different sizes on CIFAR10, employing both RGE and CGE. As the two figures below show, CGE can achieve test accuracy comparable to FO training and significantly outperforms RGE and also greater time efficiency than RGE.
Based on the advantages of CGE over RGE in terms of both accuracy and computation efficiency, we choose CGE as the preferred ZO gradient estimator. However, query complexity of CGE is still a bottleneck, as it scales with model size \(d\).
Proposed ZO DL Framework: DeepZero
To our best knowledge, no prior work has demonstrated the effectiveness of ZO optimization in training deep neural networks (DNNs) without a significant decrease in performance. To overcome this roadblock, we develop DeepZero, a principled ZO deep learning (DL) framework that can scale ZO optimization to DNN training from scratch.

Model pruning via ZO oracle: ZOGraSP: A randomly initialized, dense neural network contains a highquality sparse subnetwork. However, most effective pruning methods incorporate model training as an intermediate step. Thus, they are not wellsuited for finding sparsity via a ZO oracle.To address the above challenge, we draw inspiration from trainingfree pruning methods, known as pruningatinitialization. Within this family, gradient signal preservation (GraSP) is a method to identify the sparsity prior of DL through the gradient flows of a randomlyinitialized network.

Sparse Gradient: To retain the accuracy benefits of training dense models, we incorporate gradient sparsity (in CGE) rather than weight sparsity. This ensures that we train a dense model in the weight space, rather than training a sparse model. Specifically, we leverage ZOGraSP to determine layerwise pruning ratios (LPRs) that can capture DNN compressibility and then ZO optimization can train the dense model by iteratively updating partial model parameter weight with their corresponding gradient estimation by CGE, where the sparse gradient ratio determined by LPRs.

Feature Reuse: Since CGE perturbs each parameter elementwise, it can reuse the feature immediately preceding the perturbed layer and carry out the remaining forward pass operations instead of starting from the input layer. Empirically, CGE with feature reuse exhibits a 2X reduction in training time.

Forward Parallelization: CGE enables parallelization of model training due to its alignment of parameter perturbations with forward passes. The decoupling property enables scaling forward passes via distributed machines, which can significantly improve ZO training speed.
Performance Comparison
Image classification
we compare the accuracy of DeepZerotrained ResNet20 with two variants trained by FO recipes:
 (1) a dense ResNet20 acquired through FO training
 (2) a sparse ResNet20 acquired through FO training under FOGraSP sparsity pattern.
As shown in figure below, the accuracy gap still exists between (1) and the model trained with DeepZero in the sparsity regime of 80% to 99%. This highlights the challenge of ZO optimization for deep model training, where achieving high sparsity is desired to reduce the number of model queries in Sparse CGE for scaling to ResNet20. Notably, in the sparsity regime of 90% to 99%, DeepZero outperforms (2), showcasing the superiority of gradient sparsity in DeepZero compared to weight sparsity (i.e., directly training a sparse model).
Blackbox defense
The blackbox defense problem arises when the owner of an ML model is unwilling to share the model details with the defender against adversarial attacks. This poses a challenge for existing robustness enhancement algorithms that directly robustify whitebox ML models using FO training. To overcome this challenge, ZOAEDS [1] are proposed to introduces an autoencoder (AE) between the whitebox denoised smoothing (DS) defense operation (to be learned) and the blackbox image classifier to address dimensionality challenges with ZO training. The downside of ZOAEDS is poor scaling to highresolution datasets (e.g., ImageNet) due to the use of AE, which compromises the fidelity of the image input to the blackbox image classifier and leads to inferior defense performance. In contrast, DeepZero can directly learn the defense operation integrated with the blackbox classifier, without needing AE. As shown in table below, DeepZero consistently outperforms ZOAEDS in terms of certified accuracy (CA) for all values of input perturbation radius \(r > 0\).
Simulationcoupled DL
Numerical methods, while instrumental in providing physicsinformed simulations, come with their own challenge: the discretization unavoidably produces numerical errors. The feasibility of training a corrective neural network through looping interactions with the iterative partial differential equation (PDE) solver, coined ‘solverintheloop’ (SOL) [2]. While existing work focused on using or developing differentiable simulators for model training, we extend SOL by leveraging DeepZero, enabling its use with nondifferentiable or blackbox simulators. The table below compares the test error correction performance of ZOSOL (via DeepZero) with three differentiable approaches methods:
 SRC (low fidelity simulation without error correction),
 NON (noninteractive training out of the simulation loop using pregenerated low and high fidelity simulation data),
 FOSOL (FO training for SOL given a differentiable simulator).
The error for each test simulation is computed as the mean absolute error (MAE) of the corrected simulation compared to the high fidelity simulation averaged across all simulation timesteps. The results demonstrate that ZOSOL achieved by DeepZero outperforms the SRC and NON baselines, and narrows the performance gap with FOSOL, despite only having querybased access to the simulator. Comparing ZOSOL with NON highlights the promise of ZOSOL even when integrated with blackbox simulators.
Citation
@article{chen2023deepzero,
title={DeepZero: Scaling up ZerothOrder Optimization for Deep Model Training},
author={Chen, Aochuan and Zhang, Yimeng and Jia, Jinghan and Diffenderfer, James and Liu, Jiancheng and Parasyris, Konstantinos and Zhang, Yihua and Zhang, Zheng and Kailkhura, Bhavya and Liu, Sijia},
journal={arXiv preprint arXiv:2310.02025},
year={2023}
}
References
[1] Zhang et al. (ICLR 2022). How to robustify blackbox ml models? a zerothorder optimization perspective
[2] Um et al. (NeurIPS 2020). Solverintheloop: Learning from differentiable physics to interact with iterative pdesolvers.