pypose.optim.LevenbergMarquardt¶
- class pypose.optim.LevenbergMarquardt(model, solver=None, strategy=None, kernel=None, corrector=None, weight=None, reject=16, min=1e-06, max=1e+32, vectorize=True, sparse=False)[source]¶
The Levenberg-Marquardt (LM) algorithm solving non-linear least squares problems. It is also known as the damped least squares (DLS) method. This implementation is for optimizing the model parameters to approximate the target, which can be a Tensor/LieTensor or a tuple of Tensors/LieTensors.
\[\bm{\theta}^* = \arg\min_{\bm{\theta}} \sum_i \rho\left((\bm{f}(\bm{\theta},\bm{x}_i)-\bm{y}_i)^T \mathbf{W}_i (\bm{f}(\bm{\theta},\bm{x}_i)-\bm{y}_i)\right), \]where \(\bm{f}()\) is the model, \(\bm{\theta}\) is the parameters to be optimized, \(\bm{x}\) is the model input, \(\mathbf{W}_i\) is a weighted square matrix (positive definite), and \(\rho\) is a robust kernel function to reduce the effect of outliers. \(\rho(x) = x\) is used by default.
\[\begin{aligned} &\rule{113mm}{0.4pt} \\ &\textbf{input}: \lambda~\text{(damping)}, \bm{\theta}_0~\text{(params)}, \bm{f}~\text{(model)}, \bm{x}~(\text{input}), \bm{y}~(\text{target}) \\ &\hspace{12mm} \rho~(\text{kernel}), \epsilon_{s}~(\text{min}), \epsilon_{l}~(\text{max}) \\ &\rule{113mm}{0.4pt} \\ &\textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do} \\ &\hspace{5mm} \mathbf{J} \leftarrow {\dfrac {\partial \bm{f}} {\partial \bm{\theta}_{t-1}}} \\ &\hspace{5mm} \mathbf{A} \leftarrow (\mathbf{J}^T \mathbf{W} \mathbf{J}) .\mathrm{diagnal\_clamp(\epsilon_{s}, \epsilon_{l})} \\ &\hspace{5mm} \mathbf{R} = \bm{f(\bm{\theta}_{t-1}, \bm{x})}-\bm{y} \\ &\hspace{5mm} \mathbf{R}, \mathbf{J}=\mathrm{corrector}(\rho, \mathbf{R}, \mathbf{J})\\ &\hspace{5mm} \textbf{while}~\text{first iteration}~\textbf{or}~ \text{loss not decreasing} \\ &\hspace{10mm} \mathbf{A} \leftarrow \mathbf{A} + \lambda \mathrm{diag}(\mathbf{A}) \\ &\hspace{10mm} \bm{\delta} = \mathrm{solver}(\mathbf{A}, -\mathbf{J}^T \mathbf{W} \mathbf{R}) \\ &\hspace{10mm} \lambda \leftarrow \mathrm{strategy}(\lambda,\text{model information})\\ &\hspace{10mm} \bm{\theta}_t \leftarrow \bm{\theta}_{t-1} + \bm{\delta} \\ &\hspace{10mm} \textbf{if}~\text{loss not decreasing}~\textbf{and}~ \text{maximum reject step not reached} \\ &\hspace{15mm} \bm{\theta}_t \leftarrow \bm{\theta}_{t-1} - \bm{\delta} ~(\text{reject step}) \\ &\rule{113mm}{0.4pt} \\[-1.ex] &\bf{return} \: \theta_t \\[-1.ex] &\rule{113mm}{0.4pt} \\[-1.ex] \end{aligned} \]- Parameters:
model (nn.Module) – a module containing learnable parameters.
solver (nn.Module, optional) – a linear solver. If
None,solver.Cholesky()is used. Default:None.strategy (object, optional) – strategy for adjusting the damping factor. If
None, thestrategy.TrustRegion()will be used. Defult:None.kernel (nn.Module, or
list, optional) – the robust kernel function. If alist, the element must be nn.Module orNoneand the length must be 1 or the number of residuals. Default:None.corrector – (nn.Module, or
list, optional): the Jacobian and model residual corrector to fit the kernel function. If alist, the element must be nn.Module orNoneand the length must be 1 or the number of residuals. If a kernel is given but a corrector is not specified, auto correction is used. Auto correction can be unstable when the robust model has indefinite Hessian. Default:None.weight (
Tensor, orlist, optional) – the square positive definite matrix defining the weight of model residual. If alist, the element must beTensorand the length must be equal to the number of residuals. The corresponding residual and weight should be broadcastable. For example, if the shape of a residual isB*M*N*R, the shape of its weight can beR*R,N*R*R,M*N*R*RorB*M*N*R*R. Use this only when all inputs shared the same weight matrices. This is ignored when weight is given when callingstep()oroptimize()method. Default:None.reject (integer, optional) – the maximum number of rejecting unsuccessfull steps. Default: 16.
min (float, optional) – the lower-bound of the Hessian diagonal. Default: 1e-6.
max (float, optional) – the upper-bound of the Hessian diagonal. Default: 1e32.
vectorize (bool, optional) – the method of computing Jacobian. If
True, the gradient of each scalar in output with respect to the model parameters will be computed in parallel with"reverse-mode". More details go topypose.optim.functional.modjac(). Default:True.sparse (bool, optional) – if
True, use the sparse LM path based on sparse Jacobians and sparse normal equations. This mode requires the optional sparse backend bae and is intended to be used with sparse linear solvers such assolver.CG(). Default:False.
Available solvers:
solver.PINV();solver.LSTSQ();solver.Cholesky();solver.CG();solver.PCG().Available kernels:
kernel.Huber();kernel.PseudoHuber();kernel.Cauchy().Available correctors:
corrector.FastTriggs(),corrector.Triggs().Available strategies:
strategy.Constant();strategy.Adaptive();strategy.TrustRegion();Note
Setting
sparse=Trueenables the sparse Jacobian / sparse LM backend. It should be used when the underlying optimization problem exhibits a large, structured sparse Jacobian, where each residual depends only on a small subset of parameters. Please cite the following paper implementing the sparse LM backend:Zitong Zhan, Huan Xu, Zihang Fang, Xinpeng Wei, Yaoyu Hu, Chen Wang, Bundle Adjustment in the Eager Mode, arXiv preprint arXiv:2409.12190, 2024.
Check a full and clean runable example with
sparse=Truefor bundle adjustment.Warning
The output of model \(\bm{f}(\bm{\theta},\bm{x}_i)\) and target \(\bm{y}_i\) can be any shape, while their last dimension \(d\) is always taken as the dimension of model residual, whose inner product will be input to the kernel function. This is useful for residuals like re-projection error, whose last dimension is 2.
Note that auto correction is equivalent to the method of ‘square-rooting the kernel’ mentioned in Section 3.3 of the following paper. It replaces the \(d\)-dimensional residual with a one-dimensional one, which loses residual-level structural information.
Christopher Zach, Robust Bundle Adjustment Revisited, European Conference on Computer Vision (ECCV), 2014.
Therefore, the users need to keep the last dimension of model output and target to 1, even if the model residual is a scalar. If the model output only has one dimension, the model Jacobian will be a row vector, instead of a matrix, which loses sample-level structural information, although computing Jacobian vector is faster.
- step(input, target=None, weight=None)[source]¶
Performs a single optimization step.
- Parameters:
input (Tensor/LieTensor, tuple or a dict of Tensors/LieTensors) – the input to the model.
target (Tensor/LieTensor) – the model target to optimize. If not given, the squared model output is minimized. Defaults:
None.weight (
Tensor, orlist, optional) – the square positive definite matrix defining the weight of model residual. If alist, the element must beTensorand the length must be equal to the number of residuals. This argument is currently not supported whensparse=True. Default:None.
- Returns:
the minimized model loss.
- Return type:
Tensor
Note
The (non-negative) damping factor \(\lambda\) can be adjusted at each iteration. If the residual reduces rapidly, a smaller value can be used, bringing the algorithm closer to the Gauss-Newton algorithm, whereas if an iteration gives insufficient residual reduction, \(\lambda\) can be increased, giving a step closer to the gradient descent direction.
See more details of Levenberg-Marquardt (LM) algorithm on Wikipedia.
Note
Different from PyTorch optimizers like SGD, where the model error has to be a scalar, the output of model \(\bm{f}\) can be a Tensor/LieTensor or a tuple of Tensors/LieTensors.
Note
When
sparse=True, only a single residual tensor is currently supported. If the model returns multiple residuals, only the first one is used.Example
Optimizing a simple module to approximate pose inversion.
>>> class PoseInv(nn.Module): ... def __init__(self, *dim): ... super().__init__() ... self.pose = pp.Parameter(pp.randn_se3(*dim)) ... ... def forward(self, input): ... # the last dimension of the output is 6, ... # which will be the residual dimension. ... return (self.pose.Exp() @ input).Log() ... >>> posinv = PoseInv(2, 2) >>> input = pp.randn_SE3(2, 2) >>> strategy = pp.optim.strategy.Adaptive(damping=1e-6) >>> optimizer = pp.optim.LM(posinv, strategy=strategy) ... >>> for idx in range(10): ... loss = optimizer.step(input) ... print('Pose Inversion loss %.7f @ %d it'%(loss, idx)) ... if loss < 1e-5: ... print('Early Stopping with loss:', loss.item()) ... break ... Pose Inversion error: 1.6600330 @ 0 it Pose Inversion error: 0.1296970 @ 1 it Pose Inversion error: 0.0008593 @ 2 it Pose Inversion error: 0.0000004 @ 3 it Early Stopping with error: 4.443569991963159e-07
Optimizing a tiny pose graph with sparse LM (requires the optional sparse backend bae and CUDA).
Here,
parallel_for_sparse_jacobianmarks the relative-pose residual so the sparse backend can assemble sparse Jacobians for sparse LM. Use it on factorwise residual functions that take batch inputs and return one residual block per batch item. When you call the function normally, it behaves the same as before; the decorator only helps the sparse backend build sparse Jacobians. In the example, the root pose is fixed, and the remaining poses are optimized only from relative-pose edge errors.>>> from pypose.autograd.function import parallel_for_sparse_jacobian >>> torch.manual_seed(0) >>> device = torch.device("cuda") >>> dtype = torch.float64 >>> >>> @parallel_for_sparse_jacobian ... def edge_error(node1, node2, relpose): ... return (relpose.Inv() @ node1.Inv() @ node2).Log().tensor() ... >>> class PoseGraph(nn.Module): ... def __init__(self, root, nodes): ... super().__init__() ... self.register_buffer('root', root) ... self.nodes = pp.Parameter(nodes, sjac=True) ... ... def forward(self, edges, relposes): ... nodes = torch.cat((self.root, self.nodes), dim=0) ... return edge_error(nodes[edges[:, 0]], nodes[edges[:, 1]], relposes) ... >>> gt_nodes = pp.SE3(torch.tensor([ ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0], ... [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0], ... [2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0], ... ], device=device, dtype=dtype)) >>> edges = torch.tensor([[0, 1], [1, 2]], device=device) >>> relposes = gt_nodes[edges[:, 0]].Inv() @ gt_nodes[edges[:, 1]] >>> init = gt_nodes[1:] * pp.randn_SE3(2, sigma=0.1, device=device, dtype=dtype) >>> graph = PoseGraph(gt_nodes[:1], init).to(device) >>> strategy = pp.optim.strategy.Constant(damping=1e-4) >>> optimizer = pp.optim.LM( ... graph, ... solver=pp.optim.solver.PCG(), ... strategy=strategy, ... sparse=True) ... >>> for idx in range(5): ... loss = optimizer.step(input=(edges, relposes)) ... print('Sparse chain PGO loss %.7f @ %d it'%(loss, idx)) ... if loss < 1e-5: ... print('Early Stopping with loss:', loss.item()) ... break ... Sparse chain PGO loss 0.0265935 @ 0 it Sparse chain PGO loss 0.0000001 @ 1 it Early Stopping with loss: 6.876693949595198e-08