BaseBoostDepth: Exploiting Larger Baselines for Self-Supervised Monocular Depth Estimation

Aston University1, Loughborough University2

We demonstrate significant edge-based depth improvements over the previous state of the art.

Abstract

In the domain of multi-baseline stereo, the conventional understanding is that, in general, increasing baseline separation substantially enhances the accuracy of depth estimation. However, prevailing self-supervised depth estimation architectures primarily use minimal frame separation and a constrained stereo baseline. Larger frame separations can be employed; however, we show this to result in diminished depth quality due to various factors, including significant changes in brightness, and increased areas of occlusion. In response to these challenges, our proposed method, \textbf{BaseBoostDepth}, incorporates a curriculum learning-inspired optimization strategy to effectively leverage larger frame separations. However, we show that our curriculum learning-inspired strategy alone does not suffice, as larger baselines still cause pose estimation drifts. Therefore, we introduce incremental pose estimation to enhance the accuracy of pose estimations, resulting in significant improvements across all depth metrics. Additionally, to improve the robustness of the model, we introduce error-induced reconstructions, which optimize reconstructions with added error to the pose estimations. Ultimately, our final depth network achieves state-of-the-art performance on KITTI and SYNS-patches datasets across image-based, edge-based, and point cloud-based metrics without increasing computational complexity at test time.

Method

We present BaseBoostDepth, a two-stage optimization strategy capable of accurately estimating depth with clearly defined object boundaries. Unlike previous methods, we effectively exploit wider baselines and observe greater brightness-contrast cues, resulting in SotA depth estimations. Enhanced object boundaries are achieved through our boosting stage, which is agnostic to the depth backbone, enabling any pretrained self-supervised model to be boosted and attain improved object boundary definition. The overall framework is illustrated below.

Method

Video Presentation

Results on KITTI Eigen

We achieve state-of-the-art results on the KITTI test set when using the SQLdepth backbone at a resolution of 640x192, utilizing only one frame at inference with no post-processing. Our method, BaseBoostDepth, employs the same backbone as Monodepth2 and commences training from ImageNet pre-trained weights. On the other hand, $\dagger$ utilizes MonoViT's backbone, starting from MonoViT's pre-trained weights, and $*$ employs SQLdepth's depth backbone, also starting from MonoViT's pre-trained weights. Both of these versions commence training from the boosting stage.

Experiments on KITTI

Results on SYNS

We also achieve state-of-the-art results on the SYNS test set when using MonoViT's backbone at a resolution of 640x192, using only one frame at inference with no post-processing.

Experiments on SYNS

Comaprison With MiDaSv3.1

BaseBoostDepth demonstrates competitive performance against state-of-the-art supervised methods. Our method excels in providing finer details of intricate structures.