VoxDet: Rethinking 3D Semantic Occupancy Prediction as Dense Object Detection

1VITA@EPFL, 2Zhejiang University

Too Long; Don't Read

VoxDet address semantic occupancy prediction with an instance-centric formulation inspried by dense object detection, which uses a Voxel-to-Instance (VoxNT) trick freely transferring voxel-level class labels to instance-level offset labels.

Key Features

  • Versatile: Adaptable to various voxel-based scenarios, such as camera and LiDAR settings.
  • Powerful: Achieve joint state-of-the-art on camera-based and LiDAR-based benchmarks.
  • Efficient: Fast (~1.3× speed-up) and lightweight (reducing ~57.9% parameters).
  • Leaderboard Topper: Achieve 63.0 IoU (single-frame & single-model & no extra data/labels), securing 1st place on the online SemanticKITTI leaderboard.

Observation: Free Lunch in Voxel Labels

Free Lunch: Voxel-level class labels inherently provide instance-level insights, which have been overlooked by the community. Specifically, when only class labels (and not instance labels) are available
Left: Pixel-level class labels fail to discover or regress instances due to 2D occlusion.
Right: Voxel-level class labels can discover and regress instances thanks to their occlusion-free nature in 3D.

VoxNT Trick: Generate Free Offset Labels

VoxNT trick can freely transfer the voxel-level class labels to the instance-level offset labels by fully utilizing the observed free lunch mentioned above, which densely scans across 6 directions (x⁺, x⁻, y⁺, y⁻, z⁺, z⁻) and stops when the voxel label changes, indicating approaching object borders.

VoxDet: Fully Using the Generated Free Offset Labels

VoxDet reformulates the voxel-level occupancy prediction as instance-level dense object detection to achieve instance-centric learning, which decouples it into two sub-tasks: offset regression and semantic prediction. This is based on the generated offset labels from the VoxNT trick.

Qualitative Comparison

Camera/LiDAR-based Benchmarks

VoxDet achieves state-of-the-art performance on the camera-based benchmarks, including SemanticKITTI and SSCBench-KITTI-360 (the 1st and 2nd tables), and LiDAR-based SemanticKITTI benchmark (the 3rd table).

Model Efficiency & Monocular Adaptation

VoxDet is highly effcienct (Left) regarding model parameters and inference speed. Besides, VoxDet achieves the best results using monocular depth (Right).

Abstract

3D semantic occupancy prediction aims to reconstruct the 3D geometry and semantics of the surrounding environment. With dense voxel labels, prior works typically formulate it as a dense segmentation task, independently classifying each voxel. However, this paradigm neglects critical instance-centric discriminability, leading to instance-level incompleteness and adjacent ambiguities. To address this, we highlight a free lunch of occupancy labels: the voxel-level class label implicitly provides insight at the instance level, which is overlooked by the community. Motivated by this observation, we first introduce a training-free Voxel-to-Instance (VoxNT) trick: a simple yet effective method that freely converts voxel-level class labels into instance-level offset labels. Building on this, we further propose VoxDet, an instance-centric framework that reformulates the voxel-level occupancy prediction as dense object detection by decoupling it into two sub-tasks: offset regression and semantic prediction. Specifically, based on the lifted 3D volume, VoxDet first uses (a) Spatially-decoupled Voxel Encoder to generate disentangled feature volumes for the two sub-tasks, which learn task-specific spatial deformation in the densely projected tri-perceptive space. Then, we deploy (b) Task-decoupled Dense Predictor to address this task via dense detection. Here, we first regress a 4D offset field to estimate distances (6 directions) between voxels and object borders in the voxel space. The regressed offsets are then used to guide the instance-level aggregation in the classification branch, achieving instance-aware prediction. Experiments show that VoxDet can be deployed on both camera and LiDAR input, jointly achieving state-of-the-art results on both benchmarks. VoxDet is not only highly efficient, but also gives 63.0 IoU on the SemanticKITTI test set, ranking 1st on the online leaderboard.

BibTeX

If you find our work helpful, please consider citing:


@article{li2025voxdet,
  title={VoxDet: Rethinking 3D Semantic Occupancy Prediction as Dense Object Detection},
  author={Li, Wuyang and Yu, Zhu and Alahi, Alexandre},
  journal={arXiv preprint arXiv:2506.04623},
  year={2025}
}