3D semantic occupancy prediction aims to reconstruct the 3D geometry and semantics of the surrounding
environment. With dense voxel labels, prior works typically formulate it as a dense segmentation task,
independently classifying each voxel. However, this paradigm neglects critical instance-centric
discriminability, leading to instance-level incompleteness and adjacent ambiguities. To address this, we
highlight a free lunch of occupancy labels: the voxel-level class label implicitly provides insight at
the instance level, which is overlooked by
the community. Motivated by this observation, we first introduce a training-free Voxel-to-Instance
(VoxNT) trick: a simple yet effective method that freely converts voxel-level class labels into
instance-level offset labels. Building on this, we further propose VoxDet, an instance-centric framework
that reformulates the voxel-level occupancy prediction as dense object detection by decoupling it into
two sub-tasks: offset regression and semantic prediction. Specifically, based on the lifted 3D volume,
VoxDet first
uses (a) Spatially-decoupled Voxel Encoder to generate disentangled feature volumes for the two
sub-tasks, which learn task-specific spatial deformation in the densely projected tri-perceptive space.
Then, we deploy (b) Task-decoupled Dense Predictor to address this task via dense detection. Here, we
first regress a 4D offset field to estimate distances (6 directions) between voxels and object borders
in the voxel space. The regressed offsets are then used to
guide the instance-level aggregation in the classification branch, achieving instance-aware prediction.
Experiments show that VoxDet can be deployed on both camera and LiDAR input, jointly achieving
state-of-the-art results on both benchmarks. VoxDet is not only highly efficient, but also gives 63.0
IoU on the SemanticKITTI test set, ranking 1st on the online leaderboard.