Coarse Semantic Injection for LLM-Conditioned Structured Indoor Prediction

Shuliang Zhu, Tomiwa Adey, Jinjia Zhou · arXiv:2605.16832. A 0.2M-parameter semantic interface that makes point-cloud language models see doors, windows, and cluttered furniture — and lets you steer their 3D predictions by clicking.

The problem: pooling is a vote, and small things lose

LLM-conditioned indoor prediction — the recipe behind SpatialLM-style systems — works like this: a point cloud goes in, a sparse encoder tokenises it, and a language model decodes a structured description of the scene: walls, doors, windows, and oriented boxes for furniture. It is an elegant pipeline, and the backbone of how DecorLM reasons about space.

It also has a consistent blind spot. Sparse voxelisation is a pooling operation, and pooling is a vote: small things lose. A door is a handful of points on a plane of thousands; a window is a hole the encoder mostly averages away. The result is a model that draws confident walls where there are openings, and misses individual furniture instances in clutter — precisely the details that matter if you want to reason about how people move through and live in a space.

The method: four colours, painted into the points

The fix is almost embarrassingly direct. Instead of redesigning the decoder or trying to align latent tokens after the fact, we inject coarse semantics into the point cloud itself, before tokenisation. Every point is assigned to one of four groups and gets a fixed colour for it: red for furniture, green for walls, blue for openings, black for everything else. That colour is appended to the point’s raw attributes, so semantic evidence and geometric evidence travel the same sparse pathway through the same encoder. No new interface, no decoder changes, no retokenisation — which also means any improvement is attributable to the semantics, not to a different token budget.

Because voxel pooling attenuates whatever signal you inject, the method adds one small counterweight: a semantic shift module with 0.2M trainable parameters. For each non-background group, the points of that class are encoded into a prototype token; a small per-class network turns each prototype into a delta vector confined to its own block of channels; and a geometry-side router decides, token by token, how much of each class’s delta to mix back in as a residual. The decoder never changes.

Training stays almost vanilla: the standard autoregressive loss plus three lightweight constraints that keep the router honest — a KL term matching predicted semantic ratios to the empirical ones, a budget penalty on total routing mass, and an entropy term that stops the routing from collapsing onto one class.

Results

We evaluate on three benchmarks against the SpatialLM baseline: synthetic layouts (Structured3D), joint layout-and-furniture prediction (the SpatialLM benchmark), and real-world RGB-D scans (ARKitScenes). The pattern is the same everywhere, and largest at strict thresholds — which is where thin structures live.

Structured3D · Layout F1 @ 0.5

Coarse Semantic Injection0.599

SpatialLM baseline0.498

ARKitScenes · 3D Box F1 @ 0.5

Coarse Semantic Injection0.342

SpatialLM baseline0.231

SpatialLM benchmark · Full-house @ 0.25

Coarse Semantic Injection0.452

SpatialLM baseline0.335

At the looser 0.25 threshold the gains hold (Structured3D 0.592 → 0.634; ARKitScenes 0.366 → 0.430), and on the failure modes the method was designed for, the lift is concentrated exactly where it should be:

Targeted failure mode	Δ F1
Thin openings (doors, windows)	+0.08
Cluttered furniture	+0.05
Low-support scenes	+0.22

The gains also transfer across point encoders — Sonata picks up +0.035 to +0.042 F1 across the three benchmarks, the newer Utonia +0.012 to +0.021 — so this is a property of the interface, not a quirk of one backbone.

The controls that make us believe it

Headline numbers are cheap; controls are where a method earns trust. Δ values below are descriptive differences between reported point estimates against the full method on the same metric.

Control	Metric	Control	Full method	Δ
Random four-colour code	Structured3D F1@0.25	0.595	0.634	−0.039
Fusion after the encoder	ARKitScenes F1@0.25	0.309	0.401	−0.092
Colours without shift module	Structured3D F1@0.25	0.623	0.634	−0.011
Without the ratio constraint	ARKitScenes F1@0.25	0.148	0.401	−0.253

Read top to bottom: it is the meaning doing the work, not the extra channels; inject before tokenisation, because latent tokens misalign after sparse pooling; the 0.2M counterweight earns its place; and the routing constraints are not decoration — drop the ratio term and box prediction collapses. The ablations also surface an honest tension: removing the entropy term helps layout (+0.014) while hurting furniture boxes. Layout and furniture are partially competing objectives, not a free lunch.

A model you can point at things

Because semantics enter through the points, they become a handle you can hold at inference time. The protocol: click an object in an RGB view; SAM 3 lifts the click to an instance mask; the mask transfers onto the aligned point map; the target’s points get their active semantic colour while everything else goes black; and the unchanged, frozen decoder does the rest. No retraining, no new parsing.

Click-to-steer, 42 clicked targets	Score
Target hit rate at full intensity	95.2%
Single matched box, no false positives	88.1%
Target F1@0.25 at 100% / 50% / 25% intensity	0.842 / 0.345 / 0.211

Scaling the colour intensity scales the influence — though not smoothly. Steering power first, perfect dials later. Pointing at an object and having a structured 3D predictor attend to it is a small preview of what controllable specialist models will feel like to use.

Limitations

As stated in the paper: the four-way code is deliberately coarse; quality depends on mask quality and view coverage; and the disentanglement between semantic meaning and raw colour-channel effects is not yet complete. Future work points at richer but still robust semantic codes, confidence-aware multi-view fusion, and stronger controls.

Why this matters to us

DecorLM’s job is interior space: design, floor-plan reasoning, furniture. Doors, windows, and openings are not edge cases for that job — they are the job. A general 3D model can afford to average them away. A specialist cannot.

There is also a meta-lesson, and it is one we will keep repeating: find the narrow failure, fix it with the smallest intervention that preserves the interfaces, and measure relentlessly. Four colours and 0.2M parameters. This is what frontier specialist research looks like — the kind of result you only find when a model has one job.

Read the paper on arXiv →