We propagate user-provided, single-view 2D edits to the multi-view representation of a 3D asset. This enables fast, mask-free, high-fidelity, and consistent 3D editing with intuitive control.
Source view
Edited view
Source view
Edited view
Source view
Edited view
Source view
Edited view
Source view
Edited view
We present EditP23, a method for mask-free 3D editing that propagates 2D image edits to multi-view representations in a 3D-consistent manner. In contrast to traditional approaches that rely on text-based prompting or explicit spatial masks, EditP23 enables intuitive edits by conditioning on a pair of images: an original view and its user-edited counterpart. These prompts guide an edit-aware flow in the latent space of a pre-trained multi-view diffusion model, coherently propagating the edit across views. Operating in a feed-forward manner (no optimisation), our method preserves the objectβs identity in both structure and appearance. We demonstrate effectiveness across diverse categories and edits, achieving high fidelity to the source without masks.
Results of our method across diverse object categories. Each block compares a source object (top) with its edited versions (below). The leftmost column shows the conditioning views (source and target) used to prompt the edit, while the remaining columns present novel viewpoints of the result. Our approach consistently applies the desired edit while preserving the objectβs structure and identity across all viewpoints.
Original
Pixar style
Wings
Original
Old
Zombie
Original
Fantasy
Original
Donut
Original
Tail
Original
Oreo
Original
Vintage
Modern
Original
Plush
Pixar
The EditP23 pipeline propagates your 2D edit into a full, 3D-consistent object modification. The process is designed to be intuitive, requiring only a single edited view to guide the entire 3D update. Hereβs how it unfolds:
The process begins with a 3D object, which is rendered to generate a multi-view grid (mv-grid) composed of six different viewpoints, and an additional fixed prompt view.
The user can take the prompt view and modify it with any preferred 2D editing tool, such as painting or generative AI. This user-edited image becomes the target view, which guides the 3D edit.
The core of the method is a technique called "edit-aware denoising". At each step, the system runs two parallel processes within a multi-view diffusion model:
After the diffusion process completes, the final, edited multi-view grid is generated. This edited grid is then converted into a textured 3D mesh using a reconstruction module like InstantMesh. The output is a fully edited 3D object.
Cartoonish car
Car with wings
Fox with open eyes
Golden R2D2
Robot with sunglasses
Terrier with beanie
Batman with backpack
Grogu with red robe
Fox with tuxedo
Gothic cathedral
Terrier with Paddington's hat
Vespa
Grogu in the-force pose
Superman in Superman pose
Batmobile
Lego figure of Grogu