Watch Your Steps 👀👣

Local Image and Scene Editing by Text Instructions

1Samsung AI Centre Toronto, 2University of Toronto, 3York University, 4Vector Institute for AI

TL;DR: A localized image and scene editing method based on denoising diffusion models


We present a method to localize the desired edit region implicit in a text instruction. We leverage InstructPix2Pix (IP2P) and identify the discrepancy between IP2P predictions with and without the instruction. This discrepancy is reffered to as the relevance map. The relevance map conveys the importance of changing each pixel to achieve the edits, and is used to to guide the modifications. This guidance ensures that the irrelevant pixels remain unchanged. Relevance maps are further used to enhance the quality of text-guided editing of 3D scenes in the form of neural radiance fields. A field is trained on relevance maps of training views, denoted as the relevance field, that defines the 3D region that should be edited. We perform iterative updates on the training views guided by rendered relevance maps from the relevance field. Our method achieves state-of-the-art performance on both image and NeRF editing tasks.

Image Editing Results

Iterative Dataset Updates

Scene Editing Results

Relevance Field Visualizations


          title={Watch Your Steps: Local Image and Scene Editing by Text Instructions}, 
          author={Ashkan Mirzaei and Tristan Aumentado-Armstrong and Marcus A. Brubaker and Jonathan Kelly and Alex Levinshtein and Konstantinos G. Derpanis and Igor Gilitschenski},

Some website elements are borrowed from InstructNeRF2NeRF.