few shot object reasoning for robot instruction following
play

Few-shot Object Reasoning for Robot Instruction Following Yoav - PowerPoint PPT Presentation

Few-shot Object Reasoning for Robot Instruction Following Yoav Artzi Workshop on Spatial Language Understanding EMNLP 2020 Task Navigation between landmarks Agent: quadcopter drone Inputs: poses, raw RGB camera images, and


  1. Few-shot Object Reasoning for Robot Instruction Following Yoav Artzi Workshop on Spatial Language Understanding EMNLP 2020

  2. Task • Navigation between landmarks • Agent: quadcopter drone • Inputs: poses, raw RGB camera images, and natural language instructions

  3. Task go straight and stop before reaching the planter turn left towards the globe and go forward until just before it

  4. Mapping Instructions to Control ( • The drone maintains a configuration of target velocities Linear forward velocity Angular yaw rate ( v , ω ) • Each action updates the configuration or stops • Goal: learn a mapping from inputs to configuration updates v t f ( , , ) = STOP go straight and stop before reaching the planter ω t turn left globe …

  5. Modular Approach • Build/train separate components • Symbolic meaning representation • Complex integration Instruction Language Understanding Planning Control Perception Mapping

  6. Single-model Approach (a.k.a end-to-end) f Instruction Action How to think of extensibility, interpretability, and modularity when packing everything in a single model?

  7. Single-model Approach • Extensibility: extending the model to reason about new object after training • Interpretability: viewing how the model reasons about object grounding and trajectories • Modularity: re-using parts of the model Within a representation learning framework

  8. Representation: Design vs. Learning • Systems that use symbolic representations are interpretable and (potentially) extensible • However: representation design of every possible concept is brittle and hard to scale • Instead: design the most general concepts and let representation learning fill them with content • Today, two concepts: objects and trajectories

  9. Today Few-shot instruction following: • Few-shot language-conditioned object segmentation • Object context mapping • Integration into a visitation-prediction policy for mapping instructions to drone control

  10. Language-conditioned Object Segmentation • Input: instruction and observation images • Goal: identify and align objects and references

  11. Few-shot Version • Input: instruction, observation images, and database • Goal: identify previously unseen objects and mentions and align them Database blue ball planet earth orange cup plant pot

  12. Alignment via a Database • Approach: align blue ball planet observations and earth references through the orange database cup plant pot • Adding objects to the database extends the alignment ability • Requires only adding a few image and language exemplars

  13. Alignment via a Database • Approach: align blue ball planet observations and earth references through the orange database cup plant pot • Adding objects to the database extends the alignment ability Melon the fruit • Requires only adding a wedge slice watermelon few image and language exemplars the red red cube lego red brick

  14. <latexit sha1_base64="63JDzvbINZs3luchNf8vAbdEYw=">ACU3icbVFNaxsxFNRu0tTd1KnbHnsRMQEbitkNKe2l4NJDe3QgtgNeYyRZa6vWxyK9LTXL/scS6KF/pJceWvkjIbEzIBhm5vGkEc2lcBDHv4Pw4PDJ0dPas+j4ef3kRePlq4EzhW8z4w09poSx6XQvA8CJL/OLSeKSj6ki8rf/idWyeMvoJlzseKzLTIBCPgpUnjWwr8BzhWfpJipqsWfYtG3/EUeoKNSlN1WtRjFMlphibdq9lNtxnorPbSJpZwso7j/oYvR+tVl67mjSacSdeA+TZEuaIvepHGTg0rFNfAJHFulMQ5jEtiQTDJqygtHM8JW5AZH3mqieJuXK47qfCZV6Y4M9YfDXit3p8oiXJuqahPKgJzt+utxMe8UQHZh3EpdF4A12yzKCskBoNXBeOpsJyBXHpCmBX+rpjNiW8I/DdEvoRk98n7ZHDeSd514suLZvfLto4aeoNOUQsl6D3qoq+oh/qIoZ/oD/oXoOBX8DcMw8NAy2M6/RA4T1/yOUr2Y=</latexit> Alignment Score go straight and stop before reaching the planter Reference turn left towards the globe and go forward until just before it Bounding box X Align ( b, r ) = P ( b | o ) P ( o | r ) o Database Object record b Bounding box orange r Reference blue ball cup plant planet pot earth o Database object

  15. <latexit sha1_base64="uZINEPQvNpHiWOKMysbyWhr8Wg=">ACU3icbVFNaxsxFNRu0tTd1KnbHnsRMQEbitkNKe2l4NJDe3QgtgNeYyRZa6vWxyK9LTXL/scS6KF/pJceWvkjIbEzIBhm5vGkEc2lcBDHv4Pw4PDJ0dPas+j4ef3kRePlq4EzhW8z4w09poSx6XQvA8CJL/OLSeKSj6ki8rf/idWyeMvoJlzseKzLTIBCPgpUnjWwr8BzhWfpJipqsWfYtG3/E0VnqCjUpTdVrUYxTJaYm3avZTbch6LbRJpZwso7i/oUvZ+sVl67mjSacSdeA+TZEuaIvepHGTg0rFNfAJHFulMQ5jEtiQTDJqygtHM8JW5AZH3mqieJuXK47qfCZV6Y4M9YfDXit3p8oiXJuqahPKgJzt+utxMe8UQHZh3EpdF4A12yzKCskBoNXBeOpsJyBXHpCmBX+rpjNiW8I/DdEvoRk98n7ZHDeSd514suLZvfLto4aeoNOUQsl6D3qoq+oh/qIoZ/oD/oXoOBX8DcMw8NAy2M6/RA4T1/x39r2Y=</latexit> Alignment Score go straight and stop before reaching the planter Reference turn left towards the globe and go forward until just before it Bounding box P ( o | b ) P ( b ) P ( o | r ) X Align ( b, r ) = P ( o ) o Database Object record b Bounding box orange r Reference blue ball cup plant planet pot earth o Database object

  16. <latexit sha1_base64="uZINEPQvNpHiWOKMysbyWhr8Wg=">ACU3icbVFNaxsxFNRu0tTd1KnbHnsRMQEbitkNKe2l4NJDe3QgtgNeYyRZa6vWxyK9LTXL/scS6KF/pJceWvkjIbEzIBhm5vGkEc2lcBDHv4Pw4PDJ0dPas+j4ef3kRePlq4EzhW8z4w09poSx6XQvA8CJL/OLSeKSj6ki8rf/idWyeMvoJlzseKzLTIBCPgpUnjWwr8BzhWfpJipqsWfYtG3/E0VnqCjUpTdVrUYxTJaYm3avZTbch6LbRJpZwso7i/oUvZ+sVl67mjSacSdeA+TZEuaIvepHGTg0rFNfAJHFulMQ5jEtiQTDJqygtHM8JW5AZH3mqieJuXK47qfCZV6Y4M9YfDXit3p8oiXJuqahPKgJzt+utxMe8UQHZh3EpdF4A12yzKCskBoNXBeOpsJyBXHpCmBX+rpjNiW8I/DdEvoRk98n7ZHDeSd514suLZvfLto4aeoNOUQsl6D3qoq+oh/qIoZ/oD/oXoOBX8DcMw8NAy2M6/RA4T1/x39r2Y=</latexit> Alignment Score P ( o | b ) P ( b ) P ( o | r ) X Align ( b, r ) = P ( o ) o b Bounding box • Region proposal r Reference network gives bounding o Database object P ( b ) boxes and orange cup plant pot P ( o ) is uniform • blue ball planet earth

  17. <latexit sha1_base64="uZINEPQvNpHiWOKMysbyWhr8Wg=">ACU3icbVFNaxsxFNRu0tTd1KnbHnsRMQEbitkNKe2l4NJDe3QgtgNeYyRZa6vWxyK9LTXL/scS6KF/pJceWvkjIbEzIBhm5vGkEc2lcBDHv4Pw4PDJ0dPas+j4ef3kRePlq4EzhW8z4w09poSx6XQvA8CJL/OLSeKSj6ki8rf/idWyeMvoJlzseKzLTIBCPgpUnjWwr8BzhWfpJipqsWfYtG3/E0VnqCjUpTdVrUYxTJaYm3avZTbch6LbRJpZwso7i/oUvZ+sVl67mjSacSdeA+TZEuaIvepHGTg0rFNfAJHFulMQ5jEtiQTDJqygtHM8JW5AZH3mqieJuXK47qfCZV6Y4M9YfDXit3p8oiXJuqahPKgJzt+utxMe8UQHZh3EpdF4A12yzKCskBoNXBeOpsJyBXHpCmBX+rpjNiW8I/DdEvoRk98n7ZHDeSd514suLZvfLto4aeoNOUQsl6D3qoq+oh/qIoZ/oD/oXoOBX8DcMw8NAy2M6/RA4T1/x39r2Y=</latexit> Alignment Score P ( o | b ) P ( b ) P ( o | r ) X Align ( b, r ) = P ( o ) o P ( o ∣ b ) is computed using • b Bounding box visual similarity r Reference • Using Kernel Density o Database object Estimation with a symmetric multivariate Gaussian kernel orange cup plant pot P ( o ∣ r ) is computed similarly • using text similarity with pre- trained embeddings blue ball planet earth

  18. Mask Refinement • Refine each bounding box with a UNet model UNet • Gives a tight object mask Align = 0.7 • Paired with a bounded alignment score to a reference in the text go straight and stop before reaching the planter turn left towards the globe and go forward until just before it

  19. <latexit sha1_base64="uZINEPQvNpHiWOKMysbyWhr8Wg=">ACU3icbVFNaxsxFNRu0tTd1KnbHnsRMQEbitkNKe2l4NJDe3QgtgNeYyRZa6vWxyK9LTXL/scS6KF/pJceWvkjIbEzIBhm5vGkEc2lcBDHv4Pw4PDJ0dPas+j4ef3kRePlq4EzhW8z4w09poSx6XQvA8CJL/OLSeKSj6ki8rf/idWyeMvoJlzseKzLTIBCPgpUnjWwr8BzhWfpJipqsWfYtG3/E0VnqCjUpTdVrUYxTJaYm3avZTbch6LbRJpZwso7i/oUvZ+sVl67mjSacSdeA+TZEuaIvepHGTg0rFNfAJHFulMQ5jEtiQTDJqygtHM8JW5AZH3mqieJuXK47qfCZV6Y4M9YfDXit3p8oiXJuqahPKgJzt+utxMe8UQHZh3EpdF4A12yzKCskBoNXBeOpsJyBXHpCmBX+rpjNiW8I/DdEvoRk98n7ZHDeSd514suLZvfLto4aeoNOUQsl6D3qoq+oh/qIoZ/oD/oXoOBX8DcMw8NAy2M6/RA4T1/x39r2Y=</latexit> Learning P ( o | b ) P ( b ) P ( o | r ) X Align ( b, r ) = UNet P ( o ) o • Region proposal network parameters for bounding box proposal b Bounding box P ( o ∣ b ) • Image similarity measure for r Reference UNet parameters for mask refinement • o Database object • Text similarity uses pre-trained embeddings • Challenge: need large-scale heavily annotated visual data

  20. Augmented Reality Training Data FPV Overlay Composite Mask labels

  21. Augmented Reality Training Data Large-scale generation with ShapeNet objects Learned representations generalize beyond specific objects for: • Region proposal network for bounding boxes P ( o ∣ b ) • Image similarity measure for UNet parameters for mask • refinement Composite Mask labels

  22. Today Few-shot instruction following: • Few-shot language-conditioned object segmentation • Object context mapping • Integration into a visitation-prediction policy for mapping instructions to drone control

  23. Object Context Mapping Goal: create maps that capture object location and the instruction behavior around objects 1. Identify and align object mentions to observations 2. Compute abstract contextual representations for object references 3. Project and aggregate masks over time 4. Combine aggregated masks with contextual representations to create a map

  24. Object Context Mapping Step I: Identify and Align orange • Bounding box proposals from plant cup pot Region Proposal Network blue planet ball • Object references from tagger earth • Align with language- conditioned segmentation and the database • To compute: first-person masks aligned to instruction references

Recommend


More recommend