proposal of a hierarchical proposal of a hierarchical
play

Proposal of a Hierarchical Proposal of a Hierarchical Architecture - PowerPoint PPT Presentation

Proposal of a Hierarchical Proposal of a Hierarchical Architecture for Multimodal Architecture for Multimodal Interactive Systems y Masahiro Araki* 1 Tsuneo Nitta* 2 Kouichi Katsurada* 2 Takuya Nishimoto* 3 Tetsuo Amakasu* 4 Shinnichi Kawamoto* 5


  1. Proposal of a Hierarchical Proposal of a Hierarchical Architecture for Multimodal Architecture for Multimodal Interactive Systems y Masahiro Araki* 1 Tsuneo Nitta* 2 Kouichi Katsurada* 2 Takuya Nishimoto* 3 Tetsuo Amakasu* 4 Shinnichi Kawamoto* 5 * 1 Kyoto Institute of Technology * 1 Kyoto Institute of Technology * 2 Toyohashi University of Technology * 3 The University of Tokyo * 4 NTT Cyber Space Labs. * 5 ATR 2007/11/16 W3C MMI ws 1

  2. Outline Outline • Background – Introduction of speech IF committee under ITSCJ – Introduction to Galatea toolkit • Problems of W3C MMI Architecture – Modality Component is too large y p g – Fragile Modality fusion and fission functionality – How to deal with user model? • Our Proposal – Hierarchical MMI architecture Hierarchical MMI architecture – “ Convention over Configuration ” in various layers 2007/11/16 W3C MMI ws 2

  3. Background(1) Background(1) • What is ITSCJ? – Information Technology Standards Commission of Japan p • under IPSJ (Information Processing Society of Japan) • Speech Interface Committee under ITSCJ • Speech Interface Committee under ITSCJ – Mission • Publish TS (Trial Standard) document concerning multimodal dialogue systems 2007/11/16 W3C MMI ws 3

  4. Background(2) Background(2) • Theme of the committee – Architecture of MMI system h f – Requirements of each component • Future directions – Guideline for implementing practical MMI system Guideline for implementing practical MMI system – specify markup language 2007/11/16 W3C MMI ws 4

  5. Our Aim 1. Propose an MMI architecture which can be used for advanced MMI research used for advanced MMI research W3C: From the practical point of view (mobile, accessibility) 2. Examine the validity of the architecture through y g system implementation Galatea Toolkit 3. Develop a framework and release it as a open source towards de facto standard 2007/11/16 W3C MMI ws 5

  6. Galatea Toolkit(1) Galatea Toolkit(1) • Platform for • Platform for developing MMI systems systems • Speech recognition • Speech Synthesis • Face Image g Synthesis 2007/11/16 W3C MMI ws 6

  7. Galatea Toolkit(2) Galatea Toolkit(2) ASR Julian Dialogue Manager TTS Galatea DM Galatea talk Face FSM 2007/11/16 W3C MMI ws 7

  8. Galatea Toolkit(3) Galatea Toolkit(3) Di l Dialogue Phoenix Manager Macro Control Layer (AM ‐ MCL) ( ) Agent Agent Manager Direct Control Layer (AM ‐ DCL) ASR ASR TTS TTS Face Face Julian Galatea talk FSM 2007/11/16 W3C MMI ws 8

  9. Problems of W3C MMI(1) Problems of W3C MMI(1) • The “size” of Modality Component does not suit for life ‐ like agent control it f lif lik t t l Delivery Interaction Data Context Context manager Component Component Runtime Framework Modality Component API Modality Component API Speech Modality Face Image Modality FSM ASR TTS 2007/11/16 W3C MMI ws 9

  10. Problems of W3C MMI(1) Problems of W3C MMI(1) • Lip synchronization with speech output Delivery l Interaction Data Context manager Component Component Runtime Framework 3 o [65] h[60] set lip 1 1 a[65] ... [ ] moving set Text= 2 “ohayou” sequence Speech Modality h d l Face Image Modality d l 4 FSM ASR ASR TTS TTS start 2007/11/16 W3C MMI ws 10

  11. Problems of W3C MMI(1) Problems of W3C MMI(1) • Back channeling mechanism Delivery l Interaction Data Context manager Component Component Runtime Framework 1 2 2 set Text=“hai” short pause nod start Speech Modality h d l Face Image Modality d l FSM ASR ASR TTS TTS 2007/11/16 W3C MMI ws 11

  12. Problems of W3C MMI(2) Problems of W3C MMI(2) • Fragile Modality fusion and fission functionality Delivery Interaction Data Context manager g Component p C Component t Runtime Framework How to define How to define multimodal point (120,139) “from here to there” point (200,300) grammar? g Speech Modality Tactile Modality Is simple touch h unification ASR sensor enough? 2007/11/16 W3C MMI ws 12

  13. Problems of W3C MMI(2) Problems of W3C MMI(2) • Fragile Modality fusion and fission functionality Delivery Interaction Data Context manager g Component p C Component t Runtime Framework Contents “this is route map” SVG planning is p g suitable for Speech Modality Graphic Modality adapting various SVG devices. TTS Viewer 2007/11/16 W3C MMI ws 13

  14. Problems of W3C MMI(3) Problems of W3C MMI(3) • How to deal with user model? Delivery Interaction Data Context manager manager Component Component Component Runtime Framework Where is the user model information stored? t d? Speech Modality Face Image Modality FSM ASR TTS fails many times times 2007/11/16 W3C MMI ws 14

  15. Solution Solution • Back to multimodal framework – more smaller modality component • Separate state transition description • Separate state transition description – task flow – interaction flow – modality fusion/fission hierar hi al ar hite t re hierarchical architecture 2007/11/16 W3C MMI ws 15

  16. Investigation procedure g p Phase 1 use case analysis requirement for overall systems Working draft for MMI architecture 2007/11/16 W3C MMI ws 16

  17. Use case analysis y Name input modality output modality a on ‐ line li di display, speech l h mouse, speech shopping animated agent b voice search b voice search mouse speech mouse, speech display speech display, speech c site search mouse, speech, key display, speech d interaction with i t ti ith speech, image, sensor speech, display robot negotiation negotiation e with interactive speech speech, face image agent agent f kiosk terminal touch, speech speech, display 2007/11/16 W3C MMI ws 17

  18. Example of use case Interaction with robot Interaction with robot Nishijin Kasuri is Nishijin Kasuri is a traditional texture in Kyoto. What is Kasuri ? 2007/11/16 W3C MMI ws 18

  19. Requirements q 1. general 2 2. input modality input modality in common in common with W3C 3. output modality 4. architecture, integration and synchronization 4 hit t i t ti d h i ti point 5 5. runtimes and deployments runtimes and deployments 6. dialogue management extension extension 7. handling of forms and fields 7 h dli f f d fi ld 8. connection with outside application 9. user model and environment information 10.from the viewpoint of developer 2007/11/16 W3C MMI ws 19

  20. user model / layer 6: data model application logic application device model set/get set/get event/ control layer 5: control task control task control event / result command layer 4 layer 4 control interaction control integrated result / event command command event layer 3: control / understanding control modality integration event interpreted result/ interpreted result/ command command command event event layer 2: control/ control/ modality control control interpret p interpret p component component command command results ・ event event TTS / / graphical g p layer 1: y ASR ASR pen / touch / t h I/O device audio output output

  21. Investigation procedure Investigation procedure Phase 2 Detailed analysis of use case Requirements for each layer Publish trial standard release reference implementation 2007/11/16 W3C MMI ws 21

  22. Detailed use case analysis y 2007/11/16 W3C MMI ws 22

  23. Requirements of each layer q y • Clarify Input/Output with adjacent layers • Define events • • Clarify inner layer processing Clarify inner layer processing • Investigate markup language 2007/11/16 W3C MMI ws 23

  24. 1 st layer : Input/Output module 1 layer : Input/Output module • Function – Uni ‐ modal recognition/synthesis module / • Input module – Input : (from outside) signal (from 2 nd layer) information used for recognition – Output : (to 2 nd ) recognition result – Example : ASR, touch input, face detection, ... • Output module – Input : (from 2 nd ) output contents – Output : (to outside) signal – Example : TTS, Face image synthesizer, Web browser, ... 2007/11/16 W3C MMI ws 24

  25. 2 nd : Modality component y p • Function – lapper that absorbs the difference of 1 st layer lapper that absorbs the difference of 1 st la er ex ) Speech Recognition component grammar : SRGS semantic analysis : SISR result: EMMA – provide multimodal synchronization ex) TTS with lip synchronization 2nd: LS ‐ TTS Modality component 1st: TTS FSM Input/Output p p module 2007/11/16 W3C MMI ws 25

Recommend


More recommend