Homework 3: Dialog � Part 1 Part 1 � � Call Call TellMe TellMe and get two sets of driving directions and get two sets of driving directions � � Call CMU’s Let’s Go Call CMU’s Let’s Go � � Call Amtrak Call Amtrak � � Part 2 Part 2 � � Build your own pizza ordering systems Build your own pizza ordering systems � � Register with Tell Me Studio Register with Tell Me Studio � � Use Use VoiceXML VoiceXML to build a system to build a system � th November 3:30pm � Results are due 17 Results are due 17 th November 3:30pm �
Speech Processing 15-492/18-492 Spoken Dialog Systems Beyond VoiceXML: the Olympus Spoken Dialog Framework
Spoken Dialog - VoiceXML Write (several) vxml vxml “pages” and resources “pages” and resources � Write (several) � � Your dialog application control Your dialog application control � � Provide grammar for understanding Provide grammar for understanding � � Define what your system says Define what your system says � Generally just use provided ASR/TTS � Generally just use provided ASR/TTS � Great for basic form- -filling applications filling applications � Great for basic form � � What if your application can’t be made into a What if your application can’t be made into a � form- -filling one? filling one? form
Olympus Spoken Dialog Framework A general dialog system architecture � A general dialog system architecture � Modular, open source framework � Modular, open source framework � � Provides components needed to build SDS Provides components needed to build SDS � ASR/TTS, Language Understanding/Generation, ASR/TTS, Language Understanding/Generation, Dialog Management, etc. Dialog Management, etc. � Can replace components with other options Can replace components with other options � e.g., use a different ASR engine e.g., use a different ASR engine � Tied together via Galaxy message Tied together via Galaxy message- -passing passing � communication infrastructure communication infrastructure http://wiki.speech.cs.cmu.edu/olympus wiki.speech.cs.cmu.edu/olympus � http:// �
Example Olympus Systems Let’s Go! (bus information) � Let’s Go! (bus information) � TeamTalk (robot interaction) (robot interaction) � TeamTalk � � http:// http://wiki.speech.cs.cmu.edu/teamtalk wiki.speech.cs.cmu.edu/teamtalk/ / � Vera � Vera � � http:// http://www.speech.cs.cmu.edu/~awb/vera.wmv www.speech.cs.cmu.edu/~awb/vera.wmv � Many others � Many others �
Organization of Olympus Systems Core components � Core components � � Generic, useful in multiple different systems Generic, useful in multiple different systems � Application components � Application components � � System System- -specific, useful for a single application specific, useful for a single application �
Olympus Core Directory Structure Source code for all system-independent Galaxy servers Binaries Scripts to compile Olympus Generic system configuration includes External dependencies System- independent resources (ASR and Tools and scripts for VAD acoustic LM training, log mining… models)
System Directory Structure Source code for system-specific System-specific Galaxy servers binaries System configurations System documentation System-specific Dialog logs resources (grammars, language models, …)
Typical Pipeline Architecture ����������� �������������� �������������� ��������� ����������
Pipeline Architecture in Olympus Recog. Engine (SPHINX) ����������� �������������� Knowledge ����������� �������������� ����������� ������� Source Phone / ��������������� �������������� �������������� Backend ������ ��������� Desktop ��������� ���������� ��������� ���������� �������� ������� Synth. Engine (SAPI/FLITE)
The Olympus Architecture • Fast and small • Interface between real world and dialog manager Recog. Engine • Slot-filling templates • Acoustic/Language (SPHINX) models • Allows for random ����������� �������������� • Manages timing/turn- Knowledge • Suitable for ����������� ������� Source variations taking channel/domain • Controls dialog • Allows multiple Phone / ��������������� �������������� Backend • Grammar based ������ ��������� Desktop recognition engines • Plan-based • Interface to external • Robust parser engines (SAPI, Swift, ��������� ���������� Flite) �������� ������� • Does playback Synth. Engine (SAPI/FLITE)
Olympus Architecture Modules Recog. Engine (SPHINX) ����������� �������������� Knowledge ����������� �������������� Source Phone / ��������������� �������������� Backend ������ ��������� Desktop ��������� ���������� �������� ������� Synth. Engine (SAPI/FLITE)
Grammar Used for two things: � Used for two things: � � Parsing Parsing � � ASR language model if one isn’t available ASR language model if one isn’t available � The Phoenix Parser � The Phoenix Parser � � Context Context- -Free Grammar Free Grammar � � Robust parser Robust parser �
Phoenix Parser / Grammar [room_size_spec] ([rss_large]) � CFG Grammar CFG Grammar � ([rss_small]) ([rss_larger]) ([rss_smaller]) � Manually Manually- -generated domain generated domain- - � ([rss_smallest]) ([rss_largest]) specific grammar rules specific grammar rules ; [rss_large] � Reusable, generic sub Reusable, generic sub- -grammars grammars (large) � (big) [Yes], [No], [Number], [ [Yes], [No], [Number], [DateTime DateTime], ], (huge) ; [Help], [Repeat], [Suspend], etc… [Help], [Repeat], [Suspend], etc… [rss_larger] (*the larger) DO YOU HAVE SOMETHING A BIT LARGER? (*the bigger) [NeedRoom] ( (too small) [_i_want] (DO YOU HAVE SOMETHING) ) ; [RoomSizeSpec] ( [rss_largest] [room_size_spec] ( (*the largest) [rss_larger] (LARGER))) (*the biggest) ; [rss_small] � Parses all incoming hypotheses Parses all incoming hypotheses � (small) (little) and passes all parses along… and passes all parses along… ;
Example Phoenix Grammar [Place] [NextBus] (carnegie mellon university) (*WHEN_IS *the next *BUS) (downtown) (*WHEN_IS *the BUS after that *BUS) (robinson towne center) (the airport) WHEN_IS (south hills junction) (when is) (mount oliver) (when's) (the south side) (oakland) BUS (bloomfield) (bus) (polish hill) (one) (the strip district) ; (the north side) ;
Confidence Annotation - Helios Builds accurate confidence scores using � Builds accurate confidence scores using � features from 3 sources of knowledge: features from 3 sources of knowledge: � Speech recognition Speech recognition � � Language understanding Language understanding � � Dialog management Dialog management � Selects hypothesis with maximum � Selects hypothesis with maximum � confidence score confidence score
Recommend
More recommend