Voice Based Information Retrieval System How far is it from text based retrieval system? PRAJNA BHANDARY CMSC 676
MOTIVATION ● The ever increasing Internet bandwidth, the ever-decreasing storage costs and the fast development of multimedia technologies have paved road for more and more multimedia network content. ● The main motivation for many researchers in this area is to help visually challenged individuals to get information using a device used for speech recognition system
INTRODUCTION There are 3 different tasks of the Voice based Retrieval System ● Using Text Queries to retrieve spoken documents ○ Referred as Spoken Document Retrieval ○ Found that the queries need to be long in order for it to be more efficient ● Using spoken queries to retrieve text documents ○ Voice Search ○ The information to be retrieved is usually an existing text database such as those in directory assistance applications, although with lexical variations and so on but primarily without recognition uncertainty. ● Using spoken queries to retrieve spoken documents ○ In this case the speech recognition uncertainty exists on both sides of the queries and the documents, and therefore naturally this is a more difficult task this.
COMPARISON Text-Based Voice-Based Resources Rich resources-huge quantities of text Spoken/multimedia content are the new documents available over the internet trend Quantity continues to increase Can be realized even sooner given exponentially due to convenient access mature technologies Accuracy Retrieval accuracy is acceptable to Problems with speech recognition users and are properly ranked and errors, especially for spontaneous filtered speech under adverse environments User-System Retrieved documents easily Spoken/multimedia documents easily Interaction summarised on-screen thus easily summarised on-screen thus difficult to scanned and selected by the user scan and select User may easily select query terms Lacks efficient user system interaction suggested for next iteration retrieval in an interactive process
RETRIEVAL ACCURACY ● Lattice-based Approaches ● Position Specific Posterior Lattices(PSPL) ● Confusion Networks(CN) ● Time-based Merging for Indexing(TMI) ● Time-anchored Lattice Expansion(TALE) ● Position Specific Posterior Lattices(PSPL) ● Locating a word in a segment according to the position(or sequence ordering) of the word in a path as a tuple (W, d, pos, prob). ● Confusion Networks(CN) ● Clustering several words in a segment according to similar time spans and word pronunciation.
RETRIEVAL ACCURACY (Cont’d) Relevance ranking relevance scores between the segments and a query Q, which is a sequence of words, {W j , j = 1, 2.., Q} First calculate the expected tapered-count for each N-gram {Wi...Wi+N−1} within the query in a spoken segment d, S(d,Wi...Wi+N−1) as given below and aggregate the results to produce a score S N-gram (d, Q) for each order N as in where L is the lattice obtained from d and k is the cluster number in PSPL or CN structures. The different proximity types, one for each N-gram order allowed by the query length Q, are finally combined by a weighted sum to give the final relevance score S(d, Q),
USER-SYSTEM INTERACTION ● Multi-model dialogue for a query given by the user, the retrieval system produces a topic hierarchy constructed from the retrieved spoken documents to be shown on the screen. ● Semantic analysis of spoken documents
USER-SYSTEM INTERACTION ● Key term extraction from spoken documents Based on latent topic significance ● Automatic Generation of Summaries and Titles for spoken documents ● Query-based Local Semantic Structuring of Spoken Documents ● Semantic Structuring of spoken documents ● Interactive retrieval in Dialogue loop
PROPOSED MODEL Voice Voice to text Keyword BoW(Bag of Pattern Matching words) If no matc h with This is a three step DB process: 1. Speech to text yes Voice based 2. Pattern matching reply 3. Text to speech Voice Reply
VOICE TO TEXT ● A fuzzy logics can be used to match the speech of different accents. eg. the word “Vector” has different pronunciations ● Thus a single word can be represented by a fuzzy set. ● Now since this is a very specific to fit in a generic model of speech recognition, we can have a more general model of fuzzification of phonemes. ● This model is applied to spoken sentences. One fuzzy set is based on accents, the second one the speeds of pronunciation and the third on emphasis
BAG-of-WORDS ● A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things: ○ A vocabulary of known words. ○ A measure of the presence of unknown words. ○ The steps followed: ■ Collect data ■ Create Vocabulary ■ Create Document Vector ■ Managing Vocabulary ■ Scoring words ■ Word Hashing ■ TF-IDF
PATTERN MATCHING ● Boyer-Moore(BM) algorithm can be used which positions the pattern over the leftmost characters in the text and attempts to match it from right to left. If no mismatch occurs then the pattern is found else. ● The algorithm computes a shift by an amount by which the pattern is moved to the right before a new matching is undertaken ● Shift is computed using two heuristics : ○ match heuristic ○ Occurence heuristics i. Match all characters previously matched and ii. To bring different character to the position in the text that caused the mismatch 𝑒 [ 𝑦 ] = 𝑛𝑗𝑜 { 𝑡 | 𝑡 = 𝑛 𝑝𝑠 (0 𝑡 < 𝑛 𝑏𝑜𝑒 𝑞𝑏𝑢𝑢𝑓𝑠𝑜 [ 𝑛 − 𝑡 ] = 𝑦 )}
TEXT TO VOICE ● After getting the text it must it must analyse and then transform into a phonetic description ● NLP module: ○ Digital Signal Processing(DSP) module: It transforms the symbolic information received to audible one as follows: text analysis: first the text is segmented into tokens. The token-to-word conversion creates the orthographic form of the token example Mr is mister and humber like 2 are transformed to two ○ Application of Pronunciation rules: After the text analysis is completed pronunciation rules can be applied. Silent letters in a word(h in caught) or several phoneme like(m in maximum) ■ Dictionary based solution: A dictionary can be used where all forms of possible words are stored. ■ Rule based solution: rules are generated from the phonological knowledge of dictionaries. Only words with come exception on pronunciation are included
CONCLUSION & FUTURE SCOPE It can be concluded that this approach is efficient in term of reduced computation complexity, reduced time ● There is research being done to make the whole process telephonic ● Limitations of Bag-of-Words ● Vocabulary ● Sparsity ● Meaning
REFERENCES [1] R. Uma, B. Latha. “An efficient voice based information retrieval using bag of words based indexing”, International Journal of Engineering & Technology [2] Lin-shan Lee and Yi-cheng Pan. “Voice-based Information Retrieval- how far are we from the text-based information retrieval?”, 2009 IEEE [3] Kiruthika M, Priyadarsini S, Rishwana Roshan K, Shifana Parvin V.M, Dr. G. Umamaheshwari. “Voice Based iNformation Retrieval System”, International Journal of Innovative Research in Science, Engineering and Technology [4]Personal Voice Based Information Retrieval System, patent [5] Lakra, Sachin, et al. "Application of fuzzy mathematics to speechto-text conversion by elimination of paralinguistic content." arXiv preprint arXiv: 1209.4535 (2012). [6] KNUTH, D., J. MORRIS, and V. PRATT. 1977. "Fast Pattern Matching in Strings." SIAM J on Computing, 6, 323-50. [7] BOYER, R., and S. MOORE. 1977. "A Fast String Searching Algorithm." CACM, 20, 762-72. [8] Ondrej Chum, James Philbin, Josef Sivic, Michael Isard, and Andrew Zisserman. Total recall:Automatic query expansion with a generative feature model for object retrieval. In ICCV, pages1–8, 2007. [9] HHerv´eJ´egou, MatthijsDouze, and CordeliaSchmid. Improving bag-of-features for largescale image search. International Journal of Computer Vision, 87(3):316–336, 2010.
Recommend
More recommend