CHAIR PROF. BÖHM Compression and Similarity Indexing for Time Series Master’s Thesis Marco Neumann | 19th of August 2016 KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
Outline Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 2/34 1 Google 𝑜 -gram data 2 Clean-up 3 Similarity 4 Baseline 5 CASINO TIMES 6 Final Words
Google 𝑜 -gram data Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 3/34
Public Data Set Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 4/34
Information Provided by the Data Set Similarities = hints for common cause Warning similarity ≠ causality Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 5/34
Current problems Clean-up 19th of August 2016 Marco Neumann – CASINO TIMES Final Words CASINO TIMES Baseline Similarity Google 𝑜 -gram data „similarity“ is not precisely defined 1 for interactive analysis data is „big“ 1 choosing possible candidates is subject to frame confirmation bias slow manual analysis 6/34
Goals exact description of „similarity“ allowing of interactive nearest neighbor queries design & evaluation of baseline design & evaluation of an own approach Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 7/34
Clean-up Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 8/34
Steps OCR errors 19th of August 2016 Marco Neumann – CASINO TIMES Final Words CASINO TIMES Baseline Similarity Clean-up Google 𝑜 -gram data only last 256 years rare words lemmatisation stemming lowercase NFKC Unicode normalization word classes numbers 9/34 1 string filtering: 2 string normalization: 3 word normalization: 4 pruning:
Results 1 -grams: ≈ 800 000 2 -grams: ≈ 6 400 000 Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 10/34
Similarity Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 11/34
Input Data Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 12/34
Normalization Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 13/34
(Smooth) Gradients Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 14/34
DTW Similar structure, but sometimes slightly off ⇒ use Dynamic Time Warping (DTW) (limited by a Sakoe-Chiba Band of radius 𝑠 ) VLDB, 2002, Exact Indexing of Dynamic Time Warping; copying is by permission of the Very Large Data Base Endowment. Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 15/34
Final Order Clean-up 19th of August 2016 Marco Neumann – CASINO TIMES Final Words CASINO TIMES Baseline Similarity Google 𝑜 -gram data on demand pre-calculation radius of 𝑠 using 𝜏 16/34 1 log (𝑦 + 1) 2 Gauss-smoothing 3 gradient calculation 4 DTW with warping
Sanity Check Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 17/34
Examples of Philosophic Institute Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 18/34
Baseline Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 19/34
R-tree-based index VLDB, 2002, Exact Indexing of Dynamic Time Warping; copying is by permission of the Very Large Data Base Endowment. Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 20/34
Index Inefficiency Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 21/34
Performance Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 22/34
CASINO TIMES Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 23/34
Goals Clean-up 19th of August 2016 Marco Neumann – CASINO TIMES Final Words CASINO TIMES Baseline Similarity Google 𝑜 -gram data primary: use normal hardware slow pre-processing, fast search enable subrange queries w/o re-indexing secondary: compress data speed up nn queries using an index 24/34
Wavelet decomposition Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 25/34
Information Merging Clean-up 19th of August 2016 Marco Neumann – CASINO TIMES Final Words CASINO TIMES Baseline Similarity Google 𝑜 -gram data search similar subtrees (of different time series) ( = compression error is below threshold) difference of coefficients is small same children merge node if: node-by-node greedy method process one whole tree at the time 26/34
Example Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 27/34
Example (zoomed) Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 28/34
Weakness Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 29/34
Failed Improvements Clean-up 19th of August 2016 Marco Neumann – CASINO TIMES Final Words CASINO TIMES Baseline Similarity Google 𝑜 -gram data merge entire subtrees (same index structure) DB seeding information / subtree pruning drop time constraint for leaves DTW for leaves random boosting merge entire subtrees (FLANN) 30/34
Final Words Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 31/34
Conclusion foundation for future research 2 : definition of similarity fast baseline algorithm knowledge about tree-like methods ⇒ not promising 2 starting collaboration with Prof. Dr. Sanders Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 32/34
Possible Ideas Google 𝑜 -gram data 19th of August 2016 Marco Neumann – CASINO TIMES Final Words CASINO TIMES Baseline Similarity Clean-up + patching compression using: time series encoding using functions (e.g. cubic splines) locality-preserving hashing static/dynamic downsampling (e.g. snappy, lz4, gzip, xz, brotli) general purpose compression of chunks non-IEEE data types (e.g. A-law and 𝜈 -law) IEEE-half floating point 33/34
Thanks Dr.-Ing. Martin Schäler Prof. Dr.-Ing. Klemens Böhm IPD IT team Philosophic Friends Miguel Angel Meza Martínez Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 34/34
References I Title picture: 19th of August 2016 Marco Neumann – CASINO TIMES Kaushik Chakrabarti et al. „Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases“. In: ACM Trans. [5] Lutz Bornmann and Rüdiger Mutz. „Growth rates of modern science: A bibliometric analysis“. In: CoRR abs/1402.4578 (2014). URL: [4] Ioannina . 1999, pp. 27–29. R. J. Alcock et al. „Time-series similarity queries employing a feature-based approach“. In: In 7 th Hellenic Conference on Informatics, [3] N. Ahmed, T. Natarajan, and K. R. Rao. „Discrete Cosine Transform“. In: IEEE Transactions on Computers C-23.1 (Jan. 1974), pp. 90–93. ISSN: [2] pp. 490–501. ISBN: 1-55860-379-4. of the 21th International Conference on Very Large Data Bases . VLDB ’95. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1995, [1] 35/34 cb 2013 „Casino Royale“ by Rebecca Siegel Rakesh Agrawal et al. „Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases“. In: Proceedings 0018-9340. DOI: 10.1109/T-C.1974.223784 . . Database Syst. 27.2 (June 2002), pp. 188–228. ISSN: 0362-5915. DOI: 10.1145/568518.568520 .
More recommend