alternative data in finance example lodging key metrics
play

Alternative Data in Finance Example: Lodging Key Metrics Occupancy - PowerPoint PPT Presentation

Alternative Data in Finance Example: Lodging Key Metrics Occupancy x Room Rate ~ Revenues Online Room Number of lights on Rates Alternative Data Alternative Data 1. Point of sale transactions 2. Online behavior 3. Purchases 1. Online


  1. Alternative Data in Finance

  2. Example: Lodging Key Metrics Occupancy x Room Rate ~ Revenues Online Room Number of lights on Rates Alternative Data

  3. Alternative Data 1. Point of sale transactions 2. Online behavior 3. Purchases 1. Online 2. Brick and mortar 4. Obscure public records 5. Drone footage analysis ;) 6. Etc etc etc

  4. Supply Chain 1. Data Vendors / Suppliers 2. Aggregators and Analysts 3. Clients / Funds

  5. Outline • Basic Example (done) • What's Alternative Big Data (done) • Sourcing • Compliance and ethics • Predicting revenue and other uses • Walk though of common technical challenges • Basic trading strategy • Q & A

  6. Data Sourcing • Direct data gathering • Data vendors • Just download the data (JDD)

  7. Data gathering / Sourcing • Harvest the web • Primary Research

  8. Harvesting: Build or Buy? Build Buy Control over compliance procedures Faster to scale All IP and harvesting target information Back data stays in house Complete control over costs Risk mitigated by an intermediary Some structuring of the data done by vendor Leverage vendors’ expertise in the data and spidering * Tip for finding web harvesting firms: Look on LinkedIn for folks with web scraping skills and see who they work for.

  9. Harvesting: Symantec web • Diffbot recognizes the content of web pages • Compares against schema.org’s structures • Automatically collect structured data without explicit structure definitions • Adjusts for changes in page layouts

  10. Primary Research • Expert networks • Surveys • New ways to look at the world • Receipts • Serial numbers • Alexa or other web monitoring tools • Google trends • Classified • Drone footage

  11. Evaluating Datasets • Scarcity • How widely used or marketed is it? • Granularity • Time • Aggregation levels • How structured is it? • Coverage • Sectors / Stocks – Hedge fund motels? • Geo * Creating a standardized quantitative scoring system or ROI matrix to evaluate datasets based on these criteria is a worthwhile endeavor

  12. Evaluating Vendors • Companies monetizing their exhaust data • High quality high margin revenue • Upstream insights from buyer • Traditional data vendors • Survey data • Financial data aggregation • Hybrids • 1010 / ITG

  13. Free Datasets http://aws.amazon.com/datasets http://databib.org http://datacite.org http://figshare.com http://linkeddata.org http://reddit.com/r/datasets http://thedatahub.org alias http://ckan.net http://quandl.com http://enigma.io Hundreeds more! http://www.quora.com/Where-can-I-find-large-datasets- open-to-the-public

  14. High opportunity datasets • International • Asia • Latam • Insight into margins • Companies are more EPS surprise sensitive than revenue surprise sensitive • COGS • SG&A • Etc • B2B

  15. Compliance overview • Intent / Ethics • Regulatory

  16. Compliance overview Restricted Production Data Vendor Environment Environment PII Scrubbing Organization Process / Encrypted Archiving

  17. Compliance overview: Guidelines / Control Frameworks • NIST 800-122 • GLBA (Gramm-Leach-Bliley Act) • COBIT 5 • COSO 2013

  18. Compliance overview • Just use regular expressions ^(?:(?=.*\d)(?=.*[A-Z])(?=.*[a-z])|(?=.*\d)(?=.*[^A-Za-z0-9])(?=.*[a-z])|(?=.*[^A-Za-z0-9])(?=.*[A- Z])(?=.*[a-z])|(?=.*\d)(?=.*[A-Z])(?=.*[^A-Za-z0-9]))(?!.*(.)\1{2,})[A-Za-z0- 9!~<>,;:_=?*+#."&§%°()\|\[\]\-\$\^\@\/]{8,32} [a-zA-Z]:|\\)\\)?(((\.)|(\.\.)|([^\\/:*?"|<>. ](([^\\/:*?"|<>. ])|([^\\/:*?"|<>]*[^\\/:*?"|<>. ]))?))\\)*[^\\/:*?"|<>. ](([^\\/:*?"|<>. ])|([^\\/:*?"|<>]*[^\\/:*?"|<>. ]))? ((25[0-5]|2[0-4][0-9]|19[0-1]|19[3-9]|18[0-9]|17[0-1]|17[3- 9]|1[0-6][0-9]|1[1-9]|[2-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9%[0-9A-Fa- f]{2}|[-()_.!~*';/?:@&=+$,A-Za-z0- 9])+)([).!';/?:^(([^<>()[\]\\.,;:\s@\"]+(\.[^<>()[\]\\.,;:\s@\"]+)*)|(\".+\"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0- 9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$ * Use Regexp Buddy.

  19. Compliance overview: Web Harvesting Precedent Cases • Major (and the majority) of cases. Its an uncharted territory • Feist Publications, Inc., v. Rural Telephone Service Co., • Ryanair Scraping Cases • Ebay vs Bidders Edge • Intel vs Hamidi • Cases discussing Browserwrap vs clickwraps • Cvent, Inc. v. Eventbrite, Inc • 3taps vs Craigslist • These do not apply to investment research

  20. Compliance overview • Respect website’s TOS especially if in a Clickwrap • Sensibly web harvesting policy • Address incoming complaints • Limit number of http requests • Stay recent on laws and cases • Explicitly address headline risk and regulatory risk, create a cost benefit analysis for headline risk

  21. Generating value with alternative data • Revenue surprise estimates • Operating GAAP measures • Non GAAP measures • Churn, etc • Fully or partially automated quant strategies • Non equity asset classes • PE could benefit from the same operating metrics for diligence • PM Development and Big Data Thought Leadership • Strategic Investments • Marketing Tool for Raising Capital and Talent Recruitment

  22. Workflow and Process • Data Partners • Web Collection Third Party Data • Storage optimization Sources Data • Cleansing Visualizations L/S Acquisition Metrics • Benchmarking • De-biasing, Enrichment Teams Normalization Raw Data High Performance Data Analysts & Published Signal Computing Production • GAAP / Operating Metrics Quant • Quant Signals Sector Interpretive Modeling • Investment Thesis Insights Teams Research Research Data • Metrics Reporting R&D Quant • R&D Portfolio Vendors • Published Signal Deliverable

  23. The shifting bias longitudinal panel problem Full Panel Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec User 1 User 2 User 3 User 4 User 5 User 6 User 7 User 8 User 9 User 10 Panel with user add and churn (missing data MAR) Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec User 1 User 2 User 3 User 4 User 5 User 6 User 7 User 8 User 9 User 10

  24. The 200k and the ~800k are different Solutions: Total Spend Index • Imputation 5 • 4.5 Complete case analysis 4 • Weighting methods 3.5 3 2.5 2 1.5 1 0.5 0 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec The complete panel - ~200k users Users who have the second year of data, but not the first ……………. Dashed Line - 95% confidence N(μ,σ2).

  25. Complete Panel and the rest of users are different Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Panel 1 >200K Users (680K) Panel 2 Panel 3 Panel 4 Panel 5 Panel 6 Panel 7 Panel 8 Panel 9 Panel 10

  26. Complete Panel and the rest of users are different Jan Jan Feb Feb Mar Mar Apr Apr May May Jun Jun Jul Jul Aug Aug Sep Sep Oct Oct Nov Nov Dec Dec Panel 1 Panel 1 680K >200K Users (680K) Panel 2 Panel 2 720K Panel 3 Panel 3 Panel 4 Panel 4 Panel 5 Panel 5 Panel 6 Panel 6 Panel 7 Panel 7 Panel 8 Panel 8 Panel 9 Panel 9 Panel 10 Panel 10

  27. Complete Panel and the rest of users are different Jan Jan Feb Feb Mar Mar Apr Apr May May Jun Jun Jul Jul Aug Aug Sep Sep Oct Oct Nov Nov Dec Dec Panel 1 Panel 1 680K >200K Users (680K) Panel 2 Panel 2 720K Panel 3 Panel 3 740K Panel 4 Panel 4 760K Panel 5 Panel 5 Panel 6 Panel 6 Panel 7 Panel 7 Panel 8 Panel 8 Panel 9 Panel 9 Panel 10 Panel 10 Many users are the The further apart the panels, the less user overlap, P1 – P22 only same, ~90% overlap ~32% overlap, most users different

  28. Multivariate Time Series Clustering User A User C 25 180 25 350 160 300 20 20 140 250 120 15 15 200 100 80 150 10 10 Sum of Cnt.4 Sum of Cnt.1 60 100 40 Sum of DPT.4 Sum of DPT.1 5 5 50 20 0 0 0 0 User B User D 20 160 25 350 18 140 300 16 20 120 250 14 100 12 15 200 10 80 150 8 10 60 Sum of Cnt.2 Sum of Cnt.1 6 100 40 Sum of DPT.2 Sum of DPT.1 4 5 50 20 2 0 0 0 0

  29. Multivariate Time Series Clustering The pdc package) takes a permutation distribution, User A which is as measure of the complexity of a time series. Similarity of time series' is constructed as the distance User B between their permutation distributions. It allows us to make groupings, based on multiple variables, over time. User C clust<-pdclust(datamatrix, m=4) plot(clust, cols=c("red", "blue", "red", "blue")) User D

  30. User dropout in a longitudinal panel • We cluster each panel • Can use multivariate time series clustering like pdclust • Cluster on number of transactions and avg transaction amount, low covariance features • Each panel’s cluster boundaries are independently defined January February March April May June July August SeptemberOctober Panel 1 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Panel 2 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5

Recommend


More recommend