using machine learning to automate content metadata
play

USING MACHINE LEARNING TO AUTOMATE CONTENT METADATA Gareth Seneque - PowerPoint PPT Presentation

USING MACHINE LEARNING TO AUTOMATE CONTENT METADATA Gareth Seneque seneque@gmail.com @garethseneque https://search.abc.net.au THE PLAN! 1. What is the ABC/Search at the ABC 2. An overview of metadata the what/why 3. A platform: what


  1. USING MACHINE LEARNING TO AUTOMATE CONTENT METADATA Gareth Seneque seneque@gmail.com @garethseneque https://search.abc.net.au

  2. THE PLAN! 1. What is the ABC/Search at the ABC 2. An overview of metadata – the what/why 3. A platform: what have we built? 4. Automating transcription of audio content 5. Automated generation of keywords/synopses 6. Some fun! 7. The future?

  3. THE ABC • We make lots of things! • You may have seen these things on one of your many screens J • A trusted source in 2019 across the political spectrum – imagine that! • “Majority (68%) of respondents think the ABC is more important in an age of social media and fake news, including 64% of LNP and 61% of One Nation voters; • The results show 57% of respondents do not trust social media, while just 12% said they do trust social media; • Over three times more voters trust the ABC (52%) than trust commercial media (14%)” Source: The Australia Institute http://www.tai.org.au/content/abc-still-australia-s-most-trusted-news-source

  4. THE ABC • But for our purposes today: The ABC is in the business of words and pixels! • To come: lots of words about words, some words about pixels, too

  5. SEARCH @ THE ABC • https://search.abc.net.au • Algolia back-end (also used by Twitch, Stripe) • ~600k objects in our primary index • Covering all major content types from the last decade • ~230k articles • ~270k audio • ~85k video • Other things like recipes – very popular and worthy of their own content type!

  6. SEARCH @ THE ABC • ~500k searches/month • Peaks during weekdays, traffic nearly halves on weekends • (Aussies love a good weekend!) • Two challenges: • How do we get people to use our search? • Expectation of what search can do set by Google etc. • How do we delivery those using our search the most relevant results • Ensure high-quality metadata!

  7. METADATA: AN OPPORTUNITY! Article: News Audio: episode of Life Video: episode of BTN • • • Synopsis is first Matters podcast Missing keywords • • sentence of No show name, just Missing synopsis • • article episode name 4 keywords for Spelling mistake in a • • lengthy article keyword Total of 5 keywords • No synopsis • No transcript •

  8. METADATA: OVERVIEW • Article à BodyText à Keywords/Synopsis • Audio à Transcript à Keywords/Synopsis • Video à Closed Captions/Transcript à Keywords/Synopsis Looking to the future… • Images & Video à ‘Individual interacting with object’ à New attributes • Geoff Hinton in 2015: “I will be disappointed if in five years time we do not have something that can watch a YouTube video and tell a story about what happened” • Somehow it is already 2019, so…

  9. AN AUTOMATED METADATA PLATFORM Credit: 20 th Century Fox

  10. AN AUTOMATED METADATA PLATFORM • Transcript pipeline • 2x Lambda functions • Monitoring content notifiers, picking up Podcasts, generating transcription requests, picking up the results, separating transcript from word-confidence scores/timestamps and pushing to our search index • S3 buckets for storing transcripts • General attribute infrastructure & tools (for now, keywords/synopsis) • Load-balanced/distributed EC2 instances, CI/CD the things • Deploys models, API • Tools to update search objects (bulk/incremental) • All written in Go!

  11. WHY TRANSCRIPTS? • ~130 Podcasts in the Listen app alone • ~19 million Podcast downloads every month (!!) • Less than 5% have transcripts • Transcripts are expensive when produced by humans! • All that great content not easily discoverable • Hypothesis: we can increase audience engagement with searchable transcripts

  12. EXPERIMENTS WITH DEEPSPEECH • Mozilla’s open-source implementation of Baidu’s paper Deep Speech: Scaling up end-to-end speech recognition (2014) • A Recurrent Neural Network w/LSTM units &CTC • Optimisation method that controls for different patterns of speech • You want your network to understand when the slurring drunk and the impatient teetotaller ask their phone for directions • You’ll see the co-inventor of the LSTM in the next slide • Pre-trained models: limited to ~30 second clips at 16kHz/mono • Need to build system to manage slicing up inputs/reconstructing outputs

  13. ?????

  14. DEEPSPEECH 🤗 VS. HUMAN TRANSCRIPT 😭 VS: “On May 13, 1968, students and workers joined together in Paris in one of the largest protests the France had seen. They threatened the stability of the national government and arguably shifted the way we think about protests and political demonstrations forever. Yet that is just a small part of the story of 1968. On every continent, in almost every nation on earth”

  15. MLAAS (PRONOUNCED MRMMLYAAAS ) • Recurrent what? Maths? Who cares! • As of 2017 this stuff has been available as an AWS service w/a simple API call – AWS Transcribe

  16. HUMAN TRANSCRIPT 😭 A lot of this is taken directly from the example of the 1960s. And so this question of when the 1960s ends and its legacy I think is most apparent in the fact that in a lot of ways the 1960s hasn't ended yet, we are still grappling with many of these most basic ideas. Annabelle Quince: Zachary Scarlett, co-editor of The Third World in the Global 1960s. You also heard from: Gerard De Groot, author of Student Protest: The Sixties and After; Heike Becker, Professor of Anthropology at the University of the Western Cape; and Gerd-Rainer Horn, author of The Spirit of '68: Rebellion in Western Europe and North America, 1956-1976. The sound engineer is Russel Stapleton. I'm Annabelle Quince and you've been listening to Rear Vision on RN.

  17. AWS TRANSCRIBE 🤗 A lot of this is taken directly from the example of the nineteen sixties, and so this question of when the nineteen sixties ends and it's legacies, i think, is most apparent in the fact that in a lot of ways the nineteen sixties hasn't ended yet. We're still grappling with many of these most basic ideas. Sekeras scarlet, co editor of the third world in the global nineteen sixties. You also heard from gerard degroot, the author of student protest the sixties and after heika bika, professor of anthropology at the university of the western cape, and god rainer horn, author of the spirit of sixty eight. The sound engineer is russell stapleton. I'm annabelle quints, and this is revision on our end.

  18. TRANSCRIBE OUTPUT: UNDER THE HOOD

  19. KEYWORD/SUMMARY METADATA • Not enough cold-drip in the world for our team to create keyword/summary metadata for 600k objects • Two attributes suitable for NER and extractive/abstractive summarization • 2018/2019 has seen major breakthroughs in NLP/large language models – SOTA results across a range of tasks(Google’s BERT, OpenAI’s GPT-2, AllenAI’s ElMO) • Can any of these breakthroughs help us? How do they compare to more mature/minimal approaches

  20. EXPERIMENTS WITH BERT • Fine-tuning BERT for NER on CoNLL-2003 • Viz to the right is the PCA of embeddings • Tensorboard! • That little cluster there is the label [unused] • doh • 1.2GB model, trained on K80 cloud GPUs • Overkill!

  21. GENERATED VS EXISTING KEYWORDS Article: “Fact checking key claims of the 2019 federal election leaders' debate” – ABC News – 29/04/19

  22. GENERATED VS EXISTING SUMMARIES Water restrictions will be introduced in Water levels in dams are Sydney if drought conditions don't ease dropping faster than they in the next three months, according a have in decades, report on dwindling dam levels in New according to new research South Wales. The latest research from by Sydney Water — edging Sydney Water reveals levels across 11 Sydney closer to the re- dams in Greater Sydney are dropping introduction of water Article: Water faster than they have in decades. NSW restrictions. restrictions loom Water Minister Melinda Pavey said for Sydney as Water Rise Rules — which recommend reducing shower time and fixing tap drought continues leaks — applied to everyone in Sydney, to impact on dam the Blue Mountains and Illawarra. levels – ABC News – 05/05/19

  23. SOME (EARLY) RESULTS • Across the top News articles over the past week • 280% average increase in number of keywords • 22% increase in audio content availability across range of popular terms • 3-14% increase in CTR in A/B tests for combinations of ordered/unordered keywords/extractive summaries • Tests running as we speak! • Abstractive summarization experiments continuing!

Recommend


More recommend