ingesting 35m images with python in the cloud
play

Ingesting 35M images with Python In the cloud. lex Vinyals - PowerPoint PPT Presentation

Ingesting 35M images with Python In the cloud. lex Vinyals Software Engineer @ Hotels Data 1 Unify all the data Challenges of a metasearch 2 3 4 Partner A Partner B Partner C Hotel ID Hotel ID Hotel ID 123 $abc bilbao-hot1 Name


  1. Ingesting 35M images with Python In the cloud. Àlex Vinyals Software Engineer @ Hotels Data 1

  2. Unify all the data Challenges of a metasearch 2

  3. 3

  4. 4

  5. Partner A Partner B Partner C Hotel ID Hotel ID Hotel ID 123 $abc bilbao-hot1 Name Name Name Euskalduna Center Euskalduna Euskalduna CC Conference Center Street address Street address Avenida Street address Avda. Abandoibarra3, Abandoibarra 3 Av. Abandoibarra 3 48009 Coordinates Coordinates Coordinates 1.23, 2.43 1.23754, 2.43123 1.238, 2.431 Magic Happens Skyscanner Hotel ID 123456 Name Euskalduna Conference Center Street address Av. Abandoibarra 3 Coordinates 1.23754, 2.43123 5

  6. Partner A Partner B Partner C Hotel ID Hotel ID Hotel ID 123 $abc bilbao-hot1 Name Name Name Euskalduna Center Euskalduna Euskalduna CC Conference Center Street address Street address Avenida Street address Avda. Abandoibarra3, Abandoibarra 3 Av. Abandoibarra 3 48009 Coordinates Coordinates Coordinates 1.23, 2.43 1.23754, 2.43123 1.238, 2.431 Magic Happens Skyscanner Hotel ID 123456 Name Euskalduna Conference Center Data Release Street address Av. Abandoibarra 3 Coordinates 1.23754, 2.43123 6

  7. So what about the images? 7

  8. Partner A Partner B Hotel ID Hotel ID Partner C 123 $abc Hotel ID bilbao-hot1 Magic Happens Skyscanner Hotel ID 123456 8

  9. 9

  10. 10

  11. 11

  12. 12

  13. With more than 200 partners 800.000 hotels reach production 13

  14. Images to process = K * M * N ~ 35M images K = number of partners M = avg number of hotels per partner N = avg number of images per partner hotel 14

  15. Resizing is a thing And we have 14 different configurations 15

  16. Tale of an image processing pipeline 16

  17. Tech Stack Riding on AWS 17

  18. Tech Stack Riding on AWS SQS Simple Queue Service 18

  19. Tech Stack Riding on AWS Compute resources SQS Simple Queue Service 19

  20. *with DjangoRestFramework *without Django ORM Libraries 20

  21. *with DjangoRestFramework *without Django ORM Libraries 21

  22. *with DjangoRestFramework *without Django ORM Libraries Kombu Messaging / queues / amqp 22

  23. *with DjangoRestFramework *without Django ORM Libraries Boto Kombu Amazon stuff Messaging / queues / amqp 23

  24. Pillow Image Processing *with DjangoRestFramework *without Django ORM Libraries Boto Kombu Amazon stuff Messaging / queues / amqp 24

  25. Pillow Image Processing *with DjangoRestFramework *without Django ORM Libraries Python2.7 Boto Kombu Amazon stuff Messaging / queues / amqp 25

  26. Triggering Downloading Fingerprinting Tale of an image processing pipeline Deduplicating Prioritising Generating 26

  27. Asynchronous ( Always Running ) Triggering Downloading Fingerprinting Tale of an image processing pipeline Triggered by the Data Release Deduplicating Prioritising Generating 27

  28. Triggering Downloading Fingerprinting Triggering Deduplicating Prioritising Generating 28

  29. Partner A Hotel ID Image Release 123 Images DB http:/.../image.png http://… http://… Partner B Hotel ID $abc Computes Diff Images API http://… These urls are new http://… http://… These urls are updated http://… Those urls are deleted http://… Partner C Hotel ID bilbao-hot-1 Images http://… http://… Catalogues 29

  30. 30

  31. 31

  32. 32

  33. Triggering Downloading Fingerprinting Downloading Deduplicating Prioritising Generating 33

  34. 34

  35. import io import boto def should_filter(image): import requests height, width = image.size from PIL import Image short_size = min(width, height) s3 = boto.connect_s3() if short_size < minimum_short: bucket = s3.get_bucket('available-images') return True @reliable_callback() long_size = max(width, height) def downloader_callback(queued_image): if long_size < minimum_long: """ Overly simplified downloading callback without return True error handling logic """ response = requests.get(queued_image.url) total_pixels = width * height blob = response.content if total_pixels > max_pixels: key = bucket.new_key(queued_image.basename) return True key.set_contents_from_string(blob) image = Image.open(io.BytesIO(blob)) return False if should_filter(image): return fingerprinting_producer.publish(queued_image) 35

  36. import io import boto def should_filter(image): import requests height, width = image.size from PIL import Image short_size = min(width, height) s3 = boto.connect_s3() if short_size < minimum_short: bucket = s3.get_bucket('available-images') return True @reliable_callback() long_size = max(width, height) def downloader_callback(queued_image): if long_size < minimum_long: """ Overly simplified downloading callback without return True error handling logic """ response = requests.get(queued_image.url) total_pixels = width * height blob = response.content if total_pixels > max_pixels: key = bucket.new_key(queued_image.basename) return True key.set_contents_from_string(blob) image = Image.open(io.BytesIO(blob)) return False if should_filter(image): return fingerprinting_producer.publish(queued_image) 36

  37. import io import boto def should_filter(image): import requests height, width = image.size from PIL import Image short_size = min(width, height) s3 = boto.connect_s3() if short_size < minimum_short: bucket = s3.get_bucket('available-images') return True @reliable_callback() long_size = max(width, height) def downloader_callback(queued_image): if long_size < minimum_long: """ Overly simplified downloading callback without return True error handling logic """ response = requests.get(queued_image.url) total_pixels = width * height blob = response.content if total_pixels > max_pixels: key = bucket.new_key(queued_image.basename) return True key.set_contents_from_string(blob) image = Image.open(io.BytesIO(blob)) return False if should_filter(image): return fingerprinting_producer.publish(queued_image) 37

  38. import functools import warnings from PIL import Image def reliable_callback(): def decorator(func): @functools.wraps(func) def wrapper(*args, **kwargs): warnings.simplefilter('error', Image.DecompressionBombWarning) try: return func(*args, **kwargs) except BaseException: logger.error("Critical worker error", exc_info=True) return wrapper return decorator 38

  39. import functools import warnings from PIL import Image def reliable_callback(): def decorator(func): @functools.wraps(func) def wrapper(*args, **kwargs): warnings.simplefilter('error', Image.DecompressionBombWarning) try: return func(*args, **kwargs) except BaseException: logger.error("Critical worker error", exc_info=True) return wrapper return decorator 39

  40. import functools import warnings from PIL import Image def reliable_callback(): def decorator(func): @functools.wraps(func) def wrapper(*args, **kwargs): warnings.simplefilter('error', Image.DecompressionBombWarning) try: return func(*args, **kwargs) except BaseException: logger.error("Critical worker error", exc_info=True) return wrapper return decorator 40

  41. import functools import warnings from PIL import Image def reliable_callback(): def decorator(func): @functools.wraps(func) def wrapper(*args, **kwargs): warnings.simplefilter('error', Image.DecompressionBombWarning) try: return func(*args, **kwargs) except BaseException: logger.error("Critical worker error", exc_info=True) return wrapper return decorator 41

  42. from kombu import Connection, Consumer, Exchange, Queue, eventloop class KombuConsumer( common.BaseConsumer ): # ... bla bla def callback(self, body, message): self.handler(body) message.ack() def listen(self): with Connection(self.backend.broker, transport_options={'region': self.backend.region}) as connection: with Consumer(connection, self.queue, callbacks=[self.callback], accept=[self.backend.serializer]): for _ in eventloop(connection): pass # What a simplified worker looks like # Broker URI stored on Backend object, looks like: # sqs://{s3_key}:{s3_secret}@ consumer = KombuConsumer(backend, handler= downloader.downloader_callback) consumer.listen() 42

  43. from kombu import Connection, Consumer, Exchange, Queue, eventloop class KombuConsumer( common.BaseConsumer ): # ... bla bla def callback(self, body, message): self.handler(body) message.ack() def listen(self): with Connection(self.backend.broker, transport_options={'region': self.backend.region}) as connection: with Consumer(connection, self.queue, callbacks=[self.callback], accept=[self.backend.serializer]): for _ in eventloop(connection): pass # What a simplified worker looks like # Broker URI stored on Backend object, looks like: # sqs://{s3_key}:{s3_secret}@ consumer = KombuConsumer(backend, handler= downloader.downloader_callback) consumer.listen() 43

Recommend


More recommend