Ingesting 35M images with Python In the cloud. Àlex Vinyals Software Engineer @ Hotels Data 1
Unify all the data Challenges of a metasearch 2
3
4
Partner A Partner B Partner C Hotel ID Hotel ID Hotel ID 123 $abc bilbao-hot1 Name Name Name Euskalduna Center Euskalduna Euskalduna CC Conference Center Street address Street address Avenida Street address Avda. Abandoibarra3, Abandoibarra 3 Av. Abandoibarra 3 48009 Coordinates Coordinates Coordinates 1.23, 2.43 1.23754, 2.43123 1.238, 2.431 Magic Happens Skyscanner Hotel ID 123456 Name Euskalduna Conference Center Street address Av. Abandoibarra 3 Coordinates 1.23754, 2.43123 5
Partner A Partner B Partner C Hotel ID Hotel ID Hotel ID 123 $abc bilbao-hot1 Name Name Name Euskalduna Center Euskalduna Euskalduna CC Conference Center Street address Street address Avenida Street address Avda. Abandoibarra3, Abandoibarra 3 Av. Abandoibarra 3 48009 Coordinates Coordinates Coordinates 1.23, 2.43 1.23754, 2.43123 1.238, 2.431 Magic Happens Skyscanner Hotel ID 123456 Name Euskalduna Conference Center Data Release Street address Av. Abandoibarra 3 Coordinates 1.23754, 2.43123 6
So what about the images? 7
Partner A Partner B Hotel ID Hotel ID Partner C 123 $abc Hotel ID bilbao-hot1 Magic Happens Skyscanner Hotel ID 123456 8
9
10
11
12
With more than 200 partners 800.000 hotels reach production 13
Images to process = K * M * N ~ 35M images K = number of partners M = avg number of hotels per partner N = avg number of images per partner hotel 14
Resizing is a thing And we have 14 different configurations 15
Tale of an image processing pipeline 16
Tech Stack Riding on AWS 17
Tech Stack Riding on AWS SQS Simple Queue Service 18
Tech Stack Riding on AWS Compute resources SQS Simple Queue Service 19
*with DjangoRestFramework *without Django ORM Libraries 20
*with DjangoRestFramework *without Django ORM Libraries 21
*with DjangoRestFramework *without Django ORM Libraries Kombu Messaging / queues / amqp 22
*with DjangoRestFramework *without Django ORM Libraries Boto Kombu Amazon stuff Messaging / queues / amqp 23
Pillow Image Processing *with DjangoRestFramework *without Django ORM Libraries Boto Kombu Amazon stuff Messaging / queues / amqp 24
Pillow Image Processing *with DjangoRestFramework *without Django ORM Libraries Python2.7 Boto Kombu Amazon stuff Messaging / queues / amqp 25
Triggering Downloading Fingerprinting Tale of an image processing pipeline Deduplicating Prioritising Generating 26
Asynchronous ( Always Running ) Triggering Downloading Fingerprinting Tale of an image processing pipeline Triggered by the Data Release Deduplicating Prioritising Generating 27
Triggering Downloading Fingerprinting Triggering Deduplicating Prioritising Generating 28
Partner A Hotel ID Image Release 123 Images DB http:/.../image.png http://… http://… Partner B Hotel ID $abc Computes Diff Images API http://… These urls are new http://… http://… These urls are updated http://… Those urls are deleted http://… Partner C Hotel ID bilbao-hot-1 Images http://… http://… Catalogues 29
30
31
32
Triggering Downloading Fingerprinting Downloading Deduplicating Prioritising Generating 33
34
import io import boto def should_filter(image): import requests height, width = image.size from PIL import Image short_size = min(width, height) s3 = boto.connect_s3() if short_size < minimum_short: bucket = s3.get_bucket('available-images') return True @reliable_callback() long_size = max(width, height) def downloader_callback(queued_image): if long_size < minimum_long: """ Overly simplified downloading callback without return True error handling logic """ response = requests.get(queued_image.url) total_pixels = width * height blob = response.content if total_pixels > max_pixels: key = bucket.new_key(queued_image.basename) return True key.set_contents_from_string(blob) image = Image.open(io.BytesIO(blob)) return False if should_filter(image): return fingerprinting_producer.publish(queued_image) 35
import io import boto def should_filter(image): import requests height, width = image.size from PIL import Image short_size = min(width, height) s3 = boto.connect_s3() if short_size < minimum_short: bucket = s3.get_bucket('available-images') return True @reliable_callback() long_size = max(width, height) def downloader_callback(queued_image): if long_size < minimum_long: """ Overly simplified downloading callback without return True error handling logic """ response = requests.get(queued_image.url) total_pixels = width * height blob = response.content if total_pixels > max_pixels: key = bucket.new_key(queued_image.basename) return True key.set_contents_from_string(blob) image = Image.open(io.BytesIO(blob)) return False if should_filter(image): return fingerprinting_producer.publish(queued_image) 36
import io import boto def should_filter(image): import requests height, width = image.size from PIL import Image short_size = min(width, height) s3 = boto.connect_s3() if short_size < minimum_short: bucket = s3.get_bucket('available-images') return True @reliable_callback() long_size = max(width, height) def downloader_callback(queued_image): if long_size < minimum_long: """ Overly simplified downloading callback without return True error handling logic """ response = requests.get(queued_image.url) total_pixels = width * height blob = response.content if total_pixels > max_pixels: key = bucket.new_key(queued_image.basename) return True key.set_contents_from_string(blob) image = Image.open(io.BytesIO(blob)) return False if should_filter(image): return fingerprinting_producer.publish(queued_image) 37
import functools import warnings from PIL import Image def reliable_callback(): def decorator(func): @functools.wraps(func) def wrapper(*args, **kwargs): warnings.simplefilter('error', Image.DecompressionBombWarning) try: return func(*args, **kwargs) except BaseException: logger.error("Critical worker error", exc_info=True) return wrapper return decorator 38
import functools import warnings from PIL import Image def reliable_callback(): def decorator(func): @functools.wraps(func) def wrapper(*args, **kwargs): warnings.simplefilter('error', Image.DecompressionBombWarning) try: return func(*args, **kwargs) except BaseException: logger.error("Critical worker error", exc_info=True) return wrapper return decorator 39
import functools import warnings from PIL import Image def reliable_callback(): def decorator(func): @functools.wraps(func) def wrapper(*args, **kwargs): warnings.simplefilter('error', Image.DecompressionBombWarning) try: return func(*args, **kwargs) except BaseException: logger.error("Critical worker error", exc_info=True) return wrapper return decorator 40
import functools import warnings from PIL import Image def reliable_callback(): def decorator(func): @functools.wraps(func) def wrapper(*args, **kwargs): warnings.simplefilter('error', Image.DecompressionBombWarning) try: return func(*args, **kwargs) except BaseException: logger.error("Critical worker error", exc_info=True) return wrapper return decorator 41
from kombu import Connection, Consumer, Exchange, Queue, eventloop class KombuConsumer( common.BaseConsumer ): # ... bla bla def callback(self, body, message): self.handler(body) message.ack() def listen(self): with Connection(self.backend.broker, transport_options={'region': self.backend.region}) as connection: with Consumer(connection, self.queue, callbacks=[self.callback], accept=[self.backend.serializer]): for _ in eventloop(connection): pass # What a simplified worker looks like # Broker URI stored on Backend object, looks like: # sqs://{s3_key}:{s3_secret}@ consumer = KombuConsumer(backend, handler= downloader.downloader_callback) consumer.listen() 42
from kombu import Connection, Consumer, Exchange, Queue, eventloop class KombuConsumer( common.BaseConsumer ): # ... bla bla def callback(self, body, message): self.handler(body) message.ack() def listen(self): with Connection(self.backend.broker, transport_options={'region': self.backend.region}) as connection: with Consumer(connection, self.queue, callbacks=[self.callback], accept=[self.backend.serializer]): for _ in eventloop(connection): pass # What a simplified worker looks like # Broker URI stored on Backend object, looks like: # sqs://{s3_key}:{s3_secret}@ consumer = KombuConsumer(backend, handler= downloader.downloader_callback) consumer.listen() 43
Recommend
More recommend