Pinterest’s Engineering team has recently posted their research in fashion recommendation, which capture my attention because of its practicality. In this post, they showed a new task in fashion recommendation called Complete The Look. Attempting to explore this task is great for any machine learning practitioner. There are some ambiguous terms that could make readers misunderstand. I will try to explain these terms in Part 1 and show how I reimplement the fashion recommender system in the next part.
When reaching the end of this tutorial, I hope you could understand:
The first thing I want to notice is that Complete The Look is not a method, this is the new efficient approach for fashion recommendation.
According to Pinterest team, we have the definition of Complete The Look as follows:
Given a scene image and a product image, compute a quantitative measure of distance such that the distance measure reflects visual complementarity between the scene and the product. Such a distance measure can be used either by a binary classifier or by a re-ranker
To say few words for this definition, we should understand Complete The Look is a task that find the compatible products with a scene rather than among products. This is an intuitive reason why we called this task “Complete The Look”.
Traditional approaches have a big disadvantage is that they only compare the product images in the store. These products are often on a blank background so that they do not show us how the products look like in real life. Moreover, it could be a hug waste if we can not take advantage of the images that users post on social networks. Complete The Look is the task Pinterest proposed to resolve this problem.
Style Embedding
To measure the compatibility between products and scenes, Pinterest proposed a new neural network call Style Embedding. After the training, Style Embedding will us the unified representation of scene and products.
Style Embedding uses the features from intermediate layers of ResNet50. The reason why we should use ResNet50 is said to be transferable and strong at capturing local features.
ResNet50 features are passed through a feed forward network to map the scene and the products to the same space. There are several things we need to know the embedding. These are the entire scene image, the local regions of the scene and the product image.
There are 3 types neural networks that give us 3 types of embedding. They have similar as the image above but with different hyperparameters.
Compatibility Distance
After achieving the visual embedding of images, we should compute the compatibility distance.
There are some types of distance we need to consider:
Global distance
Local distance
Hybrid distance
Global Distance
First, we need to see how the product is compatible with the scene globally. This is similar to put the two images side by side and then give them a glimpse. The global distance is simply the difference between the scene image and the product image.
Here s, p are denoted for scene and product images. fs, fp are the embedding of the scene and the product.
Local Distance
After see two images globally, we should look at the detail.
The local distance is an attention-based metric to measure the compatibility.
We will the cropping layer to crop the scene into some local regions.
The attention weights help the system focus on where the items is likely to appear. These weights are computed by computing the difference of the category image and each region of the scene image. Then we need to scale the weights to range (0,1) by the softmax function.
One thing we should notice is that ê is the embedding the correct product image.
The local distance is the weighted sum of the difference between the product image and each region of the scene.
Hybrid Distance
The compatibility should depend on how the products could with the local regions and the entire scene. Therefore, the hybrid distance is the mean of the global distance and the local distance.
Triplet Loss
Pinterest use a triplet loss to train the Style Embedding neural networks. This loss is described as follows:
To train the Style Embedding model, we need a triplet of scene image, positive image, and negative image. Positive image means the compatible product with the scene, whereas negative image is the incompatible product. The training process tries to minimize the distance between the scene and the positive product, and move the negative image far from the scene. The margin α let us know the gap between the positive distance and the negative distance. This loss should not be less than zero.
Conclusion
This is all for the first part. I hope that you understood basically how the Pinterest solve the problem of Complete The Look. Fashion recommendation is an active topic now, and this is a pie for only one person. The playground is still waiting for more efficient methods. Complete The Look is the new attempt of Pinterest. In the next part, I will show you how to implement the fashion recommender of Pinterest.
I'm Trinh Nguyen, a passionate content writer at Neurond, a leading AI company in Vietnam. Fueled by a love of storytelling and technology, I craft engaging articles that demystify the world of AI and Data. With a keen eye for detail and a knack for SEO, I ensure my content is both informative and discoverable. When I'm not immersed in the latest AI trends, you can find me exploring new hobbies or binge-watching sci-fi
Artificial intelligence (AI), once limited to simple question-answer formats, has now advanced to agents capable of performing tasks as efficiently as humans. These agents have also far surpassed virtual assistants like Siri and Alexa, showing significant potential in drug discovery in healthcare, fraud detection in finance, supply chain optimization in e-commerce, and so much more. […]