How to make a pizza: Learning a compositional layer-based GAN model

author

Dim P. Papadopoulos

MIT

author

Youssef Tamaazousti

MIT

author

Ferda Ofli

QCRI

author

Ingmar Weber

QCRI

author

Antonio Torralba

MIT

Abstract

A food recipe is an ordered set of instructions for preparing a particular dish. From a visual perspective, every instruction step can be seen as a way to change the visual appearance of the dish by adding extra objects (e.g., adding an ingredient) or changing the appearance of the existing ones (e.g., cooking the dish). In this paper, we aim to teach a machine how to make a pizza by building a generative model that mirrors this step-by-step procedure. To do so, we learn composable module operations which are able to either add or remove a particular ingredient. Each operator is designed as a Generative Adversarial Network (GAN). Given only weak image-level supervision, the operators are trained to generate a visual layer that needs to be added to or removed from the existing image. The proposed model is able to decompose an image into an ordered sequence of layers by applying sequentially in the right order the corresponding removing modules. Experimental results on synthetic and real pizza images demonstrate that our proposed model is able to: (1) segment pizza toppings in a weakly-supervised fashion, (2) remove them by revealing what is occluded underneath them (i.e., inpainting), and (3) infer the ordering of the toppings without any depth ordering supervision.

pizzaGAN image

Paper

Dim P. Papadopoulos, Youssef Tamaazousti, Ferda Ofli, Ingmar Weber, Antonio Torralba

How to make a pizza: Learning a compositional layer-based GAN model

In Computer Vision and Pattern Recognition (CVPR), 2019

Also available on arXiv

[Bibtex]

Dataset

Synthetic pizza dataset

To evaluate our proposed pizzaGAN method, we created a synthetic pizza dataset with clip-art-stye pizza images. There are two main advantages of creating a dataset with synthetic pizzas. First, it allows us to generate an arbitrarily large set of pizza examples with zero human annotation cost. Second and more importantly, we have access to accurate ground-truth ordering information and multi-layer pixel segmentation of the toppings. Examples of the synthetic pizzas are shown below.

Together with the final created synthetic pizza images we used on the experiments of our paper (about 5500 images), we also provide the ground-truth segmentation masks for each topping, the ground-truth ordering of the toppings and the RGB images of all the intermediate steps of creating the final synthetic pizza. An synthetic pizza example is shown below:

Base pizza	+ pepperoni	+ mushrooms	+ olives	+ basil	+ tomatoes	+ bacon

Ground-truth segmentation masks per layer

Download the synthetic pizza dataset (1.8G)

Real pizza dataset

Pizza is the most photographed food on Instagram with over 38 million posts using the hashtag #pizza. We first downloaded half a million images from Instagram using several popular pizza-related hashtags. Then, we filter out the undesired images using a CNN-based classifier trained on a small set of manually labeled pizza/non-pizza images.

We crowd-source image-level labels for the pizza toppings on Amazon Mechanical Turk (AMT) for 9,213 pizza images. Given a pizza image, the annotators are instructed to label all the toppings that are visible on top of the pizza. To ensure high quality, every image is annotated by five different annotators, and the final image labels are obtained using majority vote.

Download the real pizza dataset (2.8G)

PizzaGAN Code

Training the pizzaGAN model

pizzaGAN model architecture

Module operators that are trained to add and remove pepperoni on a given image. Each operator is a GAN that generatesthe appearanceAand the maskMof the adding or the removing layer. The generated composite image is synthesized by combining theinput image with the generated residual image.

Test time inference

pizzaGAN inference

Test time inference. Given a test image, our proposed model detects first the toppings appearing in the pizza (classification). Then, we predict the depth order of the toppings as they appear in the input image from top to bottom (ordering). The green circles in the image highlight the predicted top ingredient to remove. Using this ordering, we apply the corresponding modules sequentially in order to reconstruct backwards the step-by-step procedure for making the input pizza.