A beginner's guide on how to win a Deep Learning challenge

Our team, from Rakuten Technology Institute Singapore, was among the top 3 in Rakuten Deep Learning Challenge for recipe image classification. This article elaborates on how we trained a decent model for image classification. Although we have deep learning experience in NLP, this is our first experience dealing with image data. We believe it provides a good reference for beginners.

First of all, model selection.

It’s the key in a competition or a project. You can choose to use your own model or the existent models whose performance have been demonstrated. Deep residual networks (Resnet)[1] led to 1st-place winning entries in the ImageNet and COCO 2015 competitions for image classification. It promises a great performance in the context of this competition.

nesnet50 vs mine
Fig 1. Simple Network I designed and Resnet-50
(here only shows the top of the network).

To the left of Figure 1 is the network I designed (without solid reasoning as a beginner). It resulted in top-1 accuracy of 25%. Right side of Figure 1 shows first few layers of Resnet-50. The rest of the sophisticate network architecture can be found here.  When resnet50 was applied, the top-1 accuracy immediately increased to 57%. Resnet[1] is fantastic, but if given another chance, I will also try the even more promising model, ResNeXt[2], “a simple architecture which adopts VGG/ResNets’ strategy of repeating layers, while exploiting the split-transform-merge strategy in an easy, extensible way”.

Training data, validation data and testing data

Training data Augmentation

Enriching the training data with augmentation is very important for the model to be robust. The object shall be recognized with a spectrum of variations in the image. Taking an image of salmon salad as an example, even when given a random crop of the image, even when it flipped, rotated, brightness, with different contrast and saturation the model shall recognize it as salmon salad. In each epoch the model sees one variation of the salmon salad original image, each transform happens with an assigned probability.

Fig 2\. Training data augmentation demonstration.
Fig 2. Training data augmentation

Figure 2 demonstrates the training data augmentation process using salmon salad image as an example. The following code shows how the augmentation is done in practice:

train augmentation code

Testing data with TenCrop

It’s necessary to use TenCrop in inference especially when you evaluate top n accuracy. This is in consistence with Resnet application in the ImageNet and COCO 2015 competitions. With TenCrop applied, each image has 10 sets of predictions. If number of classes is ‘N’, each set has N dimensions each referring to one class. Average the 10 vectors and the highest ‘n’ values correspond to the n most probable classes for the input. It’s also encouraged to use TenCrop in validation while training, so the validation accuracy is consistent with that in test. In ”Summary 7” of this article you’ll see how we applied TenCrop in model ensemble.

To do tencrop, you need this for test/validate transform (resource consuming)

test transform-tencrop

Without tencrop, normally you can do

test transform

How to load data and run model:

train routine


  1. Overfitting is not a good idea. Using all given data to train and validate on (part of) itself, the accuracy is much higher(15-20%) than the true accuracy. Performance is not guaranteed better than non-overfitting, say training and testing with 9:1 of the given data, where true accuracy is directly available.
  2. Aggregated impact of all augmentations was not calculated. But simply adding ColorJitter and RandomRotation to training data boosted top-1 accuracy by 0.5%, so they together should count >1%.
  3. Applying TenCrop in testing data boosted top-1 accuracy by 1%.
  4. Performance improves (1-2% on top-1 accuracy) with a larger batch size. The more the model sees the better it optimizes. One way is to train on a better machine (from GPU: GTX to GPU: V100). It’s also possible to train on larger batch size on smaller machine. Optimization needs to be made to make sure that the “waiting” time is sane but larger GPU is not really necesary if time isn’t a concern.
  5. More complex model is not better. Resnet152 is not better than Resnet50 evaluated by both top-1 and top-3 accuracy. Might be the training data is not sufficient to perform the extra depth.
  6. Is ensemble models from different architectures not better than from same architecture? “No” from our experience. We tried Resnet152 and Resnet50. We found if the single model performance is good, ensemble those models gives you better result regardless from the same or different architectures.
  7. How ensemble prediction is made? Keep ten sets predictions from each model, if there are m models in the ensemble, you’ll have 10xm sets of predictions. Calculate mean across those 10xm sets, take indices of the highest ‘n’ values. That’s the ‘n’ most probable classes regarding the input.
  8. Imbalanced distribution wasn’t taken care of. Class distribution was highly skewed in the training data. This didn’t draw our special attention because the distribution was kept in the testing data.


This work is from me and Liling Tan. We thank Kobagapu Rao and Ali Cevahir for joining a follow-up discussion of the challenge and sharing their experiences.


[1]  Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016. (pdf, ppt)

[2]  Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu & Kaiming He. Aggregated Residual Transformations for Deep Neural Networks. 2017. (pdf)