Bad buying experiences can leave a lasting impression. Remember the time when you were checking out a shirt online, whose description says it has a pocket, but in pictures, clearly, it does not? Or the time when you ordered a wallet or a purse by reading that it has a coin pocket, but only to realize after receiving, it does not?
The reasons for the data inaccuracy can be numerous. Still, it surely does not give you an overall good shopping experience or build the confidence to order anything online anytime soon, right? It may not be a big deal in the context of clothes or accessories, but when you are buying a car for the first-time online, things are at stake. So, let us look at one of the few possible data discrepancies and how we are ensuring that these won’t make it to you to spoil your online car buying experience!
Car buyers look for specific features while purchasing a car, and one of the most important features is the transmission. If you are from a metropolitan city and ride mostly in the city traffic, you would be leaning more towards the automatic transmission. Whereas if you drive more on highways or want to take full control of your car, you would pick a manual transmission. So, features like these matter a lot, and any data discrepancy about these features will be a big disappointment to any customer. We have trained a machine-learning algorithm to identify this data inaccuracy.
If you have ever visited the CARS24 website, you can see that we have a vehicle details page with images, the same as any other product details page on Amazon or Flipkart. In one of the interior images, we include a gear lever picture. Using this image, our machine learning model will identify the transmission type of the car and flag any data discrepancy. The key here is that the manual transmission car always has the gear patterns engraved on the gear knob. In comparison, the automatic car has nothing engraved on the gear knob but P-R-N-D gear patterns marked at the base of the gear lever.
You might be wondering how a machine learning model can correctly identify the transmission from a gear lever image. In reality, it is not one model but a pipeline of two Machine Learning models that work in sequence:
The answer may not be straightforward. The reason is, in the gear lever image, we have the gear lever, its base, and gear knob, which are useful for the purpose. But we also have a lot of other visual information like part of the dashboard, AC vents, part of the infotainment system, seats, cup holders, and a lot of different things in the image. All these things are too much information for an image classification model. As we do not have control over what the deeper layers of CNN networks learn, it acts as a noise and may not let the network learn the difference between manual and automatic.
To avoid this, we have placed the object detection model to localize to the gear knob region first. This helped reduce the burden on the image classification model and helped in achieving higher overall accuracy.
Localization to the gear knob region in the gear lever image is achieved by Detectron2’s Faster R-CNN algorithm. Detectron2 is Facebook AI Research’s next-generation platform that provides state-of-the-art detection and segmentation algorithms. For more information on the capabilities of Detectron2 other than object detection and Faster R-CNN algorithm, please visit the Detectron2 Github page.
Detectron2’s Object detection model has three main stages in detecting objects. The first stage extracts feature maps from the input image, the second stage proposes objects from multi-scale regions, and the third stage obtains fine-tuned box locations and classification results. At last, you will end with a maximum of 100 boxes with detected and classified objects in them. All these three stages’ actions happen in three blocks, namely: backbone network, regional proposal network, and ROI heads, respectively. You can find these three blocks in Detectron2’s meta-architecture with the Base-R-CNN-FPN network below.
As discussed above, Backbone Network extracts features from the input image, and it extracts features at different scales. The output features of Base R-CNN-FPN are at scales 1/4th, 1/8th, 1/16th, 1/32th, and 1/64th of the input image scale and are called P2, P3, P4, P5, and P6, respectively. This is done using the ResNet50 network. Next, the Region Proposal Network detects object regions from the multi-scale features and obtains 1000 box proposals (by default) with confidence scores. And last, the Box Head crops and warps feature maps using proposal boxes into multiple fixed-size features and obtains fine-tuned box locations and classification results via fully-connected layers. Finally, 100 boxes (by default) in maximum are filtered. You can take a closer look at components of these three blocks of Detectron2’s Base-RCNN-FPN network below:
If you want to learn more about each component in-depth, I would strongly recommend you to go through Hiroto Honda’s five-part blog series on Detectron2’s architecture here.
Coming back to our gear knob detection, we have used the Base Faster R-CNN model as the gear knob is easily identifiable. We trained with 500 warm-up iterations and about 100000 overall training iterations.
All the working technicalities of the object detection model apart, the bottom line is, if you give an image to this object detection model, it will return the cropped image with gear knob present in it. The accuracy of this object detection model is about 99% which is near perfect; it is even more interesting to learn that this was achieved with a fairly small amount of training data.
Once we have the localized gear knob regions from the object detection model, we trained a binary image classification model to identify if the gear knob is manual or automatic. Note that the object detection model from the previous step has performed extremely well with fewer data. With this heavy lifting done by the object detection model, the task of an image classification model has been made simpler. We could get away with using a lightweight model for gear knob classification. So, we have gone ahead with MobileNetV2 with pre-trained weights for this purpose.
In MobileNetV2, there are two types of blocks. One is a residual block with a stride of 1. Another one is a block with a stride of 2 for downsizing. There are three layers for both types of blocks. The first layer is 1×1 convolution with ReLU6. The second layer is the depth-wise convolution. The third layer is another 1×1 convolution but without any non-linearity. It is claimed that if ReLU is used again, the deep networks only have the power of a linear classifier on the non-zero volume part of the output domain.
The cropped gear knob regions are passed to the MobileNetV2 classification model, which returns the transmission of the car. This predicted transmission is then compared to the transmission from the vehicle details page of the car to see if there is any data discrepancy. This way, we can identify the transmission of the car from the image easily with this two-step Machine Learning pipeline approach.
The Machine Learning pipeline is benchmarked on the unseen data, and the benchmarking numbers are as follows: accuracy – 98%, precision – 98%, recall – 97%.
You might have crossed paths with machine learning models when you were browsing Netflix recommendations, auto-completing Google search phrases, interacting with virtual assistants, checking “people you may know” on Facebook, and receiving messages about suspected fraudulent online transactions. All these models make our life easier than before. Similarly, we are making your life easier by using Machine Learning models in our own way. Not just in traditional ways but also when used in a novel way, Machine Learning models make everyone’s lives easier by helping us identify the data discrepancies beforehand and offer you a seamless online car buying experience.
Written By: Krishna Chaitanya, Senior Machine Learning Engineer, CARS24