Just imagine: A computer that recognizes new objects after seeing just a few examples. Sounds like science fiction? Few-Shot Object Detection (FSOD) makes this vision a reality. FSOD is an exciting area of research that is pushing the boundaries of artificial intelligence.

 

Innovative FSOD: Fast Object Detection with Minimal Training Data

In simple terms, FSOD about teaching computers to recognise objects in images or videos, even if they have only seen a few examples of these objects. Traditional object detection methods require huge amounts of training data, which can be time-consuming and costly. FSOD, on the other hand, allows us to recognise new objects with minimal effort.

In XReco we integrate a very innovative approach of FSOD to search for arbitrary objects in image and video databases. For this purpose, we build an analysis service based on a client-server architecture. In addition, a web-based application for incremental training of objects, suitable for anyone without machine learning expertise, has been implemented.

How does FSOD work?

With few-shot learning, algorithms learn from a limited number of examples. In the case of object detection, this means learning to localise the objects of interest, and label them with their type (class). One of the biggest challenges with FSOD is the problem of over-fitting. Since only a few training examples are available, models tend to adapt too much to the specific characteristics of these examples and are therefore unable to recognise new, slightly different instances of these objects well.

XReco_FSOD_Few-Shot-Object-Detection1, A video or an image sequence can be loaded in the web application to test the few shot object detection.

Figure 1: Few Shot Object Detection – A video or an image sequence can be loaded in the web application to test the few shot object detection.

Various innovative approaches have been developed to overcome these challenges. One of them is DE-ViT [Zhang, 2024], a state-of-the-art FSOD method that has achieved outstanding results in recent studies, especially in recognising objects of which very few examples are available. DE-Vit is built on DINOv2 [Oquab, 2024], a self-supervised Vision Transformer Networks (ViT) network trained on large image datasets that generates robust visual features. No retraining of the ViT network is required to train a new object class.

To combat the problem of overfitting, the DE-ViT method use a low-rank subspace representation. This technique compresses the high-dimensional image features into a lower dimensional subspace. This reduces the complexity of the model and improves the generalisation capability, making the model better able to recognise new objects.

FSOD_Few Shot Object Detection2, The analysis results are displayed based on the selected object classes and detection threshold.

Figure 2: Few Shot Object Detection – The analysis results are displayed based on the selected object classes and detection threshold.

XReco_FSOD_Few Shot Object Detection3, Annotating retrieved images generates training images along with their corresponding object image masks, displayed in the lower right area.

Figure 3: Few Shot Object Detection – Annotating retrieved images generates training images along with their corresponding object image masks.

The web application: Training of FSOD can done by everyone

We make this kind of FSOD not only usable for AI experts. Thanks to a user-friendly web application developed in XReco, anyone can now utilise the power of FSOD. This application provides an intuitive graphical user interface that allows users to train new object classes and analyse images and videos.

To train a new object class, simply search for sample images in the application’s built-in image search, tapping into XReco’s Neural Media Repository, draw bounding boxes around the objects and start the training process. The application even uses an image segmentation method based on visual prompting to automatically refine the object boundaries to ensure the accuracy of the training data. The training process is very fast and only takes around two seconds per training example on a consumer graphics card. After training, you can test the ability to recognise the object class in image sequences or videos.

Watch the video below to see how easy and fast it is to add a new object class to the object recognition system:

About JOANNEUM RESEARCH

JOANNEUM RESEARCH is an applied research organisation based in Graz, Austria, participating with the Intelligent Vision Application group of its Institute for Digital Technologies in XReco. The work in XReco focuses on methods for visual content understanding and description, in order to facilitate search and use of content for 3D reconstruction.

Stay tuned and become part of the XReco community for exclusive information.
Never miss an update. Subscribe to the XReco newsletter.

References

[Zhang, 2024] Xinyu Zhang, Yuhan Liu, Yuting Wang and Abdeslam Boularias, “Detect Everything with Few Examples”, arXiv 2309.12969, 2024, https://arxiv.org/abs/2309.12969

[Oquab, 2024] Maxime Oquab, Timothée Darcet, Théo Moutakanni, at al., “DINOv2: Learning Robust Visual Features without Supervision”, Transactions on Machine Learning Research, ISSN 2835-8856, 2024

Share this article!