The realm of efficient content retrieval has surged, thanks to the exponential rise in multimedia data. Enter (MR)², a new concept denoting Mixed Reality Multimedia Retrieval. (MR)² , capitalizes on the transformative capacities of Mixed Reality (MR), featuring a live query function that empowers users to initiate queries intuitively through interaction with real-world objects. Within the new framework, we seamlessly integrate cutting-edge technologies such as object detection (YOLOv8), semantic similarity search (CLIP), and data management (Cottontail DB30) within vitrivr. Through autonomous generation of queries based on object recognition in the user’s field of view, (MR)² creates an immersive retrieval of comparable multimedia content from a connected database. This research attempts to redefine the user experience with multimedia databases, harmoniously uniting the physical and digital domains. The success of our iOS prototype application signals promising results, setting the stage for immersive and context-aware multimedia retrieval in the years of MR.


As technology evolves rapidly, it unveils novel and captivating avenues for interacting with digital data, leading to an overwhelming influx of multimedia content. However, traditional retrieval techniques are needed to help manage this vast data volume. This section delves into the convergence of Artificial Intelligence (AI), Mixed Reality (MR), and multimedia retrieval, culminating in the creation of (MR)²—a transformative concept seamlessly uniting the physical and digital realms.

The motivation behind our research arises from the growing demand for seamless user interactions with multimedia content. Conventional retrieval systems, reliant on text-based queries, often fail to deliver users’ desired immersive experience. In MR environments, our goal is to facilitate effortless engagement with multimedia content by harnessing the capabilities of AI-powered object detection.

To exemplify the potential of (MR)², we present a use case featuring a user in a city centre adorned with an MR headset. Besides menu navigation, our system empowers users to engage directly with historical buildings and recognizing them through object detection. Concentrating on a historical artefact, (MR)², can dynamically provide additional information about it and suggest similar artworks, transforming art exploration into an immersive journey.

Our investigation strives to redefine multimedia retrieval in MR environments through a robust framework integrating AI-driven object detection, XR technologies, and multimedia retrieval. This section introduces (MR)² and illustrates the revolutionary impact of AI-powered live queries on user interactions within MR environments.


This section delves into multimedia retrieval, object detection, and visual-text co-embedding. Multimedia retrieval focuses on efficient content search across diverse datasets using AI-generated ranked lists. Object detection in mixed reality relies on advanced AI techniques like YOLOv8 for real-time identification. Ultralytics’ YOLOv8 stands out in applications like autonomous driving. As exemplified by CLIP, visual-text co-embedding enhances multimedia retrieval robustness through AI-driven integration of visual and textual features. CLIP’s transformative impact extends to tasks like zero-shot image classification.

Object Detection

Detecting and interacting with physical objects in real-time in mixed reality (MR) environments demands a specialized approach to object detection. Advanced AI techniques like YOLOv832 and Faster R-CNN33, particularly in deep learning, have revolutionized object detection in images and video streams. These techniques form the foundation for object detection in MR scenarios, allowing swift and accurate identification of objects in the user’s surroundings. Ultzralytics34 has introduced YOLOv8, which strategically divides input images into grid cells for predicting objects. This version employs a deep neural network with convolutional layers and feature fusion techniques to enhance its ability to detect objects of varying sizes and contexts. YOLOv8’s efficient architecture and feature fusion make it a go-to choice in applications like autonomous driving, surveillance, and object recognition due to its speed and reliability.

Visual-Text Co-Embedding

Visual-text co-embedding is a powerful technique that fuses visual and textual features to enhance multimedia retrieval systems. The groundbreaking Contrastive Language-Image Pre-training (CLIP3536) architecture represents a significant advance in this field. CLIP uses AI to create a unified embedding space that compares text and images directly. This integration allows textual metadata to seamlessly blend with visual content, resulting in more robust and context-aware retrieval systems. The AI-driven CLIP architecture features a jointly trained vision and text encoder, allowing images and their textual descriptions to be encoded in the same space. Using a contrastive loss function, CLIP effectively brings similar image-text pairs closer while pushing dissimilar pairs apart in the shared space. CLIP’s versatility extends to tasks such as zero-shot image classification, illustrating the transformative impact of AI on multimedia retrieval.


(MR)² stands at the forefront of revolutionising multimedia retrieval within mixed reality environments, setting the stage for the harmonisation of the natural and virtual world. This groundbreaking system harnesses the power of advanced machine learning technologies to enable real-time interaction and a user-centric design, reshaping the way users engage with their surroundings.

Key Principles

  • Immersion:

(MR)² is crafted to immerse users in a mixed reality environment, seamlessly blending physical and digital realms. The overarching goal is to transport users into an augmented space, allowing them to interact with their surroundings while effortlessly accessing and engaging with digital content. The immersive experience aims to captivate users, making them feel fully present within the mixed-reality environment.

  • Real-time Interaction:

Central to (MR)² is a robust emphasis on real-time interaction. This principle ensures that users can perform actions and receive responses without perceptible delays. Whether capturing images, selecting objects or retrieving multimedia content, (MR)² prioritises immediate and fluid interactions. This commitment enhances the overall user experience, making it dynamic and engaging.

  • User-Centric Design:

(MR)² adopts a user-centric design approach, placing the user’s needs and preferences at the forefront. The system is meticulously crafted to be intuitive, user-friendly, and adaptable to individual requirements. This user-centric design spans the entire user journey within the mixed reality environment, aiming to cater to a diverse user base and ensure the concept is accessible and enjoyable for all.

  • Integration of Cutting-Edge ML Models:

To achieve accurate object detection and relevance in content retrieval, (MR)² integrates cutting-edge machine learning models. These models represent the pinnacle of ML and computer vision research standards, underscoring (MR)²’s commitment to leveraging technological advancements to provide users with precise and meaningful results.


The architecture of (MR)², illustrated in Figure 1, comprises two integral components:

  • Frontend:

The frontend is designed to capture user interactions and perform computations on the device. Offering three query modalities—object detection, area selection, and text queries—the frontend ensures that query results are not only accurate but also presented in an engaging manner. This approach facilitates user comprehension and utilization of the information retrieved.

  • Backend:

The backend handles data from queries, processes inputs using a sophisticated machine learning model, and conducts similarity searches using feature vectors. This cohesive model aligns seamlessly with (MR)²’s principles, facilitating the merger of real and digital worlds, ensuring real-time functionality, and prioritizing a user-centric design. Moreover, the flexibility in ML model selection positions (MR)² to adapt to future breakthroughs in the field.

Figure 1: Conceptual Architecture of (MR)²


This chapter comprehensively explores the prototype implementation of (MR)², providing insights into its intricate architecture and critical components. As an iOS application tailored for iPhones and iPads, (MR)² establishes seamless communication with the NMR backend, encompassing the vitrivr engine and Cottontail DB. This collaborative integration forms a robust foundation for the system’s operations.

  • iOS Application:

At the core of (MR)²’s functionality is the iOS application, meticulously crafted in Swift. This component catalyses mixed-reality interactions and robust object detection. Users experience a responsive interface that effortlessly adapts to real-world surroundings, thanks to the integration of the camera feed using AVFoundation. This integration allows for a smooth transition between front and back cameras, enhancing the interactive experience.

  • NMR Backend (vitrivr-engine and Cottontail DB):

The NMR backend, housing vitrivr-engine and Cottontail DB, is the retrieval centre for (MR)². The backend manages CLIP feature extraction through a RESTful API and executes similarity searches in Cottontail DB. This real-time process ensures the prompt retrieval of the top 100 similar objects, showcasing the system’s efficiency in delivering meaningful results to users.

(MR)²’s integration of the camera feed using AVFoundation goes beyond mere visual capture. It captures the essence of real-world surroundings, offering users an interactive and immersive experience. The app’s commitment to a responsive interface allows users to effortlessly switch between front and back cameras, contributing to the fluidity of the overall interaction.

The powerful combination of Apple’s Vision and CoreML frameworks, featuring the YOLOv8 model, empowers (MR)²’s object detection capabilities. This dynamic approach ensures the identification of objects in the live camera feed, providing users with real-time bounding boxes for a comprehensive understanding of their surroundings, seen in Figure 2a.

The backend performs CLIP feature extraction on captured images, followed by real-time similarity searches in Cottontail DB. This meticulous process ensures the prompt retrieval of the top 100 similar objects, solidifying (MR)²’s commitment to delivering efficient and meaningful user results.

The app doesn’t just retrieve results; it presents them in real-time within a dedicated ViewController. The scrollable grid, strategically designed to display the most similar images with the highest similarity in the top-left corner, offers users a visually intuitive way to navigate through results, shown in Figure 2d. Users can enlarge individual objects for closer inspection, enhancing the overall user experience.

Beyond live queries, (MR)² caters to diverse user needs with additional query options. Users can initiate region-based queries through a touch input, positioning a rectangle on the screen within the camera feed (Figure 2b). For text queries, presented in Figure 2c, users enter descriptions, and CLIP enables cross-modal similarity comparisons with image content. This flexibility ensures that (MR)² is responsive and adaptable to varying user preferences and query types.

Mixed reality search user interface

a) Object detection

Manual are selection

b) Manual area selection

Result presentation

c) Text input

d) Result presentation


This section offers a thorough evaluation of (MR)², covering both analytical and user-centric aspects. We analyse functional performance, including object detection inference time, query response time, and real-world usability. The section concludes with a focused discussion on overall performance and future advancements.

Performance and User Evaluation

To evaluate (MR)²’s functional prowess, we undertook a comprehensive two-phase assessment with an analytical evaluation followed by a user-centric exploration.

The analytical evaluation meticulously examined pure performance metrics, measuring the inference time for object detection and query response time. With a median inference time of 24.8ms, (MR)² ensures real-time applicability. The average query time at 4191ms showcases remarkable efficiency, particularly given the extensive ImageNet Large Scale Visual Recognition Challenge (ILSVRC) dataset37 stored in Cottontail DB.

Simultaneously, the user evaluation engaged 14 participants in real-world scenarios, spotlighting practical usability. Participants with moderate to high technology affinity (average ATI score of 4.4338) found (MR)² to be user-friendly and efficient. Reinforcing this positive perception, the System Usability Scale (SUS) score of 87 reflects the overall favourable views on usability39.

Open feedback from users echoed the encouraging SUS results, highlighting (MR)²’s intuitiveness and practicality. While minor concerns surfaced, such as overlapping bounding boxes when multiple objects are detected, participants expressed a strong inclination to continue using (MR)², underscoring its user-friendliness and potential for widespread adoption.


While (MR)² has garnered positive feedback, we understand the perpetual need for improvement. Our commitment to enhancing user experience involves continuous exploration, especially in broadening the range of supported objects. By incorporating user feedback, (MR)² remains on a trajectory of evolution and improvement. Currently, we’re actively pursuing three exciting pathways that promise a more immersive future in multimedia retrieval:

  • Advancements are underway in query modes, immersive result presentation, device integration, and cutting-edge machine learning techniques, such as OCR, ASR, and tag integration.
  • An exploration into complexity and context awareness is unfolding, focusing on temporal queries to enrich the search landscape. Pioneering immersive result presentation in MR contexts, seamlessly overlaying search results, is a key area of interest.
  • Beyond enhancing iOS accessibility, we are venturing into compatibility with MR glasses like Meta Quest Pro or future Apple Vision Pro devices, opening new possibilities for more immersive applications. Crucial to shaping the future of MR multimedia retrieval are advancements in object detection models and self-training approaches.

Related Work

In the dynamic landscape of XR, a tapestry of multimedia retrieval systems has laid the groundwork for innovations preceding (MR)², offering diverse perspectives on integrating XR with multimedia and enriching user interactions in immersive environments.

Prioritizing text input in VR, vitrivr-VR4041, intricately connected to Cineast42, spearheads innovation in VR interfaces. This system diverges from (MR)²’s live queries, carving its path in virtual reality exploration. Meanwhile, ViRMA43 ventures into projecting multimedia objects for visual analytics, utilizing the multi-dimensional multimedia model (M3)44 within the VR realm. Although providing advanced visual analysis support, ViRMA lacks (MR)²’s object detection and an automated query approach.

Shifting the focus to complete cultural heritage exploration, GoFind!45 seamlessly blends content-based multimedia retrieval with Augmented Reality (AR). In contrast to (MR)², GoFind! places a spotlight on historical exploration, embracing varied query modalities to enhance user engagement in augmented environments.


(MR)², a new Mixed Reality (MR) Multimedia Retrieval concept that uses MR technology’s power to transform how users interact with multimedia content. At the core of (MR)² lies an innovative approach to query formulation and a live query option that seamlessly connects the digital and physical worlds. Our prototype on iOS devices demonstrated how (MR)² can enhance the user’s multimedia retrieval experience.

(MR)² utilizes the YOLOv8 model for object detection. While we acknowledge the opportunities for expanding the supported object types, (MR)² exemplifies the potential for object recognition in MR environments. Additionally, we leverage the CLIP machine learning model for similarity searches to enhance retrieval accuracy.

The positive user evaluations have underscored (MR)2’s potential for engaging multimedia retrieval experiences. The live query option and seamless MR integration received particular acclaim, validating our user-centred design.

As we look to the future, (MR)2 lays the foundation for further advancements. Very important, of course, is to use the frontend to be able to select specific multimedia data on-site in the XReco use cases. Therefore, the retrieval should be incorporated there.