Coding

Gemini API File Search is now multimodal

Google's Gemini API file search capabilities have expanded to incorporate multimodal interaction, allowing developers to query and retrieve files via natural language prompts, images, and even audio inputs, leveraging a robust Relevance Aware Generator (RAG) model to improve search accuracy and efficiency. This upgrade enables more intuitive and flexible file search functionality, potentially revolutionizing workflows in industries reliant on large document archives. The API now supports a broader range of input modalities.

Tom H (AI-assisted) May 10, 2026 1 min read EN

Based on reporting from Source.

Google has expanded the Gemini API's File Search tool to support multimodal data — images, audio, and text — alongside custom metadata filtering and page-level citations. The update is designed to make retrieval-augmented generation (RAG) systems more flexible and accurate for developers building search into their applications.

What changed

Previously, File Search was limited to text-based queries. The new version processes images and text together natively, using the Gemini Embedding 2 model. This means you can search an archive of visual assets using natural language descriptions of tone, style, or content — without relying on filenames or keywords. Audio inputs are also supported, though the documentation does not specify which audio formats or length limits apply.

Custom metadata is another addition. You can attach key-value labels to files — for example, department: Legal or status: Final — and then filter search results by those labels at query time. This reduces noise from irrelevant documents and speeds up retrieval in large archives.

Page citations are now included in responses. When the model pulls an answer from a PDF, it returns the exact page number, allowing users to verify the source directly. This improves transparency and is useful for fact-checking in regulated or document-heavy workflows.

How it works

File Search handles the infrastructure for indexing and retrieval. Developers upload files and then query them via the Gemini API. The tool uses a Relevance Aware Generator (RAG) model to improve search accuracy. The underlying embedding model is Gemini Embedding 2, which understands both text and image data natively.

Google provides code snippets in its developer guide and API documentation. The basic workflow is:

Upload files (text, images, or audio) to the File Search tool.
Optionally attach custom metadata as key-value pairs.
Query using natural language, images, or audio.
Receive results with page citations where applicable.

Tradeoffs

File Search is a managed service — you don't need to set up your own vector database or embedding pipeline. That reduces operational overhead but also means you're tied to Google's infrastructure and pricing. The tool is part of the Gemini API, so costs depend on usage volume and the specific model tier.

Custom metadata filtering is a significant improvement for production RAG systems, but it requires upfront labeling effort. If your data lacks consistent metadata, the feature adds little value.

Page citations are limited to PDFs. For other file types (images, audio, plain text), the citation mechanism is not described.

When to use it

This update is relevant for any application that needs to search across mixed media — creative agencies looking for visual assets, legal teams scanning document archives, or customer support tools that need to retrieve information from PDFs and images alike. The multimodal search and metadata filtering make it suitable for larger, more complex datasets where simple keyword search falls short.

Bottom line

Google's Gemini API File Search now offers a practical, managed RAG solution with multimodal input, metadata filtering, and source citations. It removes the need to build custom embedding and retrieval infrastructure, at the cost of vendor lock-in and potential usage fees. For teams already using the Gemini API, it's a straightforward upgrade.

More articles like this

Coding 1 min

Visual Studio Code 1.120

Visual Studio Code’s 1.120 update slashes debugging friction with native Data Breakpoints, letting engineers pause execution when specific object properties change—not just memory addresses. The release also bakes in GitHub Copilot-powered inline code completions for Python, JavaScript, and TypeScript, cutting keystrokes by up to 40% in early benchmarks, while a revamped terminal shell integration finally bridges the gap between local and remote workflows.

Coding 1 min

Rust but Lisp

A new Rust library, "rust-but-lisp," injects a Lisp-like macro system into the statically typed language, blurring the lines between compile-time and runtime evaluation. By leveraging Rust's procedural macros, the library enables developers to write code that can be evaluated and transformed at compile-time, effectively merging the benefits of static typing with the flexibility of dynamic evaluation. This fusion of paradigms could redefine the way developers approach code generation and metaprogramming.

Coding 1 min

"Dirty Frag" (CVE-2026-43284): The Second Linux Root Exploit in Eight Days

A devastating Linux root exploit, dubbed "Dirty Frag," has emerged, capitalizing on a previously unknown vulnerability in the Linux kernel's networking stack, specifically in the handling of IPv6 fragmentation (CVE-2026-43284). This marks the second high-severity Linux exploit in just eight days, underscoring the growing urgency for kernel patching and vulnerability mitigation in the face of escalating cyber threats. The exploit's ease of exploitation and widespread kernel adoption amplify its potential impact.

Coding 1 min

Bun ported to Rust in 6 days

A 6-day code sprint has successfully ported Bun, a high-performance JavaScript runtime, to Rust, a systems programming language, marking a significant milestone in the pursuit of native, zero-CPU-overhead execution for web applications. This achievement leverages Rust's ownership model and borrow checker to eliminate runtime errors and memory safety issues. The port's rapid completion underscores the growing appeal of Rust as a platform for building high-performance, secure, and efficient web infrastructure.

Coding 1 min

GrapheneOS fixes Android VPN leak Google refused to patch

Android's VPN security gap, long dismissed by Google, has been plugged by GrapheneOS, a custom Android variant, through a patch that exploits a workaround for a kernel vulnerability, effectively bypassing the company's refusal to address the issue. The fix leverages a Linux kernel module to intercept and encrypt VPN traffic, circumventing a known flaw in Android's VPN implementation. This patch underscores the limitations of Google's control over Android's security.

Coding 1 min

Show HN: Mochi.js: bun-native high-fidelity browser automation library

A Bun-native browser automation library, Mochi.js, bypasses anti-bot defenses by eschewing superficial client-side probes in favor of raw Chrome DevTools Protocol (CDP) parity with stock Chromium, outperforming forked browsers by avoiding detectable artifacts. Built on a WAF-aware probe manifest, it targets the actual heuristics used by CAPTCHAs and web application firewalls, enabling high-fidelity automation without the need for deception. The framework’s JS-layer approach redefines browser automation by prioritizing consistency over cosmetic mimicry.