Stop Thinking, Just Do!

Sungsoo Kim's Blog

Building Multimodal AI Agents From Scratch

tagsTags

25 September 2025


Building Multimodal AI Agents From Scratch — Apoorva Joshi, MongoDB

Abstract

In this hands-on workshop, you will build a multimodal AI agent capable of processing mixed-media content—from analyzing charts and diagrams to extracting insights from documents with embedded visuals. Using MongoDB as a vector database and memory store, and Google’s Gemini for multimodal reasoning, you will gain hands-on experience with multimodal data processing pipelines and agent orchestration patterns by implementing core components directly, using good ol’ Python.

In this hands-on workshop, you will build a multimodal AI agent capable of processing mixed-media content—from analyzing charts and diagrams to extracting insights from documents with embedded visuals. Using MongoDB as a vector database and memory store, and Google’s Gemini for multimodal reasoning, you will gain hands-on experience with multimodal data processing pipelines and agent orchestration patterns by implementing core components directly, using good ol’ Python.

You will be provided with a GitHub repository consisting of learning materials and resources required to successfully execute the hands-on portions of the workshop.