Intern-S1: Multimodal LLM for Science
Abstract
In this AI Research Roundup episode, Alex discusses the paper: ‘Intern-S1: A Scientific Multimodal Foundation Model’ Intern-S1 introduces a multimodal Mixture-of-Experts LLM family targeting complex scientific domains requiring vision, sequences, and time-series understanding. Built on a Qwen3-235B MoE backbone (241B total, 28B active), it’s trained on ~5T tokens with over 2.5T scientific tokens. The system integrates InternViT-6B vision, a dynamic tokenizer for SMILES/FASTA with ~70% better compression, and a time-series encoder for long signals. A three-pronged strategy spans data mining and filtering to boost scientific purity, page-level PDF parsing with VLMs, and scalable infrastructure for long-range reasoning.
Paper URL: https://arxiv.org/abs/2508.15763