Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

Upload audio or paste a YouTube URL and ask about speech, environmental sounds, music, timestamps, speakers, or long-form events. Audio Flamingo Next gives detailed answers.

Authors: Sreyan Ghosh^1,2, Arushi Goel¹, Kaousheik Jayakumar², Lasha Koroshinadze², Nishit Anand², Zhifeng Kong¹, Siddharth Gururani¹, Sang-gil Lee¹, Jaehyeon Kim¹, Aya Aljafari¹, Chao-Han Huck Yang¹, Sungwon Kim¹, Ramani Duraiswami², Dinesh Manocha², Mohammad Shoeybi¹, Bryan Catanzaro¹, Ming-Yu Liu¹, Wei Ping¹

¹NVIDIA, CA, USA | ²University of Maryland, College Park, USA

Correspondence: sreyang@umd.edu, arushig@nvidia.com

This Model Is Best For

Standard audio QA and instruction following across speech, sound, and music
Assistant-style long-audio understanding with direct answers and follow-up chat
Speech tasks such as ASR, paralinguistic understanding, and multilingual AST / speech translation
Broad music captioning and audio description when you want an answer rather than a dense caption

If you need the most detailed long-form captions or timestamp-heavy scene breakdowns, use Audio Flamingo Next Captioner.

If you need explicit step-by-step timestamp-grounded reasoning traces, use Audio Flamingo Next Think.

Audio Input

Upload Audio File

YouTube URL

Paste any YouTube URL - we'll extract high-quality audio automatically

Prompt

Example Prompts

YouTube URL	Prompt

Model Response