Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

Upload audio or paste a YouTube URL and ask about speech, environmental sounds, music, timestamps, speakers, or long-form events. Audio Flamingo Next gives detailed answers.

Authors: Sreyan Ghosh^1,2, Arushi Goel¹, Kaousheik Jayakumar², Lasha Koroshinadze², Nishit Anand², Zhifeng Kong¹, Siddharth Gururani¹, Sang-gil Lee¹, Jaehyeon Kim¹, Aya Aljafari¹, Chao-Han Huck Yang¹, Sungwon Kim¹, Ramani Duraiswami², Dinesh Manocha², Mohammad Shoeybi¹, Bryan Catanzaro¹, Ming-Yu Liu¹, Wei Ping¹

¹NVIDIA, CA, USA | ²University of Maryland, College Park, USA

Correspondence: sreyang@umd.edu, arushig@nvidia.com

This Model Is Best For

Standard audio QA and instruction following across speech, sound, and music
Assistant-style long-audio understanding with direct answers and follow-up chat
Speech tasks such as ASR, paralinguistic understanding, and multilingual AST / speech translation
Broad music captioning and audio description when you want an answer rather than a dense caption

If you need the most detailed long-form captions or timestamp-heavy scene breakdowns, use Audio Flamingo Next Captioner.

If you need explicit step-by-step timestamp-grounded reasoning traces, use Audio Flamingo Next Think.

Prompting note: AF-Next-Instruct is strongest when the task is explicit. Ask directly for QA, ASR, AST, timestamps, or speaker labels instead of relying on a generic prompt.

Prompt Guide

Task	Prompt	Recommended Checkpoint(s)
ASR	`Transcribe the input speech.`	`Instruct`, `Think`
AST	`Translate any speech you hear from <src_lang> into <tgt_lang>.`	`Instruct`, `Think`
Short Audio Captioning	`Generate a caption for the input audio.`	`Captioner`, `Think`
Long Audio Captioning	`Generate a detailed caption for the input audio. In the caption, transcribe all spoken content by all speakers in the audio precisely.`	`Captioner`, `Think`
Music Captioning	`Summarize the track with precision: mention its musical style, BPM, key, arrangement, production choices, and the emotions or story it conveys.`	`Captioner`, `Instruct`, `Think`
Lyrics	`Generate a lyrics transcription from the input song.`	`Instruct`, `Captioner`, `Think`
QA	`What precise description did the commentator use for the punch that ended the fight?`	`Instruct`, `Think`
Timestamped Multi-Talker ASR	`Transcribe the input audio. If multiple speakers are present, provide diarized transcripts with speaker labels.` `[Speaker 1] ...` `[Speaker 2] ...`	`Instruct`, `Think`

Audio Input

Upload Audio File

YouTube URL

Paste any YouTube URL - we'll extract high-quality audio automatically

Prompt

Example Prompts

YouTube URL	Prompt

Model Response