Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music
Upload audio or paste a YouTube URL and ask about speech, environmental sounds, music, timestamps, speakers, or long-form events. Audio Flamingo Next gives detailed answers.
Authors: Sreyan Ghosh1,2, Arushi Goel1, Kaousheik Jayakumar2, Lasha Koroshinadze2, Nishit Anand2, Zhifeng Kong1, Siddharth Gururani1, Sang-gil Lee1, Jaehyeon Kim1, Aya Aljafari1, Chao-Han Huck Yang1, Sungwon Kim1, Ramani Duraiswami2, Dinesh Manocha2, Mohammad Shoeybi1, Bryan Catanzaro1, Ming-Yu Liu1, Wei Ping1
1NVIDIA, CA, USA | 2University of Maryland, College Park, USA
Correspondence: sreyang@umd.edu, arushig@nvidia.com
This Model Is Best For
- Standard audio QA and instruction following across speech, sound, and music
- Assistant-style long-audio understanding with direct answers and follow-up chat
- Speech tasks such as ASR, paralinguistic understanding, and multilingual AST / speech translation
- Broad music captioning and audio description when you want an answer rather than a dense caption
If you need the most detailed long-form captions or timestamp-heavy scene breakdowns, use Audio Flamingo Next Captioner.
If you need explicit step-by-step timestamp-grounded reasoning traces, use Audio Flamingo Next Think.
Prompting note: AF-Next-Instruct is strongest when the task is explicit. Ask directly for QA, ASR, AST, timestamps, or speaker labels instead of relying on a generic prompt.
Prompt Guide
| Task | Prompt | Recommended Checkpoint(s) |
|---|---|---|
| ASR | Transcribe the input speech. |
Instruct, Think |
| AST | Translate any speech you hear from <src_lang> into <tgt_lang>. |
Instruct, Think |
| Short Audio Captioning | Generate a caption for the input audio. |
Captioner, Think |
| Long Audio Captioning | Generate a detailed caption for the input audio. In the caption, transcribe all spoken content by all speakers in the audio precisely. |
Captioner, Think |
| Music Captioning | Summarize the track with precision: mention its musical style, BPM, key, arrangement, production choices, and the emotions or story it conveys. |
Captioner, Instruct, Think |
| Lyrics | Generate a lyrics transcription from the input song. |
Instruct, Captioner, Think |
| QA | What precise description did the commentator use for the punch that ended the fight? |
Instruct, Think |
| Timestamped Multi-Talker ASR | Transcribe the input audio. If multiple speakers are present, provide diarized transcripts with speaker labels.[Speaker 1] ...[Speaker 2] ... |
Instruct, Think |
Audio Input
OR
| YouTube URL | Prompt |
|---|