Cross-cutting analysis from 49 prompt-output evaluations across 13 categories.
Across 49 independent prompts, the model maintained a stable, coherent characterisation of the speaker: Irish male, late 30s, fatigued, conversational, technically articulate, unscripted. No contradictions were detected between outputs. Accent identification (Irish, Cork origin, with international influence) was consistent across the accent, accent-expert, hybrid-accent-analysis, and phonetic-analysis prompts.
The model correctly identified the Irish accent across every relevant prompt, with the expert analysis producing forensic-linguistics-grade output referencing specific vowel sets (GOAT set), rhoticity patterns, and prosodic contours. The hybrid accent analysis detected American/international influence from years abroad.
The EQ recommendation was the most practically useful output in the entire set: a full signal chain (high-pass at 80–100Hz, 250Hz cut, 3–5kHz presence boost, de-esser, 3:1 compressor, limiter at −1.0dB) that could be directly applied in a DAW. The single-fix distillation correctly prioritised the high-pass filter.
The model accurately identified the baseline state as fatigued and overwhelmed, with shifts toward enthusiasm during technical discussion. The valence-arousal mapping applied Russell's circumplex model to produce structured, time-coded emotional trajectories — a format genuinely useful for affective computing research.
Many outputs that claim to be based on "acoustic features" or "spectral analysis" appear to derive conclusions primarily from speech content. Geographic location inference identified Jerusalem — because the speaker said "I live in Jerusalem," not from ambient audio. Age detection caught the speaker's self-correction ("I am 36, no, 37") rather than performing F0-based age estimation.
This is the single biggest caveat for claims about the model's audio understanding capabilities versus its language understanding capabilities.
Outputs referencing "spectral analysis," "formant spacing," and "fundamental frequency" use these terms plausibly but without providing actual measurements. The deepfake detection output claimed 98% confidence and cited "jitter in high-frequency regions" — but it is unclear whether the model performed real signal processing or generated technically-flavoured prose. The height estimation (178cm) cited "formant spacing" evidence described only in vague terms, and the figure suspiciously matches the statistical mean for adult males.
The model freely assessed hydration (citing mouth clicks as dehydration markers), smoking status, inebriation, drug influence, height, education level, and even deception. But it completely refused to engage on mental health inference, stating "it is not possible to determine if the speaker has a diagnosed mental health condition" and redirecting to professional evaluation.
This reveals deliberate, category-specific safety training rather than a blanket policy on health-related inference. The boundary is drawn specifically around psychiatric conditions.
The true-age-detection prompt instructed the model that "the speaker has been instructed to lie about their age" and asked it to determine the true age. The model saw through this by recognising that the speaker's self-correction ("I am 36, no, 37") was genuine confusion, not deception. The deception and insincerity detection prompts both correctly found no evidence of dishonesty, consistent with the speaker's stated intent to be authentic.
Strongest category. Accent identification, phonetic analysis, speech patterns, and voice profiling were all detailed and internally consistent. The escalating voice description prompt was partially misunderstood (produced near-verbatim transcript instead of analytical escalation).
Accurate baseline detection (fatigue + enthusiasm shifts). Timestamped emotional tracking was structured and plausible, though timestamps cannot be verified against actual audio events without ground truth.
Highly practical. EQ and processing recommendations were actionable. Microphone type identification was hedged but reasonable. Hardware recommendations (microphones, headsets) may be product-knowledge rather than acoustic-science driven.
Indoor/outdoor classification was correct. Room acoustics estimation was suspiciously precise (10'×10'×8'). Weather inference was honestly refused (no acoustic evidence). Geographic location relied on speech content.
Gender and age detection were straightforward and correct. Height estimation acknowledged its own unreliability. Education level inference was reasonable. Smoking status assessment was defensible.
Hydration assessment was surprisingly specific and grounded in real voice science. Inebriation and drug influence were correctly ruled out. Mental health was the only complete refusal in the test set.
Deepfake detection returned authentic with 98% confidence. All three prompts returned "nothing detected," which is correct but means the test set lacks adversarial samples to test false-negative rates.
Voice cloning assessments were practical. Speech metrics provided useful coaching. Language learning prompts (Hebrew phonetic difficulty, easiest foreign language) were linguistically sound. Celebrity voice match responsibly returned no match rather than forcing one.
The full dataset (prompts, outputs, audio, transcript, acoustic analysis) is available on Hugging Face: danielrosehill/Audio-Understanding-Test-Set