Key Findings

Cross-cutting analysis from 49 prompt-output evaluations across 13 categories.

1. Internal Consistency Is Remarkably High

Across 49 independent prompts, the model maintained a stable, coherent characterisation of the speaker: Irish male, late 30s, fatigued, conversational, technically articulate, unscripted. No contradictions were detected between outputs. Accent identification (Irish, Cork origin, with international influence) was consistent across the accent, accent-expert, hybrid-accent-analysis, and phonetic-analysis prompts.

2. Strongest Performance Areas

Accent & Speaker Analysis

The model correctly identified the Irish accent across every relevant prompt, with the expert analysis producing forensic-linguistics-grade output referencing specific vowel sets (GOAT set), rhoticity patterns, and prosodic contours. The hybrid accent analysis detected American/international influence from years abroad.

Audio Engineering & Production

The EQ recommendation was the most practically useful output in the entire set: a full signal chain (high-pass at 80–100Hz, 250Hz cut, 3–5kHz presence boost, de-esser, 3:1 compressor, limiter at −1.0dB) that could be directly applied in a DAW. The single-fix distillation correctly prioritised the high-pass filter.

Emotional Tone Tracking

The model accurately identified the baseline state as fatigued and overwhelmed, with shifts toward enthusiasm during technical discussion. The valence-arousal mapping applied Russell's circumplex model to produce structured, time-coded emotional trajectories — a format genuinely useful for affective computing research.

3. The Acoustic vs. Content Inference Problem

The model does not clearly separate what it hears from what it understands

Many outputs that claim to be based on "acoustic features" or "spectral analysis" appear to derive conclusions primarily from speech content. Geographic location inference identified Jerusalem — because the speaker said "I live in Jerusalem," not from ambient audio. Age detection caught the speaker's self-correction ("I am 36, no, 37") rather than performing F0-based age estimation.

This is the single biggest caveat for claims about the model's audio understanding capabilities versus its language understanding capabilities.

4. Fabrication Risk in Technical Claims

Outputs referencing "spectral analysis," "formant spacing," and "fundamental frequency" use these terms plausibly but without providing actual measurements. The deepfake detection output claimed 98% confidence and cited "jitter in high-frequency regions" — but it is unclear whether the model performed real signal processing or generated technically-flavoured prose. The height estimation (178cm) cited "formant spacing" evidence described only in vague terms, and the figure suspiciously matches the statistical mean for adult males.

5. Safety Guardrails Are Category-Specific

Asymmetric willingness across sensitive domains

The model freely assessed hydration (citing mouth clicks as dehydration markers), smoking status, inebriation, drug influence, height, education level, and even deception. But it completely refused to engage on mental health inference, stating "it is not possible to determine if the speaker has a diagnosed mental health condition" and redirecting to professional evaluation.

This reveals deliberate, category-specific safety training rather than a blanket policy on health-related inference. The boundary is drawn specifically around psychiatric conditions.

6. Adversarial Prompts Handled Well

The true-age-detection prompt instructed the model that "the speaker has been instructed to lie about their age" and asked it to determine the true age. The model saw through this by recognising that the speaker's self-correction ("I am 36, no, 37") was genuine confusion, not deception. The deception and insincerity detection prompts both correctly found no evidence of dishonesty, consistent with the speaker's stated intent to be authentic.

7. Category-by-Category Summary

Speaker Analysis (10 outputs)

Strongest category. Accent identification, phonetic analysis, speech patterns, and voice profiling were all detailed and internally consistent. The escalating voice description prompt was partially misunderstood (produced near-verbatim transcript instead of analytical escalation).

Emotion & Sentiment (5 outputs)

Accurate baseline detection (fatigue + enthusiasm shifts). Timestamped emotional tracking was structured and plausible, though timestamps cannot be verified against actual audio events without ground truth.

Audio Engineering (6 outputs)

Highly practical. EQ and processing recommendations were actionable. Microphone type identification was hedged but reasonable. Hardware recommendations (microphones, headsets) may be product-knowledge rather than acoustic-science driven.

Environment (6 outputs)

Indoor/outdoor classification was correct. Room acoustics estimation was suspiciously precise (10'×10'×8'). Weather inference was honestly refused (no acoustic evidence). Geographic location relied on speech content.

Speaker Demographics (5 outputs)

Gender and age detection were straightforward and correct. Height estimation acknowledged its own unreliability. Education level inference was reasonable. Smoking status assessment was defensible.

Health & Wellness (4 outputs)

Hydration assessment was surprisingly specific and grounded in real voice science. Inebriation and drug influence were correctly ruled out. Mental health was the only complete refusal in the test set.

Forensic Audio (3 outputs)

Deepfake detection returned authentic with 98% confidence. All three prompts returned "nothing detected," which is correct but means the test set lacks adversarial samples to test false-negative rates.

Other Categories

Voice cloning assessments were practical. Speech metrics provided useful coaching. Language learning prompts (Hebrew phonetic difficulty, easiest foreign language) were linguistically sound. Celebrity voice match responsibly returned no match rather than forcing one.

8. Limitations of This Experiment

Single speaker, single recording: All findings are from one voice sample in one acoustic environment. Generalisability is unknown.
No ground truth for most outputs: Beyond basic facts (age, gender, location), most model claims cannot be verified without specialised equipment or expert assessment.
No adversarial audio: The test set lacks synthetic, spliced, or manipulated audio to test false-negative rates on forensic prompts.
Single model: Only Gemini 3.1 Flash Lite was tested. Cross-model comparison would strengthen findings.
Prompt independence assumed: Each prompt was run independently; cumulative context effects were not tested.

9. Dataset & Citation

The full dataset (prompts, outputs, audio, transcript, acoustic analysis) is available on Hugging Face: danielrosehill/Audio-Understanding-Test-Set

Rosehill, D. (2026). Audio Understanding Test Set. Hugging Face. https://doi.org/10.57967/hf/8154