[P] Deezer showed CNN detection fails on compressed audio, here's a dual-engine approach that survives MP3

I've been working on detecting AI-generated music and ran into the same wall that Deezer's team documented in their paper, CNN-based detection on mel-spectrograms breaks when audio is compressed to MP3.

The problem: A ResNet18 trained on mel-spectrograms works well on WAV files, but real-world music is distributed as MP3/AAC. Compression destroys the subtle spectral artifacts the CNN relies on.

What actually worked: Instead of trying to make the CNN more robust, I added a second engine based on source separation (Demucs). The idea is simple:

Separate a track into 4 stems (vocals, drums, bass, other)
Re-mix them back together
Measure the difference between original and reconstructed audio

For human-recorded music, stems bleed into each other during recording (room acoustics, mic crosstalk, etc.), so separation + reconstruction produces noticeable differences. For AI music, each stem is synthesized independently separation and reconstruction yield nearly identical results.

Results:

Human false positive rate: ~1.1%
AI detection rate: 80%+
Works regardless of audio codec (MP3, AAC, OGG)

The CNN handles the easy cases (high-confidence predictions), and the reconstruction engine only kicks in when CNN is uncertain. This saves compute since source separation is expensive.

Limitations:

Detection rate varies across different AI generators
Demucs is non-deterministic borderline cases can flip between runs
Only tested on music, not speech or sound effects

Curious if anyone has explored similar hybrid approaches, or has ideas for making the reconstruction analysis more robust.

submitted by /u/Leather_Lobster_2558
[link] [comments]

[P] Deezer showed CNN detection fails on compressed audio, here's a dual-engine approach that survives MP3

Want to read more?

Tagged with