[P] Deezer showed CNN detection fails on compressed audio, here's a dual-engine approach that survives MP3
I've been working on detecting AI-generated music and ran into the same wall that Deezer's team documented in their paper, CNN-based detection on mel-spectrograms breaks when audio is compressed to MP3.
The problem: A ResNet18 trained on mel-spectrograms works well on WAV files, but real-world music is distributed as MP3/AAC. Compression destroys the subtle spectral artifacts the CNN relies on.
What actually worked: Instead of trying to make the CNN more robust, I added a second engine based on source separation (Demucs). The idea is simple:
- Separate a track into 4 stems (vocals, drums, bass, other)
- Re-mix them back together
- Measure the difference between original and reconstructed audio
For human-recorded music, stems bleed into each other during recording (room acoustics, mic crosstalk, etc.), so separation + reconstruction produces noticeable differences. For AI music, each stem is synthesized independently separation and reconstruction yield nearly identical results.
Results:
- Human false positive rate: ~1.1%
- AI detection rate: 80%+
- Works regardless of audio codec (MP3, AAC, OGG)
The CNN handles the easy cases (high-confidence predictions), and the reconstruction engine only kicks in when CNN is uncertain. This saves compute since source separation is expensive.
Limitations:
- Detection rate varies across different AI generators
- Demucs is non-deterministic borderline cases can flip between runs
- Only tested on music, not speech or sound effects
Curious if anyone has explored similar hybrid approaches, or has ideas for making the reconstruction analysis more robust.
[link] [comments]
Want to read more?
Check out the full article on the original site