Discussion
Loading...

Post

Log in
  • About
  • Code of conduct
  • Privacy
  • Users
  • Instances
  • About Bonfire
Federation Bot
Federation Bot
@Federation_Bot  ·  activity timestamp 2 months ago

re: https://circumstances.run/@davidgerard/115568033033765688

detecting suno audio is actually much easier and more reliable than these heuristic rules - the spectrogram looks nothing like normal human vocals, and the differences are easy to spot. i did a thing at blackhat about this a few years ago: https://i.blackhat.com/USA-19/Wednesday/us-19-Williams-Detecting-Deep-Fakes-With-Mice-wp.pdf

many of the acoustic artifacts that make the vocals sound "mushy," slurred, and discontinuous are reflections of how the generated audio is not conditioned on the biophysical constraints of the articulatory system: the audio does things that are physically impossible for the human articulatory system to do.

e.g. in the linked sample track "i glued my balls to my butthole," if you take a look at the spectrogram, you see that each phoneme is essentially disconnected from the surrounding phonemes. e.g. in first pic, bottom panel, at the position in the red markers, you can see a sweep up and then a completely instantaneous shift to a flat spectrum, with the harmonics (the black horizontal bands) jumping to new frequencies on the order of milliseconds. when you speak or sing, you are moving the parts of your mouth and throat around, and so you can't make effectively instantaneous changes in where your talking parts are.

there are a few big "strips" of high amplitude (or "formants") that correspond to the resonant frequencies of your sound tubes - the biggest one is usually your throat, and the second biggest one is your mouth. when you do a sound like "doc" you need to first open your mouth, tongue down so those two places are mostly connected, and then bring the middle of your tongue up so that the front of your mouth is a much smaller place (higher resonant frequency) than your throat. so you would expect to see big sweeps in the second formant as the "o" transitions to the "c" - you can see an example of that here: https://corpus.eduhk.hk/english_pronunciation/index.php/3-2-acoustic-aspects-of-consonants/

however in the suno audio, the formants are effectively flat and then instantly teleport to the "c" sound.

the reason why it still sounds normal despite being acoustically impossible is that fooling us into not hearing the fine acoustic details of speech sounds and instead perceiving them as discrete categories (as phonemes) is the entire way that speech perception works - if we weren't fooled by this, there's no way we would be able to understand the wild acoustic variation between voices and between certain phonemes when flanked by different phonemes (e.g. your mouth needs to make a different movement in "doc" vs. "pod" and so the "o" part looks very different, but you still hear it as an "o").

so tl;dr - suno's spectrograms look like discontinuous chunks of flat harmonic stripes with weird alien dead spaces and teleportations between each phoneme chunk. normal human voices don't look like that.

@davidgerard

spectrogram and amplitude chart of part of a suno audio. it has a bunch of discontinuous phonemes all cobbled together with jumps between each of them. so rather than smoothly changing stripes, the stripes jump up and down in an instant, and they are clearly split from one another, where normal speech and singing is mostly continuous except for stops and consonant boundaries
spectrogram and amplitude chart of part of a suno audio. it has a bunch of discontinuous phonemes all cobbled together with jumps between each of them. so rather than smoothly changing stripes, the stripes jump up and down in an instant, and they are clearly split from one another, where normal speech and singing is mostly continuous except for stops and consonant boundaries
spectrogram and amplitude chart of part of a suno audio. it has a bunch of discontinuous phonemes all cobbled together with jumps between each of them. so rather than smoothly changing stripes, the stripes jump up and down in an instant, and they are clearly split from one another, where normal speech and singing is mostly continuous except for stops and consonant boundaries
  • Copy link
  • Flag this post
  • Block
sleepfreeparent
sleepfreeparent
@sleepfreeparent@kolektiva.social replied  ·  activity timestamp 2 months ago

@jonny @davidgerard that looks easy enough to automatically detect - maybe running it through a discrete cosine transform and looking for large discontinuities with regular periodicity

  • Copy link
  • Flag this comment
  • Block
jonny (good kind)
jonny (good kind)
@jonny@neuromatch.social replied  ·  activity timestamp 2 months ago

If a mouth is like a 2cm cross sectional cavity, if I'm singing this tune then I'll be causing localized sonic booms, blowing my skull to bits while instantly incinerating it by needing to teleport my tongue to the roof of my mouth

  • Copy link
  • Flag this comment
  • Block
Theo :autism: :rainbow_flag:
Theo :autism: :rainbow_flag:
@tehrealgh5tbusters@woof.tech replied  ·  activity timestamp 2 months ago

@jonny @davidgerard This is the second time in a row that I saw "I Glued my Balls to my Butthole" being used as an example on how to spot AI generated music, lol. (The first time was when I looked at Newgrounds' wiki article on how to spot AI music)

  • Copy link
  • Flag this comment
  • Block
zenkat
zenkat
@zenkat@sfba.social replied  ·  activity timestamp 2 months ago

@jonny @davidgerard Really interesting! Given these distinct differences in the spectrogram, I would think it would be relatively straightforward to train a binary classifier to distinguish human audio from AI audio. Like huggingface-tutorial, train-on-your-mac-m4 simple. I'm thinking a training set in the thousands would probably suffice for proof-of-concept.

It also seems like that classifier would have a LOT of utility for detecting all sorts of fake audio out in the wild.

Ooooh I lowkey kinda want to build this now ...

  • Copy link
  • Flag this comment
  • Block
jonny (good kind)
jonny (good kind)
@jonny@neuromatch.social replied  ·  activity timestamp 2 months ago

@zenkat
It's pretty hard! there are discontinuities in the speech signal , and other probs in detection - there are lots and lots of attempts with mixed success. It's an arms race

  • Copy link
  • Flag this comment
  • Block
zenkat
zenkat
@zenkat@sfba.social replied  ·  activity timestamp 2 months ago

@jonny I totally see the arms race. Once you have a decent classifier, well then you can just use that in your training pipeline. See DCGAN.

https://en.wikipedia.org/wiki/Generative_adversarial_network

  • Copy link
  • Flag this comment
  • Block

bonfire.cafe

A space for Bonfire maintainers and contributors to communicate

bonfire.cafe: About · Code of conduct · Privacy · Users · Instances
Bonfire social · 1.0.1-beta.35 no JS en
Automatic federation enabled
Log in
  • Explore
  • About
  • Members
  • Code of Conduct