re: https://circumstances.run/@davidgerard/115568033033765688
detecting suno audio is actually much easier and more reliable than these heuristic rules - the spectrogram looks nothing like normal human vocals, and the differences are easy to spot. i did a thing at blackhat about this a few years ago: https://i.blackhat.com/USA-19/Wednesday/us-19-Williams-Detecting-Deep-Fakes-With-Mice-wp.pdf
many of the acoustic artifacts that make the vocals sound "mushy," slurred, and discontinuous are reflections of how the generated audio is not conditioned on the biophysical constraints of the articulatory system: the audio does things that are physically impossible for the human articulatory system to do.
e.g. in the linked sample track "i glued my balls to my butthole," if you take a look at the spectrogram, you see that each phoneme is essentially disconnected from the surrounding phonemes. e.g. in first pic, bottom panel, at the position in the red markers, you can see a sweep up and then a completely instantaneous shift to a flat spectrum, with the harmonics (the black horizontal bands) jumping to new frequencies on the order of milliseconds. when you speak or sing, you are moving the parts of your mouth and throat around, and so you can't make effectively instantaneous changes in where your talking parts are.
there are a few big "strips" of high amplitude (or "formants") that correspond to the resonant frequencies of your sound tubes - the biggest one is usually your throat, and the second biggest one is your mouth. when you do a sound like "doc" you need to first open your mouth, tongue down so those two places are mostly connected, and then bring the middle of your tongue up so that the front of your mouth is a much smaller place (higher resonant frequency) than your throat. so you would expect to see big sweeps in the second formant as the "o" transitions to the "c" - you can see an example of that here: https://corpus.eduhk.hk/english_pronunciation/index.php/3-2-acoustic-aspects-of-consonants/
however in the suno audio, the formants are effectively flat and then instantly teleport to the "c" sound.
the reason why it still sounds normal despite being acoustically impossible is that fooling us into not hearing the fine acoustic details of speech sounds and instead perceiving them as discrete categories (as phonemes) is the entire way that speech perception works - if we weren't fooled by this, there's no way we would be able to understand the wild acoustic variation between voices and between certain phonemes when flanked by different phonemes (e.g. your mouth needs to make a different movement in "doc" vs. "pod" and so the "o" part looks very different, but you still hear it as an "o").
so tl;dr - suno's spectrograms look like discontinuous chunks of flat harmonic stripes with weird alien dead spaces and teleportations between each phoneme chunk. normal human voices don't look like that.