R.A. Fisher wrote that the purpose of statisticians was "constructing a hypothetical infinite population of which the actual data are regarded as constituting a random sample." ( p. 311 here ). In The Zeroth Problem Colin Mallows wrote "As Fisher pointed out, statisticians earn their living by using two basic tricks-they regard data as being realizations of random variables, and they assume that they know an appropriate specification for these random variables."
Some of the pathological beliefs we attribute to techbros were already present in this view of statistics that started forming over a century ago. Our writing is just data; the real, important object is the “hypothetical infinite population” reflected in a large language model, which at base is a random variable. Stable Diffusion, the image generator, is called that because it is based on latent diffusion models, which are a way of representing complicated distribution functions--the hypothetical infinite populations--of things like digital images. Your art is just data; it’s the latent diffusion model that’s the real deal. The entities that are able to identify the distribution functions (in this case tech companies) are the ones who should be rewarded, not the data generators (you and me).
So much of the dysfunction in today’s machine learning and AI points to how problematic it is to give statistical methods a privileged place that they don’t merit. We really ought to be calling out Fisher for his trickery and seeing it as such.
#AI #GenAI #GenerativeAI #LLM #StableDiffusion #statistics #StatisticalMethods #DiffusionModels #MachineLearning #ML
Excellent article on the dangers of dichotomisation of continuous variables
“Cake causes herpes?” - promiscuous dichotomisation induces false positives
https://link.springer.com/article/10.1186/s12874-025-02712-0
R.A. Fisher wrote that the purpose of statisticians was "constructing a hypothetical infinite population of which the actual data are regarded as constituting a random sample." ( p. 311 here ). In The Zeroth Problem Colin Mallows wrote "As Fisher pointed out, statisticians earn their living by using two basic tricks-they regard data as being realizations of random variables, and they assume that they know an appropriate specification for these random variables."
Some of the pathological beliefs we attribute to techbros were already present in this view of statistics that started forming over a century ago. Our writing is just data; the real, important object is the “hypothetical infinite population” reflected in a large language model, which at base is a random variable. Stable Diffusion, the image generator, is called that because it is based on latent diffusion models, which are a way of representing complicated distribution functions--the hypothetical infinite populations--of things like digital images. Your art is just data; it’s the latent diffusion model that’s the real deal. The entities that are able to identify the distribution functions (in this case tech companies) are the ones who should be rewarded, not the data generators (you and me).
So much of the dysfunction in today’s machine learning and AI points to how problematic it is to give statistical methods a privileged place that they don’t merit. We really ought to be calling out Fisher for his trickery and seeing it as such.
#AI #GenAI #GenerativeAI #LLM #StableDiffusion #statistics #StatisticalMethods #DiffusionModels #MachineLearning #ML
Excellent article on the dangers of dichotomisation of continuous variables
“Cake causes herpes?” - promiscuous dichotomisation induces false positives
https://link.springer.com/article/10.1186/s12874-025-02712-0
New series of social media posts on handling missing values:
- Variability Problems (432 Likes): https://www.linkedin.com/posts/joachim-schork_datasciencecourse-rstudio-visualanalytics-activity-7395018496281247745-bthL/
- Imputation Methods Compared (376 Likes): https://x.com/JoachimSchork/status/1989242911456674159
- Random Forest Imputation (523 Likes): https://www.facebook.com/groups/statisticsglobe/posts/1769123507100302/
Weighting an Average to Minimize Variance
https://www.johndcook.com/blog/2025/11/12/minimum-variance/
#HackerNews #Weighting #Average #Variance #Minimization #Statistics #DataScience
Soon to open (mid-november)
https://statml.peercommunityin.org/
To keep in mind.
#ML #statistics #OpenScience
Soon to open (mid-november)
https://statml.peercommunityin.org/
To keep in mind.
#ML #statistics #OpenScience
Map of the ratio between the total number of farm animals and the number of inhabitants per municipality in the Netherlands based on data of Statistics Netherlands (CBS).
#30daymapchallenge day 3: polygons #maps #thenetherlands #statistics #cartography
Map of the ratio between the total number of farm animals and the number of inhabitants per municipality in the Netherlands based on data of Statistics Netherlands (CBS).
#30daymapchallenge day 3: polygons #maps #thenetherlands #statistics #cartography
Completely taken by the 2025 Local Deprivation maps on gov.uk:
https://deprivation.communities.gov.uk/maps
These maps are painting such a sad & harsh picture of the country (well, at least of England)! Been spending the past hour or so checking out various places I've been to, worked and lived in — for some of them they confirm nothing seems to have changed (for the better) in 15+ years. E.g. Central Sunderland[1] still is 99% more deprived than the rest of England, many neighborhoods in the larger cities are faring pretty bad too. Austerity visualized. Would be great if there would be a time travel option available, to see/compare these maps before/after Brexit...
[1] Back in 2010, Sunderland was an eye-opening experience for me (of a VERY different England than I knew from the south), when I was there a few times developing & installing the generative and agent & sensor-driven animation/software for the 140 meter long LED wall on Platform 5 at the main train station. It seems the project might still be running after all these years (never had a single support call), and it's still the first image on Google Maps... 🙌
https://maps.app.goo.gl/NyQdf5ghkosLYadi7
[2] I always admired the UK datastore, the amount of available national data and the related mapping/infoviz culture. I built quite a few projects with their data myself, but have yet to find anything comparable for Germany. https://www.govdata.de just seems a lot more disparate and limited...
Completely taken by the 2025 Local Deprivation maps on gov.uk:
https://deprivation.communities.gov.uk/maps
These maps are painting such a sad & harsh picture of the country (well, at least of England)! Been spending the past hour or so checking out various places I've been to, worked and lived in — for some of them they confirm nothing seems to have changed (for the better) in 15+ years. E.g. Central Sunderland[1] still is 99% more deprived than the rest of England, many neighborhoods in the larger cities are faring pretty bad too. Austerity visualized. Would be great if there would be a time travel option available, to see/compare these maps before/after Brexit...
[1] Back in 2010, Sunderland was an eye-opening experience for me (of a VERY different England than I knew from the south), when I was there a few times developing & installing the generative and agent & sensor-driven animation/software for the 140 meter long LED wall on Platform 5 at the main train station. It seems the project might still be running after all these years (never had a single support call), and it's still the first image on Google Maps... 🙌
https://maps.app.goo.gl/NyQdf5ghkosLYadi7
[2] I always admired the UK datastore, the amount of available national data and the related mapping/infoviz culture. I built quite a few projects with their data myself, but have yet to find anything comparable for Germany. https://www.govdata.de just seems a lot more disparate and limited...
Happy Halloween to all those who celebrate, but especially to my fellow statisticians