I'm wondering: If a published paper mentions that a study was approved by the ethics committee of University X, number XXXX, is there a way to find (and read) this ethics approval?
I'm wondering: If a published paper mentions that a study was approved by the ethics committee of University X, number XXXX, is there a way to find (and read) this ethics approval?
How do we make sense of the body of scientific literature that is growing explosively to the point where no individual could read all the relevant papers, and is contaminated with fraudulent and LLM-generated papers? I think that science is not currently equipped to deal with this, and we need to.
I think a critical part will be post-publication peer review. With such rapid growth and time pressure on scientists, pre-publication PR cannot maintain sufficient standards so we need a way to make the informal networked review (journal clubs, conference chats etc.) more transparent and shared.
We also need ways to summarise and make connections between many related papers. I know that many people are hoping that LLMs will step up into this role, but right now that seems very risky and I don't see that changing any time soon.
LLMs are too distracted by surface level presentation, and can be manipulated at scale by iterating over multiple presentations until the LLM summarises it in the way you want it to. In addition, they're known to have problematic biases, and it's unclear if this can be fixed.
I think we need to be experimenting with ways to distribute the work of summarising and making connections between papers, and aggregating that into a collective understanding. An individual can't read all the papers, but collectively we can and already are. We just need ways to integrate that.
In principle we could do this with a post-publication peer review system that allows reviewers to annotate with connections, e.g. in reviewing paper X you create a formal link saying that part of this paper is similar to paper Y, or uses the same technique, etc.
One issue is that these annotations might become corrupted or manipulated in the same way that papers, journals and peer review have been. How do we fix that? It's not ideal, but one option might be some form of trust network: I trust X, they trust Y and thereore I (to a lesser extent) trust Y.
This would mean our summary or evaluation of the literature would depend on our individual trust network. But, this isn't a bad thing in principle. Diversity of opinion is good: there shouldn't be one definitive reference summary because that's a single point of failure and target for manipulation.
All these ideas require experimentation, and both technology development and a cultural shift towards collectively taking responsibility for doing this work. I think we need to do it and would love to hear others' ideas about how to do it and how to convince everyone that we need to.
Have you checked out our #OpenScience Bites #podcast yet? Stay tuned!
Very soon, we鈥檒l be talking with Felipe Romero, Assistant Professor at the Faculty of #Philosophy. From a #MetaScience perspective, we鈥檒l dive into the concepts of #reproducibility and #replicability, and why they matter for #research practice.
We鈥檒l let you know as soon as the new episode is online!
In the meantime, feel free to listen to our earlier episodes:
How do we make sense of the body of scientific literature that is growing explosively to the point where no individual could read all the relevant papers, and is contaminated with fraudulent and LLM-generated papers? I think that science is not currently equipped to deal with this, and we need to.
I think a critical part will be post-publication peer review. With such rapid growth and time pressure on scientists, pre-publication PR cannot maintain sufficient standards so we need a way to make the informal networked review (journal clubs, conference chats etc.) more transparent and shared.
We also need ways to summarise and make connections between many related papers. I know that many people are hoping that LLMs will step up into this role, but right now that seems very risky and I don't see that changing any time soon.
LLMs are too distracted by surface level presentation, and can be manipulated at scale by iterating over multiple presentations until the LLM summarises it in the way you want it to. In addition, they're known to have problematic biases, and it's unclear if this can be fixed.
I think we need to be experimenting with ways to distribute the work of summarising and making connections between papers, and aggregating that into a collective understanding. An individual can't read all the papers, but collectively we can and already are. We just need ways to integrate that.
In principle we could do this with a post-publication peer review system that allows reviewers to annotate with connections, e.g. in reviewing paper X you create a formal link saying that part of this paper is similar to paper Y, or uses the same technique, etc.
One issue is that these annotations might become corrupted or manipulated in the same way that papers, journals and peer review have been. How do we fix that? It's not ideal, but one option might be some form of trust network: I trust X, they trust Y and thereore I (to a lesser extent) trust Y.
This would mean our summary or evaluation of the literature would depend on our individual trust network. But, this isn't a bad thing in principle. Diversity of opinion is good: there shouldn't be one definitive reference summary because that's a single point of failure and target for manipulation.
All these ideas require experimentation, and both technology development and a cultural shift towards collectively taking responsibility for doing this work. I think we need to do it and would love to hear others' ideas about how to do it and how to convince everyone that we need to.
Lottery before peer review is associated with increased female representation and reduced estimated economic cost in a German funding line https://www.nature.com/articles/s41467-025-65660-9
New from @rmrahal and colleagues
Funder was German Foundation for Innovation in Higher Education (Stiftung Innovation in der Hochschullehre). "Lottery-first" meant short EoI were submitted and randomisation used to select those allowed to apply
Lottery before peer review is associated with increased female representation and reduced estimated economic cost in a German funding line https://www.nature.com/articles/s41467-025-65660-9
New from @rmrahal and colleagues
Funder was German Foundation for Innovation in Higher Education (Stiftung Innovation in der Hochschullehre). "Lottery-first" meant short EoI were submitted and randomisation used to select those allowed to apply
@martonkovacs.bsky.social is beavering away implementing something new for https://tenzing.club. We hope to make it easier for you to manage your collaborators, by also providing for acknowledgees (people you acknowledge but who aren't co-authors)!
By providing a place to write down information about the people you should acknowledge rather than just your co-authors, you will be less likely, come submisison time, to forget to include them in the acknowledgments. #metascience
Tenzing will also gently encourage best practices for Acknowledgments sections, by modeling them in its outputs.
We are interested in talking to groups (like statistical consulting units or library professionals) who are frequently not acknowledged appropriately in papers about how this might help.
@martonkovacs.bsky.social is beavering away implementing something new for https://tenzing.club. We hope to make it easier for you to manage your collaborators, by also providing for acknowledgees (people you acknowledge but who aren't co-authors)!
By providing a place to write down information about the people you should acknowledge rather than just your co-authors, you will be less likely, come submisison time, to forget to include them in the acknowledgments. #metascience
Tenzing will also gently encourage best practices for Acknowledgments sections, by modeling them in its outputs.
We are interested in talking to groups (like statistical consulting units or library professionals) who are frequently not acknowledged appropriately in papers about how this might help.
MetaROR, a platform for reviews of research on research, is a success! We have published 24 sets of reviews and have 16 submissions in process. MetaROR now has 9 partners - these are journals that agree to use our reviews when authors submit to them. https://metaror.org/ #metascience #openaccess
Chief editors Kathryn Zeiler and @LudoWaltman with Andr茅 Brasil and JS van Honk have worked hard to overcome many teething issues, together with the Coko team! #PublishReviewCurate
MetaROR, a platform for reviews of research on research, is a success! We have published 24 sets of reviews and have 16 submissions in process. MetaROR now has 9 partners - these are journals that agree to use our reviews when authors submit to them. https://metaror.org/ #metascience #openaccess
Chief editors Kathryn Zeiler and @LudoWaltman with Andr茅 Brasil and JS van Honk have worked hard to overcome many teething issues, together with the Coko team! #PublishReviewCurate
OK, time for a Mastodon peer review of this preprint!
It's interesting to see this paper lay out the assumptions so clearly. For me, the paper starts from a big one that I think is fundamentally wrong, which is that proposals have an intrinsic merit that reviewers assess with some error. The accuracy measure in the proposal is essentially the probability that proposals ranked as fundable according to the noisy reviews matches the ranking according to the instrinsic merit. This feels to me then like an argument for the current system based on the assumption that the current system is fundamentally correct.
Let's say that the goal is a portfolio of funding decisions that maximises the chance of a set of important discoveries. For there to be an intrinsic and linear inherent merit for proposals, it would have to be possible to predict on the basis of a specification of a set of experiments but without having carried them out, what the probability of making an important discovery is. But this can't be true, or there would be no reason to actually do the experiments.
There could be some 'signal' in the sense that some proposals might be based on clear factual errors that the authors were not aware of. This would suggest a threshold funding model where as long as proposals meet a minimum threshold score by reviewers assessing methodological soundness, they were given an equal chance of funding. In this model, once you're over the threshold, the reviewer score is ONLY bias because there is no signal above the threshold. This model would likely lead to very different conclusions than the model in this paper. The authors describe this in a footnote as being "too nihilistic" to consider.
Another alternative 'signal' would be the probability of producing results that the scientific community would find interesting. There is likely a valid case that review scores - to some extent - measure this. However, I'm not sure that a decision process based on ranking these probabilities would lead to good outcomes, even if there was no noise. Take an extreme case where you have enough funding for 5 proposals, and you get 5 identical proposals that if successful would make (the same) discovery that 50% of the field would be interested in, and 5 different proposals each of which would make a discovery that 10% of the field would be interested in. Do you want to fund 5 copies of the same proposal, or would you instead maybe fund 2 copies of the same proposal (in case one fails for some reason), and 3 different proposals? If you were funding based on intrinsic merit rank, you'd fund 5 copies of the same proposal.
I personally think it's likely that this is actually what funding bodies are doing.
Going back to the question of bias, there's an interesting interaction here. If your intrinstic 'merit' that you want to reward is based on what the field would find interesting, then 'bias' is no longer statistically independent of 'merit'. In other words, by giving a high merit to things that more people think are interesting, you are inherently saying that if minority groups are interested in something else it's because what they are interested in is objectively worse. It is building in a biased assumption.
A very interesting suggestion that has been made (and indeed I think tried in some places) is to rank proposals at least partly based on reviewer variance rather than mean. In other words, the assumption is that if a bunch of reviewers disagree violently about a proposal, it's likely to be more important than if they all agree. I suspect this approach is also too simplistic, but it seems likely to me that there's an element of this that is right. The assumptions and modelling framework of this paper rule out the possibility of this being a good procedure. This is another indication to me that the proposed model is not a good one.
So, I'm happy this paper is out there making these assumptions explicit, but I think those assumptions are very problematic, likely wrong, and I hope that UKRI do not base their policies on its conclusions.
OK, time for a Mastodon peer review of this preprint!
It's interesting to see this paper lay out the assumptions so clearly. For me, the paper starts from a big one that I think is fundamentally wrong, which is that proposals have an intrinsic merit that reviewers assess with some error. The accuracy measure in the proposal is essentially the probability that proposals ranked as fundable according to the noisy reviews matches the ranking according to the instrinsic merit. This feels to me then like an argument for the current system based on the assumption that the current system is fundamentally correct.
Let's say that the goal is a portfolio of funding decisions that maximises the chance of a set of important discoveries. For there to be an intrinsic and linear inherent merit for proposals, it would have to be possible to predict on the basis of a specification of a set of experiments but without having carried them out, what the probability of making an important discovery is. But this can't be true, or there would be no reason to actually do the experiments.
There could be some 'signal' in the sense that some proposals might be based on clear factual errors that the authors were not aware of. This would suggest a threshold funding model where as long as proposals meet a minimum threshold score by reviewers assessing methodological soundness, they were given an equal chance of funding. In this model, once you're over the threshold, the reviewer score is ONLY bias because there is no signal above the threshold. This model would likely lead to very different conclusions than the model in this paper. The authors describe this in a footnote as being "too nihilistic" to consider.
Another alternative 'signal' would be the probability of producing results that the scientific community would find interesting. There is likely a valid case that review scores - to some extent - measure this. However, I'm not sure that a decision process based on ranking these probabilities would lead to good outcomes, even if there was no noise. Take an extreme case where you have enough funding for 5 proposals, and you get 5 identical proposals that if successful would make (the same) discovery that 50% of the field would be interested in, and 5 different proposals each of which would make a discovery that 10% of the field would be interested in. Do you want to fund 5 copies of the same proposal, or would you instead maybe fund 2 copies of the same proposal (in case one fails for some reason), and 3 different proposals? If you were funding based on intrinsic merit rank, you'd fund 5 copies of the same proposal.
I personally think it's likely that this is actually what funding bodies are doing.
Going back to the question of bias, there's an interesting interaction here. If your intrinstic 'merit' that you want to reward is based on what the field would find interesting, then 'bias' is no longer statistically independent of 'merit'. In other words, by giving a high merit to things that more people think are interesting, you are inherently saying that if minority groups are interested in something else it's because what they are interested in is objectively worse. It is building in a biased assumption.
A very interesting suggestion that has been made (and indeed I think tried in some places) is to rank proposals at least partly based on reviewer variance rather than mean. In other words, the assumption is that if a bunch of reviewers disagree violently about a proposal, it's likely to be more important than if they all agree. I suspect this approach is also too simplistic, but it seems likely to me that there's an element of this that is right. The assumptions and modelling framework of this paper rule out the possibility of this being a good procedure. This is another indication to me that the proposed model is not a good one.
So, I'm happy this paper is out there making these assumptions explicit, but I think those assumptions are very problematic, likely wrong, and I hope that UKRI do not base their policies on its conclusions.
Reliability, bias and randomisation in peer review: a simulation
https://osf.io/preprints/socarxiv/4gqce_v1
new preprint from UKRI #Metascience Unit
Defending science in public we often talk about 'peer reviewed science'. But could this framing contribute to undermining trust in science and holding us back from improving the scientific process? How about instead we talk about the work that has received the most thorough and transparent scrutiny?
Peer review goes a step towards this in having a couple of people scrutinise the work, but there are limits on how thorough it can be and in most journals it's not transparent. Switching the framing to transparent scrutiny allows us to experiment with other models with a path to improvement.
For example, making review open to all, ongoing, and all reviews published improves this. When authors make their raw data and code open, it improves this.
It also gives us a way to criticise problematic organisations that formally do peer review but add little value (e.g. predatory journals). If their reviews are not open and observably of poor quality, then they are less 'thoroughly transparent'.
So with this framing the existence of 'peer reviewed' but clearly poor quality work doesn't undermine trust in science as a whole because we don't pin our meaning and value on an exploitable binary measure of 'peer reviewed'.
It also offers a hopeful way forward because it shows us how we can improve, and every step towards this becomes meaningful. If all we have is binary 'peer reviewed' or not, why spend more effort doing it better?
In summary, I think this new framing would be better for science, both in terms of the public perception of it, and for us as scientists.