The role of research infrastructures in shaping scientific activity is becoming increasingly significant. However, their broader impacts remain underexplored. While their scientific merit is widely acknowledged, secondary outputs, such as open-access datasets and computational resources, as well as their influence on downstream research, are still not well understood.
In this regard, a recent study published in the CERN IdeaSquare Journal of Experimental Innovation (CIJ) explores this through the case of the AlphaFold Database (AFDB), an open-access resource resulting from a collaboration between DeepMind and the European Molecular Biology Laboratory (EMBL). This database enables widespread access to protein structure data predicted by AlphaFold, a machine learning model that has gained recognition for its capabilities in structural biology.
The paper titled: “The Impact of Research Data Infrastructures: The Case of the AlphaFold Database” has been published as part of the Compute Impact project and uses a quantitative approach using bibliometric analysis to compare the influence of two related but distinct scientific outputs: the original AlphaFold paper, which presented the core algorithmic breakthrough, and the AlphaFold Database, which disseminates its predictions.
By analysing a dataset of publications indexed in the Web of Science Core Collection, the authors assess how each output has shaped research themes, collaboration patterns, and scientific impact. The analysis includes a matched comparison of 659 articles citing only the database, with an equal number of articles citing only the original AlphaFold paper.
The findings show that AFDB is more frequently cited in research that applies structural biology to areas such as drug discovery and disease mechanisms, while the original paper is referenced more in studies focused on algorithmic and methodological aspects. This distinction underscores the database’s role in facilitating downstream applications of protein structure data, enabling researchers to significantly reduce the computational demands required to generate these predictions independently.
Regarding the collaboration patterns, the study finds that articles citing the AFDB tend to involve fewer institutions, suggesting that access to pre-computed data may reduce the need for broad institutional partnerships. This may show a broader inclusion of resource-limited institutions in research traditionally constrained by technical barriers. However, the number of authors per paper remains similar between the two groups. On the metric of scientific impact, as measured by citation counts, no significant differences were observed between the two types of articles.
The study concludes that research data infrastructures like AFDB not only complement primary scientific breakthroughs but also reshape research directions and broaden institutional involvement. It calls for more nuanced, long-term evaluations of such infrastructures to fully capture their value and inform science policy and funding strategies.
“It’s a truly exciting time in science right now. The incredible leaps we’re seeing with open-access data and AI models are transforming how we do research, giving us insights we couldn’t have even dreamed of before. For those of us in the social sciences and the business school, it’s the perfect time to jump in, conduct more holistic studies, and spark new collaborations to really maximise the impact of these advancements,
commented Angelo Romasanta, Assistant Professor at Esade and part of the Compute Impact research team.
For more information about the study or to access the full paper, please visit here.
Discover more about the Compute Impact project here.
Published paper:
Romasanta, A. K., Wareham, J., & Pujol Priego, L. (2025). The Impact of Research Data Infrastructures: The Case of the AlphaFold Database. CERN IdeaSquare Journal of Experimental Innovation, 9(1), 42–48.
The role of research infrastructures in shaping scientific activity is becoming increasingly significant. However, their broader impacts remain underexplored. While their scientific merit is widely acknowledged, secondary outputs, such as open-access datasets and computational resources, as well as their influence on downstream research, are still not well understood. In this regard, a recent study published […]