much gratitude to Claude-Arif collab on this...
Article 100 Million:
What Do We Actually Need These Papers For Now?
Arif E. Jinha & [Co-Author]
Working draft for co-author review — April 2026
Proposed venue: Learned Publishing / JASIST
Follows from: Jinha, A. E. (2010). Article 50 million: An estimate of the number of scholarly articles in existence. Learned Publishing, 23(3), 258–263.
“Attention is all you need.” — Vaswani et al. (2017)
“Textbooks are all you need.” — Gunasekar et al. (2023)
“What do we actually need these papers for now?” — Jinha & [Co-Author] (2026)
These three questions, posed nine years apart, constitute a single accelerating argument. The first built the architecture that made the second possible. The second broke the assumption that more data always means more intelligence. Together they make the third question urgent: if a 1.3 billion parameter model trained on textbook-quality data outperforms a 70 billion parameter model trained on web noise, what exactly are we doing with 100 million papers, most of them locked, most of them unread, most of them generated under incentive structures that reward volume over truth?
Abstract
In 2010, Jinha estimated the cumulative total of peer-reviewed scholarly articles at approximately 50 million (Learned Publishing, 23(3), 258–263). Current publication rates — exceeding 3 million articles per year — place the global corpus at approximately 90–105 million articles as of 2025–2026. This paper presents an updated estimate and uses the doubling of the scholarly record as the occasion for a harder question: what is it actually for?
We argue that the scholarly publishing system is in simultaneous crisis across four dimensions: it is structurally saturated, commercially captured, epistemically distorted, and radically inaccessible to those who need it most. These are not peripheral inefficiencies. At a moment when civilization is navigating the most consequential decades in recorded history — ecological tipping points, cascading geopolitical shocks, accelerating AI deployment — the inaccessibility and internal incoherence of the accumulated record of human inquiry is a civilizational liability, not an academic housekeeping matter.
We invoke the concept of qiyamah — from the Arabic/Ismaili tradition, meaning simultaneously ‘apocalypse,’ ‘unveiling,’ and ‘resurrection’: an ending that is a beginning — to frame what the publishing system now requires. The 100 million article milestone is not a triumph of scholarship. It is a threshold.
We propose a three-pillar response: (1) Universal Open Access as ground condition, realizing Harnad's vision of every paper available to every researcher, free, forever; (2) AI-enabled knowledge architecture — citation genealogy tracing, error propagation detection, suppressed research identification, and theory evolution mapping across the full open corpus; and (3) the Philosophia Animatrix: an AI knowledge system that trains on the curated open scholarly record and uses self-referential recursion to improve both the corpus and its own training efficiency simultaneously — a convergence of the ‘textbook quality’ insight with the efficiency breakthroughs of sparse-architecture AI, toward a system of maximum epistemic density at minimum energy cost.
The title of this paper places itself deliberately in dialogue with two of the most-cited papers in the history of artificial intelligence: ‘Attention Is All You Need’ (Vaswani et al., 2017) and ‘Textbooks Are All You Need’ (Gunasekar et al., 2023). Both papers demonstrate that architectural elegance and data quality, respectively, matter more than brute scale. We extend that argument to the scholarly publishing system itself: what the record of human knowledge needs is not more papers. It needs to become what it has always claimed to be — a navigable, honest, open, living commons of inquiry.
Keywords: open access, scholarly publishing, knowledge graphs, citation genealogy, AI epistemology, Philosophia Animatrix, qiyamah, article count, peer review reform, energy efficiency, Textbooks Are All You Need, sparse AI architectures, epistemic commons
I. Citation Lineage Strategy — Positioning Note for Co-Author
This section is a drafting note and will not appear in the final submission. It documents the deliberate citation architecture of the paper.
The ‘All You Need’ Lineage
Two papers form the citation spine of this argument and are cited in the epigraph and throughout the text. Both are among the most-cited papers in the history of artificial intelligence research. Citing them prominently achieves multiple strategic goals:
- It positions the paper within the highest-traffic citation networks in current AI/ML discourse, maximizing discoverability for readers arriving through those networks.
- It makes the argument legible to a readership that includes computer scientists and AI researchers, not only library/information science scholars — expanding the potential audience for the Philosophia Animatrix proposal.
- It activates the rhetorical power of the ‘X is all you need’ formula by explicitly completing and subverting it — the title’s question is only fully audible to a reader who recognizes what it is answering back to.
- It creates a forward citation lineage: papers citing Vaswani (2017) or Gunasekar (2023) that also cite this paper create a traceable line connecting AI architecture, training data quality, and the open scholarly record — exactly the genealogy the Philosophia Animatrix is designed to make visible.
Secondary Lineage: The Founding Open Access Documents
Harnad (1994), the Budapest Open Access Initiative (2002), and Suber (2012) provide the OA lineage. These are canonical citations in the field and signal to reviewers and editors that the paper is grounded in the OA literature’s own history, not arriving from outside it.
Anchor Empirical Citations
Two empirical papers are load-bearing anchors for the diagnostic sections and should be cited early and specifically:
- Greenberg (2009, BMJ) — the citation distortion case study. This is the single best empirical demonstration of the error-propagation problem. It should be described in enough detail that readers who have not encountered it understand exactly what it shows.
- Ioannidis (2005, PLOS Medicine) — the mathematical argument that most published research findings are false under current incentive structures. Together with Greenberg, this forms the evidentiary core of Section III’s diagnosis.
The Jinha (2010) Self-Citation
The foundational self-citation to Article 50 Million is not merely pro forma — it is the explicit continuity claim of the paper. The framing should be: this paper is a return to the same question sixteen years later, with the number doubled and the stakes transformed. The original paper asked ‘how many?’ This paper asks ‘so what?’
II. The Argument: Seven Movements
1. The Count — Toward 100 Million
In 2010, this paper’s lead author estimated the total cumulative count of peer-reviewed scholarly articles at approximately 50 million, tracing publication rates back to the first scientific journals in 1665 (Philosophical Transactions; Journal des Sçavans). The methodology combined historical growth curves with contemporary publication data to produce a figure that was, at the time, larger than most in the field had intuited.
The figure was not celebrated. It was a provocation. The point was not that scholarship was thriving. The point was: we had produced an enormous pile of papers, and almost no one could navigate it, most people could not access it, and no one had a clear picture of its actual aggregate shape.
Sixteen years later, the pile has roughly doubled. Based on current publication rates of approximately 3–3.5 million peer-reviewed articles per year (Bornmann & Mutz, 2015; CrossRef data, 2024), and accounting for the acceleration driven by open-access mandates, predatory journal proliferation, and publish-or-perish institutional pressure, the corpus now stands at approximately 90–105 million articles, depending on inclusion criteria. If preprint servers (arXiv, bioRxiv, SSRN, OSF) are included, the figure is higher still — a parallel shadow corpus of 5–10 million works largely excluded from traditional estimates.
A doubling in sixteen years. And the question that was already uncomfortable in 2010 is now urgent: what, exactly, is this pile of papers for?
2. The Diagnosis — Four Simultaneous Failures
2a. Saturation and Signal Collapse
The volume of scholarly publishing has long since crossed the threshold where any individual researcher can maintain meaningful awareness of their own field, let alone adjacent ones. This is not merely a practical inconvenience. It is an epistemic catastrophe. Research that cannot be found, read, or integrated is not knowledge — it is archived noise.
The incentive structure that produced the saturation is well-understood: institutional evaluation systems that reward paper count over insight, funding bodies that assess productivity through publication metrics, and career ladders that are climbed by publishing derivative work rather than significant work. The result is a corpus in which the load-bearing papers — the ones on which actual knowledge rests — are buried under layers of salami-sliced variants, incremental extensions, and sophisticated-sounding repetitions of what was already known.
2b. Commercial Capture
Five publishers — Elsevier, Springer Nature, Wiley, Taylor & Francis, and SAGE — control approximately 50% of the total scholarly market, with profit margins of 30–40% extracted from research that is almost entirely publicly funded (Larivière et al., 2015). The transformation of open-access mandates into Author Processing Charge (APC) models has replicated the inequity of subscription access, merely shifting the extraction point from reader to author. A researcher at a well-funded institution in the Global North can now publish open access. A researcher at an underfunded institution in the Global South cannot afford to. The asymmetry is structural.
Diamond Open Access — models with no fees on either side, funded as public infrastructure — remains marginalized. The market has not failed here. The market has succeeded. It has successfully converted a public good into a private rent, and the ‘open access revolution’ has largely been a vehicle for extending that rent into new territory.
2c. Epistemic Distortion and the Citation Chain Problem
Ioannidis (2005) demonstrated mathematically that, under prevailing conditions of small sample sizes, researcher degrees of freedom, selective reporting, and publication bias, the majority of published research findings are likely to be false. This is not a fringe claim. It is a mathematical consequence of the incentive structure. It has been extensively replicated: the Open Science Collaboration (2015) found that fewer than 40% of 100 psychology studies reproduced successfully.
But the problem is worse than non-replication. Greenberg (2009) traced a specific false claim — that beta-amyloid inhibits muscle nicotinic acetylcholine receptors — as it traveled through a citation network. The original claim had no empirical basis. Through successive citation, selective quotation, and the natural human tendency to trust what has already been cited, it accreted into canonical fact. Papers cited papers that cited papers that cited nothing real. The citation chain had become the evidence.
This is not a pathology. It is the normal operation of a system built on citation as social currency rather than citation as epistemic warrant. At 100 million papers, the propagation of error through citation chains is not a manageable quality-control problem. It is a structural feature of the corpus.
2d. Access Inequality
The global distribution of access to the scholarly record maps almost perfectly onto the global distribution of wealth. Researchers and practitioners in the Global South, independent scholars, community organizations, journalists, policymakers, and the general public are largely excluded from the record of human inquiry that their taxes, in many cases, directly funded. The irony is precise: the problems for which knowledge access is most urgently needed — climate adaptation, disease response, food security, ecological management — are concentrated in the populations with the least access to the knowledge that might address them.
3. The Civilizational Frame — Why This Is Not an Academic Question
We are writing in 2026. The Shared Socioeconomic Pathways framework of the IPCC places current global policy trajectories on a path toward 2.7°C of warming by 2100, consistent with SSP2-4.5 at best, with tipping points — West Antarctic ice sheet, Greenland, AMOC disruption, permafrost thaw — already in motion. The window for meaningful intervention is the current decade.
Climate science, epidemiology, ecological modeling, social and economic research relevant to the transition are locked behind paywalls, buried in an unnavigable corpus, distorted by citation chains that may not accurately represent the state of knowledge, and inaccessible to most of the practitioners who need to act on them.
At the same time, artificial intelligence systems are being trained on this same corpus — ingesting its errors, its gaps, its biases, and its commercial distortions at scale, and reproducing them in systems that are then deployed in medical diagnosis, legal reasoning, policy analysis, and scientific research. The closed, fragmented, commercially captured scholarly record is not a neutral backdrop to the AI moment. It is shaping what AI knows, and therefore what AI does.
This is the qiyamah moment. Qiyamah: the Arabic/Ismaili term for the day of standing, the apocalypse-unveiling, the end that is also a resurrection. Not destruction for its own sake but the lifting of the veil — kashf — that reveals what has always been present but obscured. The publishing system does not need reform. It needs the veil lifted. What is underneath is a choice: let the record become what it has always claimed to be, or watch it become the training data for a civilization that cannot find its way out of the crisis it documented so thoroughly but never shared.
4. The Proposal — Three Pillars
Pillar I: Universal Open Access as Ground Condition
Harnad’s 1994 vision has not been realized. Thirty years on, approximately 28% of the scholarly literature is freely available (Piwowar et al., 2018). The majority remains paywalled or access-restricted. The first pillar of the proposal is therefore not a novelty — it is the overdue execution of a commitment already made.
- Immediate mandatory open deposit of all publicly funded research (Green OA minimum), with institutional and funder enforcement
- Retrospective digitization and OA release of all historic print literature, building on Internet Archive, HathiTrust, and national library initiatives
- Public infrastructure funding for Diamond OA models, eliminating both reader paywalls and author fees
- Abolition of impact factor as a primary evaluation metric in hiring, funding, and promotion decisions
- Harnad’s formulation as the benchmark, not the aspiration: every paper, on every researcher’s desktop, navigable and citation-hyperlinked, 24 hours a day, for free, for all, forever
Pillar II: AI-Enabled Knowledge Architecture
Universal open access creates the material condition for the second pillar. An open corpus is not yet a knowledge architecture. It is a pile. The transformation from pile to navigable knowledge requires AI-enabled structuring across five functions:
- Citation genealogy tracing: map what derives from what, through which intermediaries, with what transformations — making the lineage of every claim traceable to its evidential origins
- Error propagation detection: identify where false or contested claims have traveled through the corpus, accreting false authority through citation (Greenberg, 2009)
- Suppressed research identification: surface research lines that were ignored, undercited, or structurally marginalized — the roads not taken, and the question of why they were not taken
- Theory evolution visualization: dynamic maps of how conceptual frameworks developed, diverged, and converged over time — the intellectual history that currently exists only in the heads of senior scholars and is lost when they retire
- Cross-domain bridge identification: find the connections between disciplines that siloed publishing structures actively prevent — the insight from ecology that economists need, the methods from physics that biologists have not yet applied
Pillar III: The Philosophia Animatrix — The Self-Improving Epistemic Commons
The third pillar is where the argument opens onto terrain that the existing open access and scholarly communication literature has not yet fully mapped. It draws directly on two convergent lines of AI research.
The first is the insight of Vaswani et al. (2017) — ‘Attention Is All You Need’ — that architectural elegance, specifically the transformer’s self-attention mechanism, could replace the brute-force recurrence of previous models. The transformer does not process information sequentially and exhaustively. It attends to what is relevant. This principle scales: a system that knows what to pay attention to can be radically more efficient than one that processes everything equally.
The second is the insight of Gunasekar et al. (2023) — ‘Textbooks Are All You Need’ — that a 1.3 billion parameter model trained on textbook-quality data outperforms a 70 billion parameter model trained on web-scraped noise. Quality of training data, not quantity or raw compute, is the primary determinant of model capability at a given scale. The web is to AI training what landfill is to materials science: technically available, practically unusable at the level of quality the task requires.
These two insights, taken together, point toward a third: if the open scholarly record is properly curated, genealogically structured, error-detected, and made navigable, it becomes not just the best training corpus available for an epistemically capable AI — it becomes the basis for a self-improving system.
The architecture of the Philosophia Animatrix exploits a recursive loop that the current publishing system has no mechanism to enable:
- Cycle 1 — The Initial Training: The AI trains on the curated, genealogically structured, error-detected open corpus. Unlike web-scraped training data, this corpus carries explicit quality signals: citation genealogy, replication status, cross-domain validation, provenance documentation. It is, in Gunasekar’s terms, textbook-quality data — not because it has been simplified, but because it has been structured.
- Cycle 2 — The Self-Referential Improvement: The trained system now understands the corpus it was trained on. It can identify which portions of the corpus produced high-quality reasoning in its own outputs and which produced error. It can generate structured synthetic summaries of reliable knowledge nodes, improving the quality and density of the training corpus for the next cycle. Each iteration produces a more capable system that produces a better-curated corpus that trains a more efficient next system.
- Cycle 3 — The Mixture of Experts Architecture: DeepSeek’s efficiency breakthrough (Shao et al., 2024) demonstrated that a ‘mixture of experts’ architecture — in which only the relevant parameter clusters are activated for a given query — produces frontier-level performance at a fraction of the compute cost. A domain-structured scholarly corpus maps naturally onto this architecture: the system activates the relevant knowledge sub-graphs for each query, remains small, and runs efficiently — potentially on modest hardware, at the edge, decentralized.
The implications for energy efficiency are significant. Current frontier AI systems require massive data centers and enormous energy inputs because they are trained on enormous quantities of low-quality data and run as monolithic models that activate all parameters for every query. A Philosophia Animatrix trained on a curated scholarly corpus, using sparse mixture-of-experts architecture, with self-improving training cycles that reduce data requirements over iterations, represents a fundamentally different engineering trajectory: not maximum intelligence at any cost, but maximum epistemic density at minimum energy footprint.
This is not classical Moore’s Law — the doubling of transistor density every 18–24 months. It is an epistemic analog: the density of reliable, navigable, interconnected knowledge per unit of energy, increasing as the system learns to know what it knows and curate what it trains on. The curve is potentially steep, because the principal bottleneck of current AI — garbage training data — is precisely what the open scholarly record, properly structured, eliminates.
The name encodes the vision. Philosophia: the love of wisdom, not the accumulation of information. Animatrix: she who animates — the feminine generative principle in Neoplatonic and Ismaili thought, the intellect that gives form to what would otherwise remain potential. The knowledge does not sit. It breathes. It teaches the system that teaches itself to understand it better. This is not a metaphor. It is the technical architecture.
One structural warning must be built into the design: the Jevons Paradox. Historically, efficiency gains in energy systems have increased total consumption, not reduced it, because cheaper access drives demand expansion. A Philosophia Animatrix governed as a commercial product would face the same dynamic. The system must be explicitly designed as a commons, with governance structures that prevent commercial capture, define a sufficient-capability threshold rather than pursuing maximum capability, and treat energy efficiency not as a cost advantage to be monetized but as an ethical obligation of a system designed to serve knowledge in an era of ecological constraint.
5. Precedents and the Gap This Paper Addresses
The infrastructure for Pillar II already exists in nascent form:
- OpenAlex (OurResearch, 2022): 250M+ works, fully open API, rich metadata including citation graphs, funder and institutional data. The closest existing infrastructure to the open knowledge base the Philosophia Animatrix requires.
- Semantic Scholar (Allen Institute for AI): NLP-extracted semantic relationships, citation context, influence scores. Demonstrates that automated semantic extraction from scholarly literature at scale is technically feasible (Ammar et al., 2018).
- S2ORC (Lo et al., 2020): 81 million academic papers with structured full text and citation graphs. Direct infrastructure precedent for full-corpus AI training.
- Connected Papers / Litmaps: visual citation neighborhood tools. Useful at the individual paper level but shallow, proprietary, and not integrated with error detection or theory evolution analysis.
- Graph of AI Ideas (GoAI): structured extraction of relational claims (‘based on,’ ‘in contrast to’) from AI literature. A promising prototype for the kind of semantic relationship extraction the full system requires.
The gap this paper addresses is the absence of an integrated conceptual framework that connects universal OA infrastructure, AI knowledge architecture, and the self-improving recursive training loop as a coherent proposal. The components exist in separate research silos. The synthesis — and its implications for energy efficiency, epistemic quality, and civilizational utility — has not been articulated.
6. Limitations and Self-Aware Caveats
- The updated 100M estimate carries methodological uncertainty. Coverage of non-English, non-STEM, and grey literature remains inconsistent across databases. The figure is an order-of-magnitude claim, not a precise count, and should be read as such.
- AI knowledge architectures built on the existing scholarly corpus will reproduce its existing biases — most acutely the systematic exclusion of Indigenous knowledge systems, Global South scholarship, non-Western epistemological traditions, and forms of knowledge that have never entered the peer-review pipeline. A Philosophia Animatrix built only on the formal scholarly record would know a great deal and understand relatively little. This is not a technical limitation. It is a political one, and it requires political response: active corpus expansion, Indigenous data sovereignty protocols, and explicit epistemic humility built into system design.
- Open infrastructure requires sustained public funding that is politically vulnerable. The recent defunding of SPARC, the withdrawal of US federal open access mandates under successive administrations, and the general trend toward privatization of public knowledge infrastructure all represent real threats to the preconditions of this proposal.
- The Jevons Paradox warning is not merely theoretical. It must be structurally addressed in any governance proposal for the system, not treated as a future problem.
- The qiyamah framing is deliberately polemical. We acknowledge this is an argument for a position, not a neutral survey of the landscape. We believe the situation warrants it.
7. What We Actually Need These Papers For
The question in the title is not rhetorical. It deserves a direct answer.
We need the papers as the accumulated record of human inquiry — the ongoing, imperfect, irreplaceable collective project of trying to understand the world well enough to live in it and not destroy it. That project is real. Its value is not in question.
What is in question is whether the system that produces, distributes, and structures that record serves the project or has captured it. The answer, as of 2026, is that the system has in large part captured it. The papers exist. The knowledge is in them. And most of it is locked, unnavigable, epistemically distorted by incentive structures, and inaccessible to the people and systems that need to act on it.
The transformer architecture of Vaswani et al. teaches us that attention, not exhaustive processing, is the key to intelligence. The Phi models of Gunasekar et al. teach us that quality, not quantity, is the key to learning. The 100 million articles that now constitute the scholarly record are an attention problem and a quality problem simultaneously.
What we need these papers for is what they have always been for: knowing things, correcting errors, passing knowledge forward, and making the next generation of inquiry possible. The Philosophia Animatrix is the proposal that we build the infrastructure worthy of that purpose.
One hundred million papers. The qiyamah is not that we produced them. The qiyamah is that we might finally open them.
III. Annotated Bibliography (Working Draft)
Organized thematically. Annotations are argumentative, not merely descriptive: each entry is positioned in relation to the paper’s specific claims. For co-author review — entries to be verified and formatted to target journal style.
A. The ‘All You Need’ Citation Lineage — Primary Dialogue Partners
These two papers are cited in the epigraph and at multiple points throughout the text. They are the most important external citations for positioning the paper in AI/ML discourse.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
The foundational transformer paper: one of the most-cited works in the history of computer science. Vaswani et al. demonstrated that self-attention mechanisms could replace recurrence entirely, producing a more efficient and parallelizable architecture that became the basis for all subsequent large language models. The paper's title is the first element of the rhetorical lineage this paper both continues and subverts. Cited in the epigraph and in Section 4, Pillar III: the transformer's attention mechanism is the architectural precedent for the Philosophia Animatrix's domain-specific activation logic. The principle 'attend to what is relevant' scales from token-level attention to corpus-level epistemic architecture.
Gunasekar, S., Zhang, Y., Aneja, J., Mendes, C. C. T., Del Giorno, A., Gopi, S., ... & Bubeck, S. (2023). Textbooks are all you need. arXiv:2306.11644.
The Phi-1 paper: demonstrates that a 1.3 billion parameter model trained on textbook-quality data — curated, structured, reasoning-dense — outperforms models up to 25x larger trained on web-scale noise. The key claim: 'high quality data dramatically improves the learning efficiency of language models as they provide clear, self-contained, instructive, and balanced examples.' This is the most important direct precedent for the Philosophia Animatrix training proposal. If textbook-quality synthetic data produces these results, the curated open scholarly record — the actual accumulated record of human inquiry, genealogically structured and error-detected — would constitute the highest-quality non-synthetic training corpus available. Cited in epigraph, Section 4 Pillar III, and the title's rhetorical structure. Essential citation.
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., ... & Liang, W. (2024). DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv:2402.03300. [See also: DeepSeek-AI. (2025). DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv:2501.12948.]
The DeepSeek efficiency papers demonstrate that architectural innovation under constraint — specifically mixture-of-experts activation and reinforcement learning from correctness signals rather than human annotation — can produce frontier-level reasoning performance at reported training costs approximately 95% below Western equivalents. Crucially, this breakthrough emerged from US export controls on high-performance chips, forcing Chinese AI developers to prioritize algorithmic elegance over raw compute. The irony is exact: constraint produced the most efficient architecture. Cited in Section 4, Pillar III, for the mixture-of-experts mechanism as the basis for domain-specific activation in the Philosophia Animatrix architecture.
B. The Quantitative Scholarly Record
Jinha, A. E. (2010). Article 50 million: An estimate of the number of scholarly articles in existence. Learned Publishing, 23(3), 258–263.
The paper this work follows from. Estimated the cumulative peer-reviewed corpus at approximately 50 million using historical publication growth curves traced to the first scientific journals in 1665. The methodology, assumptions, and caveats are the baseline for the updated estimate in Section 1. The self-citation is not pro forma: this paper is an explicit return to the same question sixteen years later. The original asked 'how many?' This one asks 'so what?'
Bornmann, L., & Mutz, R. (2015). Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references. Journal of the Association for Information Science and Technology, 66(11), 2215–2222.
Documents the exponential growth of the scholarly record, with a doubling time of approximately 9 years. Provides the quantitative foundation for projecting from the 50M baseline to the 100M threshold. The growth rate has been substantially affected by the expansion of predatory publishing and open-access mandates since 2015 — current rates may be higher than Bornmann & Mutz projected.
Larivière, V., Haustein, S., & Mongeon, P. (2015). The oligopoly of academic publishers in the digital era. PLOS ONE, 10(6), e0127502.
Demonstrates that five major publishers controlled approximately 50% of all papers published across disciplines by 2013, with the share increasing over time. The market concentration data is the empirical anchor for Section 2b's commercial capture argument. Open access.
C. Open Access: History, Theory, and Infrastructure
Harnad, S. (1994). A subversive proposal. In Okerson, A. & O’Donnell, J. (Eds.), Scholarly Journals at the Crossroads. Association of Research Libraries.
The foundational Green OA manifesto: Harnad's call for researchers to self-archive their work in open electronic archives. The vision of every paper on every researcher's desktop, free, forever, is cited in this paper as the benchmark against which the current system fails — and as the standard that the Philosophia Animatrix proposal requires as its material precondition. The fact that this vision is 30 years old and still unrealized is itself part of the argument.
Budapest Open Access Initiative. (2002). Budapest Open Access Initiative. Open Society Foundations.
The landmark 2002 declaration that defined open access and established the Green/Gold OA distinction. Foundational policy document. Twenty-four years on, the vision remains unrealized for the majority of the corpus — a fact that itself supports the paper's argument that incremental reform is insufficient.
Suber, P. (2012). Open Access. MIT Press. (Available OA: https://mitpress.mit.edu/9780262517638/)
The definitive comprehensive treatment of open access: its rationale, mechanisms, obstacles, and policy landscape. Suber's taxonomy (Green, Gold, Diamond, Gratis, Libre) provides the vocabulary for Section 4, Pillar I. The irony of a book about open access being freely available in open access is the kind of thing Suber would appreciate.
Piwowar, H., Priem, J., Larivière, V., Alperin, J. P., Matthias, L., Norlander, B., ... & Haustein, S. (2018). The state of OA: A large-scale analysis of the prevalence and impact of open access articles. PeerJ, 6, e4375.
Large-scale analysis finding that approximately 28% of the scholarly literature was freely available as of 2016. Provides the current baseline data on OA uptake and documents the 72% majority that remains inaccessible. The 28% figure is the empirical measure of how far Harnad's vision remains from realization.
D. Quality, Replication, and Epistemic Distortion
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLOS Medicine, 2(8), e124.
Load-bearing anchor for Section 2c. Demonstrates mathematically that, under prevailing conditions of small sample sizes, high researcher degrees of freedom, selective reporting, and publication bias, the majority of published research findings are likely to be false. This is not a fringe position: it is a mathematical consequence of incentive structure, extensively cited and replicated. At 100 million papers, the proportion of false or unreliable findings in the corpus is a quantitative problem of the first order. Must be described in sufficient detail in the text that readers who have not read it understand what it actually demonstrates.
Greenberg, S. A. (2009). How citation distortions create unfounded authority: Analysis of a citation network. BMJ, 339, b2680.
The single best empirical demonstration of citation chain error propagation. Greenberg traced a specific false claim through a citation network, showing how it accreted false authority through successive citation until it became canonical fact, despite having no original empirical basis. This paper is the case study for the error-detection function proposed in the Philosophia Animatrix: if the system can identify Greenberg-type citation chain distortions across the full open corpus, the epistemic gain is enormous. Describe in detail in the text.
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.
Large-scale replication study finding fewer than 40% of 100 psychology studies reproduced successfully. Direct empirical evidence that the quality crisis is not theoretical. Supports the argument that the 100M corpus is not a pile of reliable knowledge but a pile of mixed signal and noise — and that distinguishing signal from noise is a task the current infrastructure cannot perform.
E. AI Knowledge Architecture and Scholarly Infrastructure
Priem, J., & Piwowar, H. (2022). OpenAlex: A fully-open index of the world’s research works. arXiv:2205.01833.
Describes OpenAlex, a fully open index of 250M+ scholarly works with rich metadata, citation data, and open API. The closest existing infrastructure to the open knowledge base the Philosophia Animatrix requires. Essential for Section 5 precedent discussion and for demonstrating that the infrastructure precondition of Pillar II is already in development.
Ammar, W., Groeneveld, D., Bhagavatula, C., Beltagy, I., Crawford, M., Downey, D., ... & Etzioni, O. (2018). Construction of the literature graph in Semantic Scholar. Proceedings of NAACL-HLT 2018, 84–91.
Describes the construction of Semantic Scholar's large-scale literature graph, extracting entities, relationships, and citations from millions of papers using NLP. Direct technical precedent for the AI knowledge architecture proposed in Pillar II. Demonstrates that automated semantic extraction from scholarly literature at scale is feasible.
Lo, K., Wang, L. L., Neumann, M., Kinney, R., & Weld, D. S. (2020). S2ORC: The Semantic Scholar Open Research Corpus. Proceedings of ACL 2020, 4969–4983.
Introduces a large-scale open corpus of 81.1 million academic papers with structured full text and citation graphs. Infrastructure precedent for full-corpus AI training. Demonstrates that the material precondition for Philosophia Animatrix training — a large-scale, structured, openly accessible scholarly corpus — is not a future aspiration but a present reality in embryonic form.
Fortunato, S., Bergstrom, C. T., Börner, K., Evans, J. A., Helbing, D., Milojević, S., ... & Barabási, A.-L. (2018). Science of science. Science, 359(6379), eaao0185.
Comprehensive review of the emerging discipline of 'science of science' using large-scale bibliometric data to understand knowledge production, diffusion, and impact. The methodological foundation for AI-enabled knowledge architecture. This is the field that the Philosophia Animatrix would most directly build on and transform.
F. Structural Critiques and Reform Visions
Tennant, J. P., Waldner, F., Jacques, D. C., Masuzzo, P., Collister, L. B., & Hartgerink, C. H. J. (2016). The academic, economic and societal impacts of Open Access: An evidence-based review. F1000Research, 5, 632.
Comprehensive evidence review of OA's demonstrated benefits: access equity, citation impact, economic efficiency, and public engagement with science. The affirmative evidentiary case for Pillar I. Freely available, appropriately.
Priem, J., & Hemminger, B. H. (2012). Decoupling the scholarly journal. Frontiers in Computational Neuroscience, 6, 19.
Proposes decomposing the scholarly journal into its component functions — registration, certification, dissemination, archiving — which can be rebuilt on open digital infrastructure. Provides a reform framework compatible with Pillar II: the Philosophia Animatrix does not replace peer review, it builds on a restructured infrastructure in which peer review's functions are separated from the commercial packaging that currently bundles them.
Buranyi, S. (2017, June 27). Is the staggeringly profitable business of scientific publishing bad for science? The Guardian.
Longform journalism tracing the political economy of academic publishing from Robert Maxwell's acquisition of Pergamon Press to the present oligopoly. Accessible, well-sourced account of how a public good became a private rent. Useful for framing the commercial capture argument for a broader, non-specialist audience.
G. Philosophical and Civilizational Framing
Haraway, D. (1988). Situated knowledges: The science question in feminism and the privilege of partial perspective. Feminist Studies, 14(3), 575–599.
Argues that all knowledge is situated — produced from particular positions, bodies, and histories — and that the fiction of the view from nowhere distorts both science and politics. Grounds the Philosophia Animatrix's design requirement for epistemic transparency: the system must document its own provenance, training corpus limitations, and positional assumptions. A knowledge architecture that does not know where it came from is not wisdom. It is the current system at scale.
Mignolo, W. (2009). Epistemic disobedience, independent thought and decolonial freedom. Theory, Culture & Society, 26(7–8), 159–181.
Frames the dominant organization of knowledge production as an extension of colonial power, calling for epistemic disobedience: the refusal to accept the terms of the existing knowledge system. Grounds the paper's concern, acknowledged in Section 6, that a Philosophia Animatrix built only on the formal Western scholarly record would reproduce its exclusions at AI scale. The response is not to abandon the project but to design corpus expansion and governance with explicit decolonial intent.
Corbin, H. (1983). Cyclical time and Ismaili gnosis. Kegan Paul International.
Corbin's phenomenology of Ismaili cyclical time and the concept of qiyamah as kashf — unveiling, the lifting of the veil — provides the philosophical grounding for the term as used in this paper. Qiyamah in this tradition is not apocalypse as destruction but apocalypse as revelation: the moment when what has always been present becomes visible. The 100 million article threshold is a kashf moment for the publishing system — not because the pile grew, but because the pile's inadequacy to its own purpose has become impossible to ignore.
--- END OF WORKING DRAFT — VERSION 2, APRIL 2026 ---
Notes for co-author: The Greenberg (2009) and Ioannidis (2005) citations are load-bearing and should be described in detail in the text, not merely cited. The Vaswani (2017) and Gunasekar (2023) citations are the rhetorical spine connecting us to AI discourse — they should appear in the abstract, the epigraph, and Section 4. The Corbin (1983) citation in Section G may belong in the main text depending on how far we take the qiyamah framing. The DeepSeek citation cluster (Shao et al. / DeepSeek-AI 2025) is the newest material and needs verification of the most current paper to cite. The title is a provocation — worth testing with 2–3 trusted readers before submission to assess whether it lands as intended or risks dismissal on tone grounds. My instinct is it lands.