Bug in Springer Nature metadata may be causing ‘significant, systemic’ citation inflation

Millions of researchers could be affected by a “dramatic distortion of citation counts” likely caused by flaws in how the academic publishing giant Springer Nature handles article metadata, according to a new preprint.

The bug means a large number of citations are automatically attributed to the first paper in a given journal volume, instead of to whichever paper in that volume they were intended for. The issue appears to affect many of the publisher’s online-only titles, such as Nature Communications, Scientific Reports and several BMC journals.

“It seems that millions of scientists lost a few citations, while tens of thousands, the authors of Article 1s, gained all these, leading to insane citation counts,” Tamás Kriváchy of the Barcelona Institute of Science and Technology, in Spain, told us. His findings appeared earlier this month on arXiv.org. And those citation losses and gains are through no fault (or intention) of the authors themselves. In fact, one author we spoke with has tried, without success, to get mistaken citations removed from her paper. 

A spokesperson for Springer Nature questioned the new data and said the preprint’s conclusions “could be misleading.”

According to the analysis, the distorted statistics appear on the journals’ own websites and in free citation databases such as Crossref, OpenCitations and Google Scholar. The problem could make it harder for scientists to find out which studies cite which and could give some authors unfair advantages in winning grants, promotions and jobs, Kriváchy said.

Whether the effects carry over to the two major commercial citation databases, which the researcher could not access, is unclear. But one expert told us they might.

”My analyses confirm that the study’s main concern is valid: citation-linking errors appear to be significant, systemic, and spill over even into curated databases such as Scopus and Web of Science,” said Lokman Meho, a bibliometrician at the American University of Beirut in Lebanon.

“If such mistakes are high, the implications could be profound,” Meho added. “Inflated citation counts distort measures of scholarly influence, misrank universities, mislead funding decisions, and compromise evidence-based science policy. They also challenge one of the field’s core assumptions: that curated citation databases are insulated from the problems encountered in open systems such as Crossref or OpenCitations.”

The preprint highlights a paper designated as “Article number 1” in the 2018 volume of Nature Communications, “Structural absorption by barbule microstructures of super black bird of paradise feathers.” The work has garnered more than 7,000 citations, according to the journal’s website, and Crossref, OpenCitations and Semantic Scholar provide comparable numbers. Meanwhile, Google Scholar lists 584 citing papers as of this writing, Clarivate’s Web of Science 582, and Scopus 1,323

According to emails we have seen, the corresponding author of that paper, Dakota McCoy of the University of Chicago, contacted Nature Communications in April of this year, stating her paper was “frequently cited spuriously.” An editorial assistant supervisor for the journal replied:  “I am afraid that we are unable to determine any steps we are able to take to resolve the issue on our side as it looks as if no errors occurred on the original publication of the paper.” They speculated the issue may have come from the citing journal, was a citation error that kept getting propagated, or was somehow being influenced by AI.  

“This is a bizarre and annoying problem that we first noticed back in 2023 and haven’t been able to solve, despite emailing editors and Googling for hours. Even worse, the articles that are meant to receive those >400 citations aren’t receiving them!” McCoy told us by email. “It is unfortunate because it makes it difficult to track the true impact of our paper.”

“I’m so happy to see that this preprint may have identified the source issue,” she added.

McCoy’s coauthor Richard Prum, an ornithologist at Yale University, told us: “Many of the articles in Google Scholar that are counted as citations of us actually make no mention whatsoever of any research related to us or our paper! So, the problem is compounding!!”

Meho said he had confirmed the problems in an analysis of Scopus data for Prum and McCoy’s article, as well as three other papers in Nature Communications that also have more than 1,000 citations each, according to the database.

“When I extracted and examined the actual cited references in those citing papers, I found that fewer than 250 references in each case actually cited the target article. In other words, roughly three out of four citation links were erroneous, a discrepancy far too large to attribute to chance or isolated database glitches,” he told us. In at least one case, the errors did not seem to be explained by the bugs the preprint described, he said.

“The study also raises a larger question: Is this problem confined to Springer Nature, or is it an early warning of hidden structural vulnerabilities in how citation data are exchanged and standardized across all major publishers?” Meho said. “If millions of citation links can silently go astray in a system as central as Springer Nature’s, then research evaluation itself needs urgent scrutiny. The credibility of metrics, rankings, and even funding depends on the reliability of these invisible networks of data.”

According to Kriváchy, the problems appear to have originated with the advent of online-only journals several years ago. These publications typically reference articles using an article number instead of the page numbers traditional print journals use.

“Based on our analysis, the mis-citations happen primarily due to the above adaptation from a page-based numbering to an article number-based one; more specifically, from the improper technical handling of the change,” the preprint states. “The problem seems to stem from the absence of the Article Number in most formats of the article metadata obtained through the SpringerLink Application Programming Interface (API), or possibly from the handling of the fields in RIS file format provided by the publisher on Springer Nature Link websites.”

Springer Nature emphasized that, as a preprint, the new work “has not yet undergone peer review or independent validation.”

“Looking at the conclusions we suspect they could be misleading due to incomplete data,” a spokesperson said. “In the meantime, we are looking at all of the data ourselves as we are always open to feedback and to ensure that we continue to do the best for our authors.”

Kriváchy told us in addition to fixing the technical issues, the publisher “should put together a thorough report” addressing the cause of the problem as well as which journals have been affected and for how long.

Some of the damage won’t be fixable, however, said Alberto Baccini of the University of Siena, who studies publication metrics.

“The well-known Matthew Effect in bibliometrics indicates that highly cited papers become even more cited simply because they are perceived as important. Therefore, the initial metadata error has probably influenced researchers’ citation behavior, leading them to cite these ‘false’ highly cited papers precisely because of their high citation count,” Baccini told us. “After the data is corrected, how many of the remaining citations were received solely thanks to this mechanism? This is an unfixable problem.”

“We are all aware of the pollution infecting contemporary science and the mechanisms that have corrupted citation counts such as citation mills,” he added. “I hope that this ‘genuine’ error, originating from one of the major players in scientific publishing, will serve as a turning point. It should compel us to abandon our blind faith in quantitative metrics – a faith that has contributed so significantly to the corruption of contemporary science.”


Like Retraction Watch? You can make a tax-deductible contribution to support our work, follow us on X or Bluesky, like us on Facebook, follow us on LinkedIn, add us to your RSS reader, or subscribe to our daily digest. If you find a retraction that’s not in our database, you can let us know here. For comments or feedback, email us at [email protected].


Processing…
Success! You're on the list.

6 thoughts on “Bug in Springer Nature metadata may be causing ‘significant, systemic’ citation inflation”

  1. Why do y’all refer to CrossRef as an open system in this post? CrossRef is a membership based organization, and you have to pay fees in order to mint DOIs, add references, contribute or otherwise enhance metadata.

    1. Exactly correct. Crossref provides an public API, but to create a DOI one must be a fee-paying member. Corrections can only be done by those fee-paying members to their own data. Crossref doesn’t do that for them.

      1. The public Crossref API is the only reason I could think of for why they would be referring to Crossref as an open system. Even then, Crossref could choose to stop offering a public API, unless it’s in their bylaws. “Open” is doing a lot of vague work in this piece.

  2. Hi,

I want to leave some hints here how publisher can tackle the issue. This is not meant to be an absolute truth – it reflects my current understanding of how the metadata situation affects MDPI in particular.

    In your workflow, when trying to match a reference with DOI numbers, you may use the CrossRef API. Some publishers (as in our case) use the CrossRef metadata to complement gaps or partially correct references metadata of a paper. In such a scenario, you may tap into the “page” field from the CrossRef API. This field may show a value such as “1-8”, which can be the “legitimate” first few pages of a volume or an issue. However, it seems some of the vendor technology platforms push the PDF’s pagination there instead of the article number (where an article number is available – in such case “1-8” only represent the information that the PDF is 8 pages long).

    This is the first part: how you might end up with a citation listing page “1-8” instead of the article number, and how that wrong pagination might be reflected the full-text XML markup. There seems to have been a second part, a compounding effect so to say. If the full-text XML was sent to PubMed Central (PMC), it seems they did reference matching using the from each reference to get to the corresponding PMID – even if the publisher already sent the correct PMID, they may have overridden this with a “wrong” PMID based on the information. The bug seems to have been fixed on PMC side as per current fetching of data from PMC via Entrez / efetch. However, there is an older API on EuropePMC which still seems to surface the wrong PMID. An example:

    We are looking at reference 29 in this paper: https://www.mdpi.com/1420-3049/29/24/6025#References

    29. Qiu, W.; Xie, X.-Y.; Qiu, J.; Fang, W.-H.; Liang, R.; Ren, X.; Ji, X.; Cui, G.; Asiri, A.M.; Cui, G. High-performance artificial nitrogen fixation at ambient conditions using a metal-free electrocatalyst. Nat. Commun. 2018, 9, 1–8. 

    The citation is to this paper: https://www.nature.com/articles/s41467-018-05758-5 – note it has article number 3485 (not page 1-8). Its proper PMID is 30154483, i.e. https://pubmed.ncbi.nlm.nih.gov/30154483/

    The publisher full-text XML (https://www.mdpi.com/1420-3049/29/24/6025/xml) has the proper tagging for DOI and PMID, but the wrong pagination info:

    Nat. Commun.
    2018
    9
    1
    8
    10.1038/s41467-018-05758-5
    30154483

    PMC Entrez also surfaces the right DOI and PMID: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=PMC11677930&rettype=full&retmode=xml

    Nat. Commun.
    2018
    9
    1
    8
    10.1038/s41467-018-05758-5
    30154483
    PMC6113289

    If you use the EuropePMC API to fetch the XML for the same PMCID, the result is different – the publisher supplied PMID is commented out and replaced by another one: https://www.ebi.ac.uk/europepmc/webservices/rest/PMC11677930/fullTextXML

    Nat. Commun.
    2018
    9
    1
    8
    10.1038/s41467-018-05758-5

    29317637

    PMID 29317637 happens to be the PMID of the article 1 in this journal-volume, i.e. Nat. Commun. 2018, Vol. 9, Article No. 1: https://pubmed.ncbi.nlm.nih.gov/29317637/

    Feel free to get in touch.

    Best regards,
    Dietrich Rordorf


      1. Hi Tamas,

        No, I do not think this is strictly related to SpringerNature.

        It is a mix of a few things:

        (1) using metadata fields without clear definition. Is it okay to write an article number into page field? It is common practice, but is it the best practice? Should CrossRef and other databases enforce a strict separation of first page, last page and article number as three distinct fields?

        (2) implementation of search algorithms to look-up metadata when trying to match a DOI, PMID, etc. for a given reference string. If the look-up implementation is not very well thought-through and not tested on edge-cases, it may lead to misattribution.

        (3) the second point is especially true when the reference string contains errors in the first place, such as page numbers when an article number should be used. The error may stem from wrong metadata in some databases, or also from human error.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.