Surveys show spreadsheets are the most widely used analytical tool in academic research. But they are prone to errors. During the height of the COVID-19 pandemic, Public Health England lost 16,000 test results after using an old Excel file format to handle data. Whether this mix-up hampered local infection control is anybody’s guess, but it certainly could have.
The software is causing trouble for other lines of research, too, as our team and others have shown: Genomics studies that rely on Excel spreadsheets to manage data turn out to be riddled with erroneous gene names, or gene-name errors, and the issue is affecting more and more journals.
A persistent problem
Gene-name errors were discovered by Barry R. Zeeberg and his team at the National Institutes of Health in 2004. Their study first showed that when genomic data is handled in Excel, the program automatically corrects gene names to dates. What’s more, Riken clone identifiers – unique name tags given to pieces of DNA – can be misinterpreted as numbers with decimal points.
In 2016, we conducted a more extensive search, showing that in more than 3,500 articles published between 2005 and 2015, 20% of Excel gene lists contained errors in their supplementary files.
There are no documented cases in which gene-name errors have affected the conclusions of a study. But if a researcher unknowingly saved a spreadsheet containing such errors, someone importing those data for further analysis would face a reproducibility problem.
In response to these issues, the HUGO Gene Nomenclature Committee (HGNC) has made changes to some gene names prone to errors. This includes converting the SEPT-1 gene to SEPTIN1 and the MARCH1 gene to MARCHF1.
The widespread adoption of these new names is a lengthy process: Our latest study, of more than 11,000 articles published between 2014 and 2020, found that 31% of supplementary files that included gene lists in Excel contained errors. This percentage is higher than in our previous 2016 study.
We also found some novel errors. Some were connected to the local language setting. For example, the AGO2 gene was converted to Aug-02, perhaps due to “Ago” being short for “August” in Spanish. Similarly, MEI1 was switched to May-01 (“Mei” is “May” in Dutch), and TAMM41 to Jan-41 (“Tammikuu” is “January” in Finnish).
One of the significant findings in our study was the five-digit conversion. When saving a gene name converted to a date and importing it into a new Excel spreadsheet, the dates will change into a five-digit number. Software packages like HGNChelper and Gene Updater can help in fixing gene-name errors, but not once they have been converted to numbers.
More and more journals affected
Despite increased recognition of gene-name errors, scientific journals have taken no significant initiatives to mitigate the problem. Our continuous tracking of these errors, the data for which is publicly accessible at https://ziemann-lab.net/public/gene_name_errors/, reveals a steady rate of gene-name errors and a rising number of affected journals until recently.
This spread may be due to the expansion of omics into different branches of biology and medicine, coupled with the growth of second-tier journals.
But the problem isn’t limited to lesser-known journals. We found the journals with the highest numbers of affected articles were Nature Communications, PLOS ONE, Scientific Reports, BMC Genomics, PLOS Genetics and Oncotarget, with at least 100 affected publications each. Clearly, journals need to produce guidelines about how to filter for and address gene-name errors that can affect data reproducibility.
A substantial number of genes may be misnamed when the computer is set to a non-English language. Our studies only scratch the surface of this problem, but they plainly show Excel and other non-specialised spreadsheet programs are not designed to handle big data.
Although gene-name errors may seem like small fry in the grand scheme of scientific data analysis, they are a symptom of broader issues like over-reliance on spreadsheets and lack of consensus on how large genomic datasets should be reviewed and disseminated. These issues need to be addressed to improve the reliability of genomics research.
Mandhri Abeysooriya is a second-year PhD student at Deakin University in Melbourne, Australia; Mark Ziemann is the head of the bioinformatics team at the Burnet Institute in Melbourne.
Abeysooriya would like to thank Severine Lamon, an associate professor at Deakin University, and Danielle Hiam, a postdoctoral fellow at Deakin University, for their invaluable insights, feedback and endless support in writing this article.
Like Retraction Watch? You can make a tax-deductible contribution to support our work, follow us on Twitter, like us on Facebook, add us to your RSS reader, or subscribe to our daily digest. If you find a retraction that’s not in our database, you can let us know here. For comments or feedback, email us at [email protected].
This research by Harvard Medical School also has an Excel-related gene name error issue.
https://pubpeer.com/publications/7D0F773579AB34E718D7843147F0A0
Loscalzo already has several retractions.
http://retractiondatabase.org/RetractionSearch.aspx#?auth%3dLoscalzo%252c%2bJoseph
https://pubpeer.com/search?q=Loscalzo
Any bioinformatician worth their salt would not use Excel for any stage during the analysis process. A lot of the tools used in this field are command line and use flat text files which have none of these conversion issues (unless there are quotes or commas, but this can be worked around).
I would suggest that the issue you refer to happens at the edges where bioinformatics meets the real world of journals insisting that tables and data files are not uploaded as plain text, but formatted as spreadsheets, where Microsoft’s default design choice of formatting anything that looks like a number as a number, a date as a date etc. when importing plain text files comes into play.
The analyses are unlikely to be incorrect, just the presentation of them.
The “helpful” way Excel automatically changes text it imports into numerals or dates can screw up my mortality tables I import, where the “1-4” (intended to mean ages 1 years to 4 years) gets transformed to January 4 of the current year.
You can see me talking about that here:
https://youtu.be/CthevUMiWK4?si=oVNLecFJh6lXSfkl&t=93
Then I step through the best practice how to prevent this from happening. One can absolutely prevent the conversion from happening by forcing the imported columns to be formatted as text.
This is also a case of researchers failing to understand and learn to make proper use of their tools. Problems are easily prevented by setting the formatting of such spreadsheet columns appropriately (e.g., as text) before data entry. AGO2 and MEI1 will be left untouched.
“Problems are easily prevented by” could be expressed more clearly: as “Problems are easily created by not”
Also, “proper use of their tools” should be “use of the proper tools.”
737max crashed by “human error.” Repeatedly. Then they decided that they should think about fixing it (well, making its failure modes less obvious).
Fundamental design principle: “”A system can’t just work well, it has to fail well.” (Doctorow)
Valuable work.
Another common error can occur in XL when using LOOKUP functions to extract data from imported sets (e.g., look for this gene name in this imported list and return the value 3 columns over to the right). If you don’t include a “FALSE” argument in the formula, it will return the next closest value from the table, even if the exact string you’re searching for is not there – literally generating data where none existed.
I won’t go into details about how I found out about this, other than to say I’m thankful it happened prior to publication!
This is scary but understandable. My university pays for Microsoft licenses but not for most other software. So labs with small budgets that do not specialize in genomics but do the occasional experiment of that sort will of course use Excel.
Would love it if someone could recommend a replacement for Excel that is free (for academics at least) and less error-prone, but still at least somewhat intuitive.
You can solve Spreadsheet problems by migrating data to a Database application. You can use MS Access database, which is already included in MS Office suite.
The key here is hiring a talented Access developer who’s able to transform your Excel process into a database.
Best comment
LibreOffice is a “free” version of MS Office. It isn’t an exact copy of course, but it will import and export Excel files.
If you are going to use it for the long-term, the moral thing to do is to through them some cash once a year. (Don’t be parasite.) They may have a university/organization license. I use the personal version, and do send them some money.
What is wrong with using R or Python and csv files? Using a spreadsheet to do data cleaning or (worse still) detailed analysis of large data suggests you don’t know about reproducible research and command line tools (like above).
I gave a talk last month about how data analysis in R and Python is not very reproducible.
It can certainly be made so, but it takes effort.
I am curious about your talk — what prevents these analyses from being reproducible?
I believe the primary issue lies in Data Validation. Here’s a proven data management approach that has yielded success for clients in the banking and securities industry:
Rather than directly managing raw data within an Excel spreadsheet, create custom database application to handle data effectively:
1. Data Validation and Exception Reporting: The application initiates by uploading raw Genomics data into a database. Here, data validation rules are applied before the upload, ensuring data integrity, and promptly generating Exception reports in the event of any issues.
2. Excel Template Integration: Afterward, application can query data and publish into a Genomics Excel template, enabling researchers to work with clean data and significantly enhancing accuracy and efficiency.
This method not only addresses the data validation concern but also optimizes the entire Genomics data management process.
For a visual explanation of this process, please watch this brief 2-minute video:
Please pay particular attention to:
1. 0:40, Validation rules are applied and generates an Excel Exceptions report.
2. 1:10, Excel Template is populated with clean and processed data.
Ed & I are part of the European Spreadsheet Risks Interest Group, by the way (and I brought this post to the EuSprIG’s attention).
I’m going to elevate the issue to a higher level — I am part of the actuarial profession, and in the U.S., we have Actuarial Standards of Practice, which impose responsibility on practicing actuaries on a variety of issues. One is ASOP 23, on data quality:
http://www.actuarialstandardsboard.org/asops/data-quality/
Because of this ASOP, actuaries have set up their own checklists for data validation, with checks like: did the number of policies in the file match the number of policies the actuarial modeling system imported? Am I getting face amounts summing up properly? Those sorts of checks.
There are professional disciplinary repercussions if one does not follow these ASOPs. Here is a public notice that mentions ASOP 23 specifically, where the subject actuary was suspended for a year as a result:
https://www.actuary.org/node/13660
Ed mentioned banking and securities. I’m in insurance. It’s not just professional standards, but also regulators and auditors looking over our shoulders — if we screw up, there are consequences. Because it matters if we get something wrong.
By not even checking your own work, you’re indicating it doesn’t matter. Think about that for a moment. Yes, this is an extremely nasty way of putting it, but the sloppiness in academic modeling and data practices is embarrassing in many of these cases. (and the Retraction Watch readers should be very familiar with this.)
Perhaps some genomics-specific journals could get out in front of the others by, as part of their data-submission requirements, also doing data reviews where they search for these data transformation errors.
Researchers knowing that these errors will be searched for may then have sufficient incentive to actually review the data files they submit.
First cited link is dead (http://10.0.49.144/f1000research.8414.1). Can anyone update? I couldn’t find anything that looked like the intended citation.
Fixed.
This thread reminds me of a scene from Star Trek 4:
We’re dealing with medievalism here. csv files. excel templates
https://youtube.com/clip/UgkxwRAFafnTY4F3LcrDx1SUZ9BK3s8BhL66?si=wsOEYn_T23MbMymz
These problems have been solved by software engineering and relational databases for decades. For some reason, Ph.D. scientists aren’t required to take any courses in such fields and instead play with medieval tools or maybe hire “talented MS Access developers”. lol. Either learn the correct tools and technologies or spend some of those millions on actual talent instead of on more VC profit taking.
The last time I hired a first-rate programmer (as an academic science PI) I paid her the maximum I was legally allowed. It was somewhat more than I was getting paid. I lost her to an IT firm which offered 1.5x my maximum. Just saying, not all scientists are rolling in money, and we have to compete with the tech industry for employees. It’s not easy.
(Tip for others in the same situation: I was able to keep her for several years by offering absolutely flexible work hours. Get a call from your kids’ school, need to run off? Just go. No questions asked. Worked until the kids needed a college fund, and then I couldn’t blame her for taking the industry offer.)
Some rather arrogant young folk posting here need a little perspective. Some of these comments are great, but the snarky ones are not at all helpful to mid – to late- career researchers. Many of them are reeling from the “red queen” problem to keep up with new tools while also being swamped with things like running the training grants that pay your students, reviewing duties, departmental admin duties, etc. Students ARE offered classes in modern data handling, but faculty find it very hard to get off their ever-faster-spinning hamster wheels and update themselves.
I was once really good at Fortran and VMS scripting. I knew what order to read in the variables of large 3D data sets to prevent excess page faulting on a Vax, etc. Then my primary type of experiment at the time (crystallography) became so advanced that professionally written software was far better than what someone who was also trying to think about the biochemistry behind the structures could write. My computer skills got out of date.
Nobody wants to or should do the same type of experiment for an entire career. Many of us now do the occassional, say, RNA-seq or other ‘omicsy experiment. We do try to consult with the available experts (sometimes hard to find if that’s not your department’s focus) but we don’t have the time to become experts ourselves in everything we need to handle, nor do we have the money to hire “talented MS Access developers.”
Yes there is a problem. Please be constructive about it.