Note: This is a special blog post coauthored by The Mad Virologist and The Blood-Brain Barrier Scientist (this article will be co-published on both our blogs). Another post has already been published on this paper, but we wanted to take a deeper look at everything that is wrong with this paper.
[UPDATE2] The study in question got retracted according to RetractionWatch:
[UPDATE] I would strongly recommend the reader to look at the comments on Pubpeer about this paper. It is terrifying to think how it percolated through peer-review.
A recent paper by ophthalmologist Chris Shaw was published and immediately touted as being proof positive that the aluminum adjuvants found in some vaccines are responsible for causing autism. Before we get into the paper, I have a few choice things to say about Chris Shaw. Despite not being an immunologist, Shaw has ventured into studying how vaccines and vaccine adjuvants cause neurological disorders such as autism. Shaw made headlines in 2016 when a paper he co-authored that claimed to show a link between the HPV vaccine and neurological disorders was retracted after being accepted by the journal Vaccine. It turns out that the statistics used in the paper were completely inappropriate and there were undisclosed conflicts of interests for some of the authors, including Shaw.These issues should have prevented the paper from being accepted in the first place, but mistakes do happen and science tends to be self correcting. More surprising is that Shaw claimed that he didn’t know why the paper was retracted and that the science was of the highest quality. Shaw’s previous work has also been described by the WHO as deeply flawed and rejected by that body. This isn’t being brought up to dismiss the paper out of hand but to help illustrate why Shaw’s work is deserving of additional scrutiny. Hopefully by the end of this post, the logic behind the need for additional scrutiny of anything Shaw publishes is abundantly clear. We’ll begin by examining the methods used by Shaw’s research group and point out some of the issues.
Background for experimental design flaws: PK and species issues
One problem that is recurrent with Shaw is his “vaccination schedule” tries to consider rodents, such as mice and rats, as humans in miniature. It is wrong to assume that rodent and human primate species are alike, they’re not and there are notable physiological differences between rodents and non-rodents. For example, there are a couple of studies by Terasaki and colleagues (http://onlinelibrary.wiley.com/doi/10.1111/j.1471-4159.2011.07208.x/abstract) that have shown differences in the expression of solute carriers and drug transporters at the blood-brain barrier. We cannot exclude that such differences may bias the outcome observed in his studies, but this bias applies intrinsically to any in vivo studies based on a rodent model.
There is also the issue of brain development and mapping the vaccination schedule and the brain maturation. In this study (as well in the previous ones), Shaw and colleagues consider that applying vaccines from post-natal day (PND) 3 to 12 is representative of a human infant vaccine schedule. There is some differences in the literature, as previous studies from Clancy and colleagues mapped the PND12 to the 7th gestational months in humans (https://blogs.cornell.edu/bfinlay/files/2015/06/ClancyNeurosci01-17kkli7.pdf), some more recent publications map PND21 to 6th month post natal in humans, making the PND12 around the 3rd month infancy following full-term birth (http://www.sciencedirect.com/science/article/pii/S2352154615001096). You can easily appreciate that by following Shaw flawed experimental design, the total amount of Al administered during a 2 year period has been indeed administered within 90 days of birth, whereas the vaccination schedule according to the CDC does not start before the 2nd month of infancy if we exclude the two injections of Hepatitis B vaccines at birth and after the first month respectively (https://www.cdc.gov/vaccines/schedules/hcp/imz/child-adolescent.html).
In addition to a flaw in the experimental design, we cannot exclude some differences in the pharmacokinetic profile of Al adjuvants between mice and humans. The data available is fairly limited but a recent study from Kim and colleagues (https://www.ncbi.nlm.nih.gov/pubmed/26437923) failed to show a significant brain uptake of Al compared to controls following the single oral administration of different Al oxide nanoparticles at a concentration of 10mg/kg. Furthermore, the approximation of Shaw in terms of total burden of Al from vaccines (550 microg/kg) is not an accurate metric as we have a dynamic process involving absorption, distribution and elimination to occur simultaneously. A daily burden of Al from vaccines is a much more reliable parameter to consider. Yokel and McNamara (https://www.ncbi.nlm.nih.gov/pubmed/11322172) established it about 1.4-8 microg/day for based on 20 injections spanning over a 6-year period in a 20kgs individual.
If we consider Shaw calculation, then the total burden at age 6 would be 1650 microg/kg or 33’000 microg for a 20kgs 6-year old child. That’s about 15 microg/day of daily Al burden from vaccines, a value that is 2 to 10 folds higher than applied to humans. It makes therefore very difficult to compare apples to oranges, as Shaw experimental paradigm is flawed and not representative of a clinical scenario.
Selection of genes to measure:
Selecting which genes to measure is a crucial step in a study like this. If care is not given to ensure that the correct genes are selected, then the study will be a wasted effort. Shaw stated in the paper that they selected genes that were previously published. However, not all of the genes that they measured came from this paper. Only 14 of the genes were from this paper (KLK1, NFKBIB, NFKBIE, SFTPB, C2, CCL2, CEBPB, IFNG, LTB, MMP9, TNFα, SELE, SERPINE1, and STAT4). This leaves 17 genes the were measured but not found in the paper. Two of these can be explained. One gene, ACHE, was mentioned as having been selected because of other work, so it is sourced. The second gene, is the internal control gene beta-actin. This is a housekeeping gene that is often used as an internal control to provide a relative expression from. This leaves 15 genes unaccounted for. We suspect that these genes were selected because they are involved in the innate immune response, but no reason is stated in the paper.
The way these genes were selected is problematic. Because half of the genes seemed to be selected for uncited reasons, this study is what is known in science as a “fishing expedition.” There’s nothing inherently wrong with this type of research and indeed it can lead to new discoveries that expand our understanding of the natural world (this study that increased the number of sequenced viral genomes by nearly tenfold is a good example of this). But what fishing expeditions can show is limited. These types of studies can lead to other studies but they do not show causality. Shaw is claiming causality with his fishing expedition here.
There is also the problem that they used old literature to select their gene targets when much more recent research has been done. By happenstance, they did measure some of these same genes in their study. However, their results do not match has has been measured in children that have been diagnosed with autism. For example, RANTES was shown to be decreased in children with autism. In Shaw’s work there was no statistical difference in RANTES expression between mice given the aluminum treatment and those receiving saline. Likewise, MIP1alpha was shown to be decreased in developmentally delayed children but was shown to be increased in the aluminum treated mice. This was also the case for ILIb which was found to be elevated in children with moderate autism yet there was no statistical difference between the mice receiving the aluminum treatment and those receiving saline. In fact IL-4 was the only gene to follow an expression pattern similar to what was found in children with severe autism (elevated in both cases). However, there is something odd with the gel in this case. This was the image for figure 4 that was included in the online version of the paper (we have not altered the image in any way). Look closely at the top right panel at the IL-4 samples and the IL-6 samples. You’ll notice that the bands for the control and the aluminum treated mice have different color backgrounds (We enlarged the image to help highlight this but did not adjust the contrast). If these came from the same gel, there would not be a shift in color like this where the treated bands have a lighter color encircling them. The only way this could happen is if the gel was assembled in photoshop. The differences could be real; however, since this image was modified we do not know for sure and this is scientific misconduct. Papers get retracted for this all the time and people have lost their degrees for doing this in their dissertations. These gel results cannot be trusted and the paper hinges on them. The Western blots and issues with them will be discussed below.
The unaltered figure 4.
A close up of the panel with the regions in question highlighted.
In order to quantify the gene expression levels of the genes that Shaw’s group selected, they used an older technique called semi-quantitative RT-PCR. This technique uses the exponential increase in PCR products in order to show differences between expression of a gene under different conditions. There’s nothing wrong with the technique provided one understands what the limitations are. Let’s say you have a large number of genes that you want to measure expression of, but you aren’t sure which genes are going to be responsive and you have limited funds. Semi-quantitative RT-PCR is a good method to screen for specific genes to be examined further by more precise techniques, such as Real-Time RT-PCR, but it’s not appropriate to use this technique and then make statements about precise quantification. Where semi-quantitative RT-PCR excels is with genes that are normally not expressed but can be expressed after some sort of stimulus, such as terpene biosynthesis genes that are induced by insect feeding.
To put it bluntly, semi-quantitative RT-PCR was not used properly in the paper by Shaw. The way that it was used implied that it would be quantitative when the technique is not that precise. Without verification by another method, ideally Real-Time PCR which can determine what the exact abundance of a given target is, these results should be taken with a grain of salt. This would still be the case if there weren’t irregularities in the gel images. With those irregularities, this is absolutely essential and should have prevented this paper from being accepted.
Western-blots and data manipulationPCR and Western-blots data: the owl is not what it seems
As The Mad Virologist mentioned, the semi-quantitative PCR is an old-fashioned RNA quantitation method, with the use of Real-Time quantitative PCR (that quantifies the amplification product at each cycle, using a fluorescent dye as an indicator) is a much more accepted method nowadays (see his section for more details). For Western-blots, the semi-quantitative approach is more accepted but it is important to show data that are consistent between what you show (qualitative) from what you count (quantitative). In Western-blot analysis, we measure the relative darkness of a protein band (the black lines that you see in papers) between treatments and controls. Because you cannot exclude some errors due to the amount of protein loading, we also measure the band intensity for proteins that are very abundant, usually referred as housekeeping proteins (because they play essential functions in cells). In this case, beta-actin (named ACT in the paper was used).
Once you normalize to beta-actin, you can compare the effect of a treatment by comparing the relative band intensity ratios. In both cases (semi-quantitative PCR and Western-blots), “what you see is what you measure” or you have to show a “representative Western-blot” alongside a quantitative data to demonstrate that your quantification matches with band densities. The common practice is the use of image acquisition software like ImageJ to determine band density. Showing Western-blot is nice, but not foolproof. Indeed, Western-blots data (with fluorescence images) is amongst the most common method by which some researchers can manipulate or even falsify data but also the most common type of data that spark a paper retraction. Someone notice something fuzzy on a Western-blot data, creating some questioning reaching to the editors and asking access to the full dataset (usually the X-ray film or the original full scan of the blot). Often, the author will use the excuse “the dog ate the flash drive” or “the hard drive containing the data crashed” if they cannot provide such data.
There are some methods to spot some image manipulation on Western-Blots and include playing with the brightness/contrast, requesting the presence of quantitative data in addition of a representative blot, samples must be coming from a same gel (you cannot use a cookie-cutter and build-your-own perfect gel). There is an excellent article that describe the pitfalls and cases of bad Western-blot data representation if not image manipulation. (https://www.elsevier.com/editors-update/story/publishing-ethics/the-art-of-detecting-data-and-image-manipulation) There are, at this time, different issues raised both in the Western-blots pictures and their subsequent analysis raising the reliability of the data presented in this study.
In this post, we have used the full-resolution pictures provided by the journal website (http://www.sciencedirect.com/science/article/pii/S0162013417300417), opened just pictures in ImageJ to convert such pictures into 8-bit format, invert the lookup tables (LUT) and adjusted the brightness and contrast. We have exported such pictures in Powerpoint to ease the annotation and comments. We recommend the reader to judge by himself/herself and download the full-resolution images as well.
The first concern is by looking at Figure 1C. First, this is the original Fig.1.
Then, this is the close-up analysis for Fig.1C
There are several issues. First there are some bands that appears as band splicings, in which the author create a custom blots by assembling different bands from different gels. This is a no-no in Western-blots: all bands showed in a blot should come from the same gel. This is why Western-blot is a torture for graduates students and postdocs, you need to show your best blot with all bands showing the same behavior for your quantitative analysis.
Second, the presence of a rectangular grey piece that was added on the top of control 3 TNF band. This is a possible data manipulation and fraud, as you are voluntary masking a band and hiding it. Thats a big red flag on the paper. The third issue of Fig.1C is the consistent feeling of seeing bands either cropped on a grey rectangle or what I call a “Photoshop brushing” in which you brush off using the brush function area of the gel you consider not looking good enough. You can clearly see it with actin as we have a clear line between the blurred blot and a sharp and uniform grey in the bottom half of the blot, compared to the wavy top of the blot. This a grey area that I am not familiar with Western-blot but this is a no-no for any immunofluorescence picture. Any image manipulation that goes beyond the brightness/contrast adjustment and involves alteration of the acquired picture is considered as data manipulation. If you analyze the data upon correcting for the inconsistency of Figure 1C, the graph looks much more different and failed to show any differences between Al-treated and control, when you restrict yourself in over-normalizing it and plot straight the protein/actin band density ratios.
What is also concerning and surprising is the conclusion from the authors that males, not females, showing an inflammatory response. Of course, the authors failed to show the same outcomes from female animals and expect us to trust them on this. The problem is that such conclusion is in direct contradiction with the literature. There is a solid literature supporting the presence of a sexual dimorphism in terms of inflammatory response, in particular in terms of neuroinflammation and autoimmune disorders such as multiple sclerosis (https://www.ncbi.nlm.nih.gov/pubmed/28647490; https://www.ncbi.nlm.nih.gov/pubmed/27870415). There is also a growing call to the scientific community to provide results for both sexes (males and females alike). Although Shaw reports the study was performed in both males and females, he gives us this explanation at the end of section 3.1: “Taken together, a number of changes indicative of the activation of the immune-mediated NF-κB pathway were observed in both male and female mice brains as a result of Al-injection, although females seemed to be less susceptible than males as fewer genes were found altered in female brains.“
Yet the interesting part comes when Shaw try to compare ikB phosphorylation between males and females following Al injection (Fig.3C). When you analyze the data, you are raising concerns very rapidly. First, we have a possible case of cookie-cutter band in which you just paste a band that seems nice enough in a blank space. This is a very suspicious activity as you can make up data as easy as this. Second, there is again this “Photoshopping brushing/erasing” taking place in that figure, in which I suspect a case of fraudulent activity. As you can see in female, it is as if someone tried to mask some bands that should not have been here. Remember when he said that males but not females showed an inflammatory response? Is it trying to dissimulate data that contradict his claims?
Again, lets bring up Figure 3 at its full resolution.
Finally, the same issues are persistent and even more obvious in Fig.5A. Again, we have a mixture of different Western-blots image manipulations including bands splicing, Photoshop brushing, cookie-cutter bands……
First, the unedited picture:
And below the close up of Fig.5A
These are some serious concerns that raise the credbility of this study and can only be addressed by providing a full-resolution (300dpi) of the original blots (X-ray films or the original picture file generated by the gel acquisition camera). There has been a lot of chatter on PubPeer discussing this paper and many duplicated bands and other irregularities have been identified by the users there. If anyone is unsure of how accurate the results are, we strongly suggest looking at what has been identified on PubPeer as it suggests that the results are not entirely accurate and until the original gels and Western blots have been provided, it looks like the results were manufactured in Photoshop.
Long time followers know that I tend to go right to the statistics that are used in papers to see if what they are claiming is reasonable or not. Poor use of statistics has been the downfall of many scientists, even if they are making honest mistakes. It’s a common problem that scientists have to be wary of. One easy solution is to consult with a statistician before submitting a paper for publication. These experts can help point out if the statistical tests that were run are the correct or not. The Shaw paper could have benefited from this expertise. They used a Student’s T Test for all of their statistics comparing the control to the aluminum treated. This is problematic for a couple of reasons. These aren’t independent tests and the data likely does not have a normal distribution, so a T Test isn’t appropriate. Better statistical tests would have been either Hotelling’s T-squared distribution or Tukey’s HSD. Another issue is how the authors used standard error (SE) instead of standard deviation (SD). To understand why this matters, it helps to understand what the SE and what the SD measure and what these statistics show. The SD measures the variation in samples and how far the measurements are from the mean of the measurements. A smaller SD means that there is low variability in the measurements. The SE measures the likelihood that a measurement varies from the mean of the measurements within a population. Both the SE and SD can be used; however, using the SE is not always appropriate, especially if you are trying to use it as a descriptive statistic (in other words if you are trying to summarize data). Simply put, the SE is an estimation and only shows the variation between the sample mean and the population mean. If you are trying to show descriptive statistics, then you need to use the SD. The misuse of SE when the SD needs to be shown is a common mistake in many research publications. In fact, this is what the GraphPad manual has to say about when to use the SD and when to use the SE:
“If you want to create persuasive propaganda:
“If your goal is to emphasize small and unimportant differences in your data, show your error bars as SEM, and hope that your readers think they are SD. If our goal is to cover-up large differences, show the error bars as the standard deviations for the groups, and hope that your readers think they are a standard errors.” This approach was advocated by Steve Simon in his excellent weblog. Of course he meant it as a joke. If you don’t understand the joke, review the differences between SD and SEM.” The bottom line is that there is an appropriate time to use the SE but not when you are trying to summarize data.
Another issue is the number of animals used in the study. A consensus in published study is to provide a minimal number of animals (usually n=8) needed to achieve statistical significance but also maintain to a minimum to ensure proper welfare and humane consideration for lab animals. In this study, such number is half (n=5). Also the authors are bringing some confusion by blurring the lines between biological replicates (n=5) and technical replicates (n=3). By definition, biological replicates are different organisms that are measured and are essential for statistical analysis as these replicates are independent from each other. Technical replicates are dependent on each other as they come from the same biological samples and are repeated measurements. By considering the latter as statistical relevant, you are biasing yourself to consider a fluke as a biological phenomenon.
Based on the methods that were used in this paper, Shaw et al. went too far in declaring that aluminum adjuvants cause autism. But there are six other key points that limit what conclusions can be drawn from this paper:
1) They selected genes based on old literature and ignored newer publications.
2) The method for PCR quantification is imprecise and cannot be used as an absolute quantification of expression of the selected genes.
3) They used inappropriate statistical tests that are more prone to giving significant results which is possibly why they were selected.
4) Their dosing regime for the mice makes assumptions on the development of mice that are not correct.
5) They gave the mice far more aluminum sooner than the vaccine schedule exposes children to.
6) There are irregularities in both the semi-quantitative RT-PCR and Western blot data that strongly suggests that these images were fabricated. This is probably the most damning thing about the paper. If the data were manipulated and images fabricated, then the paper needs to be retracted and UBC needs to do an investigation into research misconduct by the Shaw lab.
Maybe there’s a benign explanation for the irregularities that we’ve observed, but until these concerns are addressed this paper cannot be trusted.