Tuesday, June 28, 2016

Reproducibility, reputation and playing the long game in science

Every so often these days, something will come up again about how most research findings are false. Lots of ink has already been spilled on the topic, so I won’t dwell on the reproducibility issue too long, but the whole thing has gotten me thinking more and more about the meaning and consequences of scientific reputation.

Why reputation? Reputation and reproducibility are somewhat related but clearly distinct concepts. In my field (I guess?) of molecular biology, I think that reputation and reproducibility are particularly strongly correlated because the nature of the field is such that perceived reproducibility is heavily tied to the large number of judgement calls you have make in the course of your research. As such, perhaps reputation has evolved as the best way to measure reproducibility in this area.

I think that this stands in stark contrast with the more common diagnosis one sees these days for the problem of irreproducibility, which is that it's all down to statistical innumeracy. Every so often, I’ll see tweets like this (names removed unless claimed by owner):

The implication here is that the problem with all this “cell” biology is that the Ns are so low as to render the results statistically meaningless. The implicit solution to the problem is then “Isn’t data cheap now? Just get more data! It’s all in the analysis, all we need to do is make that reproducible!” Well, if you think that github accounts, pre-registered studies and iPython notebooks will magically solve the reproducibility problem, think again. Better statistical and analysis management practices are of course good, but the excessive focus on these solutions to me ignores the bigger point, which is that, especially in molecular and cellular biology, good judgement about your data and experiments trumps all. (I do find it worrying that statistics has somehow evolved to the point of absolving ourselves of the responsibility for the scientific inferences we make ("But look at the p-value!"). I think this statistical primacy is perhaps part of an bigger—and in my opinion, ill-considered—attempt to systematize and industrialize scientific reasoning, but that’s another discussion.)

Here’s a good example from the (infamous?) study claiming to show that aspartame induces cancer. (I looked this over a while ago given my recently acquired Coke Zero habit. Don’t judge.) Here’s a table summarizing their results:

The authors claim that this shows an effect of increased lymphomas and leukemias in the female rats through the entire dose range of aspartame. And while I haven’t done the stats myself, looking at the numbers, the claim seems statistically valid. But the whole thing really hinges on the one control datapoint for the female rats, which is (seemingly strangely) low compared to virtually everything else. If that number was, say, 17% instead of 8%, I’m guessing essentially all the statistical significance would go away. Is this junk science? Well, I think so, and the FDA agrees. But I would fully agree that this is a judgement call, and in a vacuum would require further study—in particular, to me, it looks like there is some overall increase in cancers in these rats at very high doses, and while it is not statistically significant in their particular statistical treatment, my feeling is that there is something there, although probably just a non-specific effect arising from the crazy high doses they used.

Hey, you might say, that’s not science! Discarding data points because they “seem off” and pulling out statistically weak “trends” for further analysis? Well, whatever, in my experience, that’s how a lot of real (and reproducible) science gets done.

Now, it would be perfectly reasonable of you to disagree with me. After all, in the absence of further data, my inklings are nothing more than an opinion. And in this case, at least we can argue about the data as it is presented. In most papers in molecular biology, you don’t even get to see the data from all experiments they didn’t report for whatever reason. The selective reporting of experiments sounds terrible, and is probably responsible for at least some amount of junky science, but here’s the thing: I think molecular biology would be uninterpretable without it. So many experiments fail or give weird results for so many different reasons, and reporting them all would leave an endless maze that would be impossible to navigate sensibly. (I think this is a consequence of studying complex systems with relatively imprecise—and largely uncalibrated—experimental tools.) Of course, such a system is ripe for abuse, because anyone can easily leave out a key control that doesn’t go their way under the guise of “the cells looked funny that day”, but then again, there are days where the cells really do look funny. So basically, in the end, you are stuck with trust: you have to trust that the person you’re listening to made the right decisions, that they checked all the boxes that you didn’t even know existed, and that they exhibited sound judgement. How do you know what work to follow up on? In a vacuum, hard to say, but that’s where reputation comes in. And when it comes to reputation, I think there’s value in playing the long game.

Reputation comes in a couple different forms. One is public reputation. This is the one you get from talks you give and the papers you publish, and it can suffer from hype and sloppiness. People do still read papers and listen to talks (well, at least sometimes), and eventually they will notice if you cut corners and oversell your claims. Not much to say about this except that one way to get a good public reputation is to, well, do good science! Another important thing is to just be honest. Own up to the limitations of your work, and I’ve found that people will actually respect you more. It’s pretty easy to sniff out someone who’s being disingenuous (as the lawyerly answers from Elizabeth Holmes have shown), and I think people will actually respect you more if you just straight up say what you really think. Plus, it makes people think you’re smart if you show you’ve already thought about all the various problems.

Far more murky is the large gray zone of private reputation, which encompasses all the trust in the work that you don’t see publicly. This is going out to dinner with a colleague and hearing “Oh yeah, so-and-so is really solid”… or “That person did the same experiment 40 times in grad school to get that one result” or “Oh yeah, well, I don’t believe a single word out of that person’s mouth.” All of which I have heard, and don’t let me forget my personal favorite “Mr. Artifact bogus BS guy”. Are these just meaningless rumors? Sometimes, but mostly not. What has been surprising to me is how much signal there is in this reputational gossip relative to noise—when I hear about someone with a shady reputation, I will often hear very similar things independently from multiple sources.

I think this is (rightly) because most scientists know that spreading science gossip about people is generally something to be done with great care (if at all). Nevertheless, I think it serves a very important purpose, because there’s a lot of reputational information that is just hard to share publicly. Many reasons for this, one of them being that the burden of proof for calling someone out publicly is very high, the potential for negative fallout is large, and you can easily develop your own now-very-public reputation for being a bitter, combative pain in the ass. A world in which all scientists called each other out publicly on everything would probably be non-functional.

Of course, this must all be balanced against the very significant negatives to scientific gossip. It is entirely possible that someone could be unfairly smeared in this way, although honestly, I’m not sure how many instances of this I’ve really seen. (I do know of one case in which one scientist supposedly started a whisper campaign against another scientist about their normalization method or something suitably petty, although I have to say the concerns seemed valid to me.)

So how much gossip should we spread? For me, that completely depends on the context. With close friends, well, that’s part of the fun! :) With other folks, I’m of course far more restrained, and I try to stick to what I know firsthand, although it’s impossible to give a straight up rule given the number of factors to weigh. Are they asking for an evaluation of a potential collaborator? Are we discussing a result that they are planning to follow up on in the lab, thus potentially harming a trainee? Will they even care what I say either way? An interesting special case is trainees in the lab. I think they actually stand to benefit greatly from this informal reputational chatter. Not only do they learn who to avoid, but even just knowing the fact that not everyone in science can be trusted is a valuable lesson.

Which leads to another important problem with private reputations: if they are private, what about all the other people who could benefit from that knowledge but don’t have access to it? This failure can manifest in a variety of ways. For people with less access to the scientific establishment (smaller or poorer countries, e.g.), you basically just have to take the literature at face value. The same can be true even within the scientific establishment; for example, in interdisciplinary work, you’ll often have one community that doesn’t know the gossip of another (lots of examples where I’ll meet someone who talks about a whole bogus subfield without realizing it’s bogus). And sometimes you just don’t get wind in time. The damage in terms of time wasted is real. I remember a time when our group was following up a cool-seeming result that ended up being bogus as far as we could tell, and I met a colleague at a conference, told her about it, and she said they saw the same thing. Now two people know, and perhaps the handful of other people that I’ve mentioned this to. That doesn’t seem right.

At this point, I often wonder about a related issue: do these private reputations even matter? I know plenty of scientists with widely-acknowledged bad reputations who are very successful. Why doesn’t it stick? Part of it is that our review systems for papers and grants just don’t accommodate this sort of information. How do you give a rational-sounding review that says “I just don’t believe this”? Some people do give those sorts of reviews, but come across as, again, bitter and combative, so most don’t. Not sure what to do about this problem. In the specific case of publishing papers, I often wonder why journal editors don’t get wind of these issues. Perhaps they just are in the wrong circles? Or maybe there are unspoken union rules about ratting people out to editors? Or maybe it’s just really hard not to send a paper to review if it looks strong on the face of it, and at that point, it’s really hard for reviewers to do anything about it. It is possible that preprints and more public discussion may help with this? Of course, then people would actually have to read each other’s papers…

That said, while the downsides of a bad private reputation may not materialize as often as we feel they should, the good news is that I think the benefits to a good private reputation can be great. If people think you do good, solid work, I think that people will support you even if you’re not always publishing flashy papers and so forth. It’s a legitimate path to success in science, and don’t let the doom and gloomers and quit-lit types tell you otherwise. How to develop and maintain a good private reputation? Well, I think it’s largely the same as maintaining a good public one: do good science and don’t be a jerk. The main difference is that you have to do these things ALL THE TIME. There is no break. Your trainees and mentors will talk. Your colleagues will talk. It’s what you do on a daily basis that will ensure that they all have good things to say about you.

(Side point… I often hear that “Well, in industry, we are held to a different standard, we need things to actually work, unlike in academia.” Maybe. Another blog post on this soon, but I’m not convinced industry is any better than academia in this regard.)

Anyway, in the end, I think that molecular biology is the sort of field in which scientific reputation will remain an integral part of how we assess our science, for better or for worse. Perhaps we should develop a more public culture of calling people out like in physics, but I’m not sure that would necessarily work very well, and I think the hostile nature of discourse in that field contributes to a lack of diversity. Perhaps the ultimate analysis of whether to spread gossip or do something gossip-worthy is just based on what it takes for you to get a good night’s sleep.

Saturday, June 11, 2016

Some thoughts on lab communication

I recently came across this nice post about tough love in science:
and this passage at the start really stuck out:
My very first task in the lab as an undergrad was to pull layers of fungus off dozens of cups of tomato juice. My second task was PCR, at which I initially excelled. Cock-sure after a week of smaller samples, I remember confidently attempting an 80-reaction PCR, with no positive control. Every single reaction failed… 
I vividly recall a flash of disappointment across the face of one of my PIs, probably mourning all that wasted Taq. That combination—“this happens to all of us, but it really would be best if it didn’t happen again”—was exactly what I needed to keep going and to be more careful.
Now, communication is easy when it's all like "Hey, I've got this awesome idea, what do you think?" "Oh yeah, that's the best idea ever!" "Boo-yah!" [secret handshake followed by football head-butt]. What I love about this quote is how it perfectly highlights how good communication can inspire and reassure, even in a tough situation—and how bad communication can lead to humiliation and disengagement.

I'm sure there are lots of theories and data out there about communication (or not :)), but when it comes down to putting things into practice, I've found that having simple rules or principles is often a lot easier to follow and to quantify. One that has been particularly effective for me is to avoid "you" language, which is the ultimate simple rule: just avoid saying "you"! Now that I've been following that rule for some time and thinking about why it's so effective at improving communication, I think there's a relatively simple principle beneath it that is helpful as well: if you're saying something for someone else's benefit, then good. If you're saying something for your own benefit, then bad. Do more of the former, less of the latter.

How does this work out in practice? Let's take the example from the quote above. As a (disappointed) human being, your instinct is going to be to think "Oh man, how could you have done that!?" A simple application of no-you-language will help you avoid saying this obviously bad thing. But there are counterproductive no-you-language ways to respond as well: "Well, that was disappointing!" "That was a big waste" "I would really double check things before doing that again". Perhaps the first two of these are straightforwardly incorrect, but I think the last one is counterproductive as well. Let's dissect the real reasons you would say "I would really double check before doing that again". Now, of course the trainee is going to be feeling pretty awful—people generally know when they've screwed up, especially if they screwed up bad. Anyone with a brain knows that if you screw up big, you should probably double check and be more careful. So what's the real reasoning behind telling someone to double check? It's basically to say "I noticed you screwed up and you should be more careful." Ah, the hidden you language revealed! What this sentence is really about is giving yourself the opportunity to vent your frustration with the situation.

So what to say? I think the answer is to take a step back, think about the science and the person, and come up with something that is beneficial to the trainee. If they're new, maybe "Running a positive control every time is really a good idea." (unless they already realized that mistake).  Or "Whenever I scale up the reaction, I always check…" These bits of advice often work well when coupled with a personal story, like "I remember when I screwed up one of these big ones early on, and what I found helped me was…". I will sometimes use another mythic figure from the lab's recent past, since I'm old enough now that personal lab stories sound a little too "crazy old grandpa" to be very effective…

It is also possible that there is nothing to learn from this mistake and that it was just, well, a mistake. In which case, there is nothing you can say that is for anyone's benefit other than yourself, and in those situations, it really is just better to say nothing. This can take a lot of discipline, because it's hard not to express those sorts of feelings right when they're hitting you the hardest. But it's worth it. If it's a repeated issue that's really affecting things, there are two options: 1. address it later during a performance review, or 2. don't. Often, with those sorts of issues, there's honestly not much difference in outcome between these options, so maybe it's just better to go with 2.

Another common category of negative communication are all the sundry versions of "I told you so". This is obviously something you say for your own benefit rather than the other person, and indeed it is so clearly accusatory that most folks know not to say this specific phrase. But I think this is just one of a class of what I call "scorekeeping" statements, which are ones that serve only to remind people of who was right or wrong. Like "But I thought we agreed to…" or "Last time I was supposed to…" They're very tempting, because as scientists we are in the business of telling each other that we're right and wrong, but when you're working with someone in the lab, scoring these types of points is corrosive in the long term. Just remember that the next time your PI asks you to change the figure back the other way around for the 4th time… :)

Along those lines, I think it's really important for trainees (not just PIs) to think about how to improve their communication skills as well. One thing I hear often is "Before I was a PI, I got all this training in science, and now I'm suddenly supposed to do all this stuff I wasn't trained for, like managing people". I actually disagree with this. To me, the concept of "managing people" is sort of a misnomer, because in the ideal case, you're not really "managing" anyone at all, but rather working with them as equals. That implies an equal stake in and commitment to productive communications on both ends, which also means that there are opportunities to learn and improve for all parties. I urge trainees to take advantage of those opportunities. Few of us are born with perfect interpersonal skills, especially in work situations, and extra especially in science, where things change and go wrong all the time, practically begging people to assign blame to each other. It's a lot of work, but a little practice and discipline in this area can go a long way.

Wednesday, June 8, 2016

What’s so bad about teeny tiny p-values?

Every so often, I’ll see someone make fun of a really small p-value, usually along with some line like “If your p-value is smaller than 1/(number of molecules in the universe), you must be doing something wrong”. At first, this sounds like a good burn, but thinking about it a bit more, I just don’t get this criticism.

First, the number itself. Is it somehow because the number of molecules in the universe is so large? Perhaps this conjures up some image of “well, this result is saying this effect could never happy anywhere ever in the whole universe by chance—that seems crazy!”, and makes it seem like there’s some flaw in the computation or deduction. Pretty easy to spot the flaw in that logic, of course: configurational space can of course be much larger than the raw number of constituent parts. For example, let’s say I mix some red dye into a cup of water and then pour half of the dyed water into another cup. Now there is some probability that, randomly, all the red dye stays in one cup and no dye goes in the other. That probability is 1/(2^numberOfDyeMolecules), which is clearly going to be a pretty teeny-tiny number.

Here’s another example that may hit a bit closer to home: during cell division, the nuclear envelope breaks down, and so many nuclear molecules (say, lincRNA) must get scattered throughout the cell (and yes, we have observed this to be the case for e.g. Xist and a few others). Then, once the nucleus reforms, those lincRNA seem to be right back in the nucleus. What is the chance that the lincRNA just happened to be back in the nucleus by chance? Well, again, 1/(2^numberOfRNAMolecules) (assuming a 50/50 nucleus/cytoplasm split), which for many lincRNA is probably like 1/1024 or so, but for something like MALAT1, would be 1/(2^2000) or so. I think we can pretty safely reject the hypothesis that there is no active trafficking of MALAT1 back into the nucleus… :)

I think the more substantial concern people raise with these p-values is that when you get something so small, it probably means that you’re not taking into account some sort of systematic error; in other words, the null model isn’t right. For instance, let’s say I measured a slight difference in the concentration of dye molecules in the second cup above. Even a pretty small change will have an infinitesimal p-value, but the most likely scenario is that some systematic error is responsible (like dye getting absorbed by the material on the second glass or the glasses having slightly different transparencies or whatever). In genomics—or basically any study where you are doing a lot of comparisons—the same sort of thing can happen if the null/background model is slightly off for each of a large number of comparisons.

All that said, I still don’t see why people make fun of small p-values. If you have a really strong effect, then it’s entirely possible that you can get such a tiny p-value. In which case, the response is typically “well, if it’s that obvious, then why do any statistics?” Okay, fine, I’m totally down with that! But then we’re basically saying that there’s no really strong effects out there: if you’re doing enough comparisons that you might get one of these tiny p-values, then any strong, real effect must generate one of these p-values, no? In fact, if you don’t get a tiny p-value for one of these multi-factorial comparisons, then you must be looking at something that is only a minor effect at best, like something that only explains a small amount of the variance. Whether that matter or not is a scientific question, not a statistical one, but one thing I can say is that I don’t know many examples (at least in our neck of the molecular biology woods) in which something which was statistically significant but explained only a small amount of the variance was really scientifically meaningful. Perhaps GWAS is a counterexample to my point? Dunno. Regardless, I just don’t see much justification in mocking the teeny tiny p-value.

Oh, and here’s a teeny tiny p-value from our work. Came from comparing some bars in a graph that were so obviously different that only a reviewer would have the good sense to ask for the p-value… ;)

Update, 6/9/2016:
I think that there are a lot of examples that I think illustrate some of these points better. First, not all tiny p-values are necessarily the result of obvious differences or faulty null models in large numbers of comparisons. Take Uri Alon's network motifs work. Following his book, he showed that in the transcriptional network of E. coli (424 nodes, 519 edges), there were 40 examples of autoregulation. Is this higher, lower or equal to what you would expect? Well, maybe you have a good intuitive handle on random network theory, but for me, the fact that this is very far from the null expectation of around 1.2±1.1 autoregulatory motifs (p-value something like 10^-30) is not immediately obvious. One can (and people) quibble about the particular type of random network model, but in the end, the p-values were always teeny tiny and I don't think that is either obvious or unimportant.

Second, the fact that a large number of small effects can give a tiny p-value doesn't automatically discount their significance. My impression from genetics is that many phenotypes are composed of large numbers of small effects. Moreover, the effects of a perturbation of, say, a gene knockout can lead to a large number of small effects. To say those are not meaningful is an (open, to my mind) scientific question, but whether the p-value is small or not is not really relevant.

This is not to say that all tiny p-values mean there's some science worth looking into. Some of the worst offenders are the examples of "binning", where, e.g. half-life of individual genes correlates with some DNA sequence element, R^2=0.21, p=10^-17 (totally made up example, no offense to anyone in this field!). No strong rule comes from this, so who knows if we actually learned something. I suppose an argument can be made either way, but the bottom line is that those are scientific questions, and the size of the p-value is irrelevant. If the p-value were bigger, would that really change anything?