Tuesday, March 20, 2012

The Joys of MCMC

Much of my PhD work has been about making various MCMC methods (Markov Chain Monte Carlo for the uninitiated) work. Usually the process looks something like this:

  • Come up with a model.
  • Write MCMC code to sample parameters of the model.
  • Run MCMC code with available data.
  • Notice that it doesn't work.
  • Spend several months getting the MCMC to work.
Sometimes it is not the sampling that is at fault; there have been problems with the model specification, which required going back and fixing the model. But around 75% of the time, the problem is lack of convergence of the MCMC chains. They are supposed to gradually converge to the posterior distribution of parameters in the model, given the data, which then allows you to draw parameter samples that are representative of this distribution. In practice, except for very simple models, this only happens reliably and speedily after you have tuned a lot of settings for your MCMC.

I recently came across an old discussion in The American Statistician (found here, it might be behind a paywall depending on where you are), where three experienced practitioners discuss best MCMC practices. I wish I had read this discussion a few years ago, for two reasons: 1) It contains a lot of useful tips that I had to learn the hard way. 2) It makes it clear that MCMC vary a lot, depending on your personal preference, domain of application, degree of complexity of your model and so on.

What I take away from this discussion is that in order to be successful at sampling with MCMC, you need to be tenacious, resourceful, and rigorous. Whenever I have failed at an attempt, it was usually because I took a "choose two out of three" approach to these qualities. Here's hoping success will come easier in the future.

Thursday, September 16, 2010

Well Hello There

In a possibly ill-advised move, I have decided to re-activate this blog in the middle of what may well be the busiest period of my life (also known as the third year of my PhD). We shall see how long this good intention lasts. In the meantime, look forward to more dispatches from the life of a PhD student.

Warning: May include occasional desperation-fuelled late-night ramblings.

Friday, November 7, 2008

On Learning When To Shut Up

You know what phrase I'm beginning to dread? It's "Oh, so you're saying that...". Nine times out of ten, I'm not actually saying that. Or if I was saying that, I was merely speculating.

Example:

Me: This is not giving the result I expected. I observed effect Y, so maybe it's due to Z.
Person X: Oh, so you're saying that [slight rephrasing of Z] is causing the problem.

No, I'm not stating an absolute, as indicated by my use of the word "maybe". I'm putting forward a hypothesis, which is what you do in science. But I don't appreciate you nailing me down on that hypothesis before I have even investigated it.

Maybe (just a hypothesis!) the problem is with me. These are (for the most part) busy people I'm talking to, and maybe they can't afford to spend that much time speculating anymore. So when I'm coming up with a hypothesis, they just assume I wouldn't be telling them if I hadn't already thought about it for some time.

In other words, do I need to learn when to shut up?

Monday, October 13, 2008

A Plea

Please, please, please, please... document your data.

It's not fun to get a matrix without any column or row labels. Like my old physics teacher used to say when somebody proudly proclaimed that the answer to an exercise was 5.029, "5.029 what? Elephants?". In physics, a quantity is not meaningful without a unit, and in data management, a matrix is not useful without labels.

It's not fun to know nothing about any preprocessing of the data. Has it been normalised? I guess I can check the mean and standard deviation, but what if it's only close to 0 and 1, but not exactly so? Was that some special normalisation method? Hell if I know.

It's not fun to know nothing about the experimental setup. Maybe you told me every second data point is a wildtype. Does that mean that these were results from two-colour arrays? Or two singe-colour arrays? Are the wildtypes from the same time point as the mutants? Come to think of it, are these even time-course data?

It's not fun to know nothing about what the biologists* want you to find. Are they looking for similarly expressed genes or for regulators? Would they rather have a network of the knocked-out genes or of all genes? Is it worse if I give them false positives or false negatives?

So please, please, please, document your data. Maybe some time in the future I will tell you about fun things called wikis and databases, but for now even a text file would do.

*Or insert other applied science here.

Monday, September 29, 2008

The View from the Foothills

I'm finally there. Five years of hard work, first as an undergraduate and then as a Masters student, have paid off. Last Monday, I was granted my rightful place in academia... on the lowest rung of the ladder.

Yes, PhD students are a dime a dozen, even in my small institute, and despite the excitement of starting research in earnest, I can't help but feel slightly apprehensive. This may just be the result of reading too many PhD comics, but a tinge of anxiety is setting in. What if my advisor turns out to be a workaholic? What if I can't finish in the required 3 1/2 years before my funding runs out? What if my office mates are insane (they're not, I think) or my experiments all fail?

Then I remember that a million students have survived their PhD just fine before me and a million will again. I may not have the prettiest office (in fact, drab is not an inaccurate description), but at least I'm not sharing with 14 other people like my flatmate. My supervisor has only been nice to me, despite the bollocking that he gave his other PhD student last week. And my project, even though it looks daunting from here, will rest on the foundations of my Masters project, meaning that I have a reasonable idea of where to start.

So, the base camp has been established in the foothills of Mt. PhD. Only the future will show if I scale the summit triumphantly, or freeze to death in a crevice somewhere. Boy, that metaphor took a bleak turn, didn't it?

Saturday, August 30, 2008

Research is Easier if You Make It Up

Yes, I know I haven't written in a good few months. In my defence, I have been kept quite busy by the research for my MSc project. Now that it is done, however, I'd like to share a few thoughts on my first real experience with research.

For this first post, I want to talk about what was perhaps the most humbling experience, and that was how tempting it was to cheat.

Like many research projects, my research was beset with problems. There were contradictory results, vague results, results that were the opposite of what we expected, without any indication why this happened. And often, when I got these results, I would think: "Gee, wouldn't it be nice if I could make up the results I wanted instead."

Now before you cast the first stone, let me be very clear: I did not fake any results, nor will I hopefully ever do so. But it got me thinking. How easy would it really be to fake results? For my MSc project, it would have been really easy. We do not have to hand in the code (although it is possible that the markers may ask for the code if they smell a rat, but let's assume for the sake of the argument that the faked results are completely convincing), so I would not even have to write the programs. I knew how the different experiments were supposed to work, so generating some convincing results would have been easy. The only people to see the results are my supervisor and a second marker. Of those two, only my supervisor could possibly spot fake results, because the second marker is not an expert in the field. If I had gotten any fake results past my supervisor, I would basically be home free.

You might be thinking that that's all very well for a Masters project, but surely in real research faked data would be spotted. But would it really? I agree that you would probably have difficulty faking a whole project: You'd be hard-pressed to answer questions from reviewers of your paper, and anyone trying to repeat the experiments would obviously get very different results. But what about just tweaking that one experiment that's poking a hole in your theory? That would again be very easy and would probably not be spotted unless somebody decides to repeat that exact experiment. If somebody later disproves your theory, well, you got a paper out of it, and nobody can really blame you for not spotting the flaw when all of your experiments were confirming the hypothesis.

Cheating can get even more subtle (choosing your experiments, skimping on controls, omitting results) and harder to spot. So the question is, given how easy it would be to cheat, what, other than personal integrity, is keeping scientists honest?

I believe curiosity and ambition are big factors. If you get results that contradict your hypothesis, you don't just say "Aw, crud", you get excited, because there's another problem to solve. Maybe this new problem will lead to an even bigger discovery than the one you were hoping to make. If you just fake the result, you'll probably never do really ground-breaking science. Worse yet, you might set back other scientists who will not pursue their theories because your "results" seem to have disproved them.

There's also training and your research environment. Never underestimate social conformity, which in this case is a good thing. If everybody around you is excited about research, as most scientists will be, you'll find it very hard to be the cheater, even if you're the only one who knows that your results weren't real. You'll want to be just as good as the rest, and if they can deal with contradictory results, then so can you.

Of course, this only applies if people the people in your research environment let you know about the problems they were having. They may be competitive people who feel that talking about struggling with research is equivalent to showing weakness. If that is the case, I recommend reading some of the many excellent blogs from scientists who are not afraid to talk about their research issues.

One thing that is clear is that you cannot just assume that every result that is published is automatically set in stone. If you think you have a better theory, test it, and if necessary repeat an experiment that has already been done. If enough people do that there might actually be a chance of demasking the cheaters. And that would be another great incentive not to cheat in the first place.

Saturday, July 5, 2008

Conference Noises

Would you say that a scientists first conference is like his first kiss*, a unique experience, never forgotten despite the fumbling and nervousness? Or is it more like the first time you went to a McDonald's: Sure, it's exciting and colourful, but after you've been a few dozen times you notice that they're all the same.

I couldn't say yet which of these is a better description, since I've only just experienced my first conference. Conference might be saying a bit much: It was a one-day symposium, and I didn't even have to leave the city.

Still, there were some memorable experiences to be made. Some were of the mundane variety: It seems that even in Britain, coffee break means coffee break, and not tea break. And don't even dare ask for water. Also, pinning your badge to your shirt is a fashion faux-pas; the correct place is discreetly on your belt.

The poster session was different from what I expected, because there were really only posters. Somehow, I always expected the poster creators to be standing next to them with proud smiles, eager to explain their science to anyone passing by. Not so here: There were posters, there were people reading the posters, and that was it.

The talks ranged from the fascinating to the mystifying. I've always been better at learning things from papers than at picking them up in lectures, so it's no surprise that I couldn't follow some of the more complicated topics. Listing to those lectures was not a waste of time, though, since at least now I know those topics exist and I can find out more about them (by reading papers!) if I want to.

The quality of the speakers varied (doesn't it always?) but some of them were very good, even inspirational. There are so many unsolved problems in bioinformatics, but these speakers were pointing the way to solving many of them.

Now for the more disappointing part of the symposium. No, not the food, that was alright. This is something that I'm willing to be not many attendees even noticed, but it's actually a huge statistical fluke if it was random: Out of 15 speakers, not a single one was female. I'm used to gender bias in my field, especially on the informatics side, but 0 out of 15? Seriously? You're telling me that there's not a single female professor that you could have invited to talk about her research?

At least many of the people in the audience were female, but jeez!

*With the first conference occasionally preceding the first kiss by a while.