Molecular Docking on OpenMacGrid - Part I
As many of you probably know, MacResearch has started a few months ago the OpenMacGrid project. It provides free access for scientists to a computer cluster built using Apple's Xgrid technology. The cluster also relies on the generous support of Mac users from around the world, that add their machines as "agents" to the OpenMacGrid controller. Since its inception, the cluster has been hosting 4 different projects, some still ongoing, with more projects pending approval to come in the near future. In this article, we are very happy to have Rob Yang, from the Marshall lab in the Washington University in St Louis, introduce the project he has been running on OpenMacGrid in the last few weeks. Rob is using computers for virtual screening of compounds that could become useful drugs for the treatment of human diseases. ...Read the article...
Drug development and drug screnning
The interactions between biomolecules (proteins, nucleic acids, small organic molecules) are the driving force for every biological function. Virtually all diseases can be traced back to misregulation of these interactions. For example, breast cancer, estimated to affect 1 in 8 American women, is linked to the over-activation of the EGF receptors (a membrane protein) in ~30% of the patients. The EGF receptor is then a clear therapeutic "target" for the development of a drug that will interact with it and "restore" a proper non-pathological state. More generally, many of the FDA-approved drugs are small molecules that can bind and alter (inhibit or enhance) the functional effectiveness of the corresponding therapeutic protein targets.
The path to a drug is thus quite clear: (1) identify a target, (2) design a drug that will bind that target, (3) show that the drug works as intended. This text is focused on the second step: the design of an effective drug on an already validated target. Traditionally, the discovery of drug candidates requires the screening of libraries of millions of randomized chemical compounds, hoping that a few will show the desired effect. To test these ligands, specific high-throughput assays have to be designed to measure the therapeutic effectiveness on the target of interest. These assays are costly and technically challenging, and can only be performed by private pharmaceutical companies. While academic labs may have a very good understanding and knowledge of a particular disease and may have some innovative ideas on how to treat it, they cannot afford the investments needed to identify a potential drug. Drug development is thus inherently limited to a small number of therapeutic targets and diseases.
Virtual drug screening
The application of in silico molecular docking could represent an efficient alternative for cheaper and faster drug design. Molecular docking relies on biophysical rules to computationally screen and predict putative candidates from large compound libraries. These putative compounds, ideally a small subset of the original compound libraries, can then be subsequently tested experimentally in a much more manageable fashion. Since theoretical computations are fast, low-cost and have been shown to be accurate in many systems, the synergistic nature of this top-down approach presents a realistic opportunity in virtual screening of thousands or even millions of compounds. With the help of molecular docking, academic research labs specializing in their own biological systems, but that cannot afford to experimentally screen millions of compounds, can conduct their own drug discoveries. In the long run, it will strengthen the academic-industry pipeline where inhibitors discovered from academic labs can serve as precursors to marketable drugs.
The next section briefly outlines how molecular docking works. The structure of the target protein is of critical importance. The preferable criteria is a high-resolution crystal structure, although there have been successful cases where the target protein structure is predicted based on homology modeling. For every compound, the docking program (my favorite is autodock4 < http://autodock.scripps.edu/>), "docks" it to the target protein by exploring many biophysically possible ways that the given compound can bind to the target. Each of these possibilities, known as the binding pose, is assigned a score based on the pre-determined scoring functions. The same process is carried out for all the compounds in a given library. Upon completion of docking all compounds to the target protein, they are ranked and selected for subsequent experimental validations based on their corresponding scores.
The following example illustrates the procedure. Shown here is the X-ray crystal structure of a target protein with an inhibitor bound to it (PDB ID: 1LEE). It serves as the "true answer" in this testing case. The goal is to pull away the inhibitor, scramble its 3D coordinates and dock it back to the protein to reproduce the same binding pose. For simplicity, only this compound is shown in this example. In virtual screening applications, the compound will be mixed into a large compound library, and thousands independent docking jobs will be performed.
The picture below shows how the docking program explores many different binding poses. This is typically the most computationally expensive procedure, and will grow exponentially with the size, flexibility, and complexity of the compound as well as the protein-binding site.
The next pictures show that the predicted pose (pink) is very similar to the experimentally determined binding pose (cyan). The accuracy depends on the size, flexibility and complexity of the compound and the binding site.
Using autodock4, and my specified level of computational requirements, it typically takes 7 CPU hours to dock 1 compound on a 2.4GHz Intel Pentium 4 machine running Linux. This translates to roughly 70,000 hours, or 8 years in order to dock a 100,000-member library on a single machine. Fortunately, because each docking jobs are independent of each other, they can be run in parallel, thus creating a perfect situation for Xgrid. Xgrid comes with every Mac OS X machine, and is designed to perform highly parallel and independent tasks. With Xgrid, docking jobs can be dispatched in parallel to many client machines when their CPUs are running empty cycles. OpenMacGrid (OMG) is a central grid server that was created by the MacResearch community to join many machines together. Assuming comparable speed between the Pentiums and Macs, if 30 client CPUs are idle in OMG, a rough estimated time for the same 100,000 docking jobs will take 3 months; if 60 clients are idle, 1.6 months; if 120 are idle, 24 days.
With the help of Mr. Gridstuffer himself- Charles Parnot, and the OMG committee, 2000 docking jobs are running as of this moment, utilizing 300 GHz thanks to the idle CPU time generously donated by many members of the OMG community. I will report the total time that these jobs took when they are finished. I am very excited about utilizing the power of a joint mac-network in the application of docking not just personally, but also for future dockers (some of the docking jobs that have already finished only took about 3 hours on fast Macs).
In the next article (to be written hopefully in the near future), I will describe the general procedure to set up a docking application for virtual screening from scratch. This procedure should be fully generalizable and can be used with different docking programs, as well as larger libraries of compounds.



Comments
autodock
And now I know where that pesky autodock4 process comes from :-) It sometimes takes a lot of my CPU cycles, even if when my iMac is not idle.
autodock4 on your mac
Hi Koen!
Could you specify the specs of your machine. We tried to get this project to only run on relatively robust hardware, as we realized it could take an old machine down. The main issue is usually the RAM and the paging that can ensue when your mac runs out of RAM.
Maybe we need to be more restrictive...
Mac specs
Hi Charles,
I have a G5 iMac, 1.8 GHz with 1 GB of RAM. Not *that* old ;-)
BTW, I also see the same thing sometimes happening with biock, which I believe is also spawned by XGrid.
Using CPU is all right
Reading your initial comment again, I realized my answer was not spot on...
What you are describing is that the Xgrid processes keep running even after you start using your computer again, I now realize (right?). This is normal. Xgrid will not stop processes already running, but will stop accepting new tasks when the current one is finished. It is not trivial to interrupt a process anyway, and this would not necessarily do that much good. The xgrid process run with low priority, and the kernel is quite good at letting other processes use the CPU when they need it, taking it away from Xgrid. There will always be some residual activity from Xgrid even when running computationally-intensive apps. And also remember that your app may appear to be working hard, but it is not just the CPU, it might also be spending quite some time in IO, in which case it will not necessarily claim all the CPU, instead waiting for IO.
One issue that can arise with Xgrid is that the process uses a large fraction of the RAM, in which case OS X becomes slow, and switching to other apps is limited by the paging and IO on disk. This can take be a real issue for certain programs on old machines, e.g. with autodock.
Hope that clarifies things a bit :-)
charles
Docking example
I would be very interested to see some example jobs that perform the small molecule docking analysis described in this article. Has this been described in more detail elsewhere?
Thanks,
Ryan