Corante

About this Author
DBL%20Hendrix%20small.png College chemistry, 1983

Derek Lowe The 2002 Model

Dbl%20new%20portrait%20B%26W.png After 10 years of blogging. . .

Derek Lowe, an Arkansan by birth, got his BA from Hendrix College and his PhD in organic chemistry from Duke before spending time in Germany on a Humboldt Fellowship on his post-doc. He's worked for several major pharmaceutical companies since 1989 on drug discovery projects against schizophrenia, Alzheimer's, diabetes, osteoporosis and other diseases. To contact Derek email him directly: derekb.lowe@gmail.com Twitter: Dereklowe

Chemistry and Drug Data: Drugbank
Emolecules
ChemSpider
Chempedia Lab
Synthetic Pages
Organic Chemistry Portal
PubChem
Not Voodoo
DailyMed
Druglib
Clinicaltrials.gov

Chemistry and Pharma Blogs:
Org Prep Daily
The Haystack
Kilomentor
A New Merck, Reviewed
Liberal Arts Chemistry
Electron Pusher
All Things Metathesis
C&E News Blogs
Chemiotics II
Chemical Space
Noel O'Blog
In Vivo Blog
Terra Sigilatta
BBSRC/Douglas Kell
ChemBark
Realizations in Biostatistics
Chemjobber
Pharmalot
ChemSpider Blog
Pharmagossip
Med-Chemist
Organic Chem - Education & Industry
Pharma Strategy Blog
No Name No Slogan
Practical Fragments
SimBioSys
The Curious Wavefunction
Natural Product Man
Fragment Literature
Chemistry World Blog
Synthetic Nature
Chemistry Blog
Synthesizing Ideas
Business|Bytes|Genes|Molecules
Eye on FDA
Chemical Forums
Depth-First
Symyx Blog
Sceptical Chymist
Lamentations on Chemistry
Computational Organic Chemistry
Mining Drugs
Henry Rzepa


Science Blogs and News:
Bad Science
The Loom
Uncertain Principles
Fierce Biotech
Blogs for Industry
Omics! Omics!
Young Female Scientist
Notional Slurry
Nobel Intent
SciTech Daily
Science Blog
FuturePundit
Aetiology
Gene Expression (I)
Gene Expression (II)
Sciencebase
Pharyngula
Adventures in Ethics and Science
Transterrestrial Musings
Slashdot Science
Cosmic Variance
Biology News Net


Medical Blogs
DB's Medical Rants
Science-Based Medicine
GruntDoc
Respectful Insolence
Diabetes Mine


Economics and Business
Marginal Revolution
The Volokh Conspiracy
Knowledge Problem


Politics / Current Events
Virginia Postrel
Instapundit
Belmont Club
Mickey Kaus


Belles Lettres
Uncouth Reflections
Arts and Letters Daily
In the Pipeline: Don't miss Derek Lowe's excellent commentary on drug discovery and the pharma industry in general at In the Pipeline

In the Pipeline

« Merck's CALIBR Venture | Main | Running Out of Helium? »

March 19, 2012

Dealing with the Data

Email This Entry

Posted by Derek

So how do we deal with the piles of data? A reader sent along this question, and it's worth thinking about. Drug research - even the preclinical kind - generates an awful lot of information. The other day, it was pointed out that one of our projects, if you expanded everything out, would be displayed on a spreadsheet with compounds running down the left, and over two hundred columns stretching across the page. Not all of those are populated for every compound, by any means, especially the newer ones. But compounds that stay in the screening collection tend to accumulate a lot of data with time, and there are hundreds of thousands (or millions) of compounds in a good-sized screening collection. How do we keep track of it all?

Most larger companies have some sort of proprietary software for the job (or jobs). The idea is that you can enter a structure (or substructure) of a compound and find out the project it was made for, every assay that's been run on it, all its spectral data and physical properties (experimental and calculated), every batch that's been made or bought (and from whom and from where, with notebook and catalog references), and the bar code of every vial or bottle of it that's running around the labs. You obviously don't want all of those every time, so you need to be able to define your queries over a wide range, setting a few common ones as defaults and customizing them for individual projects while they're running.

Displaying all this data isn't trivial, either. The good old fashioned spreadsheet is perfectly useful, but you're going to need the ability to plot and chart in all sorts of ways to actually see what's going on in a big project. How does human microsomal stability relate to the logP of the right-hand side chain in the pyrimidinyl-series compounds with molecular weight under 425? And how do those numbers compare to the dog microsomes? And how do either of those compare to the blood levels in the whole animal, keeping in mind that you've been using two different dosing vehicles along the way? To visualize these kinds of questions - perfectly reasonable ones, let me tell you - you'll need all the help you can get.

You run into the problem of any large, multifunctional program, though: if it can do everything, it may not do any one thing very well. Or there may be a way to do whatever you want, if only you can memorize the magic spell that will make it happen. If it's one of those programs that you have to use constantly or run the risk of totally forgetting how it goes, there will be trouble.

So what's been the experience out there? In-house home-built software? Adaptations of commercial packages? How does a smaller company afford to do what it needs to do? Comments welcome. . .

Comments (66) + TrackBacks (0) | Category: Drug Assays | Drug Development | Life in the Drug Labs


COMMENTS

1. PPedroso on March 19, 2012 7:35 AM writes...

I work in a small company and so far is excel spreadsheet all the way but we do not have that many projects (so far!) and things are still manageable...
From my perspective more difficult than manage and process data is manage and process all the different documents from different sources (like reports and prelimminary results, ect...) regarding that data...

Permalink to Comment

2. MattF on March 19, 2012 8:08 AM writes...

And I'd think you also need what computer people call 'version control'-- when does new data become reliable enough to supersede old data, and to what degree, and who should have access to the new data before it is deemed reliable?

Permalink to Comment

3. MarkE on March 19, 2012 8:29 AM writes...

We are able to use an integrated 'report-puller' at the small company where I work: a custom-built program which can pull user-defined data from our central data capture system. Creates large but (more) manageable Excel files.

Permalink to Comment

4. Anonymous on March 19, 2012 8:30 AM writes...

There are excellent open source and free relational database management systems (RDBMS) out there, my favorite is PostgreSQL but there is also MySQL and SQLite. For visualizing and analyzing data, there is the trusty free and open source R statistics package which can plug in to the aforementioned RDBMSs. R is very widely used in the biomedical setting and has many excellent extensions. For general data handling, there are the widely used scripting languages like Perl and Python. Even though all the packages mentioned here are free, it takes time to develop proficiency but the payoff is well worth it, even for small companies. Spreadsheets are fine for data analysis up to a point but are no substitute for databases when it comes to data storage. My experience is based on bioinformatics but databases have an obvious general utility and I believe both Python and R have extensions and libraries for chemical data too.

Permalink to Comment

5. ChrisS on March 19, 2012 8:38 AM writes...

I'd say that for both small and large pharma there are more similarities than differences in needs when the chemist asks "What should I make next?" and the biologist asks "How did those results turn out?". While small pharma may have smaller datasets, in either case once you get beyond a certain number columns of data and rows of structures in a spreadsheet, it becomes difficult to make comparisons even with conditional formatting such as color/stoplighting to help out. In the beginning we wrote simple web-based tools that queried an Oracle database and returned a grids of data, but it was quickly apparent that this was not enough.

There are now several platforms that are available for small to medium pharma which are cost-effective and help enormously with the task of selecting and arranging data in a form that eases decision making. I would say that something like this is a sine qua non for effective drug discovery so be prepared to make the pitch and spend the money.

We chose Dotmatics Browser, a tabbed forms and tabular data query/viewing platform which very flexible and easy to maintain. We've configured each project with its own collection of forms, with a primary summary form that shows data for all the assays on the project's progression scheme. This form also shows structure and any other data the project deems critical for comparison purposes. Typical of forms-based applications, one queries the data by entering terms in any of the fields. Along with the main project summary form I've configured a set of standard template forms for the "usual suspects" - chem properties, in-vitro DMPK, in-vivo DMPK and so on. These appear in every project as well. To handle documents, I have an indexer on the fileshares containing the reports, and this is linked to a tab in Dotmatics so it returns documents containing the current compound ID as you move among the compounds in the query result. Query results can be pushed out to preconfigured tabular views easily.

What I don't want is for my chemists and biologists spending time pushing data around to get it in a form they can use; better to present the data in a useable form and let them think about what to make next.

Permalink to Comment

6. dmc on March 19, 2012 8:40 AM writes...

"...To visualize these kinds of questions - perfectly reasonable ones, let me tell you - you'll need all the help you can get."

This is the key point of the discussion. Regardless of the software solutions available, you need good scientist who will actively manage and analyze the loads of data coming in. In my experience this responsibility has always fallen on the MedChemists because we need to understand all of the data relationships and dependencies in order to design the next set of analogs. With the recent gutting of MedChem talent in the industry, this is an 'art' that is being lost. You can outsource all the compound synthesis you want, but if no one is left around who understands how to properly interpret all the data being generated you might as well go back to combichem!

Permalink to Comment

7. Rajarshi on March 19, 2012 8:41 AM writes...

While RDBMS's + R are a potent combination, they definitely require informatics support and/or expertise. Certainly, from the informatics side of things, this a great combination and coupled with chemistry toolkits and modern vis frameworks, pretty much anything is possible.

But from the bench scientists point of view I'd argue that without appropriate (ie usable) interfaces, it's all useless

Permalink to Comment

8. MarkE on March 19, 2012 8:42 AM writes...

We are able to use an integrated 'report-puller' at the small company where I work: a custom-built program which can pull user-defined data from our central data capture system. Creates large but (more) manageable Excel files.

Permalink to Comment

9. HelicalZz on March 19, 2012 9:02 AM writes...

I too am in a small company with a modest number of data and compound sets. Excel is just barely manageable for this though. It requires a database, and people who know how to manage and use it. Document and data retention, organization, management, organization, etc. is something small companies don't pay enough attention to in my opinion (until forced to).

So the question becomes, how many chemists train in database management and use? Why not? If you are at a large company there is really little excuse to not get yourself at least a basic level of this training (as speaks the experience born from not taking advantage of opportunities like that myself when I should have).

Zz

Permalink to Comment

10. LionelC on March 19, 2012 9:04 AM writes...

I am new in a small chemical start-up. I have to implement/select a new system for the management of the data from the chemical structure (with exact stereochemistry), to the certificat of analysis or their biological data.
It is not clear to see what are the solutions, so any comment on it are welcomes.

But I think that clearly it depends on what are yours needs. Is it to save your know-how or is it to do Med.Chem. and select the next compound to do?

Permalink to Comment

11. John Wayne on March 19, 2012 9:10 AM writes...

If you can afford it, Spotfire is an excellent program. You can import a spreadsheet and pretty easily visualize your data in many dimensions. You can easily save your work as a picture file, and they make handy tools in presentations.

Permalink to Comment

12. Miramon on March 19, 2012 9:11 AM writes...

I'm curious who the industry leader is in this software category. Is the quality -- usability and power -- of the software generally satisfactory for most users? Do "semantic" features add value here?

Permalink to Comment

13. Moses on March 19, 2012 9:11 AM writes...

# 10. LionelC:
You have a choice of several established companies.
The big fish is Accelrys/MDL/Symyx, there's Cambridgesoft, ChemAxon and Dotmatics. I've worked for a forerunner of Accelrys and like their Accord software, but Dotmatics is probably a neater option these days.

Permalink to Comment

14. SP on March 19, 2012 9:20 AM writes...

Schrodinger Seurat works pretty well and is cheaper than the "big fish." I've worked with Cambridgesoft in the past, was not very happy with it.

Permalink to Comment

15. LionelC on March 19, 2012 9:27 AM writes...

Thanks #13 and #14 for the softwares.

Just I have to add that, ok for R or Spotfire and so on, but clearly I think that the first and most important thing is to manage correctly the chemical structures. If your structures are false the rest will be too...

Permalink to Comment

16. AJ on March 19, 2012 9:40 AM writes...

SPotfire is great if you have "clean/organized" data and one can be productive within a VERY short time - one of the best tools I have EVER worked with. To actually "work" with data one shouldnt forget "KNIME" in combi with R.... seen from a compound and HCS/imaging screening perspective. Matlab is a classic and very powerful anyhow, though one really needs to know the details...

andi

Permalink to Comment

17. CMCguy on March 19, 2012 9:49 AM writes...

My experiences ranges from big to small to medium and there was commonality that individuals and groups/Departments tended to have collections of their own Excel spreadsheets for info that was important or wanted readily accessible. While most also had or went at some point to Oracle Databases, customized in-house, while these may have been more comprehensive overall typically not very user-friendly and needed certain expertise to enter or extract anything.

I have heard it proposed that it was inability of bioinformatics that was the true reason combichem did not achieve that much successful applications because generated more data than could be reasonably handled at the time. I wonder if there exists better database tools today that would allow greater exploitation of the techniques although still would never be the panacea was once proclaimed?

Permalink to Comment

18. JTM on March 19, 2012 10:05 AM writes...

I once worked at a company that used Spotfire as the primary data visualisation tool for everyone - it was truly wonderful but insanely expensive.

I'm currently at a much small company where we use Cambridgesoft suite for collection of data (ELN, biology etc) and Excel / Dotmatics for visualisation. This works fine (it's difficult to beat excel) although it's a little clumsy and unstable. We're also evaluating Dotmatics Vortex, which they have in development as an alternative to Spotfire. At the moment it looks pretty decent (Structure visualisation, capable of handling plots of upwards of 100k datapoints

Permalink to Comment

19. DrSnowboard on March 19, 2012 10:06 AM writes...

Used to use Spotfire for HTS data - as one commenter says data needs to be quite clean.
Dotmatics usurped MDL / Isis / whatever symyx call it now, in my view. MDL got complacent and just threw in the towel as far as I was concerned. Dotmatics is web based and quick for medium sized datasets. Reporting is getting better too, and allows you to interface with legacy ActivityBase which biologists love and chemists hate, because it's useless at chemistry..

Permalink to Comment

20. JB on March 19, 2012 10:11 AM writes...

Re #15- For chemical intelligence we use Chemaxon which is really good, supportive company and they're very generous with academic collaborations. I'll also hint at some NIH-funded public tools that will be developed over the next year for assay management and data mining.
I love Spotfire (especially the newer version that replaced Decisionsite) and pipeline pilot, but as people mentioned both are very expensive- I've heard Knime as a cheap alternative to PP but haven't personally tried it.

Permalink to Comment

21. HTSguy on March 19, 2012 10:16 AM writes...

+1 to 18: Spotfire is very, very useful and insanely expensive.

I've used both Pipeline Pilot (while employed) and KNIME (while unemployed - it's free). Pipeline Pilot is both far faster (where I currently work we have it running on a small, several-year old server) and much better integrated (KNIME unfortunately acts like the Frankenstein monster it is - a jumble of disparate parts). I guess you get what you pay for.

Permalink to Comment

22. anon2 on March 19, 2012 10:29 AM writes...

This is a topic analogous to a circular piece of string with no end. It comes down, all too often, about different strokes for different folks. Everyone wants an easy to use data base, but the uses and hence objectives are not always consistent----biologists, synthetic chemists, modeling scientists, clinical folks interfacing with preclinical data, geneticists. None are always "right". None are always "wrong". But, no one system has proven (to me) to work for everyone.

Permalink to Comment

23. Thomas on March 19, 2012 10:34 AM writes...

Pfizer has a great tool called RGate - does everything including list logic, registration of virtual compounds, and exports nicely to Spotfire. There were some talks to make it available to the public, and/or replace it with a commercial product ("cheaper" as not supported by in-house specialists).

Permalink to Comment

24. AJ on March 19, 2012 10:39 AM writes...

@22 - thats what I liked bout "spotfire" - It didnt really care bout the origin of the data - it could just handle ERVERYTHING as long as it was in a structured table etc. ... though we never paid a lot for it (academia then) ...

Permalink to Comment

25. cdsouthan on March 19, 2012 10:43 AM writes...

These comments are pertinent to dealing with your own data but it's difficult to go it alone because there is an increasing imperative to intersect structures and linked data (preferably quickly) with public sources such as PubChem and ChEMBL. Having being involved in a project that engaged with this (PMID:22024215) believe me it gets even tougher than grappling with the in-house stuff.

Permalink to Comment

26. exGlaxoid on March 19, 2012 10:44 AM writes...

Used to use ISISBase, then evaluated MDLBase for several years before it was stable enough to use.

Also tried a few other programs, and a few were OK, but nothing was a powerful as ISISBase. Spotfire is neat, but I didn't like it as much as ISIS once you had a nice project table definition. But of course that took IT support, so only good at larger companies.

Currently I use a mix of Excel, some ChemAxon products, some Cambridge software, and some other software. Not thrilled with the cobbled together package, but it works for now. This (Excel) is only doable when only one project data is in each spreadsheet, otherwise it is too big. Plus now a researcher has to manually add the data, along with the issues of manually updating and data integrity issues.

Permalink to Comment

27. passionlessDrone on March 19, 2012 10:59 AM writes...

Hello friends -

I don't know squat from chemistry, but do know a little bit about technology. I've been playing around with a tool called Tableau that is priced very reasonably and can visualize data sets pretty easily.

- pD

Permalink to Comment

28. RD on March 19, 2012 11:13 AM writes...

Some suggestions:
1.) Spotfire. You can plot anything using any criteria you want in many dimensions.
2.) have your rogue programmer write an app that will allow you to see your compounds as clusters on a grid, instead of entries in a spreadsheet. I have no idea why a chemist would want to scroll through a spreadsheet when all the information is staring you in the face in a grid. Have rogue programmer add filtering and color coding to the grid to make it easier to spot patterns and activity trends.
3.) hire a really good rogue programmer. There are a few out there. The IT and informatics departments tend to try to handcuff them.

Permalink to Comment

29. chemit on March 19, 2012 11:21 AM writes...


No magic here, the best tools, especially for chemistry data management, are often the most expensive ones. But they usually worse it (for medium-large companies) if you need productivity, security, performance, reproductibility... Lucky people who have money can get Pipeline Pilot / Spotfire or Dotmatix to build a robust and powerfull system that can be used by everyone.

Free tools like KNIME are nice alternatives for simple tasks, but are really not (yet ?) in the same category. I don't know any decent free alternative to Spotfire / Dotmatix (anyone ?). Chemaxon is probably the best compromise, in terms of money, features and quality.

Finally (but you'll need money too), a skilled IT team who know chemists / biologists needs will do the job too. Look at J&J's ABCD, which looks quite wonderfull...

http://pubs.acs.org/doi/abs/10.1021/ci700267w

@25 indeed, so much to do in this area!!

Permalink to Comment

30. anon2 on March 19, 2012 11:41 AM writes...

The lack of concensus from this one biased slice of data users (tending toward chemistry) simply emphasizes my previous comment....no simple stroke for all folks. If it had been resolved across the scientific, data driven community, then this disucssion would not be taking place.

Sometimes, such "obvious" overly-simplistic questions have obvious, but maybe not so satisfying, answers. Realism hurts.

Permalink to Comment

31. Publius The Lesser on March 19, 2012 12:04 PM writes...

The problem you're describing (finding meaningful patterns in large data sets) is common across all technical domains and drives the "big data" fad -- Google for it if you want to see the hype cycle in full swing. Behind the hype are some very useful tools and techniques for finding and visualizing patterns in large data sets. Because each domain is different, there really isn't a good canned solution, and these kinds of problems are solved by people called "data scientists" these days. A good data scientist is one part domain expert, one part statistician, one part machine learning expert, and one part coder. Although I work primarily on text documents of various sorts, about 5-6 years ago, I applied some of these techniques to mass spectroscopy data back when I was a postdoc at Carnegie Mellon, with mixed results.

Permalink to Comment

32. kissthechemist on March 19, 2012 12:09 PM writes...

As a small pharma drone, we started out with Excel and it soon got very heavy. We were guided into Dotmatics Browser to handle all of our data by talented computational chemists we had. We had some teething troubles, but the made-to-measure database we have now is very user friendly and has speeded up things enormously both for the biologists (especially data entry) and the chemists (SAR is a joy not a chore). The folks at Dotmatics have been pretty spot on too, only drawbacks are the need for our own IT people (which we fortunately already have) and of course the price.

Overall, I'd say its an investment which makes sense for companies of a certain size. I'd certainly hate to be without it.

Permalink to Comment

33. JB on March 19, 2012 12:27 PM writes...

Cheap spotfire- when I was at a smaller company we used a program called Miner3D (I think the original company is Hungarian) that was the basic graphing functions of spotfire with some funny shapes included in the icon set (we always thought they were various Hungarian peasants.)

Permalink to Comment

34. cbrs on March 19, 2012 12:29 PM writes...

A relatively new and cost-effective platform is Chembiography. It utilizes a web based front-end coupled to a linux server with MySQL, so no cost overhead of Oracle and the like. Can be done as local or cloud model and is designed as solution for medium to small companies. Provides full registry and integrated biological data uploading and flexible querying with output as pdf or excel reports.

Permalink to Comment

35. C-me on March 19, 2012 12:52 PM writes...

CDD Collaborative Drug Discovery) is a choice that is web-based and does not require internal IT people. Chemical structure based, calculates all properties and will not let you register the same compound under 2 ID#s. Economical for a small player, and price includes training and complete support.
Drawback is (still) that the rendering of structures is not great in the output files (Excel). I hope they work on this because it is powerful and a workhorse especially if you have people working at different sites, consultants, etc.

Permalink to Comment

36. Assay Developer on March 19, 2012 1:09 PM writes...

@19: Biologists hate Activity Base too. And from what I hear, so do the developers. I have a lot of experience with the MDL/Symyx package (ISIS, Assay Explorer), and there is no commercial solution out there that is priced reasonably and powerful enough to do the jobs. We finally went with an in-house system running off of Pipeline Pilot for data entry/analysis & Seurat for querying the Accord db. So far so good & requires lower level of IT commitment/support.

Permalink to Comment

37. Lab Monkey on March 19, 2012 1:13 PM writes...

Another vote for Spotfire - great for visualising lots of data, and you can build all sorts of widgets to increase its functionality.

+1 for #27's Tableau suggestion too. I was looking for a free/cheap Spotfire alternative for use outside work, and this fitted the bill. Tableau Public is free (although the data you upload isn't confidential), but there are desktop/commercial versions that don't seem unreasonably priced.

Permalink to Comment

38. DataHogWild on March 19, 2012 1:22 PM writes...

Has anyone used a tool called Ensemble?

Permalink to Comment

39. Cellbio on March 19, 2012 2:10 PM writes...

Spotfire for me as well.

#6, Though perhaps the responsibility was on the Med Chemists in times past, I think it a requirement of biologists to become facile and hold responsibility from assay qualification and database interfacing through data interpretation. By this I mean validate the assays to support robust thru-put, and assure rapid and robust methodologies for data QC and release into the database. This is best if supported by policy that no data can be held in private work sheets, and by requiring teams to only work from published data. My favorite outcome after requiring this was having a biologist present results, with a confused chemist asking, "But who did your SAR?" when it was a simple extrapolation of biological data. This was a very productive change in the discourse that followed between chemist and biologist.

Final point, have meetings present data live from Spotfire (or equivalent) rather than powerpoint. This supports the need for data to be imported to databases, and will show the team how sparse most data sets are in reality. I have found the number one problem is not how to handle the thousands of rows by hundreds of columns, but how to make the sparse data sets more full to make such analysis useful.

OK, final final point, this effort will also yield insight into the number of redundant biological assays that support different programs, resulting in non-overlapping data sets and expansion of the number of columns. Makes coherent analysis tough, and reveals opportunity for improving efficiency.

Permalink to Comment

40. MIMD on March 19, 2012 3:44 PM writes...

I would like to point the readers to a series on how software to look at large amounts fo data should NOT be. That is, presenting a markedly mission hostile user experience.

Permalink to Comment

41. Stonebits on March 19, 2012 4:10 PM writes...

I'd vote for what I guess is a high end solution (I'm a developer): solid chemistry db, with good storage of the assay data, the data from this is pulled into a warehouse and then formatted by a custom program for display in a browser. It's not trivial, but everyone can then point to the same data, which has not only been vetted on input but calculated in the same way and is comparable between chemists, projects etc..

Permalink to Comment

42. Anonymous on March 19, 2012 4:43 PM writes...

We (a lab at U. Michigan) use Collaborative Drug Discovery and love it. It's most used for data storage and searching (HTS, biological). Great for that and affordable. Still trails Spotfire in terms of visualization.

Permalink to Comment

43. Martin on March 19, 2012 4:58 PM writes...

From an academic perspective, it's really a matter of choosing a computing platform first, sticking with it, and then choosing the software to fit. We like most university departments are platform agnostic. We use Macs, Windows running everything from NT! to 7, Linuxes of various flavours, Irix and the list goes on. As far as I have been able to ascertain, there is no truly cross-platform product out there that a smallish university department with one and a half IT support people can deploy at prices that the aforementioned department can afford on academic budgets. The same goes for ELNs.

Whilst I assume that the choice of desktop platform in Pharma is a bit more "structured", making rollouts of such systems easier to maintain, it only comes with a higher cost. Fans of open source solutions perversely find little favour in constrained IT departments where the time costs ultimately outweigh the material costs, not to mention the typical churn in University IT departments where such expertise is rapidly lost when that one crucial employee moves on to a better paid job.

Permalink to Comment

44. DubbleDonkey on March 19, 2012 6:08 PM writes...

As many others have said Spotfire is a great tool but is too expensive for many. Dotmatics Vortex looks to be a much cheaper but worthy alternative. However these tools are only as good as the underlying data. Getting good quality data is far more challenging. Tools such as ActivityBase are powerful, flexible and great at getting data into a database. Getting them setup in a way which allows for minimum amount of developer maintenance can be a challenge though. It’s tempting to build new templates for every assay and let the biologists choose different analysis routines. Instead it’s worth investing some time up front standardizing data analysis and assay naming conventions. As well as minimizing maintenance it will be easier for chemists to navigate results and compare results from different assays. You can also reduce the number of columns if the data analysis is standardized.

Be wary of Excel spreadsheets. Things can quickly get out of hand with them and you can end up in a real mess. Most Informatics people I know would be happier if it was uninstalled from biologist’s and chemist’s desktops! It’s use in ActivityBase can tame it somewhat but if you can afford it ActivityBase XE is the way to go for analyzing assay data. Biologists love the flexibility and extensive visualisations. Getting the data out of ActivityBase into chemist friendly views is more challenging. Dotmatics Browser can help with this.

Pipeline Pilot is a must have for the Informatics people. You can very quickly put something together to process some data and the fact that it’s chemically aware makes it invaluable. It's an unusual day if I don't use it.

Permalink to Comment

45. dvrvm on March 19, 2012 7:04 PM writes...

I've seen ChemFinder and a proprietary Access database in action. Both in academic settings, however. ChemFinder can act as a solid foundation for a variety of different systems, which are similar, but not identical.

Permalink to Comment

46. TJMC on March 19, 2012 7:29 PM writes...

2-3 issues drive the above comments and stories. Their main focus is on how to gain understanding of relationships and patterns of the diverse information types that Discovery spans. Excel and typical relational databases are everywhere, but hit limits in scale, utility and ease of use to the typical scientist. Most chemists are excellent in visualization skills in 3-D (and more dimensions.) Hence, Spotfire is readily embraced. Problem is, the range of data TYPES is exploding and tech struggles to keep up, let alone making it easy for non-data scientists. The third issues besides utility and diverse kinds of data creep, is that “usual” tools prefer well-behaved (structured) data and relationships.

All of the above became apparent a while back from a large Pharma R&D survey. A MAJORITY of respondents noted that they “abandoned avenues of research” because “it seemed too difficult or impossible” to connect and relate data they KNEW was in-house. This seemed to me as a scientist to be approaching a “mortal sin” for R&D. Lots of tools and solutions since try to address this issue but the key first step is understanding the above problems, and where you need to go, NOT what DB, interface or toolset to use.

Permalink to Comment

47. LeeH on March 19, 2012 8:29 PM writes...

This issue is really a series of issues, all strung together. They are:

1. Accurately and faithfully capturing the real essence of the assay data.
2. Storing the data in a database that maintains data integrity (e.g. only allowing the correct result types and units for that assay, such as IC50 and uM but not EC50 and ug for an enzyme inhibition assay, or enforcing valid corporate IDs for compounds).
3. Allowing sensible visualization of the data in the database.

I also favor Pipeline Pilot / Accord / Seurat. Pipeline Pilot is easily the most flexible and powerful cheminformatics tool available, and is well suited for processing assay data. Accord has the most sensible (off-the-shelf) schema for storing chemical and assay data, but unfortunately Accelrys in their wisdom has decided to drop support as of 2014, and suffers from ancient desktop clients. Seurat is a nice cost-effective query tool which can support just about any schema, although the visualization stuff is a bit crude (albeit with some nice Spotfire-like features). Vortex is a nice looking tool, although supporting the tool needs a fair bit of expert-level expertise - otherwise not for the faint of heart.

You can't make a silk purse from a sow's ear. You have to properly capture and store the data before you can look at it. I fear that as companies cut back there will be a distraction from thinking of their data as crown jewels, instead focusing just on the visualization aspect. Which in itself may be a distraction, since designing a drug is a multi-dimensional problem, and we can really only see 4 or 5 dimensions at most.

Permalink to Comment

48. XChemistTurnedCompSci on March 19, 2012 9:39 PM writes...

#41 I agree whole heartedly. If a company or academic group could afford it, they would be better off in hiring 1-2 developers to create an intranet web application to handle all the queries and data input. Then the app would be highly customizable instead of just trying to make some out of box software package fit your needs. And while not trivial to do, it wouldn't be that hard either since it would just be a specialized web application.

XChemist

Permalink to Comment

49. XChemistTurnedCompSci on March 19, 2012 9:40 PM writes...

#41 I agree whole heartedly. If a company or academic group could afford it, they would be better off in hiring 1-2 developers to create an intranet web application to handle all the queries and data input. Then the app would be highly customizable instead of just trying to make some out of box software package fit your needs. And while not trivial to do, it wouldn't be that hard either since it would just be a specialized web application.

XChemist

Permalink to Comment

50. XChemistTurnedCompSci on March 19, 2012 9:40 PM writes...

#41 I agree whole heartedly. If a company or academic group could afford it, they would be better off in hiring 1-2 developers to create an intranet web application to handle all the queries and data input. Then the app would be highly customizable instead of just trying to make some out of box software package fit your needs. And while not trivial to do, it wouldn't be that hard either since it would just be a specialized web application.

XChemist

Permalink to Comment

51. XChemistTurnedCompSci on March 19, 2012 9:41 PM writes...

#41 I agree whole heartedly. If a company or academic group could afford it, they would be better off in hiring 1-2 developers to create an intranet web application to handle all the queries and data input. Then the app would be highly customizable instead of just trying to make some out of box software package fit your needs. And while not trivial to do, it wouldn't be that hard either since it would just be a specialized web application.

XChemist

Permalink to Comment

52. gippgig on March 20, 2012 1:22 AM writes...

An important and invariably neglected issue is data accuracy. Murphy's Law of Databases: All databases are riddled with errors - this seems to be universally applicable (nucleotide sequences, satellite launches, you name it). Despite being aware of this and taking countermeasures errors still managed to creep into my yeast gene maps. Beware!

Permalink to Comment

53. Moses on March 20, 2012 6:13 AM writes...

If you must work with spreadsheets, then look at The Edge's Morphit software: it allows you to flip between data view & design view, and is purpose built for scietific work. It's more pricey than Excel but you're much less likely to screw up your very very expensively acquired data.
http://www.edge-ka.com/

Permalink to Comment

54. SteveM on March 20, 2012 6:50 AM writes...

All IT that is not directly in the value stream of a business, (like a bank) is going to the Cloud.

The number of comments here indicate a market opportunity that someone, someplace has undoubtedly recognized. Expect a SaaS web-based product priced on a subscription basis in the near future. It will be affordable to small bio-tech companies.

Permalink to Comment

55. anon the II on March 20, 2012 6:59 AM writes...

@ SteveM

There is a product like that called "Loom" from a company called Innectus. It looks intriguing if you're a start-up with 10 or so people. When I looked it wasn't very expensive (1/10 of an FTE/yr). I don't know how many people would be willing to put all their goods in the cloud.

Permalink to Comment

56. Kling on March 20, 2012 7:03 AM writes...

As a small biotech person, I have always envied you chemists. At least you understand the need for databases. In my experiences in 3 small biotechs, the execs were biologists that were computationally illiterate, in their 60s and unaware of the need to even HAVE a database. Data silos abounded, and zero $ is/was budgeted towards this effort. I had to learn and develop access databases, but quickly evolved to being the DBA for the company (which was not my job) constantly trolling people to upload data. Which invariably met with resistance by pockets of biologists who just didn't feel the need to share data, leaving everyone else in limbo. I am in my 3rd biotech and encountering the same problem again. It is funny to see people who complain the most about lack of centralized data are the same ones who won't upload, since they are computationally challenged to start with. And the execs who don't even understand an issue.

One prominant molecular evolution biotech that went bankrupt and we bought the assets of, was proud to say they had an oracle back end handling the data, but when examined, had no entries. The execs just ordered their IT people to build one, then laid them off. The execs only wanted to say they had a DB so they could sell the company. Meanwhile each research group used their own spreadheets on their laptops, which of course, disappeared when massive layoffs came.

Permalink to Comment

57. LeeH on March 20, 2012 8:20 AM writes...

#56 Kling

I sympathize. One trick, for future reference, is to have a process for the biologists that analyzes their data (i.e. reads reader output directly) and loads to the database in one fell swoop. If the data loading is not a separate operation, and it saves them time, the biologists tend to adopt the process.

Permalink to Comment

58. Cellbio on March 20, 2012 9:13 AM writes...

Yes, the ease of data export from readers and the like is key, with then a means of easy but accurate approval before writing to the data base.

Also agree with comments above re the types of data. Even simple things like storing 5 or 10 pt curve data instead of just an IC50 is a big help. IC50 transit or inflection can give you orders of magnitude differences in apparent potency in some cases, which is readily understandable with a curve to look at, or at least min max values.

Permalink to Comment

59. ExperiencedBiologist on March 20, 2012 6:53 PM writes...

I faced this dilemma about three years ago when setting up informatics at a small biotech of about 20 people. Initial investments (software pkg, as well as a server, server operating system) and especially recurring costs (server maintenance, package customization) are significant issues at all small discovery organizations in this financial environment: do you buy Spotfire or a plate reader? These are real-world issues.

I went with a remote server solution -- now, of course called the cloud. I selected Collaborative Drug Discovery (CDD), cited above, rather than the heavyweight solutions noted above due to these issues. People brought up security, but the fact is, professional cloud-based companies have better security than the servers set up at most biotechs.

Solutions like Pipeline Pilot (PP) and Spotfire are great, but are not realistic for biotechs on agonizingly tight budgets. You’ve got to balance nice-to-haves with what you can afford and especially, what you can implement. I have been in organizations that spent many dollars on PP, only to have it sit unused because the programmer was never given the “quiet time” required to set it up.

Key issues to consider: make sure chemical and batch registration are as bullet proof as possible, especially when dealing with subtle stereochemistry issues. And instead of spending time building your own tools, spend time developing rules for ensuring your data integrity. As stated above, define your units (is it ug or uM for IC50… my preference is pIC50. How will you describe your ADME data?). Think about how to implement controlled vocabulary (is it “human” or “Human”? “pH6” or “pH 6”? Protease or proteinase? Is that assayist Jon or John?) It sounds silly, but these are real world issues that can clog up a poorly thought out data system. It is also valuable to work out a route for “proofreading” (second eye review prior to posting data). Ensure assay data is accompanied by batch ID (not just molecule ID) along with assay version information. Why assay version? When you change that primary assay from substrate A to substrate B, or endpoint to kinetic, are you sure it’s legitimate to average the results together? And you’re not averaging the results of the mouse enzyme assay with the human enzyme assay, are you?

I like CDD because we spent time on those issues, instead of figuring out if we should buy Windows Server 2003 vs. 2008 or set up SqlServer vs. Oracle vs. MySql. It was a solid package from the get-go, rigorous in terms of chemical registration, flexible in terms of assay definitions and very flexible with queries. The ability to download the results of queries in xls, csv or sdf formats was used extensively. Afterwards, I saw their support is thorough, fast and personal. It does not have the bells and whistles of the other packages. The biggest two issues are 1) as noted above, rendering of structures is not great. Our workaround was to copy and paste SMILES strings into Chemdraw (you’re going to have to buy seats of this for all of your chemists anyway. ChemDraw has the best structure rendering hands down, but avoid the ChemOffice addons) and 2) no easy mechanism for tracking batch samples. That was one case where we did “roll our own” using MS Access, but we did feed that system with data from CDD. There are other emerging cloud solutions I didn’t look into back in 2008 (e.g. Dotmatics noted above, also ChemInnovation) that may also be worth a look. At the end of the day, the cloud solutions are very low maintenance from the IT point of view, and thus one less headache. The other add-on we found useful was ChemAxon’s add-in for Excel (the ChemDraw add-in for Excel is not worth the trouble). It is well behaved and project team members liked its tools for data exploration, and it can be fed with .sdf files from CDD (or your preferred system) containing structure and data.

Overall, while PP and Spotfire are great, powerful and fun, the real, day to day data matters of a small discovery company on a tight budget can be met by these more cost-effective tools.

Permalink to Comment

60. Mike Pollastri on March 21, 2012 6:14 AM writes...

Ditto what #35 and #59 have to say about collaborative drug discovery (CDD). From an academic chemist's perspective (especially one that lives his life in collaboration with many remote biological collaborators. CDD allows non-redundant registration of structures (including multiple batches/salt forms), and for biology collaborators to upload screening data from their end. I have several different projects, each of which have access managed on a user-by-user basis (ie not all collaborators see all our data/structures). The cost is quite reasonable, especially when you consider that it's hosted (therefore no server upkeep for the end-user to do), and they are very patient trainers. For data analysis, I will most frequently perform the search(es) I want to do within CDD; then use pipeline pilot and spotfire to visualize. However, you can get most every imaginable computed chemical property from within CDD.

Permalink to Comment

61. Pharmer on March 21, 2012 6:45 AM writes...

After outgrowing Excel spreadsheets, we looked at a number of options and ended up choosing DeltaSoft's ChemCart, which is working great for us. It handles not only chemical and biological searching, but also compound registration and inventory. In our case, we also need to work with overseas collaborators. It seemed to be one of the few systems where we could easily have custom views of data in both a web (cloud) environment and internally. It also integrates nicely with Spotfire, PipelinePilot, and a number of other tools.

Permalink to Comment

62. pharmchem on March 21, 2012 8:24 AM writes...

I used to use MS-Excel and Chemdraw for Data analysis. But when the data exceeds 2-300, list of the chemical structures and assay data becomes just a computer file. It became exceedingly difficult to take a look through the table and my eyeballs were so tired to scan the table. I really wanted the structure search in Excel but it was not possible. I wanted to cluster according to IC50 & CC50, but I couldn't sort with two restraints. Now I have CDD, and it is just a magic! I don't need to spend time on Excel sorting. I don't need to spend money for Chemdraw purchase, since Chemaxon is added in CDD for free. More important thing is security and accessibility, and CDD vault system and web-based database allows for meeting these two goals simultaneously. If assay is done outside US, still no problem. They can login CDD and load their data in .csv or sd file. You can get the data here on your desk and you don't need to add the new data in the existing database. It will be done sutomatically. There are other programs in the market. But secured AND easily accessible CDD is the choice for me and my collaborators. Within limited budgets, it was one of the best choices that I made for the last decade.

Permalink to Comment

63. researchfella on March 23, 2012 3:16 AM writes...

I'm surprised at all the praise mentioned by others for Spotfire. Yes, it is great in many ways, but it does not understand chemistry so you can't do structure/substructure searches. Not medchem friendly.

I've liked ChemDraw-for-Excel for limited project databases in the past, but in recent years I had so many problems with glitches and/or version compatibility problems of ChemDraw vs Excel/MS Office that it became useless for me and I had to give it up. Accord-for-Excel has done the job adequately since then, but not without some glitch pains.

To establish a larger multi-project database system, we recently evaluated many of the options mentioned above and settled on Dotmatics. A few little glitches and start-up pains, but no regrets so far.

Permalink to Comment

64. Pharma vetern on March 23, 2012 6:57 PM writes...

As a medicinal chemist, I spent quite a number of years at Wyeth grappling with many of the same problems discussed here-1) getting high quality into the database in a timely fashion (we finally switched all the biologists to Activity Base and strictly "encouraged" adoption); 2) Getting access to a high quality data mining tool with integrated analysis tools (we went through an extensive process with Tripos that led to D360); 3) Made sure that everyone had access to the application and the program had features that encouraged use and collaboration (a great app does little good if only the gear heads can use and modify it); 4) and finally, and very important in this era of shrinking IT support, made sure it had a small IT footprint. I agree that Spotfire is a great analysis tool, but very expensive and that most users only use a small fraction of the tools in it, if at al. So we had built in two way integration with Spotfire so for those expert users who needed more than the built in analysis tools (which include a bunch of chemistry specific tools not in SF) they could just pull up SF. When Pfizer took over Wyeth (don't get me started on that), they decided to retire RGate and adopt D360. Too expensive for the small biotechs? Nahh, in use already at companies from big to small and in between.

As an aside, before we went with an integrated solution we had: in house developed tools, ISIS, Pipeline Pilot, Excel with all flavors of plugins, MS Access, ChemFinder and Filemaker Pro all accessing the corporate data. What a mess, all with transportability issues, support issues, usability issues and if the person maintaining any of them left, you were left with....not much.

Permalink to Comment

65. Nile on March 24, 2012 7:13 AM writes...

TJMC ( #46 ) sounds like an experienced IT professional: you might consider contacting him or her directly and inviting further comment on this topic.

I now work on a trading floor, as a tactical developer for the derivatives traders: I can assure you that we face similar problems in the volume and complexity of our data, and the complexity of the searches and analyses that we (or rather, I) attempt to perform.

I will be looking into Spotfire.

The fact that we can probably purchase it out of the odd-coins box by the biscuit tin in the coffee lounge is a distraction: ALL software costs more than the sticker price, and if it takes up a day a week of a Production Support technician, just to assist the the lead traders, it had better be worth it. Bad software is, of course, always prohibitively expensive when you add up wasted time and losses incurred due to errors.

Meanwhile, it isn't just Big Pharma where promising and profitable lines of enquiry are abandoned because data we *know* exists onsite is just too difficult to get. Financial institutions have the problem too, and big engineering companies are bedevilled by it.

As I say: TJMC would be an interesting correspondent.

Permalink to Comment

66. TJMC on March 29, 2012 9:40 AM writes...

Neil - Thanks for the kind words. But just to be clear, I am more of an "R&D Business Architect" than a pure IT person.

My observations above reflect insights from decades of helping my firm (BMS) or clients to improve how well R&D performs. Sometimes through use of IT, but always by first looking at how all the parts interact and co-depend. - Terry

Permalink to Comment

POST A COMMENT




Remember Me?



EMAIL THIS ENTRY TO A FRIEND

Email this entry to:

Your email address:

Message (optional):




RELATED ENTRIES
Gitcher SF5 Groups Right Here
Changing A Broken Science System
One and Done
The Latest Protein-Protein Compounds
Professor Fukuyama's Solvent Peaks
Novartis Gets Out of RNAi
Total Synthesis in Flow
Sweet Reason Lands On Its Face