用Google Spreadsheet批量生成化学信息和结构图片

 手里有一串CAS或者化学名列表(十有八九是存在Excel里的),想补全其他化学基本信息包括结构式图片,现在已经有一些Excel的插件能用了,不过在Google Spreadsheet里做这件事真是格外感到荡气回肠…

所有的原理、代码、操作指南都是从 http://metamolecular.com/gchem/ 这里学来的。

实验过程全程翻墙。

基本的原理是

  • 有一个Web Service能完成化学方面的事情。这里用的是著名的 http://cactus.nci.nih.gov/chemical/structure
  • Google Spreadsheet能执行自定义的脚本(javascript);而且脚本里还能用UrlFetchApp.fetch(url)这种去访问网页或Web Service!

脚本安装和使用的过程 http://metamolecular.com/blog/2011/02/22/gchem-easily-convert-names-and-cas-numbers-to-chemical-structures-in-google-spreadsheets/

效果

其他

  • 表格导出Excel文件时自定义脚本都失效了;保存成PDF图都没了。
  • 再加上现在国内墙越弄越nb,恐怕没法作为常用工具去用。
  • 比较容易可以更换成自己写的Web Service,比之现在查到的内容更精准有效。

 

MIOSS 2011资料下载

多谢chemhack同学给的链接,下载到了的所有演讲资料。

https://registration.hinxton.wellcome.ac.uk/display_info.asp?id=246

必须得说,资料里面很有料;也必须得说,大部分都没看懂。

大致有感觉的是,开源的意思就是开放,开放的意思就是融合,所以一干工具的组合很有意思

  化学信息软件 数据处理/数据挖掘/流程控制 关系数据库
Gregory Landrum RDKit Knime PostgreSQL
Rajarshi Guha CDK R  
Kevin Lawson, LICSS System CDK MS Excel  
Dr Katy Wolstencroft CDK Taverna  
Thorsten Meinl RDKit, Indigo Knime  

事实上现在存在的工具的组合远不止于此,比如历史悠久的mychem=openbabel+mysql。搞开源的人耻于鼓捣微软的平台(鼓捣苹果就高雅了?),所以自己还是需要再做功课,争取能做一点贡献。

第6千万个CAS注册的化合物

 原文:

主要的意思

  1. 第6千万个CAS注册的化合物,是一个中国的专利,CAS号是1298016-92-8。
  2. 从2009年开始,在新化合物的专利注册上,中国就超过其他国家,估计这种领先优势将一直保持下去。
  3. 第5千万个化合物是不到两年之前注册的,而且速度还在不断加快。这样算起来的话,每天平均有几万个新的CAS注册化合物,不得了,我有些怀疑Chemical Abstracts Service (CAS)的工作质量。

 

Fingerprint similarity and substructure filtering using Lucene

Lucene is a great tool for retrieving fragmented information even including the fragmenting process (the Analyzer). So there was an intuitional way to retrieving molecules from a cheminformation database with the Lucene engine. ChemSink’s about (http://www.chemsink.com/about/) also mentions that "The chemical search uses Open Babel, Lucene, and MySQL ". Recently I have worked it out and have deployed on production services.

How it works

The concepts is connected as blow and the implementation is not hard. I work on my cheminformation platforms based on .NET and use Lucene.NET.

  1. A Molecule is a File in Lucene.
  2. Fingerprint is a Field. That means other information or even another suite of fingerprint can be stored for being queried.
  3. A Bit in the fingerprint is a Term.
  4. Substructure search queries Lucene for all the set bit in fingerprint of the query molecule are required in the result molecules.
  5. Similarity search finds the most relevant molecules.

Similarity and Lucene

The most popular similarity algorithm being used is the Tanimato as written as

As my understanding, this coefficient is so widely used most because it’s simple and running fast for some time-critical cases such as online searching.

Lucene employs Cosin-Similarity with Vector Space Model (VSM) of Information Retrieval.

http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/search/Similarity.html

and norm(t,d) part includes three parts to be multiplied, the lengthNorm part stands for what is effected by the length of the document (total fingerprint bits count of a molecule).

The default Similarity implementation inside Lucene is ready to be used for similar molecules retrieving.

More about the lengthNorm

In my researching phase I have found that the scores of several different target molecules are the same in value. It’s found that the lengthNorms are all the same for this several molecule while they have various number of fingerprint bits been set. The lengthNorm is not calculated while the searching phase but pre-calculated and stored at the indexing phase. Finally I have found this sentence inside the Lucene documents.

"However the resulted norm value is encoded as a single byte before being stored… comes with the price of precision loss"

So my molecules are treated as the same length documents when being queried.

Molecule ID # of bits set

lengthNorm calculated
in indexing phase

lengthNorm stored
and been
 
11096 23 0.2085 0.1875  
11578 27 0.19245 0.1875  
201736 28 0.18838 0.1875  

The formula to calculate lengthNorm is 1.0 / Math.Sqrt(bits_set)

Fortunately, the DefaultSimilarity class could be overrided including the lengthNorm function. According to Duan Lian’s diagram of distribution of number of fingerprint bit set, there exists a function to mapping this distribution to a more flat one and been using to calculate lengthNorm and take full advantage of the precision-limited value.

 Here is the distribution of fingerprint darkness of my database of 80000 commercial compounds. 

Chemene JSDraw,又一个非常优秀的基于javascript的化学结构编辑器

JSDraw是Tony Yuan开发的基于javascript的在线化学结构编辑器,也能作为化学结构式的显示工具。效果很棒。

我之前所了解的类似的软件是Chemhack jsMolEditor and Chemdoodle

JSDraw的特点是

  • 结构图片显示的质量高。
  • 输入的方便。我用过的最方便的在线结构输入工具,是Chemwriter。而JSDraw的输入方式和Chemwriter几乎一样,尤其是直接用键盘输入N,S,O,Cl等非C原子的功能很有用。
  • 有“部分选择”的功能,并支持Visio里一样的用ctrl键拖拽后复制的效果。
  • 支持反映路线输入,并支持RXN导出。而且对反应路线输入,提供了自动整理格式的功能。
  • 调用上很有特点。网页设计者不需要单独写js代码,而是直接对需要表达为结构式显示、编辑的区域,就是div元素,赋予特定的属性即可。
  • 能够很完美支持IE。这点jsMolEditor还不行。

Tony现在在Novartis工作。提到这个公司,自然会想到一个人,就是Peter Ertl,大名鼎鼎的JME就是他多年前开发出来并广泛使用起来的。可见这个公司文化的开放和对行业多角度的贡献。

不过JSDraw并不是开源的,对商业应用也不免费。这一点Duan Lian的jsMolEditor就开放得多(LGPL)。

JSDraw目前还没有SMILES导出的功能,但是这在Tony的计划之中。如果我写的Javascript SMILES writer能够起到一些帮助,将是我很大的荣幸。

Fast building SMILES structure viewer with Dingo and WPF

Indigo is recently introduced as open-source chemoinformatics toolkit <1>, <2>.

Dingo is a molecule and reaction rendering library included in Indigo and with .NET C# wrapper to make it possible for me to build a WPF app with it. Dingo and WPF make things so simple.

Amazing feeling to stand on the giant’s shoulder. Not too much code pasted here

Compound class to transform SMILES to Bitmap objects with Dingo

    public class Compound
    {
        public String smi{ get; set;}
        public Compound(String smiles)
        {
            smi = smiles;
        }
        public BitmapSource bmp {
            get{
                return Smi2Bitmap(smi, 160, 160);
            }
        }
        public static BitmapSource loadBitmap(System.Drawing.Bitmap source)
        {
            return System.Windows.Interop.Imaging.CreateBitmapSourceFromHBitmap(source.GetHbitmap(), IntPtr.Zero, Int32Rect.Empty,
                System.Windows.Media.Imaging.BitmapSizeOptions.FromEmptyOptions());
        }
        public static BitmapSource Smi2Bitmap(String smi, int Width, int Height)
        {
            Dingo dg = new Dingo();
            dg.loadMolecule(smi);
            dg.setBackgroundColor(System.Drawing.Color.White);
            System.Drawing.Bitmap bmp = dg.renderToBitmap(Width, Height);
            return loadBitmap(bmp);
        }
    }

SMILES file reader

    public class SMIReader: System.Collections.IEnumerable
    {
        private StreamReader sr;
        public SMIReader(String path)
        {
            sr = new StreamReader(path);
        }
        public IEnumerator GetEnumerator()
        {
            String line;
            while ((line = sr.ReadLine()) != null)
            {
                yield return new Compound(line.Split('\t')[0].Split(' ')[0]);
            }
        }
    }

XAML code of the listbox

Notes: To show as many structure pictures blocks on screen the WrapPanel is used as ItemsPanel. This depresses performance for large file that all SMILES are read in but lazy loaded in the default StackPanel mode.

            <ListBox Name="list1" Height="539" Width="790"
                     ItemTemplate="{StaticResource MolListboxTemplate}"
                     ScrollViewer.HorizontalScrollBarVisibility="Disabled"
                     IsSynchronizedWithCurrentItem ="True">
                <ListBox.ItemsPanel>
                    <ItemsPanelTemplate>
                        <WrapPanel/>
                    </ItemsPanelTemplate>
                </ListBox.ItemsPanel>
            </ListBox>

Item template of the listbox to support structure images displaying

    <Window.Resources>
        <DataTemplate x:Key="MolListboxTemplate">
            <WrapPanel Margin="3">
                <StackPanel>
                    <Image Source="{Binding bmp}" Height="160" Width="160"></Image>
                    <TextBox Text="{Binding smi}" Width="160"></TextBox>
                </StackPanel>
            </WrapPanel>
        </DataTemplate>
    </Window.Resources>

Button event to trigger databinding

        private void button1_Click(object sender, RoutedEventArgs e)
        {
            System.Windows.Forms.OpenFileDialog ofd = new System.Windows.Forms.OpenFileDialog();
            ofd.Filter = "SMILES file(*.smi)|*.smi";
            if (ofd.ShowDialog() == System.Windows.Forms.DialogResult.OK)
            {
                String path = ofd.FileName;
                textBlock1.Text = path;
                list1.ItemsSource = new nchem.SMIReader(path);
            }
        }

Done.

在线化学结构输入软件最重要的功能

Rich在用ChemWriter的License作奖品请大家写一些结构输入软件的重要功能特性,进行竞赛。我不如就写点中文的吧。仅限于在线编辑的软件 ( online editor only )。

文件导入 (file importing)

没有谁的研究工作是在某个在线编辑器上进行的,更多的是文件形式存在的结构文件。比如Chemdraw的CDX文件。而重画一遍结构式,绝对是痛苦的事情。没有哪个研发助理和采购助理乐意干这件事。所以需要导入功能。导入功能在另一个侧面可以弥补功能上的不足,因为结构都已经在其他商业软件上做好了。

在线的结构文件导入,基于ActiveX的Chemdraw功能最全;硕大的JChemPaint可以支持InChI/SMILE等输入;Chemwriter可以导入mol文件;如果网页开发人员不多写一些代码,JME只能干画。

大小 (plugin size)

既然是在线使用,size就重要。比如JChemPaint有 382kB;Chemwriter 111kB;JME 40kB。

输入方便 (easy to input)

这一点Chemwriter可居榜首。鼠标在原子、键悬停时有显示;反复点击某个键更改键类型;直接在键盘上输入元素名称等等实用功能都做到了。

而且也比较美观。

所以综合来说,我很想用用Chemwriter,所以就写上面这些希望有点运气。

 

MOL Reader and SMILES Writer in Javascript

Yes, it is named by me as jsBabel. There’s too little functionality.

It is

  • Written by Javascript.
  • Implementation of span-tree generation and serialization.
  • Modeling chemical compound in js.
  • To help enhance the js molecule editor Chemhack jsmoleditor and Chemdoodle. They are excellent ideas but have no SMILES output function.
  • In beta and many known and unknown errors.
  • Going to be opensource if one or more folks is interested in it.
  • Free to be downloaded, modified, re-published now.

A nice week end.

ChemDeposit发布版本3

ChemDeposit是在Google Appengine (java) 上运行的化合物信息生成、存储、检索的服务。

在6月15号,就已经发布上去能工作的版本了,不过不能支持子结构检索,只能用指纹散列(fingerprint)做“近似子结构检索)。

为了调整CDK的版本,增加子结构检索(SMARTS Searching),竟然用了一周的时间,遇到好多小问题。最难办的就是在客户机的SDK上运行OK,上传上去就不行的问题。最最难办的就是数据结构修改后,原有的数据成为非法数据删不掉的情况。

相关的问题和解决的过程在这里

Jun 20, 2009

Version 3.

  • Using CDK jar version 1.2.3
  • Substructure Searching supported. That means the CDK SMARTS searching function works.
  • Similar Structure Searching (Tanimoto) supported.
  • Has a domain name www.chemdeposit.com

Hosting CDK on Google App Engine

Google App Engine has supported java for a while. I don’t know if others have tried to hosting CDK jar on the App Engine. Last weekend I tried to build an application or App Engine based on CDK and fortunately it seems most parts works.

 
I have build a app named ChemDeposit, it can be visited at http://chemdeposit.appspot.com/
 
It accept SMILES inputs to add or view information generated by CDK. And there’s also an *almost* substructure search implementing just based on default fingerprints.
 
The InChI generator and SMARTS search functions can not work on App Engine and I pasted error informations at http://chemdeposit.appspot.com/cdk-on-appengine.jsp , hope there’s someone can help.

Next Page »

Random posts