在线化学结构式图片生成服务

利用一些网站提供的资源,可以在线生成结构式图片。其中比较突出的是DayLight提供的服务,在另外一篇 结构式图片生成服务, DayLight SMI2GIF 中做过详细介绍。DayLight的服务传入的结构式参数是SMILES,而且有很丰富的参数以调节输出效果。

NIST是美国国家标准与技术局(National Institute of Standards and Technology),NIST WebBook 是老牌的免费化合物信息数据库,提供丰富的化合物物理、化学性质数据。其化合物的编码方式,其实是CAS号码。把CAS号码转换成数字,就可以很容易得到结构式图片的地址了。

http://webbook.nist.gov/cgi/cbook.cgi?Struct=C490119

NIST WebBook的数据量并不是很大,只有几万条记录。不知道是不是因为太老的原因(05年就没再更新过),其中还有错误数据。至少到这篇发布的时候,上面的例子仍就是一个错误结构。我写Email报告了这个问题,不知道啥时候能修正。

NLM是(National Library of Medicine) 它提供的ChemIDPlus数据库 也是用CAS号码进行编码的。数据量要比NIST大很多,结构式输出的质量也更好。

http://chem.sis.nlm.nih.gov/chemidplus/RenderImage?maxscale=30&width=200&height=200&superlistid=000490119

对于化合物的标记,SMILES是公开的标准,直观还原结构式信息,值得应用;CAS不公开不免费,但也成为了既成的行业标准。现在能与CAS相提并论的,我想就是NCBI的PubChem 数据库了。NCBI是美国国立生物技术信息中心(The National Center for Biotechnology Information。在在线数据库的范畴内来说,PubChem的Compound ID(cid)基本上是必被引用的。所以也勉强将它用cid作结构参数的图片生成接口纳入进来。这个接口背后也有很多参数用以调节输出。

http://pubchem.ncbi.nlm.nih.gov/image/imagefly.cgi?cid=10273&width=400&height=400

有一篇很好的文章,Thirty-Two Free Chemstry Databases(32个免费化学数据库) ,仔细读过的话也许还会有更多的发现。

结构式图片生成服务, DayLight SMI2GIF

SMILES的发明者,DayLight公司提供了一个非常实用的Web Service工具,可以在线通过化合物的SMILES编码,生成它的结构式图片。这个工具就是SMI2GIF 。它是基于DayLight公司的产品”HTTP Toolkit“建立的。我们可以购买这个产品自己建立Web Service,也可以直接在线使用DayLight提供的服务。

下面的图片是一个最简单的在线应用。

图片的HTML代码是

        <img src="http://www.daylight.com/dayhttp/smi2gif?smiles=Oc1ccccc1"></img>
    

DayLight给出了一个参考文档,对这个接口的调用参数和功能,讲述得非常详细。下面就把简单的,我也懂的参数作一点举例介绍。

图片的高度和宽度,线条的粗细,输出的格式(PNG/GIF)

        http://www.daylight.com/dayhttp/smi2gif?width=100&height=100&smiles=O%3DC1CCCCC1&Linewidth=thick&output=PNG

色调搭配

提供了几种基本色调


COB
- color on black

COW
- color on white

COP
- color on paper

BOW
- black on white

BOP
- black on paper

WOB
- white on black

WOP
- white on paper

        http://www.daylight.com/dayhttp/smi2gif?smiles=O%3DC1CCCCC1&colormode=COW
    

对原子单独指定颜色

        http://www.daylight.com/dayhttp/smi2gif?numcolors=10&tdt=%24SMI%3COCCCCCCCCCC%3EALAB%3C0.0%2C.1%2C.2%2C.3%2C.4%2C.5%2C.6%2C.7%2C.8%2C.9%2C1.0%3E%7C
    

是否显示手性结构

        http://www.daylight.com/dayhttp/smi2gif?hide_chi_h=false&smiles=C[C%40%40H](N)C(%3DO)O
    

突出显示子结构

这样的功能一般都用在子结构检索之后的结果输出中。

        http://www.daylight.com/dayhttp/smi2gif?smiles=O%3DC1CCCCC1&highlight=O%3DC
    

URL编码

对查询URL中的SMILES字符串,应该用URL-Encode(RFC1738)进行编码。同时,SMI2GIF也支持省略掉百分号的简洁编码方式,比如 (O=C1CCCCC1)可直接表示成 4f3d4331434343434331

参数缩写

SMI2GIF的各个参数,都可以按照下面的映射进行缩写

Option Abbreviation
colormode c
fromto f
height he
hide_chi_h hi
hlen_pct hl
hydrogens hy
numcolors linew
linewidth n
old_style ol
orient or
output ou
reaction r
scale sca
schematic sch
smiles smil
smirks smir
tdt t
width w
xsmiles x

Structure searching supported compound database based on MS Access




Why based on MS Access

MS Access is wildly used, easily available and almost the most user friendly database product. I know many stuffs are working on MS Excel for compound information management and there exists such commercial solutions, the reasons coincide.

The solution described here is for

  • Small account of products catalog.
  • Fast development needed.
  • User interface and operation are crucial, or chemical information data is trivial part of the database.

Demonstration of functions


Platform and tools used

Solution step by step

  • Sample data

SDF files are downloaded directly from Pubchem ftp, it’s free and public.

  • Import SDF file into Access

As the structure description part (mol file) exists inside each compound, it’s not easy to parse the SDF file to csv and other table based format. I write a simple Python script to transform SDF file to Access accepted XML file. You can create a desired table in Access database and export it to XML format to get the XML template.

When data are ready, just run import command from Access.

You can also try RDkit to manipuate SDF and structure information in Python, it’s professional.

  • Generation of structure pictures

When transforming SDF file with Python, some further more actions be taken to generate BMP structure pictures.

Use Molconvert to convert the mol file into PNG format

os.system( “molconvert png %s/%s.mol -o %s/%s.png” %(folder, molid, folder, molid) )

Use PIL to convert PNG to BMP

Image.open( “%s/%s.png”%(folder, molid)).save(”%s/%s.bmp”%(folder,molid))

BMP pictures are needed because native Access only accepts BMP picture as OLE object and displayed on form control.

  • Embed Ole object with VBA

Public Sub load_img()
Me.Recordset.MoveFirst
While Not Me.Recordset.EOF
Me.pic.OLETypeAllowed = acOLELinked
Me.pic.SourceDoc = CurrentProject.Path & “img” & Me.cid & “.bmp”
Me.pic.Action = acOLECreateLink
Me.pic.SizeMode = acOLESizeStretch
Me.Recordset.MoveNext
Wend
End Sub

  • JME as structure query input

JME is a java package and I don’t know any way to combine it with Access directly. So an webcontrol is introduced to the Access form and JME embedded on the source page of the webcontrol. The dom object visit is more simple than I had imaged.

mol = Me.WebBrowser2.Document.applets.Item(0).MolFile()

  • Implentation of sturcture search function

This is truely the key technic of the solution. It is powered by opensourced checkmol/machmol. Acturally, just after read the usage part of source code of its dll version, I begun to think about to do something on Access. Access database uses JETSQL engine, it’s not as powerful as T-SQL, but it supports VBA functions. VBA code can easily visit exernal dll function, so JETSQL could be extended greatly.

So, the sturcture search code is really simple


select * from sample_data where MatchMol(query_mol, mol) > 0


MatchMol function is pre-defined by dll apis of checkmol/matchmol.

Code copied from MATCHMOLDLL.pas

——————————————————————————–
Private Declare Sub mm_SetMol Lib “matchmolDLL.dll” (ByVal st As String)
Private Declare Sub mm_SetCurrentMolAsQuery Lib “matchmolDLL.dll” ()
Private Declare Function mm_Match Lib “matchmolDLL.dll” (ByVal Exact As Boolean) As Long

Private Declare Function mm_GetRings Lib “matchmolDLL.dll” () As Long
Private Declare Function mm_GetAtomRing Lib “matchmolDLL.dll” (ByVal AtomNumber As Long) As Long
Private Declare Sub mm_Version Lib “matchmolDLL.dll” (ByVal st As String)

Public Function MatchMol(Needle As String, Haystack As String, Optional ExactMatch As Boolean = False) As Boolean
Static oldNeedle As String
If oldNeedle <> Needle Then
oldNeedle = Needle
mm_SetMol Needle
mm_SetCurrentMolAsQuery
End If

mm_SetMol Haystack
If mm_Match(ExactMatch) <> 0 Then MatchMol = True

End Function
——————————————————————————–

Performance issues

As a small database, there are about 2700 compounds are imported.

  • Ole picture in BMP format consumes much space. The database file size grows to 600MB.
  • Substructure searching is too fast to recognize delay.
  • If function group or fingerprint is introduced as database index, the substructure searching could archive great performance on large scale dataset.
  • The data importing process is slow and the structure pictures generation process costs much more time and CPU. It took about half an hour to convert the 2700 BMP files on my thinkpad.

Other References


Appendix 1

Such Access DB can NOT serves for Asp.net or other applications connect through ODBC/DAO, for “ODBC and DAO do not use or know anything about the code modules inserted into an .mdb file by Access. Only Access recognizes the modules. ” announced by MSDN KB [Q166113] You cannot use user-defined modules through ODBC or DAO.

谷歌拼音输入法的化学专业词典

简介

可以用于谷歌输入法的词典文件,包含大量中文化学词汇,多为化合物名称。词库容量很大,有15976条化合物名称中文词汇;包括各种多音字拼写(也包括拼错的)共有拼音条目6万余条。比较搜狗拼音化学词汇大全【官方推荐】的一千多条的量要大多了。

google pinyin dict for chemist

作者 zh.charlie@gmail.com

使用方法

在谷歌拼音输入法的“属性设置”中导入

googlepinyinimport.png

数据和制作方法

化合物中文名称,从Chemblink.com网站上采样获得。

词汇提取程序使用Python编写。其中,从unicode字符串中提取汉字的正则表达式:

ur'([\u4e00-\u9fa5]+)'

汉字到拼音的转换程序,使用了roy在水木上贴的python代码和数据库

谷歌拼音输入法的词典格式和分析方法,在前一篇中有所介绍。

使用授权

随便用。随意转载、修改、使用,不必注明原作者。对词典的正确性、全面性作者无法保证和负责。


下载

google.pinyin.dict.for.chemists.zip

InChiKey

IUPAC(International Union of Pure and Applied Chemistry)对InChi(The IUPAC International Chemical Identifier)发布了新(beta)版本,加入了InChiKey。简要的讲,是对InChi加入了一个25位字符组成的摘要。

InChi本身就是一个Identifier,为什么还要给它加上一个Key呢?InChi作为化学品标识,与CAS、MDL Number等等相比,优点在于其承载的信息与结构式等价。但是它太长太复杂了,还包含着字母和数字之外的特殊符号,于是在两个重要的应用领域中很难使用。一是作为各类化学信息系统中的化合物ID,可以想象一个长度无法控制的标记放在Excel表格里,是多么不方便;另外一个应用就是作为互联网上可供搜索的关键字。

那InChiKey就与CAS, MDL Number是同样的意义了吗?肯定不是的。其一,它是免费并开放的(under the terms of the GNU Lesser General Public License)。其二,是因为有InChi存在。知道了一个InChiKey,就肯定可以很容易的找到对应的InChi,也就唯一的确定了结构式,确定了这个化学品实际是什么。而CAS, MDL Number如果没有相应的商业数据库甚至软件支持,就什么也不是。InChiKey虽然也需要数据库去查询InChi,但其实这个数据库中只需要存在这个简单的对应表就足够了。

没有仔细研究InChiKey的生成方法。应该是很简单,原理不外乎用摘要算法(类似MD5)做一下数据压缩而已。ChemWeb的mailing中也提到”There is a finite, but very small probability of finding two structures with the same InChIKey.” 。用摘要算法做数据压缩虽然是很常见的方法,在这个方面却有很多可能的应用。比如数据库中某个字段如果存储InChi,开销会比较大,搜索操作效率也不会高。如果将其都MD5之后存储成固定宽度摘要字段,当用户输入InChi进行搜索的时候,将其输入也进行MD5,再进行搜索效率就会大大提高了。压缩算法总是会有信息损失的,比如字符串片段的搜索就没法进行了。

Chemical Structure Similarity 笔记

SMILES on Wikipeida

The original SMILES specification was developed by Arthur Weininger and David Weininger in the late 1980s. It has since been modified and extended by others, most notably by Daylight Chemical Information Systems Inc.

it also has a wide base of software support with extensive theoretical (e.g., graph theory) backing.

A common application of Canonical SMILES is for indexing and ensuring uniqueness of molecules in a database.

In terms of a graph-based computational procedure, SMILES is a string obtained by printing the symbol nodes encountered in a depth-first tree traversal of a chemical graph. The chemical graph is first trimmed to remove hydrogen atoms and cycles are broken to turn it into a spanning tree.

SMARTS is a modification of SMILES that allows, in addition to the SMILES elements, the specification of wildcard atoms and bonds. This is used in specifying search structures and is widely used in chemical database search applications.

Improved SMILES Substructure Searching , by Daylight

Daylight Theory Manual - Covering general information on representing molecules and an in-depth discussion of SMILESTM, SMARTS®, SMIRKS®, fingerprints, THOR database concepts, and Merlin analysis <html, pdf>

SMARTS - A Language for Describing Molecular Patterns

Fingerprints - Screening and Similarity

OpenBabel, including Implementation of Daylight SMARTS molecular matching syntax

makefp is a command line program to compute hashed path fingerprints from input smiles, or other file formats such as sdf or mol files.

Checkmol is a command-line utility program which reads molecular structure files in different formats (see below) and analyzes the input molecule for the presence of various functional groups and structural elements.

Search by Functional groups

PubChem Similar Searches search allows you to find similar chemical structures to the provided query. Similarity is measured using the Tanimoto equation and a binary fingerprint computed for every structure in the PubChem Compound database. This fingerprint consists of a series of chemical substructure “keys”. Each key denotes the presence or absence of a particular substructure in a molecule. The fingerprint does not consider variation in stereochemical or isotopic information. Collectively, these binary keys provide a “fingerprint” of a particular chemical structure valence-bond form.

PubChem Substructure search allows you to locate chemical structures that contain the particular connectivity and valence bond pattern that you provide in your query. For example, a substructure search of ethanol (SMILES: OCC) would return, among others, acetic acid (SMILE: OC(=O)C), since ethanol is a substructure of acetic acid.

OpenEye software

Roll Your Own Chemical Database With Free Components

Creating a Web-based, Searchable Molecular Structure Database Using Free Software

How to create a web-based molecular structure database with free software, a fine presentation to read.

[电子书 ] Chemoinformatics: Theory, Practice, & Products

下载地址:rapidshare
解压密码:gigapedia

来源: http://rapidsharebooks.blogspot.com/2007/03/chemoinformatics-theory-practice.html

Chemoinformatics: Theory, Practice, & Products
Pages:295

Chemoinformatics:
Theory, Practice & Products covers theory, commercially available
packages and applications of Chemoinformatics. Chemoinformatics is
broadly defined as the use of information technology to assist in the
acquisition, analysis and management of data and information relating
to chemical compounds and their properties. This ranges from molecular
modelling, to reactions, to spectra, to structure-activity
relationships associated with chemicals. Computational scientists,
chemists, and biologists all rely on the rapidly evolving field of
Chemoinformatics. Chemoinformatics: Theory, Practice & Products is
an essential handbook for determining the right Chemoinformatics method
or technology to use. There has been an explosion of new
Chemoinformatics tools and techniques. Each technique has its own
utility, scope, and limitations, as well as meeting resistance to use
by experimentalists. The purpose of Chemoinformatics: Theory, Practice
& Products is to provide computational scientists, medicinal
chemists and biologists with unique practical information and the
underlying theories relating to modern Chemoinformatics and related
drug discovery informatics technologies.

The book also provides
a summary of currently available, state-of-the-art, commercial
Chemoinformatics products, with a specific focus on databases,
toolkits, and modelling technologies designed for drug discovery. It
will be broadly useful as a reference text for experimentalists wishing
to rapidly navigate the expanding field, as well as the more expert
computational scientists wishing to stay up to date.

It is
primarily intended for applied researchers from the chemical and
pharmaceuticalindustry, academic investigators, and graduate students.

Web应用中理想的化学标记法(Line Notation)的十一个要素

Eleven Qualities of The Perfect Line Notation for the Web 这篇文章作一个笔记,翻译其中片断。

Line Notation ,一般来讲是指用ASCII文本表示事务的标记方法,通常用在在化学命名法中。从网上来看,对这个词的翻译各种各样。尤其是”Line”,大多数翻译成“线性的”,甚至是“线段的”。我想能够理解成“一行内表示的”(区别于复杂论述)就足够了。不过的确不太好翻译。

具体的讲,就是用一行文本来表达化学结构式方法体系。我认为广义来说,就是一种编解码的体系。化学家在140多年前 就开始研究线性标记法了,那个时候还没有计算机的概念。

目前,最重要、常用的集中线性记法是:

IUPAC Nomenclature历史最久,应用最为广泛。

SMILES是在IUPAC Nomenclature之后出现的方法。它将三维的化学结构式转化成生成树,然后采用纵向优先遍历树算法将其生成文本。SMILES便于计算机索引和搜索。可以使用通配符进行搜索,并且可以进行子图的搜索,而不仅仅是字符串的比较。普遍认为,SMILES与InChI相比更具有可读性。

InChI似乎在Web上的应用最广泛,比如InChIMatic就是把分子结构式转换成InChI之后再通过Google去搜索的一个应用;还有W3C的这样一个报告 Googling for INChIs; A remarkable method of chemical searching

那么,满足Web应用的,最理想的化学标记法因该满足那些要素呢?

  1. 对人来说是可读的,人能理解和编写这种标记方法。
  2. 同样,计算机也能编码、解码并且理解这种标记方法。
  3. 只使用URI-Safe的字符。
  4. 可以对所有的分子结构编码。
  5. 简短。
  6. 对所有分子结构都是唯一的。
  7. 对氢原子显示编码。
  8. 具有继承性的结构。
  9. 平坦的结构。
  10. 有开源的软件实现。
  11. 没有专利的约束。

这11条我并不都很理解。而且,有些要素之间肯定存在矛盾要有所取舍的(比如5和7,8和9)。正是考虑到了3和5,我的ChemPedia 才采用了CAS作为分子的标示方法,并且嵌在URL中以实现对搜索引擎友好的URL。

Lucene for Information Retrieval kicked off

Why Lucene

手头有一个项目,要做一些文本信息分析的工作。咨询了我心中的AI专家阿飞,告诉我要先从这些文本所属领域的关键词的识别开始。而且提到了了Lucene这个全文检索系统。甚善。

Lucene本身也是一个框架。而我要做的事情,在结构上和全文检索是基本一致的。都是要在一些资料(文档)中按照Query分析出相关的资料,甚至对资料中的这些信息作进一步的分析。其中要针对专业领域信息的特点,建立独特的索引结构,来辅助分析。总之,这个框架有很大的借鉴意义。

Lucene应用结构和实现结构

应用结构图

Lucene系统实现结构图

 

我要做的事情,就是使用Lucene这个框架,对我所关心的资料,进行索引和查询。

结合已有的领域知识词典,修改analyzer,用以生成特殊的索引结构。

修改queryPaser和search模块,将资料的信息索引中生成目标信息,希望能得到值钱的结果。查询的功能不是目标,被动或者主动将信息索引生成用户需要的结果,都是同一个道理。

Analyzer

在 一个文档被索引之前,首先需要对文档内容进行分词处理,这部分工作就是由 Analyzer 来做的。Analyzer 类是一个抽象类,它有多个实现。针对不同的语言和应用需要选择适合的 Analyzer。Analyzer 把分词后的内容交给 IndexWriter 来建立索引。

开始

第一个困难,不懂java。好在有一些文章可以参考,于是有一点简单的认识。
.jar的文件是已经编译好的java程序。把这个.jar文件路径加入系统变量CLASSPATH就可以在控制台运行这个java程序了。
C:\>java org.apache.lucene.demo.SearchFiles
当然前提是安装了java环境。

写其他java程序也可以通过类似的namespace直接引用Lucene提供的API了。

运行了一下,收集阅读了不少入门资料,就算是个开始吧。正如阿飞的建议,理论方法不是最重要的,重要的在于立即着手去尝试。

参考资料

对Lucene的介绍和安装

模型, 理论

Programming with Lucene

Random posts

  • 读一段论语
  • 放在U盘上,可以在Windows下运行的Ubuntu
  • 数据(内存)对齐
  • list partterns in Matlab and Python
  • 又见renren.com