September 20, 2008
化学信息软件开发工具时间线
http://www.dalkescientific.com/writings/diary/archive/2008/09/20/euroqsar.html
Filed by
charlie
at 4:20 pm under chemoinformatics,Coding
No Comments
http://www.dalkescientific.com/writings/diary/archive/2008/09/20/euroqsar.html
Filed by
charlie
at 4:20 pm under chemoinformatics,Coding
No Comments
Cross tabulation , is defined on WikiPedia as A cross tabulation (often abbreviated as cross tab) displays the joint distribution of two or more variables.
As below figures, the column values of the first table is transformed to column names in the second one, so the production of
Cross tabs are frequently used because:
I think the cross-tab is really what the Relational Database means. That’s a meaningful “Relationships” between entities but not entity and it’s attributes.
Code:
SELECT
SalesPerson,
[Oranges] AS Oranges,
[Pickles] AS Pickles
FROM
( SELECT SalesPerson, Product, SalesAmount FROM tblSales ) ps
PIVOT
(
SUM (SalesAmount)
FOR Product IN ( [Oranges], [Pickles])
) AS pvt
This most weak part of feature is that you must list all pivot column name manually. It’s pain.
SELECT < non-pivoted column>,
[first pivoted column] AS ,
[second pivoted column] AS ,
...
[last pivoted column] AS
FROM
(< SELECT query that produces the data>)
AS
PIVOT
(
< aggregation function>()
FOR
[< column that contains the values that will become column headers>]
IN ( [first pivoted column], [second pivoted column],
... [last pivoted column])
) AS
< optional ORDER BY clause>;
Microsoft Access provides a very nice cross table design wizard.
With the wizard, we get the cross-tab script based on JET-SQL.
TRANSFORM Sum(tblSales.SalesAmount) AS [Sum of SalesAmount] SELECT tblSales.SalesPerson, Sum(tblSales.SalesAmount) AS [Total SalesAmount] FROM tblSales GROUP BY tblSales.SalesPerson PIVOT tblSales.Product;
Pivot columns is automaticly generated and the aggregation over all pivot column can be produced. The JET-SQL is so smart and smoothly for small applications those don’t have to be scaled. You can use almost all VBA programming utilities in it. The TRANSFORM/PIVOT feature is powerful and more then 5 years before the SQL Server’s implementation.
It is based on JET-SQL, so not availible in an ADP project.
The classic solution of cross-tab in SQL Server 2000 or before is produced by CASE and GROUP.
select
SalesPerson,
SUM(CASE Product WHEN 'Pickles' THEN SalesAmount ELSE 0 END ) as [Pickles],
SUM(CASE Product WHEN 'Oranges' THEN SalesAmount ELSE 0 END ) as [Oranges],
SUM(SalesAmount) as [Total Sales]
from
tblSales
group by
SalesPerson
We can find it’s footprint clearly in the SQL Server 2005′s pivot feature. And this is a base for other ideas.
Dynamic Cross-Tabs/Pivot Tables By Rob Volk
The basic idea is to generate dynamic CASE-and-GROUP based SQL statement and execute.
The following comments providing many improvement on this solution. i.e.
There is also an similar solution I have not dived in, Dynamic Crosstab Queries, by Itzik Ben-Gan .
But the most easy and rich tool I have used is MS Excel’s pivot table.
The basic idea of OLAP Cube is a multi-dimession pivot table. The common visualization tool for OLAP is also a pivot table or pivot chart viewer.
So, draw a figure to show basic concepts in OLAP by example metioned above as the END.
Filed by
charlie
at 10:15 pm under BI,datamining,DBA,English,MS_ACCESS
1 Comment
Slope-One, the simplest form of non-trivial item-based collaborative filtering based on ratings. (Original Paper)
Referencing to Bryan O’Sullivan’s tutorial of implementing Slope One in Python, I write a the implementation in T-SQL. Believe it useful to many people and projects.
The UserData is fact table of business transactions. I use an view to wrap it for switching between testing data and working data.
The Freq&Diff matrix is square and sparse. Only non-zero values is meaningful and stored. And a half-matrix triangle holds full information about the matrix.
There created two indices to avoid heavily bookmark-lookup.
create table UserData ( -- fact table
userid varchar(50) not null,
itemid varchar(50) not null,
rating float not null default 0,
updtime datetime default getdate(),
primary key (userid, itemid)
)
GO
create table FreqDiff ( -- Freqs and Diffs
itemid1 varchar(50),
itemid2 varchar(50),
freq float not null default 0,
diff float not null default 0,
updtime datetime default getdate(),
primary key (itemid1, itemid2)
)
GO
create index idx_freqdiff_itemid1 on FreqDiff(itemid1, freq, diff, itemid2)
create index idx_freqdiff_itemid2 on FreqDiff(itemid2, freq, diff, itemid1)
/*
* The matrix FreqDiff is *almost* symmetric,
* so only half of the data need to be stored.
* There would be huge of space (50%) saved for large dataset.
*/
alter view vw_freqdiff as
select itemid1 as itemid1, itemid2 as itemid2, freq, diff from FreqDiff fd
union all
select itemid2 as itemid1, itemid1 as itemid2, freq, -1* diff from FreqDiff fd
GO
/*
* Wrap for userdata,
* switch from one model to another easily.
*/
alter view vw_userdata as
select * from userdata
GO
Same as Bryan’s but names changed for easily debugging print.
-- init userdata, Bryan O'Sullivan's sample data is used insert into UserData values ( 'u1', 'i1', 1, getdate() ) insert into UserData values ( 'u1', 'i2', .5, getdate() ) insert into UserData values ( 'u1', 'i3', .2, getdate() ) insert into UserData values ( 'u2', 'i1', 1, getdate() ) insert into UserData values ( 'u2', 'i3', .5, getdate() ) insert into UserData values ( 'u2', 'i4', .2, getdate() ) insert into UserData values ( 'u3', 'i1', .2, getdate() ) insert into UserData values ( 'u3', 'i2', .4, getdate() ) insert into UserData values ( 'u3', 'i3', 1, getdate() ) insert into UserData values ( 'u3', 'i4', .4, getdate() ) insert into UserData values ( 'u4', 'i2', .9, getdate() ) insert into UserData values ( 'u4', 'i3', .4, getdate() ) insert into UserData values ( 'u4', 'i4', .5, getdate() ) GO
-- update process
delete FreqDiff
insert into FreqDiff
select
ud1.itemid, ud2.itemid, count(*), (sum(ud1.rating - ud2.rating))/count(*), getdate()
from
vw_userdata ud1
join vw_userdata ud2 on
ud1.userid = ud2.userid
and ud1.itemid > ud2.itemid
group by ud1.itemid, ud2.itemid
-- predict process
declare @pref table(itemid varchar(50), rating float)
insert into @pref values('i1', 0.4)
select -- distinct top 10
itemid1,
sum(freq) as freq,
sum(freq*(diff + rating)) as pref,
sum(freq*(diff + rating)) /sum(freq) as rating
from
vw_freqdiff fd
join @pref p on fd.itemid2 = p.itemid
where itemid1 not in( select itemid from @pref )
group by itemid1
Further works as intermediate data updating seems easy.
So, writing here, listening for suggestions.
Filed by
charlie
at 5:35 am under BI,datamining,English
6 Comments
排序
分页
也就是说,分页完全是在ASP.NET中进行的,对于大量数据集就会有性能问题。
解:
Filed by
charlie
at 12:08 pm under Coding
No Comments
GAE到目前为止并没有提供从站点上下载或备份代码的功能,本地的开发代码一旦丢失或损坏,就会有无法恢复的麻烦。所以本地代码用SVN之类的管理工具管理起来是很必要的。
Manatlan编写了一个工具,可以将整个GAE站点的代码打成zip包下载。是一个很简单的过程
这个程序会通过google的身份认证来判断访问者是不是管理员。而且对于各个版本的代码,也可以分别下载了。
不过不能直接访问代码的确是GAE的明显缺陷。
所以相信这个问题很快会解决掉,至少能和Google Code结合在一起,代码管理和发布管理的功能集成起来。
Filed by
charlie
at 12:48 am under Coding
No Comments
MS Access is wildly used, easily available and almost the most user friendly database product. I know many stuffs are working on MS Excel for compound information management and there exists such commercial solutions, the reasons coincide.
The solution described here is for
SDF files are downloaded directly from Pubchem ftp, it’s free and public.
As the structure description part (mol file) exists inside each compound, it’s not easy to parse the SDF file to csv and other table based format. I write a simple Python script to transform SDF file to Access accepted XML file. You can create a desired table in Access database and export it to XML format to get the XML template.
When data are ready, just run import command from Access.
You can also try RDkit to manipuate SDF and structure information in Python, it’s professional.
When transforming SDF file with Python, some further more actions be taken to generate BMP structure pictures.
Use Molconvert to convert the mol file into PNG format
os.system( “molconvert png %s/%s.mol -o %s/%s.png” %(folder, molid, folder, molid) )
Use PIL to convert PNG to BMP
Image.open( “%s/%s.png”%(folder, molid)).save(“%s/%s.bmp”%(folder,molid))
BMP pictures are needed because native Access only accepts BMP picture as OLE object and displayed on form control.
Public Sub load_img()
Me.Recordset.MoveFirst
While Not Me.Recordset.EOF
Me.pic.OLETypeAllowed = acOLELinked
Me.pic.SourceDoc = CurrentProject.Path & “img” & Me.cid & “.bmp”
Me.pic.Action = acOLECreateLink
Me.pic.SizeMode = acOLESizeStretch
Me.Recordset.MoveNext
Wend
End Sub
JME is a java package and I don’t know any way to combine it with Access directly. So an webcontrol is introduced to the Access form and JME embedded on the source page of the webcontrol. The dom object visit is more simple than I had imaged.
mol = Me.WebBrowser2.Document.applets.Item(0).MolFile()
This is truely the key technic of the solution. It is powered by opensourced checkmol/machmol. Acturally, just after read the usage part of source code of its dll version, I begun to think about to do something on Access. Access database uses JETSQL engine, it’s not as powerful as T-SQL, but it supports VBA functions. VBA code can easily visit exernal dll function, so JETSQL could be extended greatly.
So, the sturcture search code is really simple
select * from sample_data where MatchMol(query_mol, mol) > 0
MatchMol function is pre-defined by dll apis of checkmol/matchmol.
Code copied from MATCHMOLDLL.pas
——————————————————————————–
Private Declare Sub mm_SetMol Lib “matchmolDLL.dll” (ByVal st As String)
Private Declare Sub mm_SetCurrentMolAsQuery Lib “matchmolDLL.dll” ()
Private Declare Function mm_Match Lib “matchmolDLL.dll” (ByVal Exact As Boolean) As Long
Private Declare Function mm_GetRings Lib “matchmolDLL.dll” () As Long
Private Declare Function mm_GetAtomRing Lib “matchmolDLL.dll” (ByVal AtomNumber As Long) As Long
Private Declare Sub mm_Version Lib “matchmolDLL.dll” (ByVal st As String)
Public Function MatchMol(Needle As String, Haystack As String, Optional ExactMatch As Boolean = False) As Boolean
Static oldNeedle As String
If oldNeedle <> Needle Then
oldNeedle = Needle
mm_SetMol Needle
mm_SetCurrentMolAsQuery
End If
mm_SetMol Haystack
If mm_Match(ExactMatch) <> 0 Then MatchMol = True
End Function
——————————————————————————–
As a small database, there are about 2700 compounds are imported.
Such Access DB can NOT serves for Asp.net or other applications connect through ODBC/DAO, for “ODBC and DAO do not use or know anything about the code modules inserted into an .mdb file by Access. Only Access recognizes the modules. ” announced by MSDN KB [Q166113] You cannot use user-defined modules through ODBC or DAO.
Filed by
charlie
at 11:29 pm under chemoinformatics,English,MS_ACCESS
3 Comments
实际上对.Net中的“委托(delegate)”的概念并不很懂。如果仅理解成自定义事件的话,在Access中也可以部分实现。
窗体2中的自定义事件FireFromF2被窗体1捕获处理,参数被传递。用法很简单,注意WithEvents关键字的使用。
上面的例子要求先f2开启状态下再打开f1才能成功注册事件(原因见最后的总结)。如果是子窗体的事件,就简单一些,应用更常见。
总结
Filed by
charlie
at 11:16 pm under Coding,MS_ACCESS
1 Comment
花了两天时间,搭建起了这个del.icio.us自定义搜索引擎,可以在自己del.icio.us书签的某个tag的站点中进行搜索。比如直接在job的tag下去搜已经收集好的51job/chinahr等站点。
需要输入del.icio.us的用户名和密码。我只能口头保证不窥探保存你的密码,使用前还请谨慎。因为不保存用户名和密码信息,所以cse的站点定义不会随着del.icio.us的书签更新而更新。需要更新的话,不妨再次访问这个程序。
调用了del.icio.us的api,但是不能频繁访问,否则就会摔过来503。
zooie早就做过相同的工作,提供的功能也更复杂。我的这个程序使用了google cse比较新的Linked CSEs,一来不必操作annotations xml文件;二来可以生成一段代码嵌到你需要的页面中。
相关的一些文档和链接
Filed by
charlie
at 10:46 pm under Engineer,webdev
No Comments
可以用于谷歌输入法的词典文件,包含大量中文化学词汇,多为化合物名称。词库容量很大,有15976条化合物名称中文词汇;包括各种多音字拼写(也包括拼错的)共有拼音条目6万余条。比较搜狗拼音化学词汇大全【官方推荐】的一千多条的量要大多了。
作者 zh.charlie@gmail.com
在谷歌拼音输入法的“属性设置”中导入
化合物中文名称,从Chemblink.com网站上采样获得。
词汇提取程序使用Python编写。其中,从unicode字符串中提取汉字的正则表达式:
ur'([\u4e00-\u9fa5]+)'
汉字到拼音的转换程序,使用了roy在水木上贴的python代码和数据库。
谷歌拼音输入法的词典格式和分析方法,在前一篇中有所介绍。
随便用。随意转载、修改、使用,不必注明原作者。对词典的正确性、全面性作者无法保证和负责。
google.pinyin.dict.for.chemists.zip
Filed by
charlie
at 7:33 am under chemoinformatics,Python,Tools
1 Comment