Cross tabulation, Pivot table and OLAP Cube summarized by tools

Cross tab

Cross tabulation , is defined on WikiPedia as A cross tabulation (often abbreviated as cross tab) displays the joint distribution of two or more variables.

As below figures, the column values of the first table is transformed to column names in the second one, so the production of

 

 

Cross tabs are frequently used because:

  1. They are easy to understand. They appeal to people who do not want to use more sophisticated measures.
  2. They can be used with any level of data: nominal, ordinal, interval, or ratio - cross tabs treat all data as if it is nominal.
  3. A table can provide greater insight than single statistics.
  4. It solves the problem of empty or sparse cells.
  5. They are simple to conduct.

 

I think the cross-tab is really what the Relational Database means. That’s a meaningful “Relationships” between entities but not entity and it’s attributes.

Solution by SQL Server 2005

Code:

SELECT
    SalesPerson,
    [Oranges] AS Oranges,
    [Pickles] AS Pickles
FROM
    ( SELECT SalesPerson, Product, SalesAmount FROM tblSales ) ps
PIVOT
    (
      SUM (SalesAmount)
      FOR Product IN ( [Oranges], [Pickles])
    ) AS pvt

 

This most weak part of feature is that you must list all pivot column name manually. It’s pain.


BOL: Using PIVOT and UNPIVOT

SELECT < non-pivoted column>,
    [first pivoted column] AS ,
    [second pivoted column] AS ,
    ...
    [last pivoted column] AS
FROM
    (< SELECT query that produces the data>)
    AS
PIVOT
(
    < aggregation function>()
FOR
[< column that contains the values that will become column headers>]
    IN ( [first pivoted column], [second pivoted column],
    ... [last pivoted column])
) AS
< optional ORDER BY clause>;

Solution by MS Access

Microsoft Access provides a very nice cross table design wizard.

With the wizard, we get the cross-tab script based on JET-SQL.

TRANSFORM Sum(tblSales.SalesAmount) AS [Sum of SalesAmount]
SELECT tblSales.SalesPerson, Sum(tblSales.SalesAmount) AS [Total SalesAmount]
FROM tblSales
GROUP BY tblSales.SalesPerson
PIVOT tblSales.Product;

 

 

Pivot columns is automaticly generated and the aggregation over all pivot column can be produced. The JET-SQL is so smart and smoothly for small applications those don’t have to be scaled. You can use almost all VBA programming utilities in it. The TRANSFORM/PIVOT feature is powerful and more then 5 years before the SQL Server’s implementation.

It is based on JET-SQL, so not availible in an ADP project.

Solution by SQL Server 2000, CASE and GROUP

The classic solution of cross-tab in SQL Server 2000 or before is produced by CASE and GROUP.

select
    SalesPerson,
    SUM(CASE Product WHEN 'Pickles' THEN SalesAmount ELSE 0 END ) as [Pickles],
    SUM(CASE Product WHEN 'Oranges' THEN SalesAmount ELSE 0 END ) as [Oranges],
    SUM(SalesAmount)    as [Total Sales]
from
    tblSales
group by
    SalesPerson

We can find it’s footprint clearly in the SQL Server 2005’s pivot feature. And this is a base for other ideas.

Solution by SQL Server 2000, Dynamic SQL

Dynamic Cross-Tabs/Pivot Tables By Rob Volk

The basic idea is to generate dynamic CASE-and-GROUP based SQL statement and execute.

The following comments providing many improvement on this solution. i.e.

  • Introduce an WHERE statement as parameter for the @PivotColTable.
  • Using string to generate pivot columns but global temp table for better supporting of concurency calling.
  • Using user defined table with an user session column instead of global temp table.
  • To summarize multiple values.

There is also an similar solution I have not dived in, Dynamic Crosstab Queries, by Itzik Ben-Gan .

Solution by Excel

But the most easy and rich tool I have used is MS Excel’s pivot table.

Pivot table and OLAP Cube

The basic idea of OLAP Cube is a multi-dimession pivot table. The common visualization tool for OLAP is also a pivot table or pivot chart viewer. 

So, draw a figure to show basic concepts in OLAP by example metioned above as the END.

Implementing Slope-One in T-SQL

Slope-One, the simplest form of non-trivial item-based collaborative filtering based on ratings. (Original Paper)
Referencing to Bryan O’Sullivan’s tutorial of implementing Slope One in Python, I write a the implementation in T-SQL. Believe it useful to many people and projects.

Brief process summary

  • Define the fact table as user data.
  • Calculating intermediate matrix(FreqDiff). The information about users is eliminated and frequency/ score differences data between items is produced.
  • Predicting from user input score with the intermediate data.

Data schema

The UserData is fact table of business transactions. I use an view to wrap it for switching between testing data and working data.
The Freq&Diff matrix is square and sparse. Only non-zero values is meaningful and stored. And a half-matrix triangle holds full information about the matrix.
There created two indices to avoid heavily bookmark-lookup.

create table UserData (                  -- fact table
    userid   varchar(50) not null,
    itemid   varchar(50) not null,
    rating   float not null default 0,
    updtime  datetime default getdate(),
    primary key (userid, itemid)
)
GO

create table FreqDiff (                  -- Freqs and Diffs
    itemid1  varchar(50),
    itemid2  varchar(50),
    freq     float not null default 0,
    diff     float not null default 0,
    updtime  datetime default getdate(),
    primary key (itemid1, itemid2)
)
GO
create index idx_freqdiff_itemid1 on FreqDiff(itemid1, freq, diff, itemid2)
create index idx_freqdiff_itemid2 on FreqDiff(itemid2, freq, diff, itemid1)

/*
 * The matrix FreqDiff is *almost* symmetric,
 * so only half of the data need to be stored.
 * There would be huge of space (50%) saved for large dataset.
 */
alter view vw_freqdiff as
select itemid1 as itemid1, itemid2 as itemid2, freq,     diff from FreqDiff fd
union all
select itemid2 as itemid1, itemid1 as itemid2, freq, -1* diff from FreqDiff fd
GO

/*
 * Wrap for userdata,
 * switch from one model to another easily.
 */
alter view vw_userdata as
select * from userdata
GO

Testing data

Same as Bryan’s but names changed for easily debugging print.

-- init userdata, Bryan O'Sullivan's sample data is used
insert into UserData values ( 'u1', 'i1',  1, getdate() )
insert into UserData values ( 'u1', 'i2', .5, getdate() )
insert into UserData values ( 'u1', 'i3', .2, getdate() )
insert into UserData values ( 'u2', 'i1',  1, getdate() )
insert into UserData values ( 'u2', 'i3', .5, getdate() )
insert into UserData values ( 'u2', 'i4', .2, getdate() )
insert into UserData values ( 'u3', 'i1', .2, getdate() )
insert into UserData values ( 'u3', 'i2', .4, getdate() )
insert into UserData values ( 'u3', 'i3',  1, getdate() )
insert into UserData values ( 'u3', 'i4', .4, getdate() )
insert into UserData values ( 'u4', 'i2', .9, getdate() )
insert into UserData values ( 'u4', 'i3', .4, getdate() )
insert into UserData values ( 'u4', 'i4', .5, getdate() )
GO

Processing the intermediate table

-- update process
delete FreqDiff
insert into FreqDiff
select
    ud1.itemid, ud2.itemid, count(*), (sum(ud1.rating - ud2.rating))/count(*), getdate()
from
    vw_userdata ud1
    join vw_userdata ud2 on
            ud1.userid = ud2.userid
        and ud1.itemid > ud2.itemid
group by ud1.itemid, ud2.itemid

Predicting

-- predict process
declare @pref table(itemid varchar(50), rating float)
insert into @pref values('i1', 0.4)

select -- distinct top 10
    itemid1,
    sum(freq)                               as freq,
    sum(freq*(diff + rating))            as pref,
    sum(freq*(diff + rating)) /sum(freq) as rating
from
    vw_freqdiff fd
    join @pref p on fd.itemid2 = p.itemid
where itemid1 not in( select itemid from @pref )
group by itemid1

Further works as intermediate data updating seems easy.
So, writing here, listening for suggestions.

Gridview排序和分页的基本机制

排序

  • 客户端PostBack。
  • 在Sorting事件中,取得排序的列及排序方向。
  • 调用DataSource的排序方法。
  • 重新DataBind

分页

  • Gridview读取全部的数据源数据。
  • 在PageIndexChanging事件中,取得要前往的页编号(e.NewPageIndex)
  • 数据绑定前,设置PageIndex属性,实现分页。

也就是说,分页完全是在ASP.NET中进行的,对于大量数据集就会有性能问题。
解:

Config VirtualBox with Host Interface Networking (HIF)

Diagram as the note.

Config VirtualBox with Host Interface Networking (HIF)

下载Google App Engine站点的代码

GAE到目前为止并没有提供从站点上下载或备份代码的功能,本地的开发代码一旦丢失或损坏,就会有无法恢复的麻烦。所以本地代码用SVN之类的管理工具管理起来是很必要的。
Manatlan编写了一个工具,可以将整个GAE站点的代码打成zip包下载。是一个很简单的过程

      在根目录下根据manatlan的代码建立zipme.py。
      在app.yaml中加入handles: - url: /zipme script: zipme.py。
      访问youapp.appspot.com/zipme即可。

这个程序会通过google的身份认证来判断访问者是不是管理员。而且对于各个版本的代码,也可以分别下载了。
不过不能直接访问代码的确是GAE的明显缺陷。

  • 代码可能损坏或丢失而无法恢复
  • 使得合作开发模式也并不灵光,开发者之间需要其他渠道交换和维护代码。
  • 代码的版本和发布的版本不好对应。

所以相信这个问题很快会解决掉,至少能和Google Code结合在一起,代码管理和发布管理的功能集成起来。

Structure searching supported compound database based on MS Access




Why based on MS Access

MS Access is wildly used, easily available and almost the most user friendly database product. I know many stuffs are working on MS Excel for compound information management and there exists such commercial solutions, the reasons coincide.

The solution described here is for

  • Small account of products catalog.
  • Fast development needed.
  • User interface and operation are crucial, or chemical information data is trivial part of the database.

Demonstration of functions


Platform and tools used

Solution step by step

  • Sample data

SDF files are downloaded directly from Pubchem ftp, it’s free and public.

  • Import SDF file into Access

As the structure description part (mol file) exists inside each compound, it’s not easy to parse the SDF file to csv and other table based format. I write a simple Python script to transform SDF file to Access accepted XML file. You can create a desired table in Access database and export it to XML format to get the XML template.

When data are ready, just run import command from Access.

You can also try RDkit to manipuate SDF and structure information in Python, it’s professional.

  • Generation of structure pictures

When transforming SDF file with Python, some further more actions be taken to generate BMP structure pictures.

Use Molconvert to convert the mol file into PNG format

os.system( “molconvert png %s/%s.mol -o %s/%s.png” %(folder, molid, folder, molid) )

Use PIL to convert PNG to BMP

Image.open( “%s/%s.png”%(folder, molid)).save(”%s/%s.bmp”%(folder,molid))

BMP pictures are needed because native Access only accepts BMP picture as OLE object and displayed on form control.

  • Embed Ole object with VBA

Public Sub load_img()
Me.Recordset.MoveFirst
While Not Me.Recordset.EOF
Me.pic.OLETypeAllowed = acOLELinked
Me.pic.SourceDoc = CurrentProject.Path & “img” & Me.cid & “.bmp”
Me.pic.Action = acOLECreateLink
Me.pic.SizeMode = acOLESizeStretch
Me.Recordset.MoveNext
Wend
End Sub

  • JME as structure query input

JME is a java package and I don’t know any way to combine it with Access directly. So an webcontrol is introduced to the Access form and JME embedded on the source page of the webcontrol. The dom object visit is more simple than I had imaged.

mol = Me.WebBrowser2.Document.applets.Item(0).MolFile()

  • Implentation of sturcture search function

This is truely the key technic of the solution. It is powered by opensourced checkmol/machmol. Acturally, just after read the usage part of source code of its dll version, I begun to think about to do something on Access. Access database uses JETSQL engine, it’s not as powerful as T-SQL, but it supports VBA functions. VBA code can easily visit exernal dll function, so JETSQL could be extended greatly.

So, the sturcture search code is really simple


select * from sample_data where MatchMol(query_mol, mol) > 0


MatchMol function is pre-defined by dll apis of checkmol/matchmol.

Code copied from MATCHMOLDLL.pas

——————————————————————————–
Private Declare Sub mm_SetMol Lib “matchmolDLL.dll” (ByVal st As String)
Private Declare Sub mm_SetCurrentMolAsQuery Lib “matchmolDLL.dll” ()
Private Declare Function mm_Match Lib “matchmolDLL.dll” (ByVal Exact As Boolean) As Long

Private Declare Function mm_GetRings Lib “matchmolDLL.dll” () As Long
Private Declare Function mm_GetAtomRing Lib “matchmolDLL.dll” (ByVal AtomNumber As Long) As Long
Private Declare Sub mm_Version Lib “matchmolDLL.dll” (ByVal st As String)

Public Function MatchMol(Needle As String, Haystack As String, Optional ExactMatch As Boolean = False) As Boolean
Static oldNeedle As String
If oldNeedle <> Needle Then
oldNeedle = Needle
mm_SetMol Needle
mm_SetCurrentMolAsQuery
End If

mm_SetMol Haystack
If mm_Match(ExactMatch) <> 0 Then MatchMol = True

End Function
——————————————————————————–

Performance issues

As a small database, there are about 2700 compounds are imported.

  • Ole picture in BMP format consumes much space. The database file size grows to 600MB.
  • Substructure searching is too fast to recognize delay.
  • If function group or fingerprint is introduced as database index, the substructure searching could archive great performance on large scale dataset.
  • The data importing process is slow and the structure pictures generation process costs much more time and CPU. It took about half an hour to convert the 2700 BMP files on my thinkpad.

Other References


Appendix 1

Such Access DB can NOT serves for Asp.net or other applications connect through ODBC/DAO, for “ODBC and DAO do not use or know anything about the code modules inserted into an .mdb file by Access. Only Access recognizes the modules. ” announced by MSDN KB [Q166113] You cannot use user-defined modules through ODBC or DAO.

Access中的事件和委托

实际上对.Net中的“委托(delegate)”的概念并不很懂。如果仅理解成自定义事件的话,在Access中也可以部分实现。

窗体2中的自定义事件FireFromF2被窗体1捕获处理,参数被传递。用法很简单,注意WithEvents关键字的使用。

accessevent1.png
accesseventf1.png

accesseventf21.png

上面的例子要求先f2开启状态下再打开f1才能成功注册事件(原因见最后的总结)。如果是子窗体的事件,就简单一些,应用更常见。

accesseventf31.png

总结

  • WithEvents设置了事件监听的钩子,这个钩子针对的是Object,是实例,而不是Class或类型。
  • 所以可以监听Application这样的全局物件,也可以监听某个具体的Form。但是不能对所有的Form(Access.Form类型)起作用。
  • VBA中不支持自动的up-casting;WithEvents也不支持对象数组。所以后期绑定的方法也基本行不通。

用Google自定义搜索引擎(CSE)搜索del.icio.us

花了两天时间,搭建起了这个del.icio.us自定义搜索引擎,可以在自己del.icio.us书签的某个tag的站点中进行搜索。比如直接在job的tag下去搜已经收集好的51job/chinahr等站点。

需要输入del.icio.us的用户名和密码。我只能口头保证不窥探保存你的密码,使用前还请谨慎。因为不保存用户名和密码信息,所以cse的站点定义不会随着del.icio.us的书签更新而更新。需要更新的话,不妨再次访问这个程序。

调用了del.icio.us的api,但是不能频繁访问,否则就会摔过来503。

zooie早就做过相同的工作,提供的功能也更复杂。我的这个程序使用了google cse比较新的Linked CSEs,一来不必操作annotations xml文件;二来可以生成一段代码嵌到你需要的页面中。

相关的一些文档和链接

谷歌拼音输入法的化学专业词典

简介

可以用于谷歌输入法的词典文件,包含大量中文化学词汇,多为化合物名称。词库容量很大,有15976条化合物名称中文词汇;包括各种多音字拼写(也包括拼错的)共有拼音条目6万余条。比较搜狗拼音化学词汇大全【官方推荐】的一千多条的量要大多了。

google pinyin dict for chemist

作者 zh.charlie@gmail.com

使用方法

在谷歌拼音输入法的“属性设置”中导入

googlepinyinimport.png

数据和制作方法

化合物中文名称,从Chemblink.com网站上采样获得。

词汇提取程序使用Python编写。其中,从unicode字符串中提取汉字的正则表达式:

ur'([\u4e00-\u9fa5]+)'

汉字到拼音的转换程序,使用了roy在水木上贴的python代码和数据库

谷歌拼音输入法的词典格式和分析方法,在前一篇中有所介绍。

使用授权

随便用。随意转载、修改、使用,不必注明原作者。对词典的正确性、全面性作者无法保证和负责。


下载

google.pinyin.dict.for.chemists.zip

真是个操蛋的墙

连这样的网页也会触到丫的敏感私处。随后所有的这个网站的网页都不能访问,春节期间的这个JOB毁了。不是当年的人恐怕很难了解为什么。
8-9-6-4.png

Next Page »

Random posts

  • Firefox加载Java Applet后死翘的问题
  • 小小白满月
  • 解决在firefox中word-wrap, word-break的问题
  • 放在U盘上,可以在Windows下运行的Ubuntu
  • 真是个操蛋的墙