SQL Server CLR的进程和线程

结论

  • SQL Server CLR是一个多线程运行的环境。
  • 静态变量等线程安全问题,在CLR环境中存在。

摘录《Professional SQL Server™ 2005 CLR Programming》

SQL Server 2005 hosts the CLR in a “sandbox”-like environment in-process to itself, as you can see Figure 1-2. When a user requests a SQL CLR object for the first time, SQL Server will load the .NET execution engine mscoree.dll (which is the CLR) into memory. 

实验

CLR代码

public partial class ProcessThread
{
    public static int static_i=0;

    [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
    public static SqlString PID()
    {
        return  Process.GetCurrentProcess().Id.ToString();
    }

    [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
    public static SqlString TID()
    {
        return AppDomain.GetCurrentThreadId().ToString()+ ":"+
            System.Threading.Thread.CurrentThread.ManagedThreadId.ToString();
    }

    [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
    public static SqlString StaticI()
    {
        return (static_i++).ToString();
    }

};

安装CLR的SQL代码

CREATE ASSEMBLY sqlclr_test
FROM 'C:\SQLCLR\bin\Release\sqlclr.dll'
WITH permission_set = UNSAFE;
GO

create function fn_processid () returns nvarchar(max)
as external name sqlclr_test.ProcessThread.PID
GO
create function fn_threadid () returns nvarchar(max)
as external name sqlclr_test.ProcessThread.TID
GO
create function fn_statici () returns nvarchar(max)
as external name sqlclr_test.ProcessThread.StaticI
GO

测试用查询命令

select
    dbo.fn_processid() as [ProcessID],
    dbo.fn_threadid()  as [ThreadID],
    dbo.fn_statici()   as [StaticI]

返回结果

ProcessID ThreadID StaticI
1608 5820:4 11

系统进程截图 (Process Explorer) 

线程模型

现象 

  1. 每次执行查询命令,得到的ThreadID会发生变化,在SQL Server的进程包含的线程池中轮转。
  2. 每次执行查询命令,不论是否同一个线程,StaticI的值会增加。
  3. SQL Server的用户进程(登录),与所执行的线程编号几乎没有关系。

 

SQLMOL, 化合物结构存储检索关系数据库平台

SQLMOL是我的第一个开源软件项目,发布在Google code上。是今年春节假期的空闲时间的成果。

作为一个基于关系数据库的化合物结构存储检索关系数据库平台,不同于之前介绍的各类方案,这个方案不需要Cartridge/CLR/UDF等数据库插件,只需要标准的关系数据库环境,用SQL程序即可实现功能。

目前的测试是在SQL Server 2008 express这个免费版本上进行的,以后的计划是在MySQL, PostgreSQL等主流数据库,尤其是免费的关系数据库平台上进行实现。

这个方案主要的原理是在“Chemical Substructure Search in SQL (Adel Golovin and Kim Henrick)”这篇paper中介绍的。将化合物分子中的原子、键的结构,以生成树(spanning tree)的形式存储在关系数据库中,用普通的SQL进行结构检索。为了实现这个方案,在SQLMOL中实现了以下的关键部件

  • 用SQL实现的SMILES parser。
  • 利用SMILES parser的结果,实现的Data builder和Query builder。

这样,用SMILES表达的化合物结构式信息,就可以用上面的方案导入到关系数据库中;用SMILES表达的结构检索条件,也可以以相同的规则生成,并执行检索。

到现在为止,这个项目的核心程序,SMILES解析器,还没有实现对SMILES协议100%的支持,对某些情况存在错误

在关系数据库中存储和使用分子结构信息的方案

分子结构式的表达和存储

化学分子结构式的表达方法有很多。最基本的方法当然就是图片文件,图片文件的确仅适用于展示,不适用于分析和检索。常见的以文本形式存储的化学结构包括InChI,SMILES,Molfiles,CML等。对于这些格式,在inchi.info网站上有一个不错的比较。原文还带有一些注释。

  InChI InChIKey SMILES Molfile CML
线性(不用换行) Yes Yes Yes No No
唯一性 Yes No Possibly No No
可读性 Hardly Impossible Easily Hardly Hardly
定义原子几何位置 No No No Yes Yes
长度(每原子字符数) ~2 ~1 1-2 ~50 ~50
软件支持 (0-1) 0.3 0.1 0.2 1 0.5

在数据库中作为结构式信息进行存储,比较重要的考虑因素就是唯一性、存储空间。所以使用InChI, SMILES作为结构式的描述信息都可行。

这个表里需要注意的是,InChIKey与Molfile/CML在唯一性上都是No,但有很大不同。不同的分子结构有可能用同一个InChIKey表达;同一个分子式,可以表示成不同的Molfile/CML。这都对数据库中的 一项重要工作,检索带来大麻烦。

找到合适的分子结构式表达方式,用文本的方式存储起来,并不复杂。执行一些简单的操作,比如通过分子结构(精确)查找分子,也和普通的数据库操作没什么两样。

但是要通过结构式信息进行分子参数计算、子结构检索、描述符(fingerprint)生成及索引、分子相似性计算等工作时,往往需要在关系数据库上增加插件来实现。

关系数据库插件 (cartridge)

下面的插件,没有一个是应用于MS SQL Server的。SQL Server也提供了扩展存储过程(SQL Server 2000)及CLR(SQL Server 2005)等插件接口,是能够实现相同的功能的。这也正是我现在正在进行的工作。

pgchem::tigress

开源。用在PostgreSQL上。在checkmol/matchmol 和OpenBabel的基础上开发的。文档很nice,但是似乎很久没有更新了。典型的案例是http://www.chemcollect.de/

mychem

开源。用在MySQL上,UDF形式的插件。基于OpenBabel。SVN代码更新比较勤。

OrChem

This project aims at creating an open source chemistry plugin / cartridge for the relational database system Oracle.

其他商业插件

  1. CambridgeSoft Oracle Cartridge
  2. Symyx Direct, Cartridge for Oracle
  3. CHORD, a commercial chemical cartridge for PostgreSQL, is sold by gNova. 基于OEChem。OEChem也不是免费的。
  4. JChem Cartridge, adds chemical intelligence to the Oracle platform.

一本新书

Design and Use of Relational Database in Chemistry,《在化学领域设计和使用关系数据库》。作者是gNova公司的TJ O’Donnell, Ph. D.,那这本书里用到的方法,也自然是这个公司的CHORD插件了。不过这本书对关系数据库中,化学信息的应用,写得比较全面和系统了(从目录上看)。就是有点灌水,连关系数据理论都要占一章。写一本书,讲解理论的同时推销自己的产品,还能挣稿费,这绝对是个好办法。Amazon上的定价是$93.56。

不需要插件的解决方案

Chemical Substructure Search in SQL
Adel Golovin and Kim Henrick,
EMBL-EBI Hinston Hall Genome Campus, Cambridge, U.K.
J. Chem. Inf. Model., Article ASAP
DOI:
10.1021/ci8003013

这是一篇发表不久的文章。我很感叹之前为什么没有过这么直接的思路。分子结构用SMILES表达的原理中,实际上就已经将分子结构抽象为原子(Atom)和键(Bond)之间的组合,并且用生成树(spanning tree)构造成字串。

分子、原子、键、分子结构(spanning tree),都可以,而且很适合在关系数据库中表达出来。子结构检索等功能操作,也可以用基本的SQL来实现。

相对于用插件的方案,优势在于

  1. 不限制于数据库平台。
  2. 更稳定。插件的质量高低,运行于何种进程空间,是否会出现异常,都将直接影响服务器。在重要的运营服务器上,更是隐患。
  3. 性能(performance)更优。对原子和键的查询都可直接利用数据库服务器的索引。
  4. 存储空间上(可能)有改善。当设计得当时,原子、键建立字典表,事实表就都是数字组成的关联表。在基于插件的系统中,以子结构检索为例,为了尽量减少性能消耗最大的插件函数计算,往往会生成碎片(fingerprint)索引,所使用的存储空间,加上SMILES的字符串存储空间,会高于这个方案中的存储空间。尤其是在小分子数据库中。

其他参考资料

Fast Substructure Search series, by Rich Apodaca

How to create a web-based molecular structure database with free software[PDF], by Norbert Haider (Checkmol/matchmol)

Cross tabulation, Pivot table and OLAP Cube summarized by tools

Cross tab

Cross tabulation , is defined on WikiPedia as A cross tabulation (often abbreviated as cross tab) displays the joint distribution of two or more variables.

As below figures, the column values of the first table is transformed to column names in the second one, so the production of

 

 

Cross tabs are frequently used because:

  1. They are easy to understand. They appeal to people who do not want to use more sophisticated measures.
  2. They can be used with any level of data: nominal, ordinal, interval, or ratio - cross tabs treat all data as if it is nominal.
  3. A table can provide greater insight than single statistics.
  4. It solves the problem of empty or sparse cells.
  5. They are simple to conduct.

 

I think the cross-tab is really what the Relational Database means. That’s a meaningful “Relationships” between entities but not entity and it’s attributes.

Solution by SQL Server 2005

Code:

SELECT
    SalesPerson,
    [Oranges] AS Oranges,
    [Pickles] AS Pickles
FROM
    ( SELECT SalesPerson, Product, SalesAmount FROM tblSales ) ps
PIVOT
    (
      SUM (SalesAmount)
      FOR Product IN ( [Oranges], [Pickles])
    ) AS pvt

 

This most weak part of feature is that you must list all pivot column name manually. It’s pain.


BOL: Using PIVOT and UNPIVOT

SELECT < non-pivoted column>,
    [first pivoted column] AS ,
    [second pivoted column] AS ,
    ...
    [last pivoted column] AS
FROM
    (< SELECT query that produces the data>)
    AS
PIVOT
(
    < aggregation function>()
FOR
[< column that contains the values that will become column headers>]
    IN ( [first pivoted column], [second pivoted column],
    ... [last pivoted column])
) AS
< optional ORDER BY clause>;

Solution by MS Access

Microsoft Access provides a very nice cross table design wizard.

With the wizard, we get the cross-tab script based on JET-SQL.

TRANSFORM Sum(tblSales.SalesAmount) AS [Sum of SalesAmount]
SELECT tblSales.SalesPerson, Sum(tblSales.SalesAmount) AS [Total SalesAmount]
FROM tblSales
GROUP BY tblSales.SalesPerson
PIVOT tblSales.Product;

 

 

Pivot columns is automaticly generated and the aggregation over all pivot column can be produced. The JET-SQL is so smart and smoothly for small applications those don’t have to be scaled. You can use almost all VBA programming utilities in it. The TRANSFORM/PIVOT feature is powerful and more then 5 years before the SQL Server’s implementation.

It is based on JET-SQL, so not availible in an ADP project.

Solution by SQL Server 2000, CASE and GROUP

The classic solution of cross-tab in SQL Server 2000 or before is produced by CASE and GROUP.

select
    SalesPerson,
    SUM(CASE Product WHEN 'Pickles' THEN SalesAmount ELSE 0 END ) as [Pickles],
    SUM(CASE Product WHEN 'Oranges' THEN SalesAmount ELSE 0 END ) as [Oranges],
    SUM(SalesAmount)    as [Total Sales]
from
    tblSales
group by
    SalesPerson

We can find it’s footprint clearly in the SQL Server 2005’s pivot feature. And this is a base for other ideas.

Solution by SQL Server 2000, Dynamic SQL

Dynamic Cross-Tabs/Pivot Tables By Rob Volk

The basic idea is to generate dynamic CASE-and-GROUP based SQL statement and execute.

The following comments providing many improvement on this solution. i.e.

  • Introduce an WHERE statement as parameter for the @PivotColTable.
  • Using string to generate pivot columns but global temp table for better supporting of concurency calling.
  • Using user defined table with an user session column instead of global temp table.
  • To summarize multiple values.

There is also an similar solution I have not dived in, Dynamic Crosstab Queries, by Itzik Ben-Gan .

Solution by Excel

But the most easy and rich tool I have used is MS Excel’s pivot table.

Pivot table and OLAP Cube

The basic idea of OLAP Cube is a multi-dimession pivot table. The common visualization tool for OLAP is also a pivot table or pivot chart viewer. 

So, draw a figure to show basic concepts in OLAP by example metioned above as the END.

SQL Server开源免费工具

Buck Woody 在MSDN Blog上列出了他最中意的开源 SQL Server 工具软件,这些开源软件都是在微软的开源软件平台,CodePlex 上进行维护的。他介绍了五个软件。

  1. Web版维护工具,SQLWebTools,http://www.codeplex.com/SQLWebTools。基于Web的SQL Server维护工具,可以用在远程主机上,代替企业管理器(企业管理器不是免费的)。还有一个Php版的工具phpMSAdmin,与之类似。
  2. DBA经常要做的一件事,就是用脚本备份整个数据库结构。ScriptDB就是这样一个工具,http://www.codeplex.com/ScriptDB。这是一个用C#写成的命令行工具,可以将数据库结构以企业管理器中的层次结构备份到文件夹中。可以定期对数据库结构及其中的程序(存储过程、视图、函数、触发器等)做这样的快照,之后进行文本比较或者添加到版本管理工具中。
  3. 数据库比较工具, DbDiff, http://www.codeplex.com/DbDiff。这种工具是非常有用的工具,可是在这之前我还没有找到过真正免费的。DbDiff只能比较数据库结构,不能比较数据。RedGate提供一系列的工具很强大,价格也公道。

  4. SQL Server Express 管理工具,ExpressMaint, http://www.codeplex.com/ExpressMaint。SQL Server Express是免费的数据库服务器,有了各类辅助工具,就很值得一用。

  5. Simple SQL Server Dependencies, http://www.codeplex.com/SSSDependencies, 顾名思义,就是对SQL Server中的Object之间的关系进行分析的工具。

另外,在Del.icio.us中用sqlserver opensource tool这三个Tag ,可以发现下面的这些工具。Enjoy yourself.

  1. SqlDump - MS SQL database backup program

    SqlDump is a program to backup a Microsoft SQL Server database as a text file. SqlDump generates SQL statements for tables, indexes, user-defined types, views, procedures, etc and also the table data.

  2. SourceForge.net: DB Designer Fork

    DB Designer Fork is a fork of the fabFORCE DBDesigner 4. DBDesigner is a visual database design system that integrates entity relationship design and database creation. DB Designer Fork generates SQL scripts for Oracle, SQL Server, MySQL and FireBird.

  3. SQuirreL SQL Client Home Page

    SQuirreL SQL Client is a graphical Java program that will allow you to view the structure of a JDBC compliant database, browse the data in tables, issue SQL commands etc, see Introduction. The minimum version of Java supported is 1.5.x as of SQuirreL vers

  4. Query Express

    “Simple Query Analyzer look-alike, especially useful for MSDE and SQL Express. Also connects to Oracle and other OLE-DB compliant databases. Packaged as a single 100KB executable, i.e. more than 300 times smaller than SQL Server Management Studio.”

  5. Data Dictionary Creator

    Data Dictionary Creator (DDC) is a simple application which helps you document SQL Server databases. It stores all the information in Extended Properties, so it’s easier to keep the documentation in sync with the database as it changes.

  6. Data Modelling Tools

  7. SQL Manager: Database Management Tools for MySQL, SQL Server, PostgreSQL, InterBase, Firebird

    Database maintenance tools for popular relational database servers.

  8. |QueryCommander the sql editor|

    QueryCommander is a free sql editor wrapped in a Visual Studio type of environment. QueryCommander supports:

  9. SqlBI.eu

  10. SqlBulkTool 1.0

    SqlBulkTool is a command line utility that is used to quickly create a mirror of a database. It reads its configuration from an XML file containing source and destination command strings and a list of all the tables to mirror and then handles the work of

  11. TSQLUnit testing framework

  12. SchemaSpy

    Graphical Database Schema Metadata Browser

关系数据库中存储操作树形结构

关系数据库中树形结构的存储,决定因素是对于设计的系统,在此树形结构上的操作是什么。尤其对于节点数较大的树形关系,操作性能肯定是第一考虑因素。当然也和树的具体特典有关系,比如是“扁平型”的还是“纵深型”的。最近的工作涉及到一棵近百万个节点的树,一点总结稍作笔记。

树形结构常见操作

下面的操作大部分从MySQL文档中摘录。

  1. 遍历
  2. Finding all Child nodes / 取得所有的下级节点
  3. Find the Immediate Subordinates of a Node
  4. Finding all the Leaf Nodes / 取得所有下级叶子节点
  5. Depth of a Sub-Tree / 取得子树的深度
  6. Aggregate Functions in a Nested Set
  7. Retrieving a Single Path (all ancestors nodes) / 取得当前节点到根节点的路径
  8. Finding the Depth of the Nodes. 相当于7,找到所有父节点。
  9. Adding New Nodes
  10. Deleting Nodes

可简单分类为

  • 向下搜索(1,2,3,4,5,6)
  • 向上搜索(7,8)
  • 修改(9,10)

其中3和2,4等在不同数据结构基础上的实现中,非常不同。以下分别对几种常见的表示方法,作个简单总结。

线性表示

         A
       / | \
      B  C  D
        /|   \
       E  F   G

A(B, C(E, F), D(G,),)

相当于宽度优先的遍历。
“线性表示”的意思是说,在一行文本中,表示整个复杂结构。也许可以用在数据压缩上吧,基本上与这里讲述的关系数据库关系不大。

node-id & parent-node-id

最常见的表示方法,represents a hierarchy is by using parent child relationship
在关系数据库中,一行数据表示树中的一个节点(后面叙述的结构均是如此)。这一行中存储当前节点的ID(NodeId)和它的父节点的ID(ParentNodeId),因为每个节点的父节点是唯一的。在这两个ID上做索引。

此种结构最利于临近层次的节点搜索(常见操作之3)。对于节点的增删修改(常见操作之9,10),也很容易,只要修改单个节点的属性即可。

操作3(Find the Immediate Subordinates of a Node)在展示树型数据时很有用。比如动态展开节点。实际上很多树形控件,其数据结构也正是这样的。比如YUI的TreeView ;MSComctlLib.TreeCtrl;Graphviz;ASP.NET中的Tree view等等。

对于操作2,4,7,8,在SQL Server中可以使用Inline Table-valued FunctionsMultistatement Table-valued Functions 递归调用实现。当然性能上会有损失。

Deep-first Tree

 

Depth-First Traversal


  A B C D E F G
node-id 1 2 3 4 5 6 7
most-right-child-id 7 2 5 4 5 7 7

Deep-first结构中,每条Node记录需要包含node-id和most-right-child-id。

常见操作2,4,7,8可以高效实现。但是对于操作3(immediately children),实现起来比较难,不如parent/child的容易、高效。可以结合使用。

–Get the path from the root to the given node:
SELECT t1.*
FROM employee t1, employee t2
WHERE t1.nodeindex <= i
AND t1.rightchildindex >= t2.rightchildindex
AND t2.nodeindex = i;

增删修改操作性能消耗比较大。

-- Insert a node as a child of i: UPDATE employee
SET employee.nodeindex = employee.nodeindex + 1
WHERE employee.nodeindex > parentnode;

UPDATE employee
SET employee.rightchildindex = employee.rightchildindex + 1
WHERE employee.rightchildindex >= parentnode;

INSERT INTO employee
(nodeindex, rightchildindex, NAME, comments)
VALUES (parentnode + 1, parentnode + 1, nodename, nodecomments);

-- Delete the node with NodeIndex i: DELETE employee
WHERE employee.nodeindex = i;

UPDATE employee
SET employee.nodeindex = employee.nodeindex - 1
WHERE employee.nodeindex > i;

UPDATE employee
SET employee.rightchildindex = employee.rightchildindex - 1
WHERE employee.rightchildindex > i;

Nested set, 节点左右边缘

左边缘作为node-id,同时记录right-extent。与Deep-first Tree等价。
 

-- All Leaf nodes (an item with any children) can be
-- identified by Left = Right - 1
SELECT * FROM Employee WHERE [LeftExtent] = [RightExtent] - 1

Build a deep-first tree from parent child relationship

Python code

#{
def process_node( catid ):
    n = retrieve_node( catid )
    dfid = global_increased_id()                # with a increased id
    most_right_child_dfid = dfid
    for child in retrieve_children( n ):
        most_right_child_dfid = process_node( child )
    upd_to_db( catid, dfid, most_right_child_dfid )
    print dfid
    return most_right_child_dfid

The WITH keyword in SQL Server 2005/2008

在SQL Server 2005及以上的版本的中,加入了WITH这个关键字。WITH关键字定义了匿名函数,通过表达式形成表定义common table expression (CTE),而且可以递归。而 node-id & parent-node-id 模型中的“取得当前节点到根节点的路径”操作,就需要对表值函数进行递归调用。用WITH关键字进行递归操作的例子如下

 

with HierarchyCTE (NID, PID, Lvl) as
 (select NodeID, ParentID, 0
 from dbo.Hierarchy
 where NodeID = @NodeID_in
 union all
 select NodeID, ParentID, Lvl + 1
 from dbo.Hierarchy
 inner join HierarchyCTE
 on PID = NodeID)
select *
from HierarchyCTE

References

 

 

Random posts

  • 读一段论语
  • Cross tabulation, Pivot table and OLAP Cube summarized by tools
  • 超市用防盗磁扣的原理
  • 推荐8box的音乐专题:在忧伤的旋律中的情人节
  • 04年小白赠诗