Cross tabulation, Pivot table and OLAP Cube summarized by tools

Cross tab

Cross tabulation , is defined on WikiPedia as A cross tabulation (often abbreviated as cross tab) displays the joint distribution of two or more variables.

As below figures, the column values of the first table is transformed to column names in the second one, so the production of

 

 

Cross tabs are frequently used because:

  1. They are easy to understand. They appeal to people who do not want to use more sophisticated measures.
  2. They can be used with any level of data: nominal, ordinal, interval, or ratio - cross tabs treat all data as if it is nominal.
  3. A table can provide greater insight than single statistics.
  4. It solves the problem of empty or sparse cells.
  5. They are simple to conduct.

 

I think the cross-tab is really what the Relational Database means. That’s a meaningful “Relationships” between entities but not entity and it’s attributes.

Solution by SQL Server 2005

Code:

SELECT
    SalesPerson,
    [Oranges] AS Oranges,
    [Pickles] AS Pickles
FROM
    ( SELECT SalesPerson, Product, SalesAmount FROM tblSales ) ps
PIVOT
    (
      SUM (SalesAmount)
      FOR Product IN ( [Oranges], [Pickles])
    ) AS pvt

 

This most weak part of feature is that you must list all pivot column name manually. It’s pain.


BOL: Using PIVOT and UNPIVOT

SELECT < non-pivoted column>,
    [first pivoted column] AS ,
    [second pivoted column] AS ,
    ...
    [last pivoted column] AS
FROM
    (< SELECT query that produces the data>)
    AS
PIVOT
(
    < aggregation function>()
FOR
[< column that contains the values that will become column headers>]
    IN ( [first pivoted column], [second pivoted column],
    ... [last pivoted column])
) AS
< optional ORDER BY clause>;

Solution by MS Access

Microsoft Access provides a very nice cross table design wizard.

With the wizard, we get the cross-tab script based on JET-SQL.

TRANSFORM Sum(tblSales.SalesAmount) AS [Sum of SalesAmount]
SELECT tblSales.SalesPerson, Sum(tblSales.SalesAmount) AS [Total SalesAmount]
FROM tblSales
GROUP BY tblSales.SalesPerson
PIVOT tblSales.Product;

 

 

Pivot columns is automaticly generated and the aggregation over all pivot column can be produced. The JET-SQL is so smart and smoothly for small applications those don’t have to be scaled. You can use almost all VBA programming utilities in it. The TRANSFORM/PIVOT feature is powerful and more then 5 years before the SQL Server’s implementation.

It is based on JET-SQL, so not availible in an ADP project.

Solution by SQL Server 2000, CASE and GROUP

The classic solution of cross-tab in SQL Server 2000 or before is produced by CASE and GROUP.

select
    SalesPerson,
    SUM(CASE Product WHEN 'Pickles' THEN SalesAmount ELSE 0 END ) as [Pickles],
    SUM(CASE Product WHEN 'Oranges' THEN SalesAmount ELSE 0 END ) as [Oranges],
    SUM(SalesAmount)    as [Total Sales]
from
    tblSales
group by
    SalesPerson

We can find it’s footprint clearly in the SQL Server 2005’s pivot feature. And this is a base for other ideas.

Solution by SQL Server 2000, Dynamic SQL

Dynamic Cross-Tabs/Pivot Tables By Rob Volk

The basic idea is to generate dynamic CASE-and-GROUP based SQL statement and execute.

The following comments providing many improvement on this solution. i.e.

  • Introduce an WHERE statement as parameter for the @PivotColTable.
  • Using string to generate pivot columns but global temp table for better supporting of concurency calling.
  • Using user defined table with an user session column instead of global temp table.
  • To summarize multiple values.

There is also an similar solution I have not dived in, Dynamic Crosstab Queries, by Itzik Ben-Gan .

Solution by Excel

But the most easy and rich tool I have used is MS Excel’s pivot table.

Pivot table and OLAP Cube

The basic idea of OLAP Cube is a multi-dimession pivot table. The common visualization tool for OLAP is also a pivot table or pivot chart viewer. 

So, draw a figure to show basic concepts in OLAP by example metioned above as the END.

Implementing Slope-One in T-SQL

Slope-One, the simplest form of non-trivial item-based collaborative filtering based on ratings. (Original Paper)
Referencing to Bryan O’Sullivan’s tutorial of implementing Slope One in Python, I write a the implementation in T-SQL. Believe it useful to many people and projects.

Brief process summary

  • Define the fact table as user data.
  • Calculating intermediate matrix(FreqDiff). The information about users is eliminated and frequency/ score differences data between items is produced.
  • Predicting from user input score with the intermediate data.

Data schema

The UserData is fact table of business transactions. I use an view to wrap it for switching between testing data and working data.
The Freq&Diff matrix is square and sparse. Only non-zero values is meaningful and stored. And a half-matrix triangle holds full information about the matrix.
There created two indices to avoid heavily bookmark-lookup.

create table UserData (                  -- fact table
    userid   varchar(50) not null,
    itemid   varchar(50) not null,
    rating   float not null default 0,
    updtime  datetime default getdate(),
    primary key (userid, itemid)
)
GO

create table FreqDiff (                  -- Freqs and Diffs
    itemid1  varchar(50),
    itemid2  varchar(50),
    freq     float not null default 0,
    diff     float not null default 0,
    updtime  datetime default getdate(),
    primary key (itemid1, itemid2)
)
GO
create index idx_freqdiff_itemid1 on FreqDiff(itemid1, freq, diff, itemid2)
create index idx_freqdiff_itemid2 on FreqDiff(itemid2, freq, diff, itemid1)

/*
 * The matrix FreqDiff is *almost* symmetric,
 * so only half of the data need to be stored.
 * There would be huge of space (50%) saved for large dataset.
 */
alter view vw_freqdiff as
select itemid1 as itemid1, itemid2 as itemid2, freq,     diff from FreqDiff fd
union all
select itemid2 as itemid1, itemid1 as itemid2, freq, -1* diff from FreqDiff fd
GO

/*
 * Wrap for userdata,
 * switch from one model to another easily.
 */
alter view vw_userdata as
select * from userdata
GO

Testing data

Same as Bryan’s but names changed for easily debugging print.

-- init userdata, Bryan O'Sullivan's sample data is used
insert into UserData values ( 'u1', 'i1',  1, getdate() )
insert into UserData values ( 'u1', 'i2', .5, getdate() )
insert into UserData values ( 'u1', 'i3', .2, getdate() )
insert into UserData values ( 'u2', 'i1',  1, getdate() )
insert into UserData values ( 'u2', 'i3', .5, getdate() )
insert into UserData values ( 'u2', 'i4', .2, getdate() )
insert into UserData values ( 'u3', 'i1', .2, getdate() )
insert into UserData values ( 'u3', 'i2', .4, getdate() )
insert into UserData values ( 'u3', 'i3',  1, getdate() )
insert into UserData values ( 'u3', 'i4', .4, getdate() )
insert into UserData values ( 'u4', 'i2', .9, getdate() )
insert into UserData values ( 'u4', 'i3', .4, getdate() )
insert into UserData values ( 'u4', 'i4', .5, getdate() )
GO

Processing the intermediate table

-- update process
delete FreqDiff
insert into FreqDiff
select
    ud1.itemid, ud2.itemid, count(*), (sum(ud1.rating - ud2.rating))/count(*), getdate()
from
    vw_userdata ud1
    join vw_userdata ud2 on
            ud1.userid = ud2.userid
        and ud1.itemid > ud2.itemid
group by ud1.itemid, ud2.itemid

Predicting

-- predict process
declare @pref table(itemid varchar(50), rating float)
insert into @pref values('i1', 0.4)

select -- distinct top 10
    itemid1,
    sum(freq)                               as freq,
    sum(freq*(diff + rating))            as pref,
    sum(freq*(diff + rating)) /sum(freq) as rating
from
    vw_freqdiff fd
    join @pref p on fd.itemid2 = p.itemid
where itemid1 not in( select itemid from @pref )
group by itemid1

Further works as intermediate data updating seems easy.
So, writing here, listening for suggestions.

一些厂商提供的数据仓库工具

200811

19:48

做一个最浅度的分析。孰优孰劣从软件的外观上可知一二。图都来自于厂商在线手册。

公 司 名 称

ETL工具

数据仓库管理工具

OLAP工具

数据挖掘工具

报 表 工 具

IBM

Warehouse Manager

Visual Warehouse

OLAP Server

Intelligent

Miner

Insight&Qucik view

Oracle

Oracle ETL Server

Enterprise Manager

Express Server

Darwin

Express Analyser

Sybase

Replication Server PowerStage

Warehouse Studio

Warehouse Analyzer

SAS

SPSS

InfoMaker

CA

InfoPump

PLATINUM ERWin

PLATINUM InfoPump

DecisionBase InfoBeacon

Neugent

Aion

Forest&Trees

InfoReports

SAS

第三方

Warehouse Administrator

SAS MDDB

Enterprise Miner

EIS ER

Microsoft

SSIS

SQL Server

SSAS

SSAS

SSRS

Pasted from <http://book.csdn.net/bookfiles/537/10053718137.shtml>

IBM DB2 Business Intelligence

Pasted from <http://publib.boulder.ibm.com/infocenter/db2v7luw/topic/com.ibm.db2v7.doc/db2tu/db2tuo10.htm>

Oracle Warehouse Builder

Pasted from <http://download.oracle.com/docs/cd/B28359_01/owb.111/b31278/concept_data_modeling.htm>

Sybase WarehouseArchitect

Pasted from <http://manuals.sybase.com/onlinebooks/group-wa/wag0320e/waug/@Generic__BookTextView/28735;pt=30045>

Dimension Structure tab of Dimension Designer.

Pasted from <http://msdn2.microsoft.com/en-us/library/ms169952.aspx>

用于数据统计的R语言及RPy

类似于Matlab的统计模块,R语言是一种用于数据统计的开源免费语言(软件)。它是从贝尔实验室在70年代开发的S语言演变而来的,是成熟、通用的统计计算用工具,同时具有很强的绘图功能。IBM DW上的这篇文章介绍得很好。

R is a language and environment for statistical computing and graphics. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …). 所以,R语言在数据挖掘中也可以得到应用。

r demo
r demo

针对R语言定期发布的一系列通讯,编辑得很好,对理解统计技术及应是很有帮助的,值得去读。

R语言可以通过RPy嵌入到Python环境中。这使得整合复杂的开发环境和工具成为可能。

摘录一段例子,应用RPy进行最小二乘线性拟合以及绘图的过程。

from rpy import r
my_x = [5.05, 6.75, 3.21, 2.66]
my_y = [1.65, 26.5, -5.93, 7.96]
linear_model = r.lm(r("y ~ x"), data = r.data_frame(x=my_x, y=my_y))
gradient = linear_model['coefficients']['x']
yintercept= linear_model['coefficients']['(Intercept)']

r.png("scatter_regression.png", width=400, height=350)
r.plot(x=my_x, y=my_y, xlab="x", ylab="y", xlim=(0,7), ylim=(-16,27),
           main="Example Scatter with regression")
r.abline(a=yintercept, b=gradient, col="red")
r.dev_off()

一个PNG图片输出出来

Random posts

  • Moved from Blogger to Wordpress
  • 弃bloglines.com投reader.google.com
  • 要个什么样的小孩?风间、正男、阿呆还是小新?
  • Structure searching supported compound database based on MS Access
  • 同是底层的人