活的数据,中国30年

非常值得看的一段视频,数据可视化演示得精彩。

讲演的大爷以前是研究社会发展的,研究的数据也与此有关。其中一共有三个片段分别提到了中国。对中国的发展历程,展示的很生动。

家庭人口和人均寿命

  • 纵坐标是平均寿命,横坐标是家庭人口(准确讲是“每个母亲的孩子数”)。
  • 总体来讲,是发展中国家(红点)家庭人口多,平均寿命低,发达国家相反。
  • 62年时,最左边的大红点是中国,下面的红点是印度。那个时候中国就比印度平均寿命高。
  • 随着时间变化,国家都在发展,大多数国家都移动到了“家庭人口少,平均寿命高”的位置。
  • 中国的发展速度,始终是领先于印度。
  • 中国在60年代有一次很诡异的平均寿命下降的动作,是那场天灾人祸。
  • 中国跑得快,肯定和计划生育相关。

婴儿的成活率

  • 世界各国的婴儿成活率(纵轴),与经济水品(横轴GDP)有明显的线性关系。
  • 中国(图中的大黄点),在时间上的变化很有趣。
  • 1978年之前的“毛时代”,中国的经济水品一直处于世界的末位,但是婴儿的成活率却是大大高于这条规律直线。这就是成就。很多人怀念、赞赏毛时代的体制,不全无道理。
  • 1978年之后的“邓时代”,中国的经济水平迅速攀升,但是婴儿成活率却没有提高。加迅速地回归到规律直线。
  • 简单的讲,改革开放后中国人生活水准的提高速度,与中国经济发展的速度不匹配。

收入分布

  • 红色是中国的收入分布;绿色是美国。
  • 在1972年,“根本没有交集”,美国最穷的人也比中国最富的人有钱。
  • 曲线的宽度是贫富差距。1972年中国贫富差距很小,美国很大。
  • 1999年,中国“像一个幽灵一样”,扑向美国。这段动画非常形象。
  • 但是1999年,中国的收入分布形状,很明显地展示了中国的贫富差距增大的趋势。
 

 

Cross tabulation, Pivot table and OLAP Cube summarized by tools

Cross tab

Cross tabulation , is defined on WikiPedia as A cross tabulation (often abbreviated as cross tab) displays the joint distribution of two or more variables.

As below figures, the column values of the first table is transformed to column names in the second one, so the production of

 

 

Cross tabs are frequently used because:

  1. They are easy to understand. They appeal to people who do not want to use more sophisticated measures.
  2. They can be used with any level of data: nominal, ordinal, interval, or ratio - cross tabs treat all data as if it is nominal.
  3. A table can provide greater insight than single statistics.
  4. It solves the problem of empty or sparse cells.
  5. They are simple to conduct.

 

I think the cross-tab is really what the Relational Database means. That’s a meaningful “Relationships” between entities but not entity and it’s attributes.

Solution by SQL Server 2005

Code:

SELECT
    SalesPerson,
    [Oranges] AS Oranges,
    [Pickles] AS Pickles
FROM
    ( SELECT SalesPerson, Product, SalesAmount FROM tblSales ) ps
PIVOT
    (
      SUM (SalesAmount)
      FOR Product IN ( [Oranges], [Pickles])
    ) AS pvt

 

This most weak part of feature is that you must list all pivot column name manually. It’s pain.


BOL: Using PIVOT and UNPIVOT

SELECT < non-pivoted column>,
    [first pivoted column] AS ,
    [second pivoted column] AS ,
    ...
    [last pivoted column] AS
FROM
    (< SELECT query that produces the data>)
    AS
PIVOT
(
    < aggregation function>()
FOR
[< column that contains the values that will become column headers>]
    IN ( [first pivoted column], [second pivoted column],
    ... [last pivoted column])
) AS
< optional ORDER BY clause>;

Solution by MS Access

Microsoft Access provides a very nice cross table design wizard.

With the wizard, we get the cross-tab script based on JET-SQL.

TRANSFORM Sum(tblSales.SalesAmount) AS [Sum of SalesAmount]
SELECT tblSales.SalesPerson, Sum(tblSales.SalesAmount) AS [Total SalesAmount]
FROM tblSales
GROUP BY tblSales.SalesPerson
PIVOT tblSales.Product;

 

 

Pivot columns is automaticly generated and the aggregation over all pivot column can be produced. The JET-SQL is so smart and smoothly for small applications those don’t have to be scaled. You can use almost all VBA programming utilities in it. The TRANSFORM/PIVOT feature is powerful and more then 5 years before the SQL Server’s implementation.

It is based on JET-SQL, so not availible in an ADP project.

Solution by SQL Server 2000, CASE and GROUP

The classic solution of cross-tab in SQL Server 2000 or before is produced by CASE and GROUP.

select
    SalesPerson,
    SUM(CASE Product WHEN 'Pickles' THEN SalesAmount ELSE 0 END ) as [Pickles],
    SUM(CASE Product WHEN 'Oranges' THEN SalesAmount ELSE 0 END ) as [Oranges],
    SUM(SalesAmount)    as [Total Sales]
from
    tblSales
group by
    SalesPerson

We can find it’s footprint clearly in the SQL Server 2005’s pivot feature. And this is a base for other ideas.

Solution by SQL Server 2000, Dynamic SQL

Dynamic Cross-Tabs/Pivot Tables By Rob Volk

The basic idea is to generate dynamic CASE-and-GROUP based SQL statement and execute.

The following comments providing many improvement on this solution. i.e.

  • Introduce an WHERE statement as parameter for the @PivotColTable.
  • Using string to generate pivot columns but global temp table for better supporting of concurency calling.
  • Using user defined table with an user session column instead of global temp table.
  • To summarize multiple values.

There is also an similar solution I have not dived in, Dynamic Crosstab Queries, by Itzik Ben-Gan .

Solution by Excel

But the most easy and rich tool I have used is MS Excel’s pivot table.

Pivot table and OLAP Cube

The basic idea of OLAP Cube is a multi-dimession pivot table. The common visualization tool for OLAP is also a pivot table or pivot chart viewer. 

So, draw a figure to show basic concepts in OLAP by example metioned above as the END.

Implementing Slope-One in T-SQL

Slope-One, the simplest form of non-trivial item-based collaborative filtering based on ratings. (Original Paper)
Referencing to Bryan O’Sullivan’s tutorial of implementing Slope One in Python, I write a the implementation in T-SQL. Believe it useful to many people and projects.

Brief process summary

  • Define the fact table as user data.
  • Calculating intermediate matrix(FreqDiff). The information about users is eliminated and frequency/ score differences data between items is produced.
  • Predicting from user input score with the intermediate data.

Data schema

The UserData is fact table of business transactions. I use an view to wrap it for switching between testing data and working data.
The Freq&Diff matrix is square and sparse. Only non-zero values is meaningful and stored. And a half-matrix triangle holds full information about the matrix.
There created two indices to avoid heavily bookmark-lookup.

create table UserData (                  -- fact table
    userid   varchar(50) not null,
    itemid   varchar(50) not null,
    rating   float not null default 0,
    updtime  datetime default getdate(),
    primary key (userid, itemid)
)
GO

create table FreqDiff (                  -- Freqs and Diffs
    itemid1  varchar(50),
    itemid2  varchar(50),
    freq     float not null default 0,
    diff     float not null default 0,
    updtime  datetime default getdate(),
    primary key (itemid1, itemid2)
)
GO
create index idx_freqdiff_itemid1 on FreqDiff(itemid1, freq, diff, itemid2)
create index idx_freqdiff_itemid2 on FreqDiff(itemid2, freq, diff, itemid1)

/*
 * The matrix FreqDiff is *almost* symmetric,
 * so only half of the data need to be stored.
 * There would be huge of space (50%) saved for large dataset.
 */
alter view vw_freqdiff as
select itemid1 as itemid1, itemid2 as itemid2, freq,     diff from FreqDiff fd
union all
select itemid2 as itemid1, itemid1 as itemid2, freq, -1* diff from FreqDiff fd
GO

/*
 * Wrap for userdata,
 * switch from one model to another easily.
 */
alter view vw_userdata as
select * from userdata
GO

Testing data

Same as Bryan’s but names changed for easily debugging print.

-- init userdata, Bryan O'Sullivan's sample data is used
insert into UserData values ( 'u1', 'i1',  1, getdate() )
insert into UserData values ( 'u1', 'i2', .5, getdate() )
insert into UserData values ( 'u1', 'i3', .2, getdate() )
insert into UserData values ( 'u2', 'i1',  1, getdate() )
insert into UserData values ( 'u2', 'i3', .5, getdate() )
insert into UserData values ( 'u2', 'i4', .2, getdate() )
insert into UserData values ( 'u3', 'i1', .2, getdate() )
insert into UserData values ( 'u3', 'i2', .4, getdate() )
insert into UserData values ( 'u3', 'i3',  1, getdate() )
insert into UserData values ( 'u3', 'i4', .4, getdate() )
insert into UserData values ( 'u4', 'i2', .9, getdate() )
insert into UserData values ( 'u4', 'i3', .4, getdate() )
insert into UserData values ( 'u4', 'i4', .5, getdate() )
GO

Processing the intermediate table

-- update process
delete FreqDiff
insert into FreqDiff
select
    ud1.itemid, ud2.itemid, count(*), (sum(ud1.rating - ud2.rating))/count(*), getdate()
from
    vw_userdata ud1
    join vw_userdata ud2 on
            ud1.userid = ud2.userid
        and ud1.itemid > ud2.itemid
group by ud1.itemid, ud2.itemid

Predicting

-- predict process
declare @pref table(itemid varchar(50), rating float)
insert into @pref values('i1', 0.4)

select -- distinct top 10
    itemid1,
    sum(freq)                               as freq,
    sum(freq*(diff + rating))            as pref,
    sum(freq*(diff + rating)) /sum(freq) as rating
from
    vw_freqdiff fd
    join @pref p on fd.itemid2 = p.itemid
where itemid1 not in( select itemid from @pref )
group by itemid1

Further works as intermediate data updating seems easy.
So, writing here, listening for suggestions.

Random posts

  • 积水潭医院回龙观院区
  • Hello world!
  • 被树熊点到,回答问题
  • 从google earth上看龙泽站的黑车
  • 春运铁路的确应该涨价,吴敬琏说得很对