1、2022年11月11日星期五1 Data Warehousing2022年11月11日星期五2 1.传统数据库以及OLTP(On-Line Transaction Processing 联机事务处理)在日常的管理事务处理中获得了巨大的成功,但是对管理人员的决策分析要求却无法满足。2.因为,管理人员常常希望能够通过对组织中的大量数据进行分析,了解业务的的发展趋势。而传统数据库只保留了当前的业务处理信息,缺乏决策分析所需要的大量的历史信息。3.为满足管理人员的决策分析需要,就需要在数据库的基础上产生适应决策分析的数据环境数据仓库(Data Warehose)。数据仓库技术出现的背景数据仓库技术出现
2、的背景2022年11月11日星期五3n1-What is a data warehouse?2022年11月11日星期五4What is Data Warehouse?nDefined in many different ways,but not rigorously.nA decision support database that is maintained separately from the organizations operational databasenSupport information processing by providing a solid platform o
3、f consolidated,historical data for analysis.n“A data warehouse is a subject-oriented,integrated,time-variant,and nonvolatile collection of data in support of managements decision-making process.”W.H.Inmon2022年11月11日星期五5数据仓库的定义与基本特性数据仓库的定义与基本特性n William H.Inmon在1993年所写的论著Building the DataWarehouse首先系
4、统地阐述了关于数据仓库的思想、理论,为数据仓库的发展奠定了历史基石。文中他将数据仓库定义为:a data warehouse is a subject-oriented,integrated,non-volatile,time-variant collection of data in support of management decisions.一个面向主题的、集成的、非易失性的、随时间变化的数据的集合,以用于支持管理层决策过程。2022年11月11日星期五6Data WarehouseSubject-OrientednOrganized around major subjects,suc
5、h as customer,product,sales.nFocusing on the modeling and analysis of data for decision makers,not on daily operations or transaction processing.nProvide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.2022年11月11日星期五7Da
6、ta WarehouseIntegratednConstructed by integrating multiple,heterogeneous data sourcesnrelational databases,flat files,on-line transaction recordsnData cleaning and data integration techniques are applied.nEnsure consistency in naming conventions,encoding structures,attribute measures,etc.among diffe
7、rent data sourcesnE.g.,Hotel price:currency,tax,breakfast covered,etc.nWhen data is moved to the warehouse,it is converted.2022年11月11日星期五8Data WarehouseTime VariantnThe time horizon for the data warehouse is significantly longer than that of operational systems.nOperational database:current value da
8、ta.nData warehouse data:provide information from a historical perspective(e.g.,past 5-10 years)nEvery key structure in the data warehousenContains an element of time,explicitly or implicitlynBut the key of operational data may or may not contain“time element”.2022年11月11日星期五9Data WarehouseNon-Volatil
9、enA physically separate store of data transformed from the operational environment.nOperational update of data does not occur in the data warehouse environment.nDoes not require transaction processing,recovery,and concurrency control mechanismsnRequires only two operations in data accessing:ninitial
10、 loading of data and access of data.2022年11月11日星期五10Data Warehouse vs.Heterogeneous DBMSnTraditional heterogeneous DB integration:nBuild wrappers/mediators on top of heterogeneous databases nQuery driven approachnWhen a query is posed to a client site,a meta-dictionary is used to translate the query
11、 into queries appropriate for individual heterogeneous sites involved,and the results are integrated into a global answer setnComplex information filtering,compete for resourcesnData warehouse:update-driven,high performancenInformation from heterogeneous sources is integrated in advance and stored i
12、n warehouses for direct query and analysis2022年11月11日星期五11Data Warehouse vs.Operational DBMSnOLTP(on-line transaction processing)nMajor task of traditional relational DBMSnDay-to-day operations:purchasing,inventory,banking,manufacturing,payroll,registration,accounting,etc.nOLAP(on-line analytical pr
13、ocessing)nMajor task of data warehouse systemnData analysis and decision makingnDistinct features(OLTP vs.OLAP):nUser and system orientation:customer vs.marketnData contents:current,detailed vs.historical,consolidatednDatabase design:ER+application vs.star+subjectnView:current,local vs.evolutionary,
14、integratednAccess patterns:update vs.read-only but complex queries2022年11月11日星期五12OLTP vs.OLAP OLTP OLAP users clerk,IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current,up-to-date detailed,flat relational isola
15、ted historical,summarized,multidimensional integrated,consolidated usage repetitive ad-hoc access read/write index/hash on prim.key lots of scans unit of work short,simple transaction complex query#records accessed tens millions#users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction t
16、hroughput query throughput,response 2022年11月11日星期五13Why Separate Data Warehouse?nHigh performance for both systemsnDBMS tuned for OLTP:access methods,indexing,concurrency control,recoverynWarehousetuned for OLAP:complex OLAP queries,multidimensional view,consolidation.nDifferent functions and differ
17、ent data:nmissing data:Decision support requires historical data which operational DBs do not typically maintainndata consolidation:DS requires consolidation(aggregation,summarization)of data from heterogeneous sourcesndata quality:different sources typically use inconsistent data representations,co
18、des and formats which have to be reconciled2022年11月11日星期五14n2-A multi-dimensional data model2022年11月11日星期五15From Tables and Spreadsheets to Data CubesnA data warehouse is based on a multidimensional data model which views data in the form of a data cubenA data cube,such as sales,allows data to be mo
19、deled and viewed in multiple dimensionsnDimension tables,such as item(item_name,brand,type),or time(day,week,month,quarter,year)nFact table contains measures(such as dollars_sold)and keys to each of the related dimension tablesnIn data warehousing literature,an n-D base cube is called a base cuboid.
20、The top most 0-D cuboid,which holds the highest-level of summarization,is called the apex cuboid.The lattice of cuboids forms a data cube.2022年11月11日星期五16Cube:A Lattice of Cuboidsalltimeitemlocationsuppliertime,itemtime,locationtime,supplieritem,locationitem,supplierlocation,suppliertime,item,locati
21、ontime,item,suppliertime,location,supplieritem,location,suppliertime,item,location,supplier0-D(apex)cuboid1-D cuboids2-D cuboids3-D cuboids4-D(base)cuboid2022年11月11日星期五17A Concept Hierarchy:Dimension(location)allEuropeNorth_AmericaMexicoCanadaSpainGermanyVancouverM.WindL.Chan.allregionofficecountryT
22、orontoFrankfurtcity2022年11月11日星期五18Multidimensional DatanSales volume as a function of product,month,and regionProductRegionMonthDimensions:Product,Location,TimeHierarchical summarization pathsIndustry Region YearCategory Country QuarterProduct City Month Week Office Day2022年11月11日星期五19A Sample Data
23、 CubeTotal annual salesof TV in U.S.A.DateProductCountrysumsum TVVCRPC1Qtr2Qtr3Qtr4QtrU.S.ACanadaMexicosum2022年11月11日星期五20Cuboids Corresponding to the Cubeallproductdatecountryproduct,dateproduct,countrydate,countryproduct,date,country0-D(apex)cuboid1-D cuboids2-D cuboids3-D(base)cuboid2022年11月11日星期
24、五21n3-Conceptual Modeling of Data Warehouses2022年11月11日星期五22Modeling data warehouses:dimensions&measuresnStar schema:A fact table in the middle connected to a set of dimension tables nSnowflake schema:A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller di
25、mension tables,forming a shape similar to snowflakengalaxy schema:Multiple fact tables share dimension tables,viewed as a collection of stars,therefore called galaxy schema2022年11月11日星期五23Example of Star Schema time_keydayday_of_the_weekmonthquarteryeartimelocation_keystreetcityprovince_or_streetcou
26、ntrylocationSales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_salesMeasuresitem_keyitem_namebrandtypesupplier_typeitembranch_keybranch_namebranch_typebranch2022年11月11日星期五24Example of Snowflake Schematime_keydayday_of_the_weekmonthquarteryeartimelocation_keystreet
27、city_keylocationSales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_salesMeasuresitem_keyitem_namebrandtypesupplier_keyitembranch_keybranch_namebranch_typebranchsupplier_keysupplier_typesuppliercity_keycityprovince_or_streetcountrycity2022年11月11日星期五25Example of Gal
28、axy schema time_keydayday_of_the_weekmonthquarteryeartimelocation_keystreetcityprovince_or_streetcountrylocationSales Fact Tabletime_key item_key branch_key location_key units_sold dollars_sold avg_salesMeasuresitem_keyitem_namebrandtypesupplier_typeitembranch_keybranch_namebranch_typebranchShipping
29、 Fact Tabletime_key item_key shipper_key from_location to_location dollars_cost units_shippedshipper_keyshipper_namelocation_keyshipper_typeshipper2022年11月11日星期五26n4-Typical OLAP Operations2022年11月11日星期五27nRoll up(drill-up):summarize datanby climbing up hierarchy or by dimension reductionnDrill down
30、(roll down):reverse of roll-upnfrom higher level summary to lower level summary or detailed data,or introducing new dimensionsnSlice and dice:nproject and select nPivot(rotate):nreorient the cube,visualization,3D to series of 2D planes.nOther operationsndrill across:involving(across)more than one fact tablendrill through:through the bottom level of the cube to its back-end relational tables(using SQL)