1、XML 代价估计方法研究综述代价估计方法研究综述 XML Group Outlinen研究代价估计的必要性nxml中代价估计研究的难点n背景知识介绍n现有方法分类研究代价估计的必要性n查询优化的基础n不同分支对查询目标的选择率不同,如果选择性低的分支先于选择性高的分支执行,就会有效的减少中间结果从而提供查询效率n结构连接在XML中是一个很重要的操作,连接的顺序的选择需要计算不同连接顺序的代价n及早给用户提供反馈信息xml中代价估计研究的难点nXML数据的不规则性是对传统统计信息方法的重要挑战,具体表现在以下几个方面:n路径依赖性,n如同为name结点,在person下和在city下出现意义就完
2、全不同n结构相关性n嵌套在不同祖先下面的同类结点的个数差别可能很大,如book结点下的author个数是不确定的n值相关性n/purchase/Itemtype=bookprice100nXML的有序性制约了转换规则的灵活性n所有这些问题,都使得在xml中采用传统的代价估计方法不切实际,会带来很大的误差。针对xml数据的特点,我们应该寻求一种新的代价估计方法。Background Knowledge nXML Data ModelnA large,node-labeled tree T(V,E)d0a1a2a3p4p5n6t14k15y131999y16t17k18k19n7b9p8y20t2
3、1k22y24t25k26n10b12p9t23t26200220012000Example XML Document TreeBackground Knowledge nXML Data ModelnA large,node-labeled tree T(V,E)d0a1a2a3p4p5n6t14k15y131999y16t17k18k19n7b9p8y20t21k22y24t25k26n10b12p9t23t26200220012000Example XML DocumentBackground KnowledgenXML Query ModelnA node-labeled query
4、tree TQnEach node of TQ is labeled with a variable name qi in Q nEach edge(qi,qj)of TQ is annotated with an XPath expression path(qi,qj)that describes the specific structural constraints specified in QQuery Fragment:for$b in doc(“bib.xml”)/bib/book$a in$b/author$t in$b/title bibbookauthortitleroot$b
5、$a$tTwig Query Model现有估计方法的分类n路径选择性计算nTwig匹配个数统计n值谓词选择率估计n结构连接代价估计 Overviewdata treedifferent summarization methods based on:pathall m-length pathF/B-stabilityLoreMcHugh et al,VLDB99pruningPathTree,Markov TableAboulnaga et al,VLDB01XSketchPolyzotis et al,VLDB02typeXSketchesPolyzotis et al,SIGMOD02+v
6、alue+twigXSketchesPolyzotis et al,SIGMOD04Region Code2D Model1D ModelStatiXFreire et al,SIGMOD02PH HistogramWu Y EDBT 2002Interva&Position ModelH.Lu SIGMOD 2003路径选择性计算n问题描述 /a/bc/d/eg/e/f/*/*/e/f/a/b/*/*/e/f/c/d/e/g/e/f/a/b/*/*/e/f/c/d/e/g/e/fPath Tree&Markov Table nAshraf Aboulnaga,Alaa R.Alameldee
7、n,and Jeffry F.Naughton.Estimating the selectivity of XML path expressions for Internet scale applications.VLDB 2001 Path TreenConstructionnStart from an original path treenCount of nodes in absolute pathsnAggregate the tree to fit in the limited spacenSibling-*nEstimationnMatch the query against th
8、e path tree,matching*as the last resort.nNeed to decide whether to use total count or average count.An XML document and its path tree A1B2C1D1D1E3Estimation:count(A/B/D)=1 count(A/C/D)=1 Aggregate Path TreeA1C9B13G 10F15H6K12E5D7K11J4I2Aggregate Path TreeA1C9B13G 10F15H6K12E5D7K11J4I2红色结点表示那些不频繁出现的结
9、点,需要删除,从而保证path树能够放入内存Sibling-*SummarizationA1C9B13*F15*K*f=23n=2683把查询和path 树匹配需要决定用总数还是平均数估计方法Estimation:count(/C/H/K)=11 count(/C/*/K)=23 Markov Tablen构造一个表,存储长度=2。n当查询路径长度=m时,可以直接从表中读出结果,否则,用公式计算。n例如,m3,查询为/A/B/C/D,则n当表不能装入内存的时候,进行剪枝操作。f(B/C/D)f(B/C)f(A/B/C/D)f(A/B/C)Markov Table:Exampleabbcddee
10、efda1a/b2b2a/c1c1b/d2d3c/d1e3c/e3f1c/f1abdde12213cf11m=2b2c/e3d3*3/3e3c/*2/2a/b2*/*1/1b/d2suffix-*Q1:a/c/e(/)(/)(/)(/)(*/*)3()(*)f c ef c ef a c ef a cff cfTwig 匹配个数统计n问题描述nTwig Query(basic building block of declarative query languages for XML)FOR$f IN document(“person.xml”)/department/faculty FOR$r
11、 in$f/RA,FOR$t in$f/TADepartmentFacultyRATA$f$r$tXSketch方法n N.Polyzotis,M.Garofalakis.Statistical Synopses for Graph-Structured XML Databases,In Proc.of ACM SIGMOD Conf.2002 nN.Polyzotis and M.Garofalakis.Structure and value synopses for xml data graphs.In Proceedings of 28th VLDB Conf.,2002.nN.Poly
12、zotis,M.Garofalakis,and Yannis Iosnnidis.Selectivity Estimation for XML Twigs.In Proc.of the 20th Intl.Conf.on Data Engineering,2004nN.Polyzotis and M.Garofalakis and Y.Ioannidis.Approximate XML Query Answers.In Proc.ACM SIGMOD 2004d0a1a2a3p4p5n6t14k15y131999 y16t17k18k19n7b9p8y20t21k22y24t25k26n10b
13、12p9t23t26200220012000D(1)A(3)N(3)P(4)B(2)Y(4)K(6)T(6)H(Y)B/FB/FB/FB/FBB/FFFXSketch方法B/F stabilityn Definition:n Let U,V be sets of elements in an XML data Tree T.We say that V is backward-stable(B-stale)with respect to U,iff for each vV there exists a u U such that edge(u,v)is in T.nSimilarly,U is
14、said to be forward-stable(F-stable)with respect to V,iff for each u U there exists a v V such that the edge(u,v)is in G.D(1)A(3)N(3)P(4)B(2)Y(4)K(6)T(6)H(Y)B/FB/FB/FB/FBB/FFFXSketch方法How to estimate the cardinality of xpath query such as A/P/K or A/p?Utilizing edge stability,We can give an accurate
15、match number:Count(A/P/K)=6Count(A/p)=3But what about A/P/T which doesnt conform to backward stability?2 assumptions Path independence Backward-edge uniformityCount(A/P/T)=f(A/P/T)*count(T)f(A/P/T)=f(A/P)*f(P/T|A/P)=f(P/T|A/P)=f(P/T)count(P)/(count(B)+count(P)XSketch方法(处理值谓词)v1v3v5v2v4BBFFSTN(v5)Dep
16、(v5)=v1,v4v1v4Histogram at v5H(1,4)=count(v11v2/v3v44/v5)v3 v5 count(v11v2/v33v44/v5)=H(1,4)*count(v33)/count(v3)A1Edge-Value IndependenceAcross Non-Satble Edgesf(u/v|v)f(u/v)A2Value-Independence Outside Correlation ScopeXSkecth方法(处理twig)nProblem with the former modelnLet us see a simple examplefor
17、t0 in A,t1 in t0/B,t2 in t0/Cra1a2b1c1b2c2ra1a2b1c1b2c21001010010100100 1010RA(2)B(110)C(110)B/FB/FB/F(a)(b)(c)Structural XSketch2000 binding tuples on the first document10100 tuples on the second document XSkecth方法(处理twig)nSinge-path XSketch model dose not store detailed enough information to captu
18、re the correlation patterns between multiple path expressions.nIf node A records a two-dimensional distribution fA(b,c)for the fraction of elements in A that have b children in B and c children in C.CBCCElementsfp10010a10.510100a20.52(0.5100100.510100)CBCCElementsfp100100a10.51010a20.52(0.51001000.5
19、1010)XSkecth方法(处理twig)CKCYElementsfp21p40.25P54(0.2521100.25110.51)5CPCN21210.2521110.511p8,p9Another ExampleEdge-distribution:fP(CY,CK,CP,CN)PYPKAPANStatix方法n Freire J,Jayant R,Ramanath M,Roy P and Simeon J.StatiX:Making XML Count.In:Proc.of ACM SIGMOD 2002,USA.Statix方法nConstructionnevery instance
20、node is assigned a unique id(type id+sequential node id)during parsing and validation,meanwhile gathering statisticsnConstructing the histogramsnHistogramnBuild histogram for each typenMight contain histogram describing the distribution of the parents of a typenValue histogram can be built for types
21、 that are defined in terms of base types(eg.integer)nEstimationnConvert the query into SQL(i.e.,relational join on IDs)nUse standard histogram multiplicationStatiX:ExampleFOR$s in document(“imdb.xml”)/show,$r in$s/reviewWHERE$s/year 1992RETURN$s/title,$rSELECT title,reviewFROM Show,ReviewWHERE Show.
22、year 1992 AND Show.ID=Review.ParIDSTAT Show CARD:5 ID:16 STAT Show_Year VALUE:19902001 BUCKET_NUM:2 BUCKETS:1990,1994):3;1994,2001):2STAT Review CARD:8 ID:3038 PARENT HISTOGRAM Show BUCKET_NUM:3 BUCKETS:1,2):4;2,3):1;3,5):3 STAT Aka CARD:6 ID:6-12 PARENT HISTOGRAM Show_Episode BUCKET_NUM:3 BUCKETS:1
23、,2):1;2,6):0;6,12):7 31/28/5=2.4SummaryQuery/*BranchValueCorrelationSchemaStatiXSimple Twig+ValueNYYYNYPath TreeSimple PathNYNNNNMarkov TableSimple PathNYNNNNXSketchSimple TwigNNYNNNXSketchesSimple Twig+ValueNNYYY(mD methods)N结构连接代价估计n问题描述n给定任意的祖先节点集合A和后代节点集合D,计算A/D结果集合的大小,估计匹配的祖先后代的结果个数 n当同一祖先有多个后代
24、,或者同一后代有多个祖先时,节点对个数可能远远大于祖先或后代的个数。n应用n连接顺序的选择结构连接代价估计nExisted workn2D model:pH histogram Wu Y,Patel J and Jagadish H.V.Estimating Answer Sizes for XML Queries.In:Proc.of the EDBT 2002.nInterval and Positional model W.Wang,H.Jiang,H.Lu,and J.F.Xu.Containment join Size Estimation:Models and Methods.In
25、:Proc.Of ACM SIGMOD 2003 Absolute Region Code(start,end)blah 1234 5678 0000 contactnamephoneblahofficehomemobile123456780000(1,16)(2,4)(3,3)(5,15)(6,8)(7,7)(9,11)(10,10)(12,14)(13,13)234a.start d.start and d.end a.endpH HistogramForbidden Regions For a nodeR1和R2是结点A的ForbiddenRegion 两个结点的(start,end)满
26、足:(1)no overlap (2)nested 不可能有部分重叠的(partiallyoverlap)情况 pH HistogramEstPA=HistP1A1/4HistP2A+HistP2B+HistP2C+HistP2E+1/2(HistP2D+HistP2F)Ancestor-based Join EstimationEstPA=HistP2A1/4HistP1A+HistP1F+HistP2G+HistP1HDescendant-based Join EstimationpH Histogram (1 4)ADDAAAAEAHAHGHHHIHA35 51 3 71021203 0
27、0 0 00000est=3*(12+1+2+0.25*5)=48.75AGIHstartendstartstartendendFor off-diagonal cells:pH for ApH for DInterval&Position ModelsnMap into new problems under 2 newly proposed models.nInterval model and Position modelnHistogram based method that is adaptive to the data:PL histogramnAssumption:1D unifor
28、m of the descendant setnEstimation is robust when correlation is lownSampling based method that captures the correlation:IM Sampling and PM SamplingnAssumption:Height of the XML data tree is a small constantnBoth have theoretical bounds on the accuracy of the estimationInterval ModelnInterval Modeln
29、Map elements to 1D line segments(points)nIMA(A):interval setnIMD(D):point setn|A SJ D|=|interval-point pairs|a1a2a3IMA(A)022d1d2d3d4IMD(D)022a(start,end)a1(2,7)a2(18,21)a3(1,22)I IMMA A(A A)d(start,end)d1(3,4)d2(9,10)d3(11,12)d3(19,20)I IMMD D(D D)interval setpoint set(2,7)a3a1a2d1d2d3d4(1,22)(3,4)(
30、9,10)(11,12)(19,20)(18,21)Position ModelnPosition ModelnUse frequency tables to record coverage and starting point informationnPMA(S):coverage tablenPMD(S):start tablen|A SJ D|=nContainment joinEquijoina(start,end)a1(2,7)a2(18,21)a3(1,22)d(start,end)d1(3,4)d2(9,10)d3(11,12)d4(19,20)VF112232425262728
31、191101111121131141151161171182192202212221P PMMA A(A A)VF102031405060708091100111120130140150160170180191200210220P PMMD D(D D)coverage tablestart table(.)(.)ia i F d i FPoint-Line(PL)-Histograma1a2a3IMA(A)d1d2d3d4IMD(D)nwsswsel1()jalldadwswsj nXnnn nPartition the pace into partitionsnEstimate the o
32、verlapping(interval,point)pairs within each bucket.nAssumptionsnData distribution of A and D are independentnD conforms to uniform distributionSampling in Interval ModelnAdapted from bifocal samplingnIM Algorithm:nChoose m samples from IMD(D)nSum up the m subjoin sizes.nX=Scale up the result by|IMD(D)|/mnStatistical GuaranteenE(X)=X(join size)nwith high probability(by Hoeffding bound)a1a2a3IMA(A)d1d2d3d4IMD(D)2136m=2ProblemnHow to handle parent-child relationship?nHow to count|A/B/C|nHow to use estimation for Query OptimizationQuestion?Then We done!Thank you!