1、存储系统容错编码简介内容mRAID、容错编码mReed-Solomon编码m二进制线性码m阵列码m利用组合数学工具构造容错编码内容mRAID、容错编码mReed-Solomon编码m二进制线性码m阵列码m利用组合数学工具构造容错编码RAIDmRedundant Arrays of Inexpensive DisksRedundant Arrays of Independent Disksq容量q性能q可靠性mChen,P.M.,Lee,E.K.,Gibson,G.A.,Katz,R.H.,and Patterson,D.A.“RAID:high-performance,reliable sec
2、ondary storage.”ACM Computing Surveys 26(2),pp.143-185,June 1994.RAID结构mData Stripingstripe unitstripe04KB-14KB8KB-18KB12KB-112KB16KB-116KB20KB-1RAID结构mRedundancy04KB-1RAID结构m编码:d1 XOR d2 XOR XOR dn=pm解码:di=p XOR d1 XOR XOR di-1 XOR di-1 m解码:pnew=p XOR diold XOR dinewRAID结构mdata unit、parity unitmRAI
3、D5:更新负载均匀分布RAID结构RAID5的读写RAID5的读写RAID5布局mEdward K.Lee,Randy H.Katz,“The Performance of Parity Placements in Disk Arrays”,IEEE Transactions on Computers,vol.42 no.6,pp.651-664,1993.RAID5布局RAID5布局RAID0mHui-I Hsiao and David J.DeWitt,“Chained declustering:A new availability strategy for multiprocessor
4、database machines”,Technical Report CS TR 854,University of Wisconsin,Madison,June 1989.RAID0mGang Wang,Xiaoguang Liu,Sheng Lin,Guangjun Xie,Jing Liu,“Constructing Double-and Triple-erasure-correcting Codes with High Availability Using Mirroring and Parity Approaches”,ICPADS2007.What is an Erasure C
5、ode?mJ.S.Plank,“Erasure Codes for Storage Applications”,Tutorial of the 4th Usenix Conference on File and Storage Technologies,San Francisco,CA,Dec,2005.When are they useful?mAnytime you need to tolerate failures.When are they useful?mAnytime you need to tolerate failures.When are they useful?mAnyti
6、me you need to tolerate failures.When are they useful?mAnytime you need to tolerate failures.When are they useful?mAnytime you need to tolerate failures.When are they useful?mAnytime you need to tolerate failures.When are they useful?mAnytime you need to tolerate failures.Terms&DefinitionsmNumber of
7、 data disks:nmNumber of coding disks:mmRate of a code:R=n/(n+m)mIdentifiable Failure:“Erasure”The problem,once againIssues with Erasure CodingmPerformanceqEncodingTypically O(mn),but not always.qUpdateTypically O(m),but not always.qDecodingTypically O(mn),but not always.Issues with Erasure CodingmSp
8、ace UsageqQuantified by two of four:Data Devices:nCoding Devices:mSum of Devices:(n+m)Rate:R=n/(n+m)qHigher rates are more space efficient,but less fault-tolerant.Issues with Erasure CodingmFlexibilityqCan you arbitrarily add data/coding nodes?q(Can you change the rate)?qHow does this impact failure
9、 coverage?Trivial Example:ReplicationmMDSmExtremely fast encoding/decoding/update.mRate:R=1/(m+1)-Very space inefficientmThere are many replication/based systemsm(P2P especially).Less Trivial Example:Simple ParitymPatterson D A,Gibson G A,Katz R H,“A case for redundant arrays of inexpensive disks(RA
10、ID)”,ACM International Conference on Management of Data,Chicago,ACM Press,1988,pp.109-116.mP.M.Chen,E.K.Lee,G.A.Gibson,R.H.Katz,and D.A.Patterson.RAID:High-performance,reliable secondary storage.ACM Computing Surveys,26(2):145185,June 1994.Evaluating ParitymMDSmRate:R=n/(n+1)-Very space efficientmOp
11、timal encoding/decoding/update:qn-1 XORs to encode&decodeq2 XORs to updatemExtremely popular(RAID Level 5).mDownside:m=1 is limited.UnfortunatelymThose are the last easy things youll see.mFor(n 1,m 1),there is no consensus on the best coding technique.mThey all have tradeoffs.Why is this such a pain
12、?mCoding theory historically has been the purview of coding theorists.mTheir goals have had their roots elsewhere(noisy communication lines,byzantine memory systems,etc).mThey are not systems programmers.m(They dont care)内容mRAID、容错编码mReed-Solomon编码m二进制线性码内容mRAID、容错编码mReed-Solomon编码m二进制线性码m阵列码m利用组合数学
13、工具构造容错编码Reed-Solomon CodesmThe only MDS coding technique for arbitrary n&m.mThis means that m erasures are always tolerated.mHave been around for decades.mExpensive.J.S.Plank.A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems.Software Practice&Experience,27(9):9951012,Septemb
14、er 1997.Reed-Solomon CodesmOperate on binary words of data,composed of w bits,where 2w n+m.Reed-Solomon CodesmOperate on binary words of data,composed of w bits,where 2w n+m.Reed-Solomon CodesmThis means we only have to focus on words,rather than whole devices.mWord size is an issue:qIf n+m 256,we c
15、an use bytes as words.qIf n+m 65,536,we can use shorts as words.Reed-Solomon CodesmCodes are based on linear algebra.qFirst,consider the data words as a column vector D:Reed-Solomon CodesmCodes are based on linear algebra.qNext,define an(n+m)*n“Distribution Matrix”B,whose first n rows are the identi
16、ty matrix:Reed-Solomon CodesmCodes are based on linear algebra.qB*D equals an(n+m)*1 column vector composed of D and C(the coding words):Reed-Solomon CodesmThis means that each data and coding word has a corresponding row in the distribution matrix.Reed-Solomon CodesmSuppose m nodes fail.mTo decode,
17、we create B by deleting the rows of B that correspond to the failed nodes.Reed-Solomon CodesmSuppose m nodes fail.mTo decode,we create B by deleting the rows of B that correspond to the failed nodes.mYoull note that B*D equals the survivors.Reed-Solomon CodesmNow,invert B:Reed-Solomon CodesmNow,inve
18、rt B:mAnd multiply both sides of the equation by B-1Reed-Solomon CodesmNow,invert B:mAnd multiply both sides of the equation by B-1mSince B-1*B=I,You have just decoded D!Reed-Solomon CodesmNow,invert B:mAnd multiply both sides of the equation by B-1mSince B-1*B=I,You have just decoded D!Reed-Solomon
19、 CodesmTo Summarize:EncodingqCreate distribution matrix B.qMultiply B by the data to create coding words.mTo Summarize:DecodingqCreate B by deleting rows of B.qInvert B.qMultiply B-1 by the surviving words to reconstruct the data.Reed-Solomon CodesmTwo Final Issues:q1:How to create B?All square subm
20、atrices must be invertible.Derive from a Vandermonde MatrixJ.S.Plank and Y.Ding.Note:Correction to the 1997 tutorial on Reed-Solomon coding.Software Practice&Experience,35(2):189194,2005.q#2:Will modular arithmetic work?NO!(no multiplicative inverses)Instead,you must use Galois Field arithmetic.Reed
21、-Solomon Codesm(n+m)n的范德蒙矩阵m基本变换q任意两列可交换 q任何一列可以乘以一个非0数 q任意两列可做如下变换:Ci=Ci+c*Cj,c非0 Reed-Solomon PerformancemEncoding:O(mn)qMore specifically:mS (n-1)/BXOR+n/BGFMult qS=Size of a deviceqBXOR=Bandwith of XOR(3 GB/s)qBGFMult=Bandwidth of Multiplication over GF(2w)GF(28):800 MB/sGF(216):150 MB/sReed-Sol
22、omon PerformancemUpdate:O(m)qMore specifically:m+1 XORs and m multiplications.Reed-Solomon PerformancemDecoding:O(mn)or O(n3)qLarge devices:dS (n-1)/BXOR+n/BGFMult qWhere d=number of data devices to reconstruct.qYes,theres a matrix to invert,but usually thats in the noise because dSn n3.Reed-Solomon
23、 Bottom LinemSpace Efficient:MDSmFlexible:qWorks for any value of n and m.qEasy to add/subtract coding devices.qPublic-domain implementations.mSlow:qn-way dot product for each coding device.qGF multiplication slows things down.Cauchy Reed-Solomon CodesJ.Blomer,M.Kalfane,M.Karpinski,R.Karp,M.Luby,and
24、 D.Zuckerman.An XOR-based erasure-resilient coding scheme.Technical Report TR-95-048,International Computer Science Institute,August 1995.m#1:Use a Cauchy matrix instead of a Vandermonde matrix:Invert in O(n2).m#2:Use neat projection to convert Galois Field multiplications into XORs.mKind of subtle,
25、so well go over it.Cauchy Reed-Solomon Codesm取GF(2w)中m+n个不同元素,构成X=x1,xm,Y=y1,ym mCauchy矩阵:元素(i,j)为1/(xi+yj)GF(2w)上运算Cauchy Reed-Solomon Codesm例:X=1,2,Y=0,3,4,5,6 Cauchy Reed-Solomon CodesCauchy Reed-Solomon Codesmn、m固定,w增长,GC的优势明显参考文献m参考资料qJ.S.Plank.Enumeration of optimal and good Cauchy matrices fo
26、r Reed-Solomon coding.Technical Report CS-05-570,University of Tennessee,December 2005.m通信协议避免重新发送qL.Rizzo.Effective erasure codes for reliable computer communication protocols.ACM SIGCOMM Computer Communication Review,27(2):2436,1997.qL.Rizzo and L.Vicisano.RMDP:an FEC-based reliable multicast prot
27、ocol for wireless environments.Mobile Computer and Communication Review,2(2),April 1998.参考文献m已经正在成为IETF标准qM.Luby,L.Vicisano,J.Gemmell,L.Rizo,M.Handley,and J.Crowcroft.Forward error correction(FEC)building block.IETF RFC 3452(http:/www.ietf.org/rfc/rfc3452.txt),December 2002.qM.Luby,L.Vicisano,J.Gemm
28、ell,L.Rizo,M.Handley,and J.Crowcroft.The use of forward error correction(FEC)in reliable multicast.IETF RFC 3453(http:/www.ietf.org/rfc/rfc3453.txt),December 2002.m加密qC.S.Jutla.Encryption modes with almost free message integrity.Lecture Notes in Computer Science,2045,2001.参考文献m分布式数据结构qW.Litwin and T
29、.Schwarz.Lh*rs:a high-availability scalable distributed data structure using Reed Solomon codes.In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data,pages 237248.ACM Press,2000.m降低无线通信能耗qP.J.M.Havinga.Energy efficiency of error correction on wireless systems,1999.m广域网
30、、对等网存储系统qJ.Kubiatowicz,D.Bindel,Y.Chen,P.Eaton,D.Geels,R.Gummadi,S.Rhea,H.Weatherspoon,W.Weimer,C.Wells,and B.Zhao.Oceanstore:An architecture for global-scale persistent storage.In Proceedings of ACM ASPLOS.ACM,November 2000.参考文献m用于cache而不是冗余qJ.Byers,M.Luby,M.Mitzenmacher,and A.Rege.A digital founta
31、in approach to reliable distribution of bulk data.In ACM SIGCOMM 98,pages 5667,Vancouver,August 1998.qJ.W.Byers,M.Luby,and M.Mitzenmacher.Accessing multiple mirror sites in parallel:Using tornado codes to speed up downloads.In IEEE INFOCOM,pages 275283,New York,NY,March 1999.qI.T.Rowstron and P.Drus
32、chel.Storage management and caching in PAST,a large-scale,persistent peer-topeer storage utility.In Symposium on Operating Systems Principles,pages 188201,2001.参考文献mTornado码qM.Luby,M.Mitzenmacher,and A.Shokrollahi.Analysis of random processes via and-or tree evaluation.In 9th Annual ACM-SIAM Symposi
33、um on Discrete Algorithms,January 1998.qM.Luby,M.Mitzenmacher,A.Shokrollahi,D.Spielman,and V.Stemann.Practical loss-resilient codes.In 29th Annual ACM Symposium on Theory of Computing,pages 150159,1997.qM.Luby.Benchmark comparisons of erasure codes.http:/www.icsi.berkeley.edu/luby/erasure.html,2002.
34、(性能比RS码好)参考文献m传统单节点用于容错和提高性能qS.Frolund,A.Merchant,Y.Saito,S.Spence,and A.Veitch.A decentralized algorithm for erasure-coded virtual disks.In DSN-04:International Conference on Dependable Systems and Networks,Florence,Italy,2004.IEEE.qG.R.Goodson,J.J.Wylie,G.R.Ganger,and M.K.Reiter.Efficient byzantin
35、e-tolerant erasure-coded storage.In DSN-04:International Conference on Dependable Systems and Networks,Florence,Italy,2004.IEEE.qW.Wilcke et al.The IBM intelligent brick project petabytes and beyond.IBM Journal of Research and Development,to appear,April 2006.m归档系统qS.Rhea,C.Wells,P.Eaton,D.Geels,B.Z
36、hao,H.Weatherspoon,and J.Kubiatowicz.Maintenance-free global data storage.IEEE Internet Computing,5(5):4049,2001.参考文献m广域网存储:qS.Atchley,S.Soltesz,J.S.Plank,M.Beck,and T.Moore.Fault tolerance in the network storage stack.In IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems,Ft.Lauderdale
37、,FL,April 2002.qR.L.Collins and J.S.Plank.Assessing the performance of erasure codes in the wide-area.In DSN-05:International Conference on Dependable Systems and Networks,Yokohama,Japan,2005.IEEE.qH.Xia and A.A.Chien.RobuSTore:Robust performance for distributed storage systems.Technical Report CS20
38、05-0838,University of California at San Diego,October 2005.参考文献m对等网:qZ.Zhang and Q.Lian.Reperasure:Replication protocol using erasure-code in peer-to-peer storage network.In 21st IEEE Symposium on Reliable Distributed Systems(SRDS02),pages 330339,October 2002.qJ.Li.PeerStreaming:A practical receiver
39、-driven peer-to-peer media streaming system.Technical Report MSR-TR-2004-101,Microsoft Research,September 2004.qW.K.Lin,D.M.Chiu,and Y.B.Lee.Erasure code replication revisited.In PTP04:4th International Conference on Peer-to-Peer Computing.IEEE,2004.qL.Dairaine,J.Lacan,L.Lancerica,and J.Fimes.Conten
40、t-access QoS in peer-to-peer networks using a fast MDS erasure code.Computer Communications,28(15):17781790,September 2005.参考文献m内容分发系统:qJ.Byers,M.Luby,M.Mitzenmacher,and A.Rege.A digital fountain approach to reliable distribution of bulk data.In ACM SIGCOMM 98,pages 5667,Vancouver,August 1998.qM.Mit
41、zenmacher.Digital fountains:A survey and look forward,.In 2004 IEEE Information Theory Workshop,San Antonio,October 2004.内容mRAID、容错编码mReed-Solomon编码m二进制线性码m阵列码m利用组合数学工具构造容错编码内容mRAID、容错编码mReed-Solomon编码m二进制线性码m阵列码m利用组合数学工具构造容错编码Binary Linear CodesmLisa Hellerstein,Garth A Gibson,Richard M Karp,Randy
42、H Katz,David A Patterson,“Coding techniques for handling failures in large disk arrays”,Algorithmica,vol.12,no.2/3,pp.182-208,1994.m数据单元划分为重叠的校验组,每个校验组相当于一个RAID4/5Binary Linear Codesm容错编码评价标准qMTTDLqcheck disk overheadqupdate penaltyqgroup sizeq扩展性t维编码m磁盘排列为t维方阵,每维最后一行为校验盘m1维:RAID4、RAID5,校验磁盘开销,1/Gm2
43、维:2/G,但注意,G是校验组大小,不是磁盘数,G2=N,因此是2/sqrt(N)mt维:t/G,t/N1/t,t大于3时,冗余率太差 t维编码校验矩阵表示法m数据位和校验位组成一个向量码字mparity check matrixH=P|Iqc(k+c),k数据位数,c校验位数qI为cc的单位矩阵,P为ck的0/1矩阵q表示校验计算公式,对码字X,应有HX=0qP的列数据盘,I的列校验盘q行校验组q元素:1参与校验组,0未参与校验矩阵表示法容错能力的判定m以下命题是等价的qH能恢复任何的t擦除故障qH能检测任意的t错误q任意两个码字的距离distance不相同的位的数目,至少为t+1qH的任意
44、t列在GF2上线性无关因此,故障盘是否可恢复对应列是否线性无关!线性无关所有列向量的和或任意非空子集的和不为0 校验矩阵还可以表示其他指标m校验开销:c/km校验组大小:行的权重1的个数m更新代价:列的权重m一种常见的扩展方式q新增加的磁盘全部清0无需重新计算校验,而H相应的增加一列mMTTDL用校验矩阵表示比较困难重构m假定m个盘故障,重排校验矩阵H=A|B,X=d|y,其中B和y对应故障磁盘m重构求解ymHX=0Ad+By=0求解线性方程组Ad=Bym此方程组有唯一解(可正确重构)的充要条件是B的各列线性无关m无需解整个方程组,抽取出故障校验组对应的行Ad=By,计算y=(B)-1Ad即可
45、 双容错和三容错编码 m最优冗余full-2和full-3:所有可能的权重为2(3)的列组成的校验矩阵mbad t+1故障:一个数据单元及其t个校验单元m高可靠性编码:一个t-erasure-correcting code可恢复除bad t+1故障之外的所有t+1故障m图表示法qfull-2、full-3:完全图q二维:完全二部图2擦除码对比线性码比较线性码比较t3mfull-t不具备t容错能力,t维码冗余率太差m定理:H=P|I 为校验矩阵,若P的所有列重量为t,且对于P的任意两行,P至多有1列在两行上均为1,则编码能容t错qS为P的j列的集合(j2,数据方阵中选取斜率为0、1、t-1这t个
46、方向组织校验组M.Blaum,J.Bruck,and A.Vardy,“MDS array codes with independent parity symbols”,IEEE Trans.on Information Theory,Vol.42,No.2,Mar,1996,pp.529-542.q并非对所有的n、m=t可保证t容错能力,上文中给出了适用的参数RDPm双容错水平码,(p-1)*(p-1)的数据阵列,p为素数m校验相关水平校验单元作为“数据”参与对角线校验阵列码和线性码的关系00010203041011121314202122232430313233344041424344Q0
47、Q1Q2Q3Q4P0P1P2P3P4Liberation码J.S.Plank,“The RAID-6 Liberation Codes”,6th USENIX Conference on File and Storage Technologies,San Francisco,2008,pp.97110.m编码矩阵1的个数最少,但不意味着校验计算性能最优其他水平码Cheng Huang,Lihao Xu,“STAR:An Efficient Coding Scheme for Correcting Triple Storage Node Failures”,4th USENIX Conferen
48、ce on File and Storage Technologies San Francisco,2005,pp.197-210.Chong-Won Park and Jin-Won Park,“A multiple disk failure recovery scheme in RAID systems,”Journal of Systems Architecture,vol.50,pp.169175,2004.B-CODEL.Xu,V.Bohossian,J.Bruck,and D.G.Wagner,Low-Density MDS Codes and Factors of Complet
49、e Graphs,IEEE Trans.Information Theory,pages 1817-1826,IEEE,1999.m双容错MDS垂直码m基于完全图的完全1-因子分解(P1F)m无素数限制图论领域的一个猜测:对所有偶数n,Kn存在P1F,每个P1F又能构造两个规模的B-CODE,所有规模的阵列都能构造B-CODEm基于full-2码B-CODEX-CodeL.Xu and J.Bruck,“X-Code:MDS Array Codes with Optimal Encoding”,IEEE Trans.on Information Theory,Vol.45,No.1,Jan,1
50、999,pp.272-276.m双容错“垂直码”每个磁盘都是既放置数据单元,又放置校验单元EVENODD是“水平码”数据单元和校验单元放置在不同磁盘上m校验方向:1和-1X-Codem结构示意其他垂直码RM2:C.Park,“Efficient placement of parity and data to tolerate two disk failures in disk array systems”,IEEE Trans.Parallel Distribut.Syst.,vol.6,no.11,pp.1177-1184,1995.q不保证MDSq构造方法不是确定的,要进行搜索WEAVER