1、Cisco Confidential 2010 Cisco and/or its affiliates.All rights reserved.1分布式计算平台分布式计算平台Hadoop环境下的组网方案环境下的组网方案 2010 Cisco and/or its affiliates.All rights reserved.Cisco Confidential2uHadoop起源uMapReduce和HDFS介绍uHadoop的流量模型u组网设计Cisco Confidential 2010 Cisco and/or its affiliates.All rights reserved.3Ha
2、doop介绍介绍 2010 Cisco and/or its affiliates.All rights reserved.Cisco Confidential4nDoug Cutting说:这个名字是我的孩子给一头吃饱了的棕黄色大象取的。我的命名标准是简短、容易发音和拼写,没有太多的含义,并且不会被用于别处。小孩是这方面的高手。Google就是小孩子起的名字。n2019年,Hadoop起源于Apache Nutch,一个开源的网络搜索引擎。后来,开发者认为该引擎的架构可扩展度不够,不能解决数十亿网页的搜索问题。怎么办呢?n2019-04年,Google发表了举世闻名的三大论文:n BigTa
3、ble一个分布式的结构化数据存储系统n GFSThe Google File Systemn MapReduce个处理和生成超大数据集的算法模型的相关实现Hadoop起源hadoop.apache.org/2010 Cisco and/or its affiliates.All rights reserved.Cisco Confidential55Hadoop 核心分布式文件系统HDFSMapReduce框架并行数据分析语言Pig 列存储NoSQL数据库 Hbase分布式协调器Zookeeper数据仓库Hive(使用SQL)Hadoop日志分析工具Chukwa以Google的论文为基础,Ha
4、doop也有了自己的生态系统 2010 Cisco and/or its affiliates.All rights reserved.Cisco Confidential6MapReduce和HDFS的工作流nLaod data into the cluster(HDFS Write)。nAnalyze the data(Map Reduce)nStore the results in the cluster(HDFS Write)nRead the results from the cluster(HDFS Read)Cisco Confidential 2010 Cisco and/or
5、 its affiliates.All rights reserved.7MapReduce介绍介绍 2010 Cisco and/or its affiliates.All rights reserved.Cisco Confidential8MapReduce的逻辑数据流n从一堆数据中找出每年的最高温度值 2010 Cisco and/or its affiliates.All rights reserved.Cisco Confidential9MapReduce运行原理nMap阶段:n Input Splitn Map运算n 缓存(内存中)n Spill to Disk/Partiti
6、onn 排序 Sort/Merge on Disk nShuffle阶段(In many ways,the shuffle is the heart of MapReduce and is where the“magic”happens)nReduce阶段n 排序 Sort/Merge(内存到磁盘)n Reduce运算n Output(输出到HDFS)2010 Cisco and/or its affiliates.All rights reserved.Cisco Confidential10MapReduce图解nBuffer默认为100MBn超出Buffer的部分为,被Spill到磁盘。
7、可以设置Buffer阀值为80%n默认可将10个Spill文件并行写入Merge文件nSpill、Merge都可以压缩。用CPU换IOn默认情况,Reduce最多只能同时下载5个Map的数据,mapred.reduce.parallel.copies 2010 Cisco and/or its affiliates.All rights reserved.Cisco Confidential11JobTracker和TaskTrackernJobTracker:协调作业(job)的运行。n客户端:提交MapReduce作业。nTaskTracker:运行作业划分后的任务(task)。n一个Jo
8、b可以被划分成多个Task,每个Maper负责运行一个Task。2010 Cisco and/or its affiliates.All rights reserved.Cisco Confidential12MapReduce运行流程Cisco Confidential 2010 Cisco and/or its affiliates.All rights reserved.13Hadoop Distributed File System介绍介绍 2010 Cisco and/or its affiliates.All rights reserved.Cisco Confidential14
9、HDFSHadoop分布式文件系统n以集群的方式存储海量数据:PB级n对HDFS来说,一次写入,多次读取是最高效的访问模式。n商用硬件:使用普通的PC Server构建集群。HDFS被设计成,如果某些Server遇到故障,集群应不受到影响,继续运行且不让用户察觉到明显的中断。n低时间延迟的访问:要求时延低的的应用,例如几十毫秒,HDFS不适合。HDFS是为高数据吞吐量应用优化的,这可能会以高时延为代价。目前,对于低延迟的应用,Hbase是更好的选择。2010 Cisco and/or its affiliates.All rights reserved.Cisco Confidential15
10、Namenode和DatanodenNamenode:管理者,管理文件系统。记录着每个文件中各个块所在的Datanode信息。n客户端:代表用户与Namenode与Datanode交互来访问整个文件系统。nDatanode:工作节点,根据需要存储并检索数据块,并且定期向Namenode发送它们所存储的块的列表。2010 Cisco and/or its affiliates.All rights reserved.Cisco Confidential16HDFS数据写入剖析 2010 Cisco and/or its affiliates.All rights reserved.Cisco C
11、onfidential17HDFS副本的布局n相同节点中的进程。n同一Rack上的不同Node。n同一DC中的不同Rack上的Node。n不同DC中的Node。2010 Cisco and/or its affiliates.All rights reserved.Cisco Confidential18HDFS数据读取剖析Cisco Confidential 2010 Cisco and/or its affiliates.All rights reserved.19某电商某电商Hadoop集群案例集群案例Cisco Confidential 2010 Cisco and/or its af
12、filiates.All rights reserved.20某电商Hadoop集群规模总容量50PB数据每天增长超过100T总共2800多台机器约150000道作业/天每日扫描数据总量约5PB,产生数据总量约500TBSalve:6 Cores CPU*2、48G Mem、2T12 HDSlave:8 Map、8 Reduce从0:10-24:00都有任务在运行,但其中80%的任务在0:10-9:00之间完成,这段时间是最重要的生产时段Cisco Confidential 2010 Cisco and/or its affiliates.All rights reserved.21Hadoo
13、p流量特征流量特征 2010 Cisco and/or its affiliates.All rights reserved.Cisco Confidential22MapReduce图解 2010 Cisco and/or its affiliates.All rights reserved.Cisco Confidential23流量特征nMapReduce的Shuffle阶段,会造成流量多打一。产生MicroBurst、Incast等现象。n使用TCP作为通讯协议。n整网尽量做到低收敛比。2010 Cisco and/or its affiliates.All rights reserv
14、ed.Cisco Confidential24From The Viewpoint Of Network 2010 Cisco and/or its affiliates.All rights reserved.Cisco Confidential25uCompanies like Google,Microsoft,Yahoo,and Amazon use datacenters for web search,storage,e-commerce,and large-scale general computations.In particular,the vast majority of da
15、tacenters use TCP for communication between nodes.TCP is a mature technology that has survived the test of time and meets the communication needs of most applicationsuOne communication pattern,termed“Incast”by other researchers,elicits a pathological response from popular implementations of TCP.In t
16、he Incast communication pattern,a receiver issues data requests to multiple senders.The senders,upon receiving the request,concurrently transmit a large amount of data to the receiver.The data from allsenders traverses a bottleneck link in a many-to-one fashion.As the number of concurrent senders in
17、creases,the perceived application-level throughput at the receiver collapses.The receiver application sees goodput that is orders of magnitude lower than the link capacityuThe incast pattern potentially arises in many typical datacenter applications.For example,in cluster storage,when storage nodes
18、respond to requests for data,in websearch,when many workers respond near simultaneously to search queries,and in batch processing jobs like MapReduce,in which intermediate key-value pairs from many Mappers are transferred to appropriate Reducers during the“shuffle”stage.Incast现象、现象、GoodputCisco Conf
19、idential 2010 Cisco and/or its affiliates.All rights reserved.26组网设计组网设计 2010 Cisco and/or its affiliates.All rights reserved.Cisco Confidential27网络架构网络架构CSW-1CSW-2CSW-3CSW-4N3548-1N3548-2N3548-3N3548-NServersServersServersServersServersServersServersServersServersServersServersServersServersServers
20、ServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServers 2010 Cisco and/or its affiliates.All rights reserved.Cisco Co
21、nfidential28网络架构网络架构n四台交换机组成4个CSW平面,CSW平面之间不互联nN3548分别与4个平面的CSW交换机互联nN3548与CSW之间通过动态路由协议实现自路由收敛n通过BFD和IP FRR提升网络的可用性nN3548与CSW之间可以是10GE或40GE互联nN3548与Server之间可以是1GE或10GE互联nCSW交换机可以是N3548、N5K、N7Kn所支持服务器数量取决于CSW交换机的端口密度 2010 Cisco and/or its affiliates.All rights reserved.Cisco Confidential29设计思路设计思路n灵
22、感来自于Multi-Chassis Router CRS:n Hadoop集群内部主要是巨大的东西向流量n 加速比/Speedup(为什么要有多个平面?),相关术语:HOLB、VoQn ECMP,相关术语:Round Robin、Per Flown Buffering Fabric、Backpressuren Self Routing,相关术语:CrossBar Fabricn使用最新的Nexus3548,并利用其最新的特性:n Buffer Allocation、Managementn DCTCPn理论基础出自于上世纪60-70年代的论文CLOS Fat Tree,但是网络结构绝对不是翻新,
23、至少在2019年以前,整个工业界大部分还是使用传统的3层汇聚架构 2010 Cisco and/or its affiliates.All rights reserved.Cisco Confidential30CRS架构架构 2010 Cisco and/or its affiliates.All rights reserved.Cisco Confidential31Good HoLB solutions Virtual Output Queues and Backpressure TXTXRXRX40G40G112G112GVOQ(Virtual Output Queues)Cisco
24、12000 or ASR9000 per-destination slot queues 4-16 destination slots hundreds VOQs per card!Fabric QoS+backpressure Cisco CRS-1(1296 slots!)2.8x egress overspeed 4 queues at each point vital bit packet packingIngress LinecardsTXTXRXRXEgress Linecards10G10G10G10Garbiter grantgrantVirtual Output Queues
25、-Voice:strict scheduling-Multicast:extra queues Destination Queues-Voice:strict scheduling-Multicast:extra queues Overspeed Queues-Voice:strict scheduling-Multicast:extra queues Fabric Queues-Voice:strict scheduling-Multicast:extra queues backpressure 2010 Cisco and/or its affiliates.All rights rese
26、rved.Cisco Confidential32Bene Self-Routing,Buffering Fabric no arbiterIngress LinecardsTXTXTXTXRXRXRXRXEgress LinecardsCRS-1 Switch Fabric dual-stage Bene Fabric QoS(4 queues)per port backpressure Replicates Multicast scales up to 1176 slots QoS QoS QoS QoS QoS112 Gbps45 GbpsBACKPRESSURES1S2S3CRS-1
27、Switch Fabric 为什么不使用Crossbar,而要使用Benes?Benes网络最大的优点是:相对一个没有中间交换过程的Crossbar结构,对于要实现一个nn的全交换,Benes网络所需要的连接节点 数目要小的多。所以这是一个成本问题。2010 Cisco and/or its affiliates.All rights reserved.Cisco Confidential33Self-Routing 2010 Cisco and/or its affiliates.All rights reserved.Cisco Confidential34MicroBurst in Ma
28、pReduce Shuffle Stagen发生MircroBurst之后:在Node上可以发现大量的TCP Retransmission,Incast 2010 Cisco and/or its affiliates.All rights reserved.Cisco Confidential353847681152153619202304268830723456384061449/15/2019 3:11:01 PM5051090140000009/15/2019 3:11:02 PM00000109010050009/15/2019 3:11:03 PM00000010801105009
29、/15/2019 3:11:04 PM00000100120300009/15/2019 3:11:05 PM0510851500000009/15/2019 3:11:06 PM20050000000000Nexus 3548 Buffer吸收溢出的流量Active Buffer Monitoring#Of SamplesAlgoBoost Buffer Histogram Software PollingHardware Polling 2010 Cisco and/or its affiliates.All rights reserved.Cisco Confidential36仅靠仅靠
30、Buffer行吗?行吗?n一个有意思的现象,使用了大Buffer的交换机之后,JOB的时间会缩短,吞吐量会上去,但是仍然会看到有TCP Retransmissionn 这是因为心跳和TCP ACK等信令报文被积压在了Buffer中,没有及时到达,导致TCP重传TCP数据报文TCP ACK报文Job Tracker与Task Tracker之间的心跳报文NameNode与DataNode之间的心跳报文 2010 Cisco and/or its affiliates.All rights reserved.Cisco Confidential37高吞吐高吞吐 与与 低延迟低延迟n为了减缓TCP
31、Incast,高吞吐量需要Switch具备一定的Buffer,来缓存溢出的流量。但是低延迟则相反,留在Buffer中的时间越短越好。n心跳报文/TCP ACK需要低延迟,需要被快速的送达目的地。如何让这类报文避过Buffer的延迟?n使用DCTCP,减少TCP Incast带来的流量溢出。在保持高吞吐量的同时,将Buffer队列维持在一个较小的占用比例,以此让心跳报文/TCP ACK在Buffer中停留的时间大大缩短。nN3548支持DCTCP,同时具备ULL,所以会让心跳报文/TCP ACK传递的更快。2010 Cisco and/or its affiliates.All rights r
32、eserved.Cisco Confidential38ECN首先由传输层进行能力协商协商完毕后控制IP头的ECT、CE标致位接收端接收到CE包,向发送端发送拥塞通知目前TCP通过使用两个预留标志位来实现能力协商和拥塞通知TCP新建标志位为CWR(Congestion Window Reduce)和ECE(ECN-Echo)UDP等其余传输层协议需要应用层通知ECN:Congestion Notification 2010 Cisco and/or its affiliates.All rights reserved.Cisco Confidential39SYN=1,ECE=1,CWR=1支
33、持拥塞通告,也支持拥塞窗口调整SYN=1,ACK=1,ECE=1,CWR=0支持拥塞通告,不支持拥塞窗口调整ACK=1,ECE=0,CWR=0 能力协商结束TCP 握手阶段拥塞发生IP ECT=1,CE=0IP ECT=1,CE=0IP ECT=1,CE=1ACK=N,ECE=1,CWR=0通知发生拥塞Data,CWR=1接收到拥塞通知,发送窗口减半ACK=M,ECE=0,CWR=0接收到CWR=1,ECE清除,否则持续发送传统的ECN模式 2010 Cisco and/or its affiliates.All rights reserved.Cisco Confidential40Data
34、 Center TCP AlgorithmSwitch side:Mark packets when Queue Length K.Queue is not fullSender side:Maintain running average of fraction of packets marked().In each RTT:Adaptive window decreases:Note:decrease factor between 1 and 2.BKMarkDont MarkSource:Data Center TCP(DCTCP),SIGCOMM 2019,New Dehli,India
35、,August 31,2019.2010 Cisco and/or its affiliates.All rights reserved.Cisco Confidential41Incast Results with DCTCP2MB4MB8MB 2010 Cisco and/or its affiliates.All rights reserved.Cisco Confidential42每台服务器使用翻倍的以太网链路接入交换矩阵,消除每台服务器使用翻倍的以太网链路接入交换矩阵,消除TCP Incast。以空间换时间。以空间换时间。简单简单粗暴:增加Server与网络之间的带宽 2010 Cisco and/or its affiliates.All rights reserved.Cisco Confidential43总结总结nMapReduce的运行方式会造成TCP Incast,降低Hadoop的运算效率。n网络架构设计灵感来自于CRS,设计要点包括:n 加速比/Speedup、ECMP、Self Routing、Buffern解决TCP Incast的方式:n 大Buffern 适量的Buffer+DCTCP(N3548)n 使用翻倍的以太网链路接入交换矩阵,消除TCP Incast