高级数据库课件：-A-Dyanmo.pptx_163文库

资源描述

1、10/22/20121Distributed key/value storage systemScalableHighly availableCAP10/22/20122It sacrifices strong consistency for availability: always writableIncremental scalabilitySymmetry: every node should have the same set of responsibilities as its peersDecentralizationHeterogeneity10/22/20123Data par

2、titioningReplicationData versioningDynamo APIMembership managementApplication can deliver its functionality in abounded time: Every dependency in the platform needs to deliver its functionality with even tighter bounds.Example: service guaranteeing that it will provide a response within 300ms for 99

3、.9% of its requests for a peak client load of 500 requests per second.Service-oriented architecture of Amazons platformIf size of data exceeds the capacity of a single machine: partitioningSharding data (horizontal partitioning).Consistent hashing is one form of automatic sharding.hash(Fatemeh) = 12

4、hash(Ahmad) = 2hash(Seif) = 9hash(Jim) = 14hash(Sverker) = 4Hash both data and nodes using the same hash function in a same id space.partition = hash(d) mod n, d: data, n: number of nodesNODE IDENTIERS MAY NOT BE BALANCED.DATA IDENTIERS MAY NOT BE BALANCED.10/22/20129HOT SPOTS.HETEROGENEOUS NODES.10

5、/22/201210Each physical node picks multiple random identifiers.Each identifier represents a virtual node.Each node runs multiple virtual nodes.10/22/201211If a node becomes unavailable the load handled by this node is evenly dispersed across the remaining available nodes.10/22/201212When a node beco

6、mes available again, the newly available node accepts a roughly equivalent amount of load from each of the other available nodes.The number of virtual nodes that a node is responsible can decided based on its capacity, accounting for heterogeneity in the physical infrastructure.To achieve high avail

7、ability and durability, data should be replicates on multiple nodes.10/22/201213Updates are propagated asynchronously.Each update/modication of an item results in a new and immutable version of the data.Multiple versions of an object may exist.Replicas eventually become consistent.10/22/201214Versio

8、n branching can happen due to node/network failures.I Use vector clocks for capturing causality, in the form of (node, counter)If causal: older version can be forgottenIf concurrent: conflict exists, requiring reconciliationClient C1 writes new object via Sx.C1 updates the object via Sx.C1 updates t

9、he object via Sy.C2 reads D2 and updates the object via Sz.C3 reads D3 and D4 via Sx.The read context is a summary of the clocks of D3 and D4: (Sx, 2), (Sy, 1), (Sz, 1).Reconciliationget(key)Return single object or list of objects with conflicting version and context.put(key, context, object)Store o

10、bject and context under key.Context encodes system metadata, e.g., version number.Client can send the request:To the node responsible for the data (coordinator): save on latency, code on clientTo a generic load balancer: extra hopeCoordinator generates new vector clock and writes the new version loc

11、ally.10/22/201217Coordinator generates new vector clock and writes the new version locally.Send to N nodes.Wait for response from W nodes.Using W=1High availability for writesLow durabilityCoordinator requests existing versions from N.Wait for response from R nodes.If multiple versions, return all v

12、ersions that are causally unrelated.Divergent versions are then reconciled.Reconciled version written back.Using R=1High performance read engine.R/W is the minimum number of nodes that must participate in a successful read/write operation.Setting R + W N yields a quorum-like system.In this model, th

13、e latency of a get (or put) operation is dictated by the slowest of the R (or W) replicas. For this reason, R and W are usually configured to be less than N, to provide better latency.Assume N = 3. When A is temporarily down or unreachable during a write, send replica to D.D is hinted that the repli

14、ca is belong to A and it will deliver to A when A is recovered.Again: “always writeable”Administrator explicitly adds and removes nodes.Gossiping to propagate membership changes.Eventually consistent view.O(1) hop overlay.A new node X added to system.X is assigned key ranges w.r.t. its virtual serve

15、rs.For each key range, it transfers the data items.Reallocation of keys is a reverse process of adding nodes.Passive failure detection.Use pings only for detection from failed to alive.In the absence of client requests, node A doesnt need to know if node B is alive.Anti-entropy for replica synchroni

16、zation.Use Merkle trees for fast inconsistency detection and minimum transfer of data.Nodes maintain Merkle tree of each key range.Exchange root of Merkle tree to check if the key ranges are updated.Due to partitions, quorums might not exist. Sloppy quorum.Create transient replicas: N healthy nodes from the preference list.Reconcile after partition heals.Say A is unreachable.put will use D.Later, D detects A is alive.Sends the replica to ARemoves the replica.

展开阅读全文