1、Rook-Deployed Scalable NFS Clusters Exporting CephFSRook部署可扩展NFS集群与导出CephFSWhat is Ceph(FS)?Rook-Deployed Scalable NFS Clusters Exporting CephFS2 CephFS is a POSIX distributed file system.Clients and MDS cooperatively maintain a distributed cache of metadata including inodes and directories MDS hand
2、s out capabilities(aka caps)to clients,to allow them delegated access to parts of inode metadata Clients directly perform file I/O on RADOSClientActiveMDSJournalMetadata MutationsStandby MDSActive MDSJournalRADOSData PoolMetadata PoolreadwriteJournal FlushMetadata ExchangeopenmkdirlistdirCeph,an int
3、egral component of hybrid cloudsRook-Deployed Scalable NFS Clusters Exporting CephFS3Manila:A file share service for VMs Uses CephFS to provision shared volumes.Cinder:A block device provisioner for VMs.Uses Cephs RADOS Block Device(RBD).CephFS usage in community OpenstackRook-Deployed Scalable NFS
4、Clusters Exporting CephFS4 Most Openstack users are also running a Ceph cluster already Open source storage solution CephFS metadata scalability is ideally suited to cloud environments.https:/www.openstack.org/user-survey/survey-2017So why Kubernetes?Rook-Deployed Scalable NFS Clusters Exporting Cep
5、hFS5Lightweight Containers!Trivial to spin up services in response to changing application needs.Extensible service infrastructure!Parallelism Containers are lightweight enough for lazy and optimal parallelism!Fast/Cheap Failover Service failover only requires a new pod.Fast IP Failover/Management F
6、ilesystem shared between multiple nodes.but also tenant-awareWhy Would You Need a Ceph/NFS Gateway?Rook-Deployed Scalable NFS Clusters Exporting CephFS6Clients that cant speak Ceph properly old,questionable,or unknown ceph drivers(old kernel)3rd party OSsSecurity Partition Ceph cluster from less tru
7、sted clients GSSAPI(kerberos)Openstack Manila Filesystem shared between multiple nodes.but also tenant-aware.and self-managed by tenant adminsCephRADOSOSDOSDMDSceph-fuseMetadata RPCFile I/OkernelActive/Passive DeploymentsRook-Deployed Scalable NFS Clusters Exporting CephFS7 One active server at a ti
8、me that“floats”between physical hosts Traditional“failover”NFS server,running under pacemaker/corosync Scales poorly+requires idle resources Available since Ceph Luminous(Aug 2017)Goal:Active/Active DeploymentRook-Deployed Scalable NFS Clusters Exporting CephFS8NFS ServersNFS ClientsGoals and Requir
9、ementsRook-Deployed Scalable NFS Clusters Exporting CephFS9Scale OutActive/Active cluster of mostly independent servers.Keep coordination between them to a bare minimum.ContainerizableLeverage container orchestration technologies to simplify deployment and handle networking.No failover of resources.
10、Just rebuild containers from scratch when they fail.NFSv4.1+Avoid legacy NFS protocol versions,allowing us to rely on new protocol features for better performance,and possibility for pNFS later.Ceph/RADOS for CommunicationAvoid need for any 3rd party clustering or communication between cluster nodes
11、.Use Ceph and RADOS to coordinateGanesha NFS ServerRook-Deployed Scalable NFS Clusters Exporting CephFS10 Open-source NFS server that runs in userspace(LGPLv3)Plugin interface for exports and client recovery databases well suited for exporting userland filesystems FSAL_CEPH uses libcephfs to interac
12、t with Ceph cluster Can use librados to store client recovery records and configuration files in RADOS objects Amenable to containerization Store configuration and recovery info in RADOS No need for writeable local fs storage Can run in unprivileged container Rebuild server from r/o image if it fail
13、sNFS ProtocolRook-Deployed Scalable NFS Clusters Exporting CephFS11 Based on ONC-RPC(aka SunRPC)Early versions(NFSv2 and v3)were statelesswith sidecar protocols to handle file locking(NLM and NSM)NFSv4 was designed to be stateful,and state is leased to the clientclient must contact server at least o
14、nce every lease period to renew(45-60s is typical)NFSv4.1 revamped protocol using a sessions layer to provide exactly-once semantics added RECLAIM_COMPLETE operation(allows lifting grace period early)more clustering and migration support NFSv4.2 mostly new features on top of v4.1NFS Client RecoveryR
15、ook-Deployed Scalable NFS Clusters Exporting CephFS12After restart,NFS servers come up with no ephemeral stateOpens,locks,delegations and layouts are lostAllow clients to reclaim state they held earlier for 2 lease periodsDetailed state tracking on stable storage is quite expensiveGanesha had suppor
16、t for storing this in RADOS for single-server configurationsDuring the grace period:No new state can be established.Clients may only reclaim old state.Allow reclaim only from clients present at time of crashNecessary to handle certain cases involving network partitionsMust keep stable-storage record
17、s of which clients are allowed to reclaim after a rebootPrior to a client doing its first OPEN,set a stable-storage record for client if there isnt oneRemove after last file is closed,or when client state expiresAtomically replace old client db with new just prior to ending grace periodNFS Server Re
18、boot EpochsRook-Deployed Scalable NFS Clusters Exporting CephFS13Grace PeriodNormalOpsGrace PeriodNormalOpsGrace PeriodNormalOpsEpoch 1Epoch 2Epoch 3 Consider each reboot the start of a new epoch As clients perform first open(reclaim or regular),set a record for them in a database associated with cu
19、rrent epoch During grace period,allow clients that are present in the previous epochs database to reclaim state After transition from Grace to Normal operations,can delete any previous epochs database The same applies in a cluster of servers!New in Ganesha:Coordinating Grace PeriodsRook-Deployed Sca
20、lable NFS Clusters Exporting CephFS14 Done via central database stored in a RADOS object Two 64-bit integers:Current Epoch(C):Where should new records be stored?Recovery Epoch(R):From what epoch should clients be allowed to reclaim?R=0 means grace period is not in force(normal operation)Flags for ea
21、ch node indicating current need for a grace period and enforcement status NEED(N):does this server currently require a grace period?First node to set its NEED flag declares a new grace period Last node to clear its NEED flag can lift the grace period ENFORCING(E):is this node enforcing the grace per
22、iod?If all servers are enforcing then we know no conflicting state can be handed out(grace period is fully in force)Simple logic to allow individual hosts to make decisions about grace enforcement based on state of all servers in cluster.No running centralized entity.Challenge:Layering NFS over a Cl
23、uster FSRook-Deployed Scalable NFS Clusters Exporting CephFS15 Both protocols are stateful,and use lease-based mechanism.Ganesha acts as a proxy.Example:ganesha server may crash and a new one started:Client issues request to do a open NFS reclaim Ganesha performs an open request via libcephfs Blocks
24、,waiting for caps held by previous ganesha instance Default MDS session timeout is 60s Session could also die before other servers are enforcing grace periodNew In Nautilus:Client Reclaim of SessionsRook-Deployed Scalable NFS Clusters Exporting CephFS16 When a CephFS client dies abruptly,the MDS kee
25、ps its session around for a little while(several minutes).Stateful info(locks,opens,caps,etc.)are tied to those sessions.Added new interfaces to libcephfs to allow ganesha to request that its old session be killed off immediately.May eventually add ability for session to take over old stateful objec
26、ts,obviating need for grace period on surviving server heads.https:/ and Manage NFS ClustersRook-Deployed Scalable NFS Clusters Exporting CephFS17 Dynamic NFS cluster creation/deletion Scale in two directions:active-active configuration of NFS clusters for a hot subvolumes per-subvolume independent
27、NFS clusters Handling failures+IP migration Spawn replacement NFS-Ganesha daemon and migrate the IP address.Note:NFS clients cannot modify the NFS servers IP address post-mountTying it all together with Rook and KubernetesRook-Deployed Scalable NFS Clusters Exporting CephFS18 rook.io is cloud-native
28、 storage orchestrator that can deploy ceph clusters dynamically.Ceph aware of deployment technology!Rook operator handles cookie-cutter configuration of daemons CephNFS resource type in k8s Integration with orchestrator ceph-mgr module Deploy new clusters Scale node count up or down Dashboard can ad
29、d+remove exports Can scale NFS cluster size up or down(MDS too in the future!)Ceph Nautilus:Mechanisms in PlaceRook-Deployed Scalable NFS Clusters Exporting CephFS19 Volume management integrated in Ceph to provide a SSOT for CephFS exports.Volumes provide an abstraction for CephFS file systems.Subvo
30、lumes provide an abstraction for indepdent CephFS directory trees.Manila/Rook/all use the same interface.Ceph/Rook integration in place to launch NFS-Ganesha clusters to export subvolumes.Ceph management of exports and configs for NFS-Ganesha clusters.Vision for the FutureRook-Deployed Scalable NFS
31、Clusters Exporting CephFS20 Trivial creation/config of a managed volume with a persistent NFS Gateway.NFS-Ganesha:Add support for NFSv4 migration to allow redistribution of clients among servers Optimizing grace periods for subvolumes.SMB integration(samba)$ceph fs subvolume create vol_a subvol_1$ce
32、ph fs subvolume set vol_a subvol_1$ceph fs subvolume set vol_a subvol_1 sharenfs true14.2.214.2.3+21AppendixCephFS and Client AccessRook-Deployed Scalable NFS Clusters Exporting CephFS22Journal&InodesCephRADOSOSDOSDMDSceph-fuseMetadata RPCFile I/Okernel CephFS is a POSIX distributed file system.Clie
33、nts and MDS cooperatively maintain a distributed cache of metadata including inodes and directories MDS hands out capabilities(aka caps)to clients,to allow them delegated access to parts of inode metadata Clients directly perform file I/O on RADOS23Kubernetes Pod(HA Managed by Kubernetes)MDSOSDOSDOS
34、DGaneshaNFSGWMGRManilaPush configStart grace periodMetadata IOData IOGet/Put Client State(in RADOS)Get Share/Config+Advertise to ServiceMapSpawn Container in NW Share/usr/bin/cephREST API:Get/Put Shares(Publish Intent)Share:CephFS NameExport PathsNetwork Share(e.g.Neutron ID+CIDR)Share Server CountR
35、ook/Kubernetes+Kuryr(net driver)HA managedby KubernetesScale-out&shares managedby mgrCephFSRook-Deployed Scalable NFS Clusters Exporting CephFS24 Metadata is handled via a Metadata Server(MDS)clients establish a session with the MDS CephFS MDS hands out capabilities(aka caps)to clients recallable,de
36、legated,granular parts of inode metadata:AUTH-uid,gid,mode LINK-link count FILE-read/write ability,layout,file size,mtime XATTR-xattrs most come in shared/exclusive flavors grant the clients the ability to do certain operations on an inode tied to the clients MDS session libcephfs provides a(somewhat POSIX-ish)userland client interface25EOS26Thank youOPTIONAL SECTION MARKER OR TITLE
侵权处理QQ:3464097650--上传资料QQ:3464097650
【声明】本站为“文档C2C交易模式”,即用户上传的文档直接卖给(下载)用户,本站只是网络空间服务平台,本站所有原创文档下载所得归上传人所有,如您发现上传作品侵犯了您的版权,请立刻联系我们并提供证据,我们将在3个工作日内予以改正。