1. Hard Disk Info   Debian hard disk speed   ssh speed tests  
  2. Storage With GlusterFS 3   Debian Kvm virtual storage server
  3. PCIE USB 3.0 Card   GA-USB3.0   Review USB3.0
  4. Storage Terminologies   Ceph Storage  

Storage: objects, blocks, and files (Source Origin)

Many cloud computing use cases require persistent remote storage. Storage solutions are often divided into three categories: object storage, block storage, and file storage.

Note that some storage solutions support multiple categories. For example, NexentaStor supports both block storage and file storage (with announcements for future support for object storage), GlusterFS supports file storage and object storage, and Ceph Storage supports object storage, block storage, and file storage.

Object storage

In OpenStack: Object Storage service (Swift)
     Related concepts: Amazon S3, Rackspace Cloud Files, Ceph Storage

With object storage, files are exposed through an HTTP interface, typically with a REST API. All client data access is done at the user level: the operating system is unaware of the presence of the remote storage system. In OpenStack, the Object Storage service provides this type of functionality. Users access and modify files by making HTTP requests. Because the data accesss interface provided by an object storage system is at a low level of abstraction, people often build on top of object storage to build file-based applications that provide a higher level of abstraction. For example, the OpenStack Image service can be configured to use the Object Storage service as a backend. Another use for object storage solutions is as a content delivery network (CDN) for hosting static web content (e.g., images, and media files), since object storage already provides an HTTP interface.

Block storage (SAN)

In OpenStack: Volumes (nova-volume service)
     Relatd concepts: Amazon Elastic Block Store (EBS), Ceph RADOS Block Device (RBD), iSCSI

With block storage, files are exposed through a low-level computer bus interface such as SCSI or ATA, that is accessible over the network. Block storage is synonymous with SAN (storage area network). Clients access data through the operating system at the device level: users access the data by mounting the remote device in a similar manner to how they would mount a local, physical disk (e.g., using the "mount" command in Linux). In OpenStack, the nova-volume service that forms part of the Compute service provides this type of functionality, and uses iSCSI to expose remote data as a SCSI disk that is attached to the network.

Because the data is exposed as a physical device, the end-user is responsible for creating partitions and formatting the exposed disk device. In addition, in OpenStack Compute a device can only be attached to one server at a time, so block storage cannot be used for to share data across virtual machine instances concurrently.

File storage (NAS)

In OpenStack: none
     Related concepts: NFS, Samba/CIFS, GlusterFS, Dropbox, Google Drive

With file storage, files are exposed through a distributed file system protocol. Filesystem storage is synonymous with NAS (network attached storage). Clients access data through the operating system at the file system level: users access the data by mounting a remote file system. Examples of file storage include NFS and GlusterFS. The operating system needs to have the appropriate client software installed to be able to access the remote file system.

Currently, OpenStack Compute does not have any native support for this type of file storage inside of an instance. However, there is a Gluster storage connector for OpenStack that enables the use of the GlusterFS file system as a back-end for the Image service.

CDN-as-a-Platform for Cloud Services (Source Origin)

Posted on July 24, 2012 by Larry Peterson

I've recently posted several articles about the advantages of running network functions on virtualized commodity servers, and while it's important to make the general case for replacing purpose-built hardware appliances with commodity servers, there's a more focused story centered around CDNs that's worth telling.

It involves a slight change in perspective. Rather than view a CDN as one of many services that can be deployed on a virtualized platform-in an earlier article I talked about a spectrum of services ranging from BRAS to CDN-think of the CDN as the platform on which an important subset of network functions are deployed... those related to content delivery. Virtual machines are an enabling technology, but the important point is that a CDN serves as the cornerstone for a rich collection of functions that enable network operators to sell web acceleration solutions to their business and enterprise customers.

In other words, it's fair to view a CDN as a platform that hosts functions to accelerate B2C and B2B transactions over the Internet, especially transactions that involve cloud-based services. In this scenario, the CDN runs at the edge of the operator's network, where in addition to caching static objects, it also hosts client-facing and cloud-facing optimizations. The client-facing optimizations, often collectively called front-end optimization, include an assortment of techniques aimed at reducing transaction response time, as well as SSL termination, TCP enhancements for mobile applications, and business logic offload. The cloud-facing optimizations, sometimes called WAN acceleration, include symmetric strategies for compression and de-duplication. (It is notable that these latter techniques are typically symmetric because the CDN provides a point-of-presence both in the data center and at the edge of the network.)

An architecture for CDN-as-a-Platform has two major elements. The first is edge service nodes that not only cache static objects, but also run a policy engine that governs how the CDN interacts with clients that make requests and public/private clouds that host business services. This policy engine receives and processes client HTTP requests, dispatches the request to the appropriate module for processing, and when communication with the data center is required (e.g., to satisfy cache misses or to access dynamic state), selects the appropriate module to optimize communication with those servers. In Verivue's case, some of these modules are part of the OneVantage product portfolio (some run their own VMs and others are cache plug-ins), while others are provided by third-party service venders (these are typically isolated in their own VMs).

The second element of the CDN-as-a-Platform architecture is the lynchpin - a  unified approach to service management. The management suite is responsible for provisioning virtual machines, configuring the services deployed in those virtual machines, proactively monitoring the deployment, and collecting traffic logs for billing and analytics. The core of the management suite is a data model that presents operators with a comprehensive and coherent picture of the available network functions that they must manage.

A clear understanding of  this data model is starting to emerge. It includes objects that model the central stake-holders, including CDN Operators, Service Providers, and Content Providers (i.e., business customers); objects that model the deployed infrastructure, including virtual/physical nodes and network interfaces; objects that model the set of services and modules instantiated on that infrastructure; and objects that model the set of policy directives that govern how those services and modules behave.

These rules and policies, in turn, include: (1) routing directives that govern how end-users are routed to the best service node to process a given request, (2) delivery directives that govern how the selected service node delivers the resources named in the request, (3) anchor directives that govern how the service node interacts with cloud-hosted business services that anchor the request, and (4) analytic directives that govern how the service node gathers and processes traffic statistics. In other words, these rules collectively control what service node is selected to serve a given end-user, what module(s) are invoked at that node to customize delivery for that particular user, how those modules are parameterized to serve a particular end-user, and how data is collected.

Coordinating these policy directives across a set of services and modules requires a unifying abstraction that defines the scope for each directive. We call the scope a delivery domain, and it corresponds to the subset of URIs to which a given set of rules and policies is to be applied. A delivery domain is represented (identified) in one of two ways. The first is a CDN-Prefix, which corresponds to the FQDN at the beginning of a URI; it effectively carves out a region of the URI name space to which a set of rules and policies are to be applied. The second is a URI-Filter, which is given by a regular expression; it effectively identifies a subset of URIs belonging to a CDN-Prefix that is to be treated in a uniform way.

In summary, a CDN-powered platform allows business and enterprise customers to accelerate their cloud-hosted web services by effectively extending the cloud from the data center out to the edge of the operator's network. Service nodes deployed at the network edge provide the optimal vantage point to co-locate caching, front-end optimization, and WAN acceleration technologies, where the management suite plays a central role in coordinating the resulting data center-to-edge cloud on behalf of B2C and B2B customers.

Trends in Cloud Storage: CDN (Source Origin)

Posted on August 20, 2012 by Larry Peterson

Here's a modest insight. When designing a cloud storage system, there is value in decoupling the system's archival capacity (its ability to persistently store large volumes of data) from the system's delivery capacity (its ability to deliver popular objects to a scalable number of users). The archival half need not support scalable performance, and likewise, the delivery half need not guarantee persistence.

In practical terms, this translates into an end-to-end storage solution that includes a high-capacity and highly resilient object store in the data center, augmented with caches throughout the network to take advantage of aggregated delivery bandwidth from edge sites. This is similar to what Amazon offers today: S3 implements a resilient object store in the data center, augmented with CloudFront to scale delivery through a distributed set of edge caches.

The object store runs in the data center, ingests data from some upstream source (e.g., video prepared using a Content Management System), and delivers it to users via edge caches. The ingest interface is push-based and likely includes one or more popular APIs (e.g., FTP, WebDAV, S3), while the delivery interface is pull-based and corresponds to HTTP GET requests from the CDN.

In past articles I have written extensively about how to architect a CDN that can be deployed throughout an operator network, claiming that a well-designed CDN should be agnostic as to the source of content. But it is increasingly the case that content delivered over a CDN is sourced from a data center as part of a cloud-based storage solution. This begs the question: is there anything we can learn by looking at storage from such an end-to-end perspective?

I see three points worth making, although in way of a disclaimer, I'm starting from the perspective of the CDN, and looking back to what I'd like to see from a data center based object store. The way I see it, though, there's more value in storing data if you have a good approach to distributing it to users that want to access it.

First, it makes little sense to build an object store using traditional SAN or NAS technology. This is for two reasons. One has to do with providing the right level of abstraction. In this case, the CDN running at the network edge is perfectly capable of dealing with a large set of objects, meaning there is no value in managing those objects with full file system semantics (i.e., NAS is a bad fit). Similarly, the storage system needs to understand complete objects and not just blocks (i.e., SAN is not a good fit). The second reason is related to cost. It is simply more cost effective to build a scalable object store from commodity components. This argument is well understood, and leverages the ability to achieve scalable performance and resiliency in software.

Second, a general-purpose CDN that is able to deliver a wide range of content-from software updates to video, from large files to small objects, from live (linear) streams to on-demand video, from over-the-top to managed video-should not be handicapped by an object store that isn't equally flexible. In particular, it is important that the ingest function be low-latency and redundant, so it is possible to deliver both on-demand and live video. (Even live video needs to be staged through an object store to support time shifting.)

Third, it is not practical to achieve scalable delivery from a data center. Data centers typically provide massive internal bandwidth, making it possible to build scalable storage from commodity servers, but Internet-facing bandwidth is generally limited. This is just repeating the argument in favor of delivering content via a CDN-scalable delivery is best achieved from the edge.

About Larry Peterson

As Chief Scientist, Larry Peterson provides technical leadership and expertise for research and development projects. He is also the Robert E. Kahn Professor of Computer Science at Princeton University, where he served as Chairman of the Computer Science Department from 2003-2009. He also serves as Director of the PlanetLab Consortium, a collection of academic, industrial, and government institutions cooperating to design and evaluate next-generation network services and architectures. Larry has served as Editor-in-Chief of the ACM Transactions on Computer Systems, has been on the Editorial Board for the IEEE/ACM Transactions on Networking and the IEEE Journal on Select Areas in Communication and is the co-author of the best selling networking textbook Computer Networks: A Systems Approach. He is a member of the National Academy of Engineering, a Fellow of the ACM and the IEEE, and the 2010 recipient of the IEEE Kobayahi Computer and Communication Award. He received his Ph.D. degree from Purdue University in 1985.

Ceph: Open Source Storage

As the size and performance requirements of storage systems have increased, file system designers have looked to new architectures to facilitate system scalability. Ceph's architecture consists of an object storage, block storage and a POSIX-compliant file system. It's in the most significant storage system that has been accepted into the Linux kernel. Ceph has both kernel and userland implementations. The CRUSH algorithm controlled, scalable, decentralized placement of replicated data. In addition, it has a fully leveraged, highly scalable metadata layer. Ceph offers compatibility with S3, Swift and Google Storage and is a drop in replacement for HDFS (and other File Systems).

Ceph is unique because it's massively scalable to the exabyte level. The storage system is self-managing and self-healing which means limited system administrator involvement. It runs on commodity hardware, has no single point of failure, leverages an intelligent storage node system and it open source.

(Source Origin:) A Ceph system is built on industry-standard servers and consists of nodes which handle either file-based, block, or SAN-based or object-based storage. A Ceph cluster consists of a portable operating system interface (POSIX)-compliant file system, storage nodes, a metadata server daemon (or computer program), and monitor daemons that track the state of the cluster and the nodes in the cluster. Ceph uses an algorithm called CRUSH (controlled scalable decentralized placement of replicated data) to define where objects store data in the cluster and also track modified content for placement on the appropriate media.

Ceph can also be deployed as a block-based storage system. In this configuration, Ceph is mounted as a thin-provisioned block device. When data is written to Ceph, it is automatically striped and replicated across the cluster. The Ceph RADOS Block Device (RDB) works with KVM and supports the import and export of virtual machine images and provides snapshot capability.

The system can also serve as a file system where it maps the directories and file names of the file system to objects stored in RADOS clusters. The size of these clusters can be expanded or contracted and the workload is automatically rebalanced.

Like file system clusters such as Gluster and Lustre, Ceph is scalable to multiple exabytes of data. Ceph is included in the Linux kernel and integrated into the OpenStack project.

Because, like other open source projects, Ceph can be difficult to install, configure, and maintain, Inktank became the official sponsor of Ceph and will provide not only installation and configuration, performance testing, and infrastructure assessment services, but support for Ceph itself. The company has developed a community for Ceph users where they can chat about Ceph implementation and other issues.

Ceph was designed by Sage Weil, CEO and founder of Inktank, as part of a PhD thesis at the University of California at Santa Cruz. Weil released Ceph into the open source community in 1997. Weil is also a co-founder of DreamHost, the hosting company that developed Ceph and spun it off to Inktank.

Ceph is named after the UC Santa Cruz mascot, Sammy, a banana slug mollusk. Ceph is short for cephalopods, a class of mollusks. Since nearly all mollusks release ink, it's likely that Inktank's name also derives from UC Santa Cruz's mascot.

  1. Ceph Wiki   Ceph Intro  Refs:   ceph: distributed storage   ceph.=cephalopods
  2. Tutorials:   RadosAndCephFS   Ceph Storage on Kvm-virt   Ceph cluster and RBD volume  
  3. ceph fs   Block Storage   Object Storage   publications   eu ceph   ceph docs   ceph  
  4. Ceph In OpenStack   NFS Over RBD   Use RBD  
  5. CephDoc RBD   Ceph and RBD benchmarks  
  6. Ceph Filesystem Tutorial In Google   Ceph rbd Tutorial  
  7. Ceph Admin. Commands   Ceph Shell Command   Ceph Control utility  
  8. Ceph Cluster Config   Ceph Cluster Config Docs   Ceph Operations   Ceph Internals  
  9. Ceph Rbd   QEMU-RBD   Ceph Cluster  
  10. iSCSI Wiki   Ceph ISCSI Wiki   Rbd via ISCSI   Debian iSCSI Wiki   Debian iSCSI   MultiPath  
  11. Ceph And Rados from Hastexo.com        

Ceph: Scalable distributed file system (Source Origin)

Exploring the Ceph file system and ecosystem

M. Tim Jones, Independent author

Summary:  Linux continues to invade the scalable computing space and, in particular, the scalable storage space. A recent addition to Linux's impressive selection of file systems is Ceph, a distributed file system that incorporates replication and fault tolerance while maintaining POSIX compatibility. Explore the architecture of Ceph and learn how it provides fault tolerance and simplifies the management of massive amounts of data.


As an architect in the storage industry, I have an affinity to file systems. These systems are the user interfaces to storage systems, and although they all tend to offer a similar set of features, they also can provide notably different features. Ceph is no different, and it offers some of the most interesting features you'll find in a file system.

Ceph began as a PhD research project in storage systems by Sage Weil at the University of California, Santa Cruz (UCSC). But as of late March 2010, you can now find Ceph in the mainline Linux kernel (since 2.6.34). Although Ceph may not be ready for production environments, it's still useful for evaluation purposes. This article explores the Ceph file system and the unique features that make it an attractive alternative for scalable distributed storage.

Why "Ceph"?

"Ceph" is an odd name for a file system and breaks the typical acronym trend that most follow. The name is a reference to the mascot at UCSC (Ceph's origin), which happens to be "Sammy," the banana slug, a shell-less mollusk in the cephalopods class. Cephalopods, with their multiple tentacles, provide a great metaphor for a distributed file system.

Ceph goals

Developing a distributed file system is a complex endeavor, but it's immensely valuable if the right problems are solved. Ceph's goals can be simply defined as:

Unfortunately, these goals can compete with one another (for example, scalability can reduce or inhibit performance or impact reliability). Ceph has developed some very interesting concepts (such as dynamic metadata partitioning and data distribution and replication), which this article explores shortly. Ceph's design also incorporates fault-tolerance features to protect against single points of failure, with the assumption that storage failures on a large scale (petabytes of storage) will be the norm rather than the exception. Finally, its design does not assume particular workloads but includes the ability to adapt to changing distributed workloads to provide the best performance. It does all of this with the goal of POSIX compatibility, allowing it to be transparently deployed for existing applications that rely on POSIX semantics (through Ceph-proposed enhancements). Finally, Ceph is open source distributed storage and part of the mainline Linux kernel (2.6.34).

Ceph architecture

Now, let's explore the Ceph architecture and its core elements at a high level. I then dig down another level to identify some of the key aspects of Ceph to provide a more detailed exploration.

The Ceph ecosystem can be broadly divided into four segments (see Figure 1): clients (users of the data), metadata servers (which cache and synchronize the distributed metadata), an object storage cluster (which stores both data and metadata as objects and implements other key responsibilities), and finally the cluster monitors (which implement the monitoring functions).

Figure 1. Conceptual architecture of the Ceph ecosystem
Conceptual flowchart showing the architecture of the Ceph ecosystem: 
   clients, metadata server cluster, object storage cluster, and cluster monitors

As Figure 1 shows, clients perform metadata operations (to identify the location of data) using the metadata servers. The metadata servers manage the location of data and also where to store new data. Note that metadata is stored in the storage cluster (as indicated by "Metadata I/O"). Actual file I/O occurs between the client and object storage cluster. In this way, higher-level POSIX functions (such as open, close, and rename) are managed through the metadata servers, whereas POSIX functions (such as read and write) are managed directly through the object storage cluster.

Another perspective of the architecture is provided in Figure 2. A set of servers access the Ceph ecosystem through a client interface, which understands the relationship between metadata servers and object-level storage. The distributed storage system can be viewed in a few layers, including a format for the storage devices (the Extent and B-tree-based Object File System [EBOFS] or an alternative) and an overriding management layer designed to manage data replication, failure detection, and recovery and subsequent data migration called Reliable Autonomic Distributed Object Storage (RADOS). Finally, monitors are used to identify component failures, including subsequent notification.

Figure 2. Simplified layered view of the Ceph ecosystem
Block diagram showing a simplified layered view of the Ceph ecosystem, 
      including the server, metadata servers, and object storage ddaemon

Ceph components

With the conceptual architecture of Ceph under your belts, you can dig down another level to see the major components implemented within the Ceph ecosystem. One of the key differences between Ceph and traditional file systems is that rather than focusing the intelligence in the file system itself, the intelligence is distributed around the ecosystem.

Figure 3 shows a simple Ceph ecosystem. The Ceph Client is the user of the Ceph file system. The Ceph Metadata Daemon provides the metadata services, while the Ceph Object Storage Daemon provides the actual storage (for both data and metadata). Finally, the Ceph Monitor provides cluster management. Note that there can be many Ceph clients, many object storage endpoints, numerous metadata servers (depending on the capacity of the file system), and at least a redundant pair of monitors. So, how is this file system distributed?

Figure 3. Simple Ceph ecosystem
Block diagram of a simple Ceph ecosystem

Ceph client

Kernel or user space

Early versions of Ceph utilized Filesystems in User SpacE (FUSE), which pushes the file system into user space and can greatly simplify its development. But today, Ceph has been integrated into the mainline kernel, making it faster, because user space context switches are no longer necessary for file system I/O.

As Linux presents a common interface to the file systems (through the virtual file system switch [VFS]), the user's perspective of Ceph is transparent. The administrator's perspective will certainly differ, given the potential for many servers encompassing the storage system (see the Resources section for information on creating a Ceph cluster). From the users' point of view, they have access to a large storage system and are not aware of the underlying metadata servers, monitors, and individual object storage devices that aggregate into a massive storage pool. Users simply see a mount point, from which standard file I/O can be performed.

The Ceph file system - or at least the client interface - is implemented in the Linux kernel. Note that in the vast majority of file systems, all of the control and intelligence is implemented within the kernel's file system source itself. But with Ceph, the file system's intelligence is distributed across the nodes, which simplifies the client interface but also provides Ceph with the ability to massively scale (even dynamically).

Rather than rely on allocation lists (metadata to map blocks on a disk to a given file), Ceph uses an interesting alternative. A file from the Linux perspective is assigned an inode number (INO) from the metadata server, which is a unique identifier for the file. The file is then carved into some number of objects (based on the size of the file). Using the INO and the object number (ONO), each object is assigned an object ID (OID). Using a simple hash over the OID, each object is assigned to a placement group. The placement group (identified as a PGID) is a conceptual container for objects. Finally, the mapping of the placement group to object storage devices is a pseudo-random mapping using an algorithm called Controlled Replication Under Scalable Hashing (CRUSH). In this way, mapping of placement groups (and replicas) to storage devices does not rely on any metadata but instead on a pseudo-random mapping function. This behavior is ideal, because it minimizes the overhead of storage and simplifies the distribution and lookup of data.

The final component for allocation is the cluster map. The cluster map is an efficient representation of the devices representing the storage cluster. With a PGID and the cluster map, you can locate any object.

The Ceph metadata server

The job of the metadata server (cmds) is to manage the file system's namespace. Although both metadata and data are stored in the object storage cluster, they are managed separately to support scalability. In fact, metadata is further split among a cluster of metadata servers that can adaptively replicate and distribute the namespace to avoid hot spots. As shown in Figure 4, the metadata servers manage portions of the namespace and can overlap (for redundancy and also for performance). The mapping of metadata servers to namespace is performed in Ceph using dynamic subtree partitioning, which allows Ceph to adapt to changing workloads (migrating namespaces between metadata servers) while preserving locality for performance.

Figure 4. Partitioning of the Ceph namespace for metadata servers
Diagram showing the partitions of the Ceph namespace for metadata 

But because each metadata server simply manages the namespace for the population of clients, its primary application is an intelligent metadata cache (because actual metadata is eventually stored within the object storage cluster). Metadata to write is cached in a short-term journal, which eventually is pushed to physical storage. This behavior allows the metadata server to serve recent metadata back to clients (which is common in metadata operations). The journal is also useful for failure recovery: if the metadata server fails, its journal can be replayed to ensure that metadata is safely stored on disk.

Metadata servers manage the inode space, converting file names to metadata. The metadata server transforms the file name into an inode, file size, and striping data (layout) that the Ceph client uses for file I/O.

Ceph monitors

Ceph includes monitors that implement management of the cluster map, but some elements of fault management are implemented in the object store itself. When object storage devices fail or new devices are added, monitors detect and maintain a valid cluster map. This function is performed in a distributed fashion where map updates are communicated with existing traffic. Ceph uses Paxos, which is a family of algorithms for distributed consensus.

Ceph object storage

Similar to traditional object storage, Ceph storage nodes include not only storage but also intelligence. Traditional drives are simple targets that only respond to commands from initiators. But object storage devices are intelligent devices that act as both targets and initiators to support communication and collaboration with other object storage devices.

From a storage perspective, Ceph object storage devices perform the mapping of objects to blocks (a task traditionally done at the file system layer in the client). This behavior allows the local entity to best decide how to store an object. Early versions of Ceph implemented a custom low-level file system on the local storage called EBOFS. This system implemented a nonstandard interface to the underlying storage tuned for object semantics and other features (such as asynchronous notification of commits to disk). Today, the B-tree file system (BTRFS) can be used at the storage nodes, which already implements some of the necessary features (such as embedded integrity).

Because the Ceph clients implement CRUSH and do not have knowledge of the block mapping of files on the disks, the underlying storage devices can safely manage the mapping of objects to blocks. This allows the storage nodes to replicate data (when a device is found to have failed). Distributing the failure recovery also allows the storage system to scale, because failure detection and recovery are distributed across the ecosystem. Ceph calls this RADOS (see Figure 3).

Other features of interest

As if the dynamic and adaptive nature of the file system weren't enough, Ceph also implements some interesting features visible to the user. Users can create snapshots, for example, in Ceph on any subdirectory (including all of the contents). It's also possible to perform file and capacity accounting at the subdirectory level, which reports the storage size and number of files for a given subdirectory (and all of its nested contents).

Ceph status and future

Although Ceph is now integrated into the mainline Linux kernel, it's properly noted there as experimental. File systems in this state are useful to evaluate but are not yet ready for production environments. But given Ceph's adoption into the Linux kernel and the motivation by its originators to continue its development, it should be available soon to solve your massive storage needs.

Other distributed file systems

Ceph isn't unique in the distributed file system space, but it is unique in the way that it manages a large storage ecosystem. Other examples of distributed file systems include the Google File System (GFS), the General Parallel File System (GPFS), and Lustre, to name just a few. The ideas behind Ceph appear to offer an interesting future for distributed file systems, as massive scales introduce unique challenges to the massive storage problem.

Going further

Ceph is not only a file system but an object storage ecosystem with enterprise-class features. In the Resources section, you'll find information on how to set up a simple Ceph cluster (including metadata servers, object servers, and monitors). Ceph fills a gap in distributed storage, and it will be interesting to see how the open source offering evolves in the future.



Get products and technologies


About the author

M. Tim Jones

M. Tim Jones is an embedded firmware architect and the author of Artificial Intelligence: A Systems Approach, GNU/Linux Application Programming (now in its second edition), AI Application Programming (in its second edition), and BSD Sockets Programming from a Multilanguage Perspective. His engineering background ranges from the development of kernels for geosynchronous spacecraft to embedded systems architecture and networking protocols development. Tim is a Consultant Engineer for Emulex Corp. in Longmont, Colorado.

The RADOS Object Store and Ceph Filesystem (Source Origin)

By Martin Loschwitz

                      The RADOS Object Store and Ceph Filesystem        Scalable storage is a key component in cloud environments. RADOS and Ceph enter the field, promising to support seamlessly scalable storage.

Cloud computing is, without a doubt, the IT topic of our time. No issue pushes infrastructure managers with large-scale enterprises as hard as how to best implement the cloud. The IaaS principle is quite simple: The goal is to provide capacity to users in the form of computational power and storage in a way that means as little work as possible for the user and that keeps the entire platform as flexible as possible.

Put more tangibly, this means customers can request CPU cycles and disk space as they see fit and continue to use both as the corresponding services are needed. For IT service providers, this means defining your own setup to be as scalable as possible: It should be possible to accommodate peak loads without difficulty, and if the platform grows - which will be the objective of practically any enterprise - a permanent extension should also be accomplished easily.

In practical terms, implementing this kind of solution will tend to be more complex. Scalable virtualization environments are something that is easy to achieve: Xen and KVM, in combination with the current crop of management tools, make it easy to manage virtualization hosts. Scale out also is no longer an issue: If the platform needs more computational performance, you can add more machines that integrate seamlessly with the existing infrastructure.

Things start to become more interesting when you look at storage. The way IT environments store data has remained virtually unchanged in the past few years. In the early 1990s, data centers comprised many servers with local storage, all of which suffered from legacy single points of failure. As of the mid-1990s, Fibre Channel HBAs and matching SAN storage entered the scene, offering far more redundancy than their predecessors, but at a far higher price. People who preferred a lower budget approach turned to DRBD with standard hardware a few years ago, thus avoiding what can be hugely expensive SANs. However, all of these approaches share a problem: They do not scale out seamlessly.

Scale Out and Scale Up

Admins and IT managers distinguish between two basic types of scalability. Vertical scalability (scale up) is based on the idea of extending the resources of existing devices, whereas horizontal scalability (scale out) relies on adding more resources to the existing set (Figure 1). Databases are a classic example of a scale-out solution: Typically, slave servers are added to support load distribution.


Figure 1: Two kinds of scalability.

Scale out is completely new ground when it comes to storage. Local storage in servers, SAN storage, or servers with DRBD will typically only scale vertically (more disks!), not horizontally. When the case is full, you need a second storage device, and this will not typically support integration with the existing storage to provide a single unit, thus making maintenance far more difficult. In terms of SAN storage, two SANs just cost twice as much as one.

Object Stores

If you are planning a cloud and thinking about seamlessly scalable storage, don't become despondent at this point: Authors of the popular cloud applications are fully aware of this problem and now offer workable solutions known as object stores.

Object stores follow a simple principle: All servers that become part of an object store run software that manages and exports the server's local disk space. All instances of this software collaborate on the cluster, thus providing the illusion of a single, large data store. To support internal storage management, the object store software does not save data in its original format on the individual storage nodes, but as binary objects. Most exciting is that the number of individual nodes joining forces to create the large object store is arbitrary. You can even add storage nodes on the fly.

Because the object storage software also has internal mechanisms to handle redundancy, and the whole thing works with standard hardware, a solution of this kind combines the benefits of SANs or DRBD storage and seamless horizontal scalability. RADOS has set out to be the king of the hill in this sector, in combination with the matching Ceph filesystem.

How RADOS Works

RADOS (reliable autonomic distributed object store, although many people mistakenly say "autonomous") has been under development at DreamHost, led by Sage A. Weil, for a number of years and is basically the result of a doctoral thesis at the University of California, Santa Cruz. RADOS implements precisely the functionality of an object store as described earlier, distinguishing between three different layers to do so:

  1. Object Storage Devices (OSDs). An OSD in RADOS is always a folder within an existing filesystem. All OSDs together form the object store proper, and the binary objects that RADOS generates from the files to be stored reside in the store. The hierarchy within the OSDs is flat: files with UUID-style names but no subfolders.
  2. Monitoring servers (MONs): MONs form the interface to the RADOS store and support access to the objects within the store. They handle communication with all external applications and work in a decentralized way: There are no restrictions in terms of numbers, and any client can talk to any MON. MONs manage the MONmap (a list of all MONs) and the OSDmap (a list of all OSDs). The information from these two lists lets clients compute which OSD they need to contact to access a specific file. In the style of a Paxos cluster, MONs also ensure RADOS's functionality in terms of respecting quorum rules.
  3. Metadata servers (MDS): MDSs provide POSIX metadata for objects in the RADOS object store for Ceph clients.

What About Ceph?

Most articles about RADOS just refer to Ceph in the title, causing some confusion. Weil described the relationship between RADOS and Ceph as two parts of the same solution: RADOS is the "lower" part and Ceph the "upper" part. One thing is for sure: The best looking object store in the world is useless if it doesn't give you any options for accessing the data you store in it.

However, it is precisely these options that Ceph offers for RADOS: It is a filesystem that accesses the object store in the background and thus makes its data directly usable in the application. The metadata servers help accomplish this task by providing the metadata required for each file that Ceph accesses in line with the POSIX standard when a user requests a file via Ceph.

Because DreamHost didn't consider until some later stage of development that RADOS could be used as a back end for tools other than filesystems, they generated confusion regarding the names of RADOS and Ceph. For example, the official DreamHost guides refer simply to "Ceph" when they actually mean "RADOS and Ceph."

First RADOS Setup

Theory is one thing, but to gain some understanding of RADOS and Ceph, it makes much more sense to experiment on a "live object." You don't need much for a complete RADOS-Ceph setup: Three servers with local storage will do fine. Why three? Remember that RADOS autonomically provides a high-availability option. The MONs use the PAXOS implementation referred to earlier to guarantee that there will always be more than one copy of an object in a RADOS cluster. Although you could turn a single node into a RADOS store, this wouldn't give you much in the line of high availability. A RADOS cluster comprising two nodes is even more critical: In a normal case, the cluster would have a quorum, but the failure of a single node would then make the other node useless because RADOS needs a quorum, and by definition, one can't make a quorum. In other words, you need three nodes to be on the safe side, so the failure of single node won't be an issue.

Incidentally, nothing can stop you from using virtual machines with RADOS for your experiments - RADOS doesn't have any functions that require specific hardware features.

Finding the Software

Before experimenting, you need to install RADOS and Ceph. Ceph, which is a plain vanilla filesystem driver on Linux systems (e.g., ext3 or ext4), made its way into the Linux kernel in Linux 2.6.34 and is thus available for any distribution with this or a later kernel version (Figure 2). The situation isn't quite as easy with RADOS; however, the documentation points to prebuilt packages, or at least gives you an installation guide, for all of the popular distributions. Note that although the documentation refers to "ceph," the packages contain all of the components you need for RADOS. After installing the packages, it's time to prepare RADOS.

Figure 2: After loading the ceph kernel module, the filesystem is available on Linux. Ceph was first introduced in kernel 2.6.34.

Preparing the OSDs

RADOS needs OSDs. As I mentioned earlier, any folder on a filesystem can act as an OSD; however, the filesystem must support extended attributes. The RADOS authors recommend Btrfs but also mention XFS as an alternative for anyone who is still a bit wary of using Btrfs. For simplicity's sake, I will assume in the following examples that you have a directory named osd.ID in /srv on three servers, where ID stands for the server's hostname in each case. If your three servers are named alice, bob, and charlie, you would have a folder named /srv/osd.alice on server alice, and so on.

If you will be using a local filesystem set up specially for this purpose, be sure to mount it in /srv/osd.ID. Finally, each of the hosts in /srv also needs a mon.ID folder, where ID again stands for the hostname.

In this sample setup, the central RADOS configuration in /etc/ceph/ceph.conf might look like Listing 1.

Listing 1: Sample /etc/ceph/ceph.conf

   auth supported = cephx
   keyring = /etc/ceph/$name.keyring
   mon data = /srv/mon.$id
   osd data = /srv/osd.$id
   osd journal = /srv/osd.$id.journal
   osd journal size = 1000
   host = alice
   mon addr =
   host = bob
   mon addr =
   host = charlie
   mon addr =
   host = alice
   host = bob
   host = charlie
   host = alice

The configuration file defines the following details: each of the three hosts provides an OSD and a MON server; host alice is also running an MDS to ensure that any Ceph clients will find POSIX-compatible metadata on access. Authentication between the nodes is encrypted: The keyring for this is stored in the /etc/ceph folder and goes by the name of $name.keyring, where RADOS will automatically replace name with the actual value later.

Most importantly, the nodes in the RADOS cluster must reach one another directly using the hostnames from your ceph.conf file. This could mean adding these names to your /etc/hosts file. Additionally, you need to be able to log in to all of the RADOS nodes as root later on for the call to mkcephfs, and root needs to be able to call sudo without an additional password prompt on all of the nodes. After fulfilling these conditions, the next step is to create the keyring to support mutual authentication between the RADOS nodes:

 mkcephfs -a -c /etc/ceph/ceph.conf -k /etc/ceph/admin.keyring

Now you need to ensure that ceph.conf exists on all the hosts belonging to the cluster (Figure 3). If this is the case, you just need to start RADOS on all of the nodes: Typing

 /etc/init.d/ceph start

will do the trick. After a couple of seconds, the three nodes should have joined the cluster; you can check this by typing

 ceph -k /etc/ceph/admin.keyring -c /etc/ceph/ceph.conf health

which should give you the output shown in Figure 4.


Figure 3: To discover which Ceph services are running on a host, type ps. In this example, the host is an OSD, MON, and MDS.


Figure 4: Ceph has its own health options that tell you whether the RADOS Paxos cluster is working properly.

Using Ceph to Mount the Filesystem

To mount the newly created filesystem on another host on one of the RADOS nodes, you can use the normal mount command - the target host is one of the MON servers (i.e., alice in this example) with a MON address set to in ceph.conf. Because Cephx authentication is being used, I need to identify the login credentials automatically generated by Ceph before I can mount the filesystem. The following command on one of the RADOS nodes outputs the credentials:

ceph-authtool -l /etc/ceph/admin.keyring

The mount process then follows:

 mount -t ceph /mnt/osd -vv -o name=admin,secret=mykey 

where mykey needs to be replaced by the value for key that you determined with the last command. The mountpoint in this example is /mnt/osd, which can be used it as a normal directory from now on.

The Crush Map

RADOS and Ceph use a fair amount of magic in the background to safeguard the setup against any kind of failure, starting with the mount. Any of the existing MON servers can act as a mountpoint; however, this doesn't mean communications are only channeled between the client and this one MON. Instead, Ceph on the client receives the MON and OSDmaps from the MON server it contacts and then references them to compute which OSD is best to use for a specific file before going on to handle the communication with this OSD.

The Crush map is another step for improving redundancy. It defines which hosts belong to a RADOS cluster, how many OSDs exist in the cluster, and how to distribute the files over these hosts for best effect. The Crush map makes RADOS rack-aware, allowing admins to manipulate the internal replication of RADOS in terms of individual servers, racks, or security zones in the data center. The setup shown for the example here also has a rudimentary default Crush map. If you want to experiment with your own Crush map, the Ceph wiki gives you the most important information for getting started.

Extending the Existing Setup

How do you go about extending an existing RADOS cluster, by adding more nodes to increase the amount of available storage? If you want to add a node named daisy to the existing setup, the first step would be to define an ID for this node. In this example, IDs 0 through 3 are already assigned, and the new node would have an ID of 4, so you need to type:

 ceph osd create 4

Then you need to extend the ceph.conf files on the existing cluster nodes, adding an entry for daisy. On daisy, you also need to create the folder structure needed for daisy to act as an OSD (i.e., the directories in /srv, as in the previous examples). Next, copy the new ceph.conf to the /etc/ceph folder on daisy.

Daisy also needs to know the current MON structure - after all, she will need to register with an existing MON later on. This means daisy needs the current MONmap for the RADOS cluster. You can read the MONmap on one of the existing RADOS nodes by typing

 ceph mon getmap -o /tmp/monmap

(Figure 5), and then use scp to copy it to daisy (this example assumes you are storing the MONmap in /tmp/monmap on daisy). Now, you need to initialize the OSD directory on daisy:

 ceph-osd -c /etc/ceph/ceph.conf -i 4 --mkfs --monmap /tmp/monmap --mkkey

If the additional cluster node uses an ID other than 4, you need to modify the numeric value that follows -i.


Figure 5: The MONmap contains information about MONs in the RADOS cluster. New OSDs rely on this information.

Finally, you need to introduce the existing cluster to daisy. In this example, the last command created a /etc/ceph/osd.4.keyring file on daisy, which you can copy to one of the existing MONs with scp. Following this,

 ceph auth add osd.ID osd 'allow *' mon 'allow rwx' -i /etc/ceph/osd.4.keyring 

on the same node adds the new OSD to the existing authentication structure (Figure 6). Typing /etc/init.d/ceph on daisy launches RADOS, and the new OSD registers with the existing RADOS cluster. The final step is to modify the existing Crush map so the new OSD is used. In this example, you would type

 ceph osd crush add 4 osd.4 1.0 pool=default host=daisy

to do this. The new OSD is now part of the existing RADOS/Ceph cluster.


Figure 6: Typing "ceph auth list" tells Ceph to reveal the keys that a MON instance already knows and what the credentials allow the node to do.


It isn't difficult to set up the combination of RADOS and Ceph. But this simple configuration doesn't leverage many of the exciting features that RADOS offers. For example, the Crush map functionality gives you the option of deploying huge RADOS setups over multiple racks in the data center while offering intelligent failsafes. Because RADOS also offers you the option of dividing its storage into individual pools of variable sizes, you can achieve more granularity in terms of different tasks and target groups.

Also, I haven't looked at the RADOS front ends beyond Ceph. After all, Ceph is just one front end of many; in this case, it supports access to files in the object store via the Linux filesystem. However, more options for accessing the data on RADOS exist. The RADOS block device, or rbd for short, is a good choice when you need to support access to files in the object store at the block device level. For example, this would be the case for virtual machines that will typically accept block devices as a back end for virtual disks, thus avoiding slower solutions with disk images. In this way, you can exploit RADOS's potential as an all-encompassing storage system for large virtualization solutions while solving another problem in the context of the cloud.

Speaking of the cloud, besides rbd, librados provides various interfaces for HTTP access - for example, a variant compatible with Amazon's S3 and a Swift-compatible variant. A generic REST interface is also available. As you can see, RADOS has a good selection of interfaces to the world outside.

At the time of writing, the RADOS Ceph components were still pre-series, but the developers were looking to release version 1.0, which could already be available as officially stable and "enterprise-ready."

The RADOS object store and Ceph filesystem: Part 2 (Source Origin)

By Martin Loschwitz

Two issues ago, ADMIN magazine introduced RADOS and Ceph[1] and explained what these tools are all about. In this second article, I will take a closer look and explain the basic concepts that play a role in their development. How does the cluster take care of internal redundancy of the stored objects, for example, and what possibilities exist besides Ceph for accessing the data in the object store?

The first part of this workshop demonstrated how an additional node can be added to an existing cluster using

 ceph osd crush add 4 osd.4 1.0 pool=default host=daisy

assuming that daisy is the hostname of the new server. This command integrates the host daisy in the cluster and gives it the same weight (1,0) as all the other nodes. Removing a node from the cluster configuration is as easy as adding one. The command for that is:

 ceph osd crush remove osd.4

The pool=default parameter for adding the node already refers to an important feature: pools.

Working with Pools

RADOS offers the option of dividing the storage of the entire object store into individual fragments called pools. One pool, however, does not correspond to a contiguous storage area, as with a partition; rather, it is a logical layer consisting of binary data tagged as belonging to the corresponding pool. Pools allow configurations in which individual users can only access specific pools, for example. The pools metadata, data, and rbd are available in the default configuration. A list of existing pools can be called up with rados lspools (Figure 1).


Figure 1: Listing all current pools in RADOS.

If you want to add another pool to the configuration, you can use the

 rados mkpool <Name>

command, replacing <Name> with a unique name. If an existing pool is no longer needed,

rados rmpool <Name>

removes it.

Inside RADOS

One of the explicit design goals for RADOS is to provide seamlessly scalable storage. Admins should be able to add any number of storage nodes to a RADOS object store at any time. Redundancy is important, and RADOS takes this into account by managing replication of the data automatically, without the admin of the RADOS cluster having to intervene manually. Combined with scalability, however, this process results in a problem for RADOS that other replication solutions don't have: How to distribute data optimally in a large cluster.

Conventional storage solutions generally "only" make sure data is copied from one server to another, so in the worst case, a failover can be executed. Usually, such solutions only run within one cluster with two, or at most three, nodes. The possibility of adding more nodes is ruled out from the beginning.

With RADOS, theoretically, any number of nodes could be added to the cluster, and for each one, the object store must ensure that its contents are available redundantly within the whole cluster. Not the least of developers' problems is dealing with "rack awareness." If you have a 20-node cluster with RADOS in your data center, you will ideally have it distributed over different multiple compartments or buildings for additional security. For that setup to work properly, the storage solution must know where each node is and where and how which data can be accessed. The solution RADOS developers - above all RADOS guru Sage A. Weil - came up with consists of two parts: placement groups and the Crush map.

Placement Groups

Three different maps exist within a RADOS cluster: the MONmap, which is a list of all monitoring servers; the OSDmap, in which all physical Object Storage Devices (OSDs) are found; and the Crush map. OSDs themselves contain the binary objects - that is, the data actually saved in the object store.

Here is where placement groups (PGs) come into play: From the outside, it seems as if the allocation of storage objects to specific OSDs occurs randomly. In reality, however, the allocation is done by means of the placement groups. Each object belongs to such a group. Simply speaking, a placement group is a list of different objects that are placed in the RADOS store. RADOS computes which placement group an object belongs to by using the name of the object, the desired replication level, and a bitmask that determines the sum of all PGs in the RADOS cluster.

The Crush Map

The Crush map, which is the second part of this system, contains information about where each placement group is found in the cluster - that is, on which OSD (Figure 2). Replication is always executed at the PG level: All objects of a placement group are replicated between different OSDs in the RADOS cluster.


Figure 2: The elements of the RADOS universe, whose interaction is controlled by the Crush map.

The Crush map got its name from the algorithm it uses: Controlled Replication under Scalable Hashing. The algorithm was developed by Weil specifically for such tasks in RADOS. Weil highlights one feature of Crush in particular: In contrast to hash algorithms, Crush remains stable when many storage devices leave or join the Cluster simultaneously. The rebalancing that other storage solutions require creates a lot of traffic with correspondingly long waiting periods. Crush-based clusters, on the other hand, transfer just enough data between storage nodes to achieve a balance.

Manipulating Data Placement

Of course, the admin also has a word to say about what data lands where. Practically all parameters that pertain to replication in RADOS can be configured by the admin, including, for example, how often an object should exist within an RADOS cluster (i.e., how many replicas of it should be made). Of course, the admin is also free to manipulate the allocation of the replicas to the OSDs. Thus, RADOS can take into account where specific racks are. Rules for replication specified by the admin control distribution of the replicas from placement groups to different OSD groups. A group can, for example, include all servers in the same data center room, another group in another room, and so on.

Administrators can define how the replication of data in RADOS is done by manipulating the corresponding Crush rules. The basic principle is the neither the allocation of the placement groups nor the results of the Crush calculations can be influenced directly. Instead, replication rules are set for each pool; subsequently, distribution of the placement groups and their positioning by means of the Crush map are done by RADOS.

By default, RADOS includes a rule that two replicas of each object must exist per pool. The number of replicas is always set for each pool; a typical example would be to raise this number to three, as is done here:

ceph osd pool set data size 3

To make the same change for the test pool, data would be replaced by test. Whether the cluster subsequently actually does what the admin expects can be investigated with ceph -v.

The use ceph in this kind of operation is certainly comfortable, but it does not provide the full range of functions. In the example, the pool continues to use the default Crush map, which does not consider properties such as the rack where the server is. If you want to distribute replicas according to such properties, you have to create your own Crush map.

A Crush Map of Your Own

With your own Crush map, you as administrator can have the objects in your pools distributed any way you want. crushtool is a valuable utility for this purpose because it creates corresponding templates. The following example creates a Crush map for a setup consisting of six OSDs (i.e., individual storage devices in servers) distributed over three racks:

 crushtool --num_osds 6 -o crush.example.map --build host straw 1 rack straw 2 root straw 0

The --num_osds 6 parameter specifies that the cluster has six individual storage devices at its disposal. The --build option introduces a statement with three three-part parameters that follow the <Name> <Internal Crush Algorithm> <Number> syntax.

<Name> can be chosen freely; however, it is wise to choose something meaningful. host straw 1 specifies that one replica is allowed per host, and rack straw 2 tells RADOS that two servers exist per rack. root straw 0 refers to the number of racks and determines that the replicas should be distributed equally on all available racks.

Subsequently, if you want to see the resulting Crush map in plain text, you can enter

 crushtool -d crush.example.map -o crush.example

The file with the map in plain text will then be called crush.example (Listing 1).

Listing 1: Crush Map for Six Servers in Two Racks

001 # begin crush map
003 # devices
004 device 0 device0
005 device 1 device1
006 device 2 device2
007 device 3 device3
008 device 4 device4
009 device 5 device5
011 # types
012 type 0 device
013 type 1 host
014 type 2 rack
015 type 3 root
017 # buckets
018 host host0 {
019         id -1            # do not change unnecessarily
020         # weight 1.000
021         alg straw
022         hash 0  # rjenkins1
023         item device0 weight 1.000
024 }
025 host host1 {
026         id -2            # do not change unnecessarily
027         # weight 1.000
028         alg straw
029         hash 0  # rjenkins1
030         item device1 weight 1.000
031 }
032 host host2 {
033         id -3            # do not change unnecessarily
034         # weight 1.000
035         alg straw
036         hash 0  # rjenkins1
037         item device2 weight 1.000
038 }
039 host host3 {
040         id -4            # do not change unnecessarily
041         # weight 1.000
042         alg straw
043         hash 0  # rjenkins1
044         item device3 weight 1.000
045 }
046 host host4 {
047         id -5            # do not change unnecessarily
048         # weight 1.000
049         alg straw
050         hash 0  # rjenkins1
051         item device4 weight 1.000
052 }
053 host host5 {
054         id -6            # do not change unnecessarily
055         # weight 1.000
056         alg straw
057         hash 0  # rjenkins1
058         item device5 weight 1.000
059 }
060 rack rack0 {
061         id -7            # do not change unnecessarily
062         # weight 2.000
063         alg straw
064         hash 0  # rjenkins1
065         item host0 weight 1.000
066         item host1 weight 1.000
067 }
068 rack rack1 {
069         id -8            # do not change unnecessarily
070         # weight 2.000
071         alg straw
072         hash 0  # rjenkins1
073         item host2 weight 1.000
074         item host3 weight 1.000
075 }
076 rack rack2 {
077         id -9            # do not change unnecessarily
078         # weight 2.000
079         alg straw
080         hash 0  # rjenkins1
081         item host4 weight 1.000
082         item host5 weight 1.000
083 }
084 root root {
085         id -10           # do not change unnecessarily
086         # weight 6.000
087         alg straw
088         hash 0  # rjenkins1
089         item rack0 weight 2.000
090         item rack1 weight 2.000
091         item rack2 weight 2.000
092 }
094 # rules
095 rule data {
096         ruleset 1
097         type replicated
098         min_size 2
099         max_size 2
100         step take root
101         step chooseleaf firstn 0 type rack
102         step emit
103 }
105 # end crush map

Of course, the names of the devices and hosts (device0, device2, ... and host1, host2, ...) must be adapted to the local conditions. Thus, the name of the device should correspond to the device name in ceph.conf (in this example: osd.0), and the hostname should agree with the hostname of the server.

Distributing Replicas on the Racks

The replication is defined in line 101 of Listing 1 by step chooseleaf firstn 0 type rack, which determines that replicas are to be distributed on the racks. To achieve a distribution per host, rack would be replaced by host. The min_size and max_size parameters (lines 98 and 99) seem inconspicuous; however, especially in combination with ruleset 1 (line 96), they are very important for RADOS to use the rule created. Which ruleset from the Crush map RADOS will use is specified for each pool; for this, RADOS not only matches names, but also the min_size and max_size parameters, which refer to the number of replicas.

In concrete terms, this means: If RADOS is supposed to process a pool according to ruleset 1, which uses two replicas, then the rule in the example would apply. However, if the admin has used the command as explained above to specify that three replicas should exist for the objects in the pool, then RADOS would not apply the rule. To be less specific, it is recommended to set min_size to 1 and max_size to 10 - this rule would then apply for all pools that use ruleset 1 and require from one to 10 replicas.

On the basis of this example, admins will be able to create their own Crush maps, which must subsequently find their way back into RADOS.

Extending the Existing Crush Map

Practice has shown that it is useful to extend an existing Crush map with new entries instead of building a new one from scratch. The following command will give access to the Crush map currently in use:

 ceph osd getcrushmap -o crush.running.map

This command will save the Crush map in binary format in crush.running.map. This file can be decoded with

 crushtool -d crush.running.map -o crush.map

which transfers the plain text version to crush.map. After editing, the Crush map must be encoded again with

 crushtool -c crush.map -o crush.new.map

before it can be sent back to the RADOS cluster with

 ceph osd setcrushmap -i crush.new.map

Ideally, for a newly created pool to use the new rule, it would be set accordingly in ceph.conf; the new rule would simply become the standard value. If the new ruleset from the example has the ID 4 in the Crush map (ruleset 4), then the line

 osd pool default crush rule = 4

would help configuration block [osd]. Additionally,

 osd pool default size = 3

can be used to determine that new pools should always be created with three replicas.

The 4K Limit for Ext3 and Ext4

If you complete a RADOS installation and use ext3 or ext4 as the filesystem, you might run into a snare: In these filesystems, the XATTR attributes (i.e., the extended file attributes) are limited to a maximum of 4KB, and the XATTR entries from RADOS regularly take up just that. If no preventive measures are taken, this setup could, in the worst case, cause ceph-osd to crash or spit out cryptic error messages like (Operation not supported).

This problem can be remedied using an external file to save the XATTR entries. To do so, the line

 filestore xattr use omap = true ; for ext3/4 filesystem

must be inserted into the [osd] entry in ceph.conf (Figure 3). In this way, you can protect the cluster against possible problems.


Figure 3: Special care is needed when operating RADOS with ext3 or ext4.

Alternative: The RADOS Block Device

In the first part of this workshop [1], I took an in-depth look at the Ceph filesystem, which is a front end for RADOS. Ceph is not, however, the only way to access the data deposited in a RADOS store - the RADOS block driver (RBD) is an alternative. With rbd, objects in RADOS can be addressed as if they were on a hard disk.

This functionality is especially helpful in the context of virtualization because presenting a block device to KVM as a hard disk saves the virtualization a detour over container formats like qcow2. Additionally, using rbd is very easy - an rbd pool is already included that admins can take advantage of. For example, to create an rbd drive with a size of 1GB, you only need to use the command

 rbd create test --size 1024

Subsequently, the block device can be used on any host with an rbd kernel module, which is now part of the mainline kernel and is found on practically every system. Because ceph.conf from the first part of this workshop has already specified that users must authenticate themselves to gain access to the RADOS services, that is also true for rbd. Note that login credentials can be called with:

 ceph-authtool -l /etc/ceph/admin.keyring

Then, on a machine that has loaded rbd, you can type

 echo "IP addresses of the MONs, separated by commas name=admin,secret=Authkeyrbd test" > /sys/bus/rbd/add

to activate the RBD drive.


RADOS lets administrators replace two-node storage with seamlessly scalable RADOS-based storage. Currently, the project developers plan to turn version 0.48 into version 1.0, which will then receive the official "Ready for Enterprise" stamp. Because the previous version 0.47 has already been released, the enterprise version may be expected soon.


[1]"RADOS and Ceph" by Martin Loschwitz, ADMIN, Issue 09, pg. 28

The Author

Martin Gerhard Loschwitz is Principal Consultant at hastexo, where he is intensively involved with high-availability solutions. In his spare time, he maintains the Linux cluster stack for Debian GNU/Linux.

A Basic Ceph Storage & KVM Virtualisation Tutorial (Source Origin)

So I had been meaning to give CEPH & KVM Virtualisation a whirl in the lab for quite some time now. Here I have provided for you all a set of command-by-command instructions I used for setting it up on a single host. The goal here is really just to get it to the 'working' stage for some basic functional experimentation.

Ideally you would want to set this up on multiple hosts with proper network separation so you can see how it performs with real-world attenuation.


For those who don't know it, CEPH is a distributed storage solution that allows you to scale horizontally with multiple machines/heads instead of the more traditional methodologies which use centralised heads with large amounts of storage attached to them.

The principle here is that you should be able to buy lots of inexpensive computers with a bunch of direct attached storage and just cluster them to achieve scalability. Also, without a central point of failure or performance bottleneck you should be able to scale beyond the limitations of our past storage architectures.

So CEPH like most distributed storage solutions, really has 3 main components:

Here is a basic diagram provided from the official site (so yes I stole it - I hope thats okay):

As you can see, ideally these components are meant to be ran on different sets of systems, with the OSD component being the most frequent. I'm just going to run them all on the same host for this demo, which is useful for a functional test, but not for a destructive or performance test.

By the way the OSD part can used different types of backends and filesystems for storage, but in this example I've chosen BTRFS.

So CEPH itself supports multiple different ways of mounting its storage, which makes it quite a flexible solution.

In this demo I'm going to concentrate only on the RBD and Ceph DFS mechanisms.

This installation was tested with:

I'm using the bleeding edge versions of these components because CEPH is really in heavy development its better to see how the main-line of development works to get a clearer picture.

OS Preparation

It was tested with real hardware hosted by Hetzner in Germany. The box specs were roughly something like:

To begin, I personally built a Debian 6.0 system (because thats all Hetzner offers you within its Robot tool) with a spare partition that I later use for the OSD/BTRFS volume. The layout was something like:

And in the LVM partition I defined the following logical volumes:

The device /dev/md2 I reserved for BTRFS. I believe a more optimal configuration is to not use an MD device but just use /dev/sda2 & /dev/sda3 and let BTRFS do the mirroring. I however have no data or performance statistics to prove this at the moment.

To get the system upgraded from Debian 6 to 7 is fairly straight-forward. First update the APT sources list.

 deb http://ftp.de.debian.org/debian/ wheezy main contrib non-free
 deb http://ftp.de.debian.org/debian/ wheezy-proposed-updates main contrib non-free

Then run the following to then get the latest updates:

 $ apt-get update
 $ apt-get -y dist-upgrade

The kernel would have been upgraded so you should reboot at this point.

CEPH Installation

Install CEPH using the ceph package. This should pull in all the dependencies you need.

 $ apt-get install ceph

Create some directories for the various CEPH components:

 $ mkdir -p /srv/ceph/{osd,mon,mds}

I used a configuration file like this below. Obviously you will need to change the various parts to suit your environment. I've left out authentication in this demo for simplicity, although if you want to do real destructive and load-testing you should always include this.

    log file = /var/log/ceph/$name.log
    pid file = /var/run/ceph/$name.pid
    mon data = /srv/ceph/mon/$name
    host = <your_hostname>
    mon addr = <your_ip_address>:6789
    host = <your_hostname>
    osd data = /srv/ceph/osd/$name
    osd journal = /srv/ceph/osd/$name/journal
    osd journal size = 1000 ; journal size, in megabytes
    host = <your_hostname>
    btrfs devs = /dev/md2
    btrfs options = rw,noatime

Now for configuration, CEPH chooses to try and SSH into remote boxes and configure things. I believe this is nice for people who are just getting started, but I'm not sure if this is correct going forward if you already have your own Configuration Management tool like Puppet, Puppet Docs, Chef or CFEngine Turo CFEngine.

So to begin with, there is a command that will initialise your CEPH filesystems based on the configuration you have provided:

 $ /sbin/mkcephfs -a -c /etc/ceph/ceph.conf --mkbtrfs --no-copy-conf

Starting the daemon was a little strange (see the -a switch?):

 $ /etc/init.d/ceph -a start

So just to test its all working, lets mount the CEPH DFS volume onto the local system:

 $ mount -t ceph <your_hostname>:/ /mnt

What you are looking at here is the CEPH object store mounted in /mnt. This is a shared object store - and you should be able to have multiple hosts mount it just like NFS. As mentioned before-hand however, this is not the only way of getting access to the CEPH storage cluster.

CEPH DFS & Directory Snapshots

So I just wanted to segway a little and talk about this neat feature. CEPH & the ceph based mount point above has the capability to do per-directory snapshots which could come in useful. The interface is quite simple as well.

making a snapshot:
 $ mkdir /mnt/test
 $ cd /mnt/test
 $ touch a b c
 $ mkdir .snap/my_snapshot
deleting a snapshot:
 $ rmdir .snap/my_snapshot
finding a snapshot:

The .snap directory won't show up when you do a ls -la in the dir.

Simply assume its there and do something like:

 $ ls -la .snap

... in the directory and the snapshots should show up under the names you created them with.

RADOS Block Device

So an alternative way of using your CEPH storage is by using RBD. The RBD interface gives you the capability to expose an object onto a remote system as a block device. Obviously this has the same caveats as any block device, so multiple hosts that mount the same device must ensure they use some sort of clustered file system such as OCFS2.

So first if its not already, load the rbd kernel module:

 $ modprobe rbd

Using the 'rbd' command line tool, create an image (size is in megs):

 $ rbd create mydisk --size 10000

You can list the current images if you want:

 $ rbd list

Now to mount the actually device, you just have to tell the kernel first:

 $ echo "<your_ip_address> name=admin rbd mydisk" > /sys/bus/rbd/add

It should create a device like /dev/rbd/rbd/mydisk. Lets now format it with a real filesystem and mount it:

  $ mkfs -t ext4 /dev/rbd/rbd/mydisk
  $ mkdir /srv/mydisk
  $ mount -t ext4 /dev/rbd/rbd/mydisk /srv/mydisk

KVM/Qemu Support

QEMU (and libvirt for that matter) at some point merged in patched to allow you to specify an 'rbd' store as a backend to a QEMU virtual instance. I'm going to focus on using an Intel/KVM image for this tutorial.

So lets start by installing KVM & Qemu and the various other pieces we'll need:

 $ apt-get install kvm libvirt-bin virtinst iptables-persistent

We want to probably create a pool for vm disks separate from the pre-existing ones. You can create as many of these as you need:

 $ rados mkpool vm_disks

Now create a qemu image inside the pool. Notice we are just using 'qemu-image' to do this?

 $ qemu-img create -f rbd rbd:vm_disks/box1_disk1 10G

Create yourself a bridge network by modifying the correct Debian configuration file.

auto virbr0
iface virbr0 inet static
  bridge_ports none

And now bring up the interface:

 $ ifup --verbose virbr0

We'll need some firewall rules so that NAT works in this case. Obviously your network needs may vary here.

-A FORWARD -s -m comment --comment "100 allow forwarding from internal" -j ACCEPT
-A FORWARD -d -m comment --comment "100 allow forwarding to internal" -j ACCEPT
-A POSTROUTING -s -o eth0 -m comment --comment "500 outbound nat for internal" -j MASQUERADE

And restart iptables-persistent to load the rules:

 $ service iptables-persistent restart

Turn on forwarding for IPv4:

 $ echo 1 > /proc/sys/net/ipv4/ip_forward

Now that the network is done, we want to create a script to help us launch our VM instance.

First of all create a device definition file called disk.xml with the following contents. This allows us to work-around limitations in virt-install, as it doesn't yet support these extra options as command-line arguments.

<disk type='network' device='disk'>
  <source protocol='rbd' name='vm_disks/box1_disk1'/>
  <target dev='vda' bus='virtio'/>

Now lets create our script.

set -x
virt-install \
  --name=box1 \
  --ram=512 \
  --vcpus=1 \
  --location=http://ftp.de.debian.org/debian/dists/wheezy/main/installer-amd64/ \
  --extra-args="console=ttyS0" \
  --serial=pty \
  --console=pty,target_type=serial \
  --os-type=linux \
  --os-variant=debiansqueeze \
  --network=bridge=virbr0,model=virtio \
  --graphics=none \
  --virt-type=kvm \
  --noautoconsole \
# This is because virt-install doesn't support passing rbd 
# style disk settings yet.
# Attaching it quickly before system boot however seems to work
virsh attach-device box1 disk.xml --persistent

And finally we should be able to run it:


Now attach to the console and go through the standard installation steps for the OS.

 virsh console box1

Note: There is no DHCP or DNS server setup - for this test I just provided a static IP and used my own DNS servers.

As you go through the setup, the RBD disk we defined and created should be available like a normal disk as you would expect. After installation you shouldn't really notice any major functional difference.

Once installation is complete, you should be able to boot the system:

 virsh start box 1

And then access the console:

# virsh console box1
Connected to domain box1
Escape character is ^]
Debian GNU/Linux 6.0 box1 ttyS0
box1 login:
And then your done.


So this is quite an interesting exercise and one worth doing, but the software is still very much early-release. They even admit this themselves.

I'm wary of performance and stability more then anything, something I can't test with just a single host - so if I ever get the time I'd really like to run this thing properly.

I had a brief look at the operations guide, and it seems the instructions for adding and removing a host to the OSD cluster looks not as automatic as I would like it. Ideally, you really want the kind of behaviour that ElasticSearch offers on this level so that adding and removing nodes is almost a brain-dead task. Having said that, adding a node seems easier then some of the storage systems/solutions I've seen about the place :-).

So regardless of my concerns - I think this kind of storage is definitely the future and I'm certainly cheering the CEPH team on for this one. The functionality was fun (and yes kind of exciting) to play with and I can see that the real-world possibilities of such a solution in the open-source arena are quite probable now.

Other things to try from here


  1. Formal (yet incomplete) Documentation:   http://ceph.com/docs/next/
  2. Wiki:   http://ceph.com/w/index.php?title=Main_Page
  3. Installation on Debian:   http://ceph.com/w/index.php?title=Installing_on_Debian
  4. RBD:   http://ceph.com/w/index.php?title=Rbd
  5. QEMU-RBD:   http://ceph.com/w/index.php?title=QEMU-RBD
  6. Snapshots:   http://ceph.com/w/index.php?title=Snapshots
Posted by

Setting up a Ceph cluster and exporting a RBD volume to a KVM guest (Source Origin)

Posted: October 12th, 2011 | Author:

Host Cluster Setup, the easy way

Fedora has included Ceph for a couple of releases, but since my hosts are on Fedora 14/15, I grabbed the latest ceph 0.3.1 sRPMs from Fedora 16 and rebuilt those to get something reasonably up2date. In the end I have the following packages installed, though to be honest I don't really need anything except the base 'ceph' RPM:

# rpm -qa | grep ceph | sort

Installing the software is the easy bit, configuring the cluster is where the fun begins. I had three hosts available for testing all of which are virtualization hosts. Ceph has at least 3 daemons it needs to run, which should all be replicated across several hosts for redundancy. There's no requirement to use the same hosts for each daemon, but for simplicity I decided to run every Ceph daemon on every virtualization host.

My hosts are called lettuce, avocado and mustard. Following the Ceph wiki instructions, I settled on a configuration file that looks like this:

    auth supported = cephx
    keyring = /etc/ceph/keyring.admin
    keyring = /etc/ceph/keyring.$name
    host = lettuce
    host = avocado
    host = mustard
    osd data = /srv/ceph/osd$id
    osd journal = /srv/ceph/osd$id/journal
    osd journal size = 512
    osd class dir = /usr/lib64/rados-classes
    keyring = /etc/ceph/keyring.$name
    host = lettuce
    host = avocado
    host = mustard
    mon data = /srv/ceph/mon$id
    host = lettuce
    mon addr =
    host = avocado
    mon addr =
    host = mustard
    mon addr =

The osd class dir bit should not actually be required, but the OSD code looks in the wrong place (/usr/lib instead of /usr/lib64) on x86_64 arches.

With the configuration file written, it is time to actually initialize the cluster filesystem / object store. This is the really fun bit. The Ceph wiki has a very basic page which talks about the mkcephfs tool, along with a scary warning about how it'll 'rm -rf' all the data on the filesystem it is initializing. It turns out that it didn't mean your entire host filesystem, As far as I can tell, it only blows away the contents of the directory configured for 'osd data' and 'mon data', in my case both under /srv/ceph.

The recommended way is to let mkcephfs ssh into each of your hosts and run all the configuration tasks automatically. Having tried the non-recommended way and failed several times before finally getting it right, I can recommend following the recommended way :-P There are some caveats not mentioned in the wiki page though:

With that in mind, I ran the following commands from my laptop, as root

 # n=0
 # for host in lettuce avocado mustard ; \
   do \
       ssh root@$host mkdir -p /etc/ceph /srv/ceph/mon$n; \
       n=$(expr $n + 1); \
       scp /etc/ceph/ceph.conf root@$host:/etc/ceph/ceph.conf
 # mkcephfs -a -c /etc/ceph/ceph.conf -k /etc/ceph/keyring.bin

On the host where you ran mkcephfs there should now be a file /etc/ceph/keyring.admin. This will be needed for mounting filesystems. I copied it across to all my virtualization hosts

 # for host in lettuce avocado mustard ; \
   do \
       scp /etc/ceph/keyring.admin root@$host:/etc/ceph/keyring.admin; \

Host Cluster Usage

Assuming the setup phase all went to plan, the cluster can now be started. A word of warning though, Ceph really wants your clocks VERY well synchronized. If your NTP server is a long way away, the synchronization might not be good enough to stop Ceph complaining. You really want a NTP server on your local LAN for hosts to sync against. Sort this out before trying to start the cluster.

 # for host in lettuce avocado mustard ; \
   do \
       ssh root@$host service ceph start; \

The ceph tool can show the status of everything. The 'mon', 'osd' and 'msd' lines in the status ought to show all 3 host present & correct

# ceph -s
2011-10-12 14:49:39.085764    pg v235: 594 pgs: 594 active+clean; 24 KB data, 94212 MB used, 92036 MB / 191 GB avail
2011-10-12 14:49:39.086585   mds e6: 1/1/1 up {0=lettuce=up:active}, 2 up:standby
2011-10-12 14:49:39.086622   osd e5: 3 osds: 3 up, 3 in
2011-10-12 14:49:39.086908   log 2011-10-12 14:38:50.263058 osd1 197 : [INF] 2.1p1 scrub ok
2011-10-12 14:49:39.086977   mon e1: 3 mons at {0=,1=,2=}

The cluster configuration I chose has authentication enabled, so to actually mount the ceph filesystem requires a secret key. This key is stored in the /etc/ceph/keyring.admin file that was created earlier. To view the keyring contents, the cauthtool program must be used

# cauthtool -l /etc/ceph/keyring.admin 
	key = AQDLk5VOeHkHLxAAfGjcaUsOXOhJr7hZCNjXSQ==
	auid = 18446744073709551615

The base64 key there will be passed to the mount command, repeating on every host needing a filesystem present:

 # mount -t ceph /mnt/ -o name=admin,secret=AQDLk5VOeHkHLxAAfGjcaUsOXOhJr7hZCNjXSQ==
error adding secret to kernel, key name client.admin: No such device

For some reason, that error message is always printed on my Fedora hosts, and despite that, the mount has actually succeeded

# grep /mnt /proc/mounts /mnt ceph rw,relatime,name=admin,secret= 0 0

Congratulations, /mnt is now a distributed filesystem. If you create a file on one host, it should appear on the other hosts & vica-verca.

RBD Volume setup

A shared filesystem is very nice, and can be used to hold regular virtual disk images in a variety of formats (raw, qcow2, etc). What I really wanted to try was the RBD virtual block device functionality in QEMU. Ceph includes a tool called rbd for manipulating those. The syntax of this tool is pretty self-explanatory

# rbd create --size 100 demo
# rbd ls
# rbd info demo
rbd image 'demo':
	size 102400 KB in 25 objects
	order 22 (4096 KB objects)
	block_name_prefix: rb.0.0
	parent:  (pool -1)

Alternatively RBD volume creation can be done using qemu-img ..., at least once the Fedora QEMU package is fixed to enable RBD support.

# qemu-img create -f rbd rbd:rbd/demo  100M
Formatting 'rbd:rbd/foo', fmt=rbd size=104857600 cluster_size=0 
# qemu-img info rbd:rbd/demo
image: rbd:rbd/foo
file format: raw
virtual size: 100M (104857600 bytes)
disk size: unavailable

KVM guest setup

The syntax for configuring a RBD block device in libvirt, is very similar to that used for Sheepdog. In Sheepdog, every single virtualization node is also a storage node, so there is no hostname required. Not so for RBD. Here it is necessary to specify one or more host names, for the RBD servers.

<disk type='network' device='disk'>
  <driver name='qemu' type='raw'/>
  <source protocol='rbd' name='demo/wibble'>
    <host name='lettuce.example.org' port='6798'/>
    <host name='mustard.example.org' port='6798'/>
    <host name='avocado.example.org' port='6798'/>
  <target dev='vdb' bus='virtio'/>

More observant people might be wondering how QEMU gets permission to connect to the RBD server, given that the configuration earlier enabled authentication. This is thanks to the magic of the /etc/ceph/keyring.admin file which must exist on any virtualization server. Patches are currently being discussed which will allow authentication credentials to be set via libvirt, avoiding the need to store the credentials on the virtualization hosts permanently.

Introducing Ceph to OpenStack (Source Origin)

| Comments

I. Ceph introduction

Ceph is a massively scalable, open source, distributed storage system. It is comprised of an object store, block store, and a POSIX-compliant distributed file system. The platform is capable of auto-scaling to the exabyte level and beyond, it runs on commodity hardware, it is self-healing and self-managing, and has no single point of failure. Ceph is in the Linux kernel and is integrated with the OpenStack cloud operating system. As a result of its open source nature, this portable storage platform may be installed and used in public or private clouds.


You can easily get confused by the denomination: Ceph? RADOS?

RADOS: Reliable Autonomic Distributed Object Store is an object storage. RADOS takes care of distributing the objects across the whole storage cluster and replicating them for fault tolerance. It is built with 3 major components:

Ceph devoloppers recommend to use btrfs as a filesystem for the storage. Using XFS is also possible and might be a better alternative for production environements. Neither Ceph nor Btrfs are ready for production. It could be really risky to put them together. This is why XFS is an excellent alternative to btrfs. The ext4 filesystem is also compatible but doesn't take advantage of all the power of Ceph.

We recommend configuring Ceph to use the XFS file system in the near term, and btrfs in the long term once it is stable enough for production.

For more information about usable file system

I.2. Ways to store, use and expose data

Several ways to store and access your data :)

Ceph exposes its distributed object store (RADOS) and it can be accessed via multiple interfaces:


The definition of "production quality" varies depending on who you ask. Because it can mean a lot of different things depending on how you want to use Ceph, we prefer not to think of it as a binary term. At this point we support the RADOS object store, radosgw, and RBD because we think they are sufficiently stable that we can handle the support workload. There are several organizations running those parts of the system in production. Others wouldn't dream of doing so at this stage. The CephFS POSIX-compliant filesystem is functionally-complete and has been evaluated by a large community of users, but has not yet been subjected to extensive, methodical testing.

Reference ceph FAQ

II. Ceph installation

Since there is no stable version, I decided to version with the upstream version of Ceph. Thus, I used the Ceph repository, I worked with the last version available so 0.47.2: Add Ceph Packages

 $ wget -q -O- https://raw.github.com/ceph/ceph/master/keys/release.asc | sudo apt-key add -
 $ sudo echo deb http://ceph.com/debian/ $(lsb_release -sc) main | sudo tee /etc/apt/sources.list.d/ceph.list
 $ sudo apt-get update && sudo apt-get install ceph

Since I don't have thousand nodes I decided to put every services on each node. Here my really basic Ceph configuration file:

; Ceph conf file!
; use semi-colon to put a comment!
    auth supported = cephx
    keyring = /etc/ceph/keyring.admin
    keyring = /etc/ceph/keyring.$name
    host = server-03
    host = server-04
    host = server-06
    osd data = /srv/ceph/osd$id
    osd journal = /srv/ceph/osd$id/journal
    osd journal size = 512
    osd class dir = /usr/lib/rados-classes
    keyring = /etc/ceph/keyring.$name
    ; working with ext4
    filestore xattr use omap = true
    ; solve rbd data corruption
    filestore fiemap = false
    host = server-03
    devs = /dev/mapper/nova--volumes-lvol0
    host = server-04
    devs = /dev/mapper/server-04-lvol0
    host = server-06
    devs = /dev/sdb
    mon data = /srv/ceph/mon$id
    host = server-03
    mon addr =
    host = server-04
    mon addr =
    host = server-06
    mon addr =

Generate the keyring authentication, deploy the configuration and configure the nodes. I will highly recommand to previously setup a SSH key authentication based because mkcephfs will attempt to connect via SSH to each servers (hostnames) you provided in the ceph configuration file. It can be a pain in the arse to enter the ssh password for every command run by mkcephfs!

Directory creation is not managed by the script so you have to create them manually on each server:

server-03:~$ sudo mkdir -p /srv/ceph/{osd0,mon0}
server-04:~$ sudo mkdir -p /srv/ceph/{osd1,mon1}
server-06:~$ sudo mkdir -p /srv/ceph/{osd2,mon2}

Don't forget to mount your OSD directory according to your disk map otherwise Ceph will by default use the root filesystem. It's up to you to use ext4 or XFS. For those of you who want to setup an ext4 cluster I extremly recommend to use the following mount options for your hard drive disks:


Now run the mkcephfs to deploy your cluster:

 $ sudo mkcephfs -a -c /etc/ceph/ceph.conf -k /etc/ceph/keyring.admin

Ceph doesn't need root permission to execute his command, it simply needs to access the keyring. Each Ceph command you execute on the command line assumes that you are the client.admin default user. The client.admin key has been generated during the mkcephfs process. The interesting thing to know about cephx is that it's based on Kerberos ticket trust mecanism. If you want to go further with the cephx authentication check the ceph documentation about it. Just make sure that your keyring is readable by everyone:

 $ sudo chmod +r /etc/ceph/keyring.admin

And launch all the daemons:

 $ sudo service ceph start

This will run every Ceph daemons, namely OSD, MON and MDS (-a flag), but you can specify a particular daemon with an extra parameter as osd, mon or mds. Now check the status of your cluster by running the following command:

 $ ceph -k /etc/ceph/keyring.admin -c /etc/ceph/ceph.conf health

As you can see I'm using the -k option, indeed Ceph supports cephx secure authentication between the nodes within the cluster, each connection and communication are initiated with this authentication mecanism. It depends on your setup but it can be overkill to use this system...

All the daemons are running (extract from the server-04):

 $ ps aux | grep ceph
root     22403  0.0  0.1 126204  7748 ?        Ssl  May23   0:35 /usr/bin/ceph-mon -i 1 --pid-file /var/run/ceph/mon.1.pid -c /etc/ceph/ceph.conf
root     22596  0.0  0.3 148680 13876 ?        Ssl  May23   0:08 /usr/bin/ceph-mds -i server-04 --pid-file /var/run/ceph/mds.server-04.pid \
                                                 -c /etc/ceph/ceph.conf
root     22861  0.0 59.8 2783680 2421900 ?     Ssl  May23   2:03 /usr/bin/ceph-osd -i 1 --pid-file /var/run/ceph/osd.1.pid -c /etc/ceph/ceph.conf

Summarize of your ceph cluster status:

 $ ceph -s
pg v623: 576 pgs: 497 active+clean, 79 active+clean+replay; 11709 bytes data, 10984 MB used, 249 GB / 274 GB avail
mds e13: 1/1/1 up {0=server-06=up:active}, 4 up:standby
osd e15: 3 osds: 3 up, 3 in
log 2012-05-23 22:54:00.018319 mon.0 10 : [INF] mds.0 up:active
mon e1: 3 mons at {0=,1=,2=}

You can also use the -w option to provide an endless and live output.

II.2. Make it grow!

It's really easy to expand your Ceph cluster. Here I will add a logical volume.

 $ ceph osd create

Copy this into your ceph.conf file:

    host = server-03
    devs = /dev/mapper/nova--volumes-lvol0

Format, create the OSD directory, mount it:

 $ sudo mkfs.ext4 /dev/mapper/nova--volumes-lvol0
 $ sudo mkdir /srv/ceph/osd3
 $ sudo mount /dev/mapper/nova--volumes-lvol0 /srv/ceph/osd3

Configure the authentifation, permission and grow the crunchmap:

 $ ceph-osd -i 3 --mkfs --mkkey
 $ ceph auth add osd.3 osd 'allow *' mon 'allow rwx' -i /etc/ceph/keyring.osd.3
 $ sudo service ceph start osd.3

At the moment, the OSD is part of the cluster but doesn't store any data, you need to add to the crush map:

 $ ceph osd crush set 3 osd.3 1.0 pool=default host=server-03

The migration starts, wait a few seconds and verify the space available with the ceph -s command, you should notice that your cluster is growing.

You can also perform this check and see that your storage tree has grown as well:

 $ ceph osd tree
dumped osdmap tree epoch 43
# id  weight  type name   up/down reweight
-1    4   pool default
-3    4       rack unknownrack
-2    1           host server-03
0 1               osd.0   up  1   
-4    2           host server-04
1 1               osd.1   up  1   
3 1               osd.3   up  1   
-5    1           host server-06
2 1               osd.2   up  1

I have 2 'resources' on the server-04 because I added a logical volume.

II.3. Shrink your cluster

It's remarkably simple to shrink your ceph cluster. First you need to stop your OSD daemon and wait until the OSD is marked as down.

 $ ceph osd crush remove osd.1
removed item id 1 name 'osd.1' from crush map
 $ ceph osd rm 1
marked dne osd.1
 $ sudo rm -r /srv/ceph/osd1/

Remove this line from the ceph.conf file:

    host = server-03

When you work with OSD you will often see the crushmap term. But what is the crushmap?

CRUSH is a pseudo-random placement algorithm which tells where data (objects) should remain. The crush map contains these information.

II.4. Re build an OSD from scratch

Here I rebuilt the OSD number 1:

 $ sudo service ceph stop osd
 $ sudo umount /srv/ceph/osd1/
 $ sudo mkfs.ext4 /dev/mapper/nova--volumes-lvol0
 $ sudo tune2fs -o journal_data_writeback /dev/mapper/nova--volumes-lvol0

Copy this in your fstab:

/dev/mapper/nova--volumes-lvol0 /srv/ceph/osd1 ext4 rw,noexec,nodev,noatime,nodiratime,user_xattr,data=writeback,barrier=0 0 0

 $ sudo mount -a
 $ ceph mon getmap -o /tmp/monmap
 $ ceph-osd -c /etc/ceph/ceph.conf --monmap /tmp/monmap -i 1 --mkfs

Finally run the OSD daemon:

 $ sudo service ceph start osd

II.5. Resize an OSD

On an LVM based setup, stop the OSD server:

 $ mount | grep osd
/dev/mapper/server4-lvol0 on /srv/ceph/osd1 type ext4 (rw,noexec,nodev,noatime,nodiratime,user_xattr,data=writeback,barrier=0)
 $ sudo service ceph stop osd1
 $ sudo umount /srv/ceph/osd1

Check your LVM status, here I resized my logical volume from 90G to 50G:

 $ sudo lvs
  LV     VG      Attr   LSize  Origin Snap%  Move Log Copy%  Convert
  lvol0  server4 -wi-ao 50.00g
  root   server4 -wi-ao 40.00g
  swap_1 server4 -wi-ao  4.00g
 $ sudo vgs
  VG      #PV #LV #SN Attr   VSize   VFree 
  server4   1   3   0 wz--n- 135.73g 1.73 g
 $ sudo e2fsck -f /dev/server4/lvol0
 $ sudo lvresize /dev/server4/lvol0 -L 50G --resizefs
fsck from util-linux 2.20.1
e2fsck 1.42 (29-Nov-2011)
/dev/mapper/server4-lvol0: clean, 3754/2621440 files, 3140894/10485760 blocks
resize2fs 1.42 (29-Nov-2011)
Resizing the filesystem on /dev/dm-2 to 13107200 (4k) blocks.
The filesystem on /dev/dm-2 is now 13107200 blocks long.
  Reducing logical volume lvol0 to 50.00 GiB
  Logical volume lvol0 successfully resized

Re-mount your device in the OSD directory and launch the OSD daemon:

 $ sudo mount -a
 $ sudo service ceph osd1 start

Check the status ceph -w, you should noticed that the size changed and that everything is back to normal.

II.6. Adjust the replication level

The replication level is set to 2 by default, you can easily check this with the size 2 value:

 $ ceph osd dump | grep ^pool
pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 192 pgp_num 192 last_change 1 owner 0 crash_replay_interval 45
pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num 192 pgp_num 192 last_change 1 owner 0
pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 192 pgp_num 192 last_change 1 owner 0
pool 3 'nova' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 22 owner 0
pool 4 'images' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 10 owner 0

Of course each pool might store more critical data, for instance my pool called nova store the RBD volume of each virtual machine, so I increased the replication level like this:

 $ ceph osd pool set nova size 3
set pool 3 size to 3

II.7. Connect your client

Clients can access the RADOS cluster either directly via librados with rados command. The usage of librbd is possible as well with the RBD tool (via rbd command), which creates an image / volume abstraction over the object store. To achieve highly available monitor, simply put all of them in the mount option:

 $ ceph-authtool --print-key /etc/ceph/keyring.admin
client:~$ sudo mount -t ceph,, /mnt/ -vv -o \
parsing options: rw,name=admin,secret=AQDVGc5P0LXzJhAA5C019tbdrgypFNXUpG2cqQ==

Monitor reliability?

I tried to simulate a MON failure while CephFS is mounted. I stopped one of my MON server but precisely the one used for mounting CephFS. Oh yes.. I forgot to tell you I used only one monitor to mount Ceph... And the result was really unexpected, after I stopped the monitor the CephFS didn't crashed and stayed alive :). There is some magic performed under the hood by Ceph. I don't really know how but Ceph and monitors are clever enough to figure out MON failure and re-initiate a connection to an another monitor and thus keep the the mounting filesystem alive.

Check this:

client:~$ mount | grep ceph
client:~$ sudo mount -t ceph /mnt -vv -o \
client:~$ mount grep ceph on /mnt type ceph (rw,name=admin,secret=AQDVGc5P0LXzJhAA5C019tbdrgypFNXUpG2cqQ==)
client:~$ ls /mnt/
client:~$ touch /mnt/mon-ok
client:~$ ls /mnt/
client:~$ sudo netstat -plantu | grep EST | grep 6789
tcp        0      0         ESTABLISHED -
server6:~$ sudo service ceph stop mon
=== mon.2 ===
Stopping Ceph mon.2 on server6...kill 532...done
client:~$ touch /mnt/mon-3-down
client:~$ sudo netstat -plantu | grep EST | grep 6789
tcp        0      0         ESTABLISHED -
server6:~$ sudo service ceph start mon
=== mon.2 ===
Starting Ceph mon.2 on server6...
starting mon.2 rank 2 at mon_data /srv/ceph/mon2 fsid caf6e927-e87e-4295-ab01-3799d6e24be1
server4:~$ sudo service ceph stop mon
=== mon.1 ===
Stopping Ceph mon.1 on server4...kill 4049...done
client:~$ touch /mnt/mon-2-down
client:~$ sudo netstat -plantu | grep EST | grep 6789
tcp        0      0         ESTABLISHED -
client:~$ touch /mnt/mon-2-down
client:~$ ls /mnt/
mon-ok mon-3-down mon-2-down


III. Openstack integration

III.1. RDB and nova-volume

Before starting, here is my setup, I voluntary installed nova-volume on a node of my Ceph cluster:

   --- - ceph-node-01
       - nova-volume
   --- - ceph-node-02
   --- - ceph-node-03

According to the OpenStack documentation on RBD I just added those lines in nova.conf:


By default, the RBD pool named rbd will be use by OpenStack if nothing is specified. I prefered to use nova as a pool, so I created it:

 $ rados lspools
 $ rados mkpool nova
 $ rados lspools
 $ rbd --pool nova ls
 $ rbd --pool nova info volume-0000000c
rbd image 'volume-0000000c':
  size 1024 MB in 256 objects
  order 22 (4096 KB objects)
  block_name_prefix: rb.0.0
  parent:  (pool -1)

Restart your nova-volume:

 $ sudo service nova-volume restart

Try to create a volume, you shouldn't have any problem :)

 $ nova volume-create --display_name=rbd-vol 1

Check this via:

 $ nova volume-list
| ID |   Status  | Display Name | Size | Volume Type | Attached to |
| 51 | available | rbd-vol      | 1    | None        |             |

Check in RBD:

 $ rbd --pool nova ls
 $ rbd --pool nova info volume-00000033
rbd image 'volume-00000033':
  size 1024 MB in 256 objects
  order 22 (4096 KB objects)
  block_name_prefix: rb.0.3
  parent:  (pool -1)

Everything looks great, but wait.. can I attach it to an instance?

Since we are using cephx authentication, nova and libvirt require a couple more steps.

For security and clarity purpose you may want to create a new user and give it access to your Ceph cluster with fine permissions. Let's say that you want to use a user called nova, each connection to your MON server will be initiate as client.nova instead of client.admin. This behavior is define by the rados_create function which create a handle for communicating with your RADOS cluster. Ceph environment variables are read when this is called, so if $CEPH_ARGS specifies everything you need to connect, no further configuration is necessary. The trick is to add the following lines at the beginning of the /usr/lib/python2.7/dist-packages/nova/volume/driver.py file:

# use client.nova instead of nova.admin
import os
os.environ["CEPH_ARGS"] = "--id nova"

Adding the variable via the init script of nova-volume should also work, it's up to you. The nova user needs this environment variable.

Here I assume that you use client.admin, if you use client.nova change every values called admin to nova. Now we can start to configure the secret in libvirt, create a file secret.xml and add this content:

<secret ephemeral='no' private='no'>
   <usage type='ceph'>
     <name>client.admin secret</name>

Import it into virsh:

 $ sudo virsh secret-define --file secret.xml
Secret 83a0e970-a18b-5490-6fce-642f9052f976 created

Virsh tells you the UUID of the secret, which is how you reference it for other libvirt commands. Now set this value with the client.admin key:

 $ sudo virsh secret-set-value --secret 83a0e970-a18b-5490-6fce-642f9052f976 --base64 AQDVGc5P0LXzJhAA5C019tbdrgypFNXUpG2cqQ==
Secret value set

At this point you should be able to attach a disk manually with virsh using this disk.xml file. I used the RBD volume previously created:

<disk type='network'>
  <driver name="qemu" type="raw"/>
  <source protocol="rbd" name="nova/volume-00000033">
    <host name='' port='6789'/>
    <host name='' port='6789'/>
    <host name='' port='6789'/>
  <target dev="vdb" bus="virtio"/>
  <auth username='admin'>
    <secret type='ceph' uuid='83a0e970-a18b-5490-6fce-642f9052f976'/>

Some explanations about this file:

The xml syntax is documented on the libvirt website.

Login to your compute node where the instance is running and check the id of the running instance. If you don't know where the instance is running launch the following commands:

 $ nova list
|                  ID                  | Name              | Status |       Networks      |
| e1457eea-ef67-4df3-8ba4-245d104d2b11 | instance-over-rbd | ACTIVE | vlan1= |
 $ nova show e1457eea-ef67-4df3-8ba4-245d104d2b11
|               Property              |                          Value                           |
| OS-DCF:diskConfig                   | MANUAL                                                   |
| OS-EXT-SRV-ATTR:host                | server-02                                                |
| OS-EXT-SRV-ATTR:hypervisor_hostname | None                                                     |
| OS-EXT-SRV-ATTR:instance_name       | instance-000000d6                                        |
| OS-EXT-STS:power_state              | 1                                                        |
| OS-EXT-STS:task_state               | None                                                     |
| OS-EXT-STS:vm_state                 | active                                                   |
| accessIPv4                          |                                                          |
| accessIPv6                          |                                                          |
| config_drive                        |                                                          |
| created                             | 2012-06-07T12:25:48Z                                     |
| flavor                              | m1.tiny                                                  |
| hostId                              | 30dec431592ca96c90bb4990d0df235f4face63907a7fc2ecdcb36d3 |
| id                                  | e1457eea-ef67-4df3-8ba4-245d104d2b11                     |
| image                               | precise-ceph                                             |
| key_name                            | seb                                                      |
| metadata                            | {}                                                       |
| name                                | instance-over-rbd                                        |
| progress                            | 0                                                        |
| status                              | ACTIVE                                                   |
| tenant_id                           | d1f5d27ccf594cdbb034c8a4123494e9                         |
| updated                             | 2012-06-07T13:06:43Z                                     |
| user_id                             | 557273155f8243bca38f77dcdca82ff6                         |
| vlan1 network                       |                                            |

As you can see my instance is running on the server-02, pick up the id of your instance here instance-000000d6 in virsh. Attach it manually with virsh:

server-02:~$ sudo virsh attach-device instance-000000d6 rbd.xml
Device attached successfully

Now check inside your instance, for this use your credential and log into it via ssh. You will see a new device called vdb:

server-02:~$ ssh -i seb.pem ubuntu@
ubuntu@instance-over-rbd:~$ sudo fdisk -l
Disk /dev/vda: 2147 MB, 2147483648 bytes
255 heads, 63 sectors/track, 261 cylinders, total 4194304 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
   Device Boot      Start         End      Blocks   Id  System
/dev/vda1   *       16065     4192964     2088450   83  Linux
Disk /dev/vdb: 1073 MB, 1073741824 bytes
16 heads, 63 sectors/track, 2080 cylinders, total 2097152 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
Disk /dev/vdb doesn't contain a valid partition table

Now you are ready to use it:

ubuntu@instance-over-rbd:~$ sudo mkfs.ext4 /dev/vdb
mke2fs 1.42 (29-Nov-2011)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
65536 inodes, 262144 blocks
13107 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=268435456
8 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
  32768, 98304, 163840, 229376
Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (8192 blocks): done
Writing superblocks and filesystem accounting information: done
ubuntu@instance-over-rbd:~$ sudo mount /dev/vdb /mnt
ubuntu@instance-over-rbd:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1       2.0G  668M  1.3G  35% /
udev            242M   12K  242M   1% /dev
tmpfs            99M  212K   99M   1% /run
none            5.0M     0  5.0M   0% /run/lock
none            246M     0  246M   0% /run/shm
/dev/vdb       1022M   47M  924M   5% /mnt
ubuntu@instance-over-rbd:~$ sudo touch /mnt/test
ubuntu@instance-over-rbd:~$ ls /mnt/
lost+found  test

Last but not least, edit your nova.conf on each nova-compute server with the authentication value. It's working without those options, since we added them manually to libvirt, but I think it good to tell them to nova. You will be able to attach a volume to an instance from the nova cli with nova volume-attach command and from the dashboard as well :D


Here we go!

 $ nova volume-create --display_name=nova-rbd-vol 1
 $ nova volume-list
| ID |   Status  | Display Name | Size | Volume Type |             Attached to              |
| 51 | available | rbd-vol      | 1    | None        |                                      |
| 57 | available | nova-rbd-vol | 1    | None        |                                      |
 $ nova volume-attach e1457eea-ef67-4df3-8ba4-245d104d2b11 57 /dev/vdd
 $ nova volume-list
| ID |   Status  | Display Name | Size | Volume Type |             Attached to              |
| 51 | available | rbd-vol      | 1    | None        |                                      |
| 57 | in-use    | nova-rbd-vol | 1    | None        | e1457eea-ef67-4df3-8ba4-245d104d2b11 |

The first disk is marked as available simply because it has been attached manually with virsh and not with nova. Have a look inside your virtual machine :)

Detach the manually attached disk:

 $ sudo virsh detach-device instance-000000d6 rbd.xml
Device detached successfully

/!\ Important note, the secret.xml needs to be added on each nova-compute, more precisely to libvirt. Keep the first secret (uuid) and put it into your secret.xml. The file below becomes your new secret.xml reference file.

<secret ephemeral='no' private='no'>
   <usage type='ceph'>
     <name>client.admin secret</name>

Attaching error found:

error : qemuMonitorJSONCheckError:318 : internal error unable to execute QEMU command 'device_add': Device 'virtio-blk-pci' could not be initialized 
error : qemuMonitorJSONCheckError:318 : internal error unable to execute QEMU command 'device_add': Duplicate ID 'virtio-disk2' for device 

The first one occured when I tried to mount a volume with /dev/vdb as device name and the second occured with /dev/vdc. Solved by using a different device name than /dev/vdc/, I think libvirt remembers 'somewhere' and 'somehow' that a device was previously attached (the manually one). I didn't really investigate since it can be simply solved.

EDIT: 11/07/2012

Some people reported tp me a common issue. There were unable to attach a RBD device with nova, but with libvirt it's fine. If you have difficulties to make it working, you will probably need to update the libvirt AppArmor profile. If you check your /var/log/libvirt/qemu/your_instance_id.log, you should see:

unable to find any monitors in conf. please specify monitors via -m monaddr or -c ceph.conf

And if you dive into the debug mode:

debug : virJSONValueFromString:914 : string={"return": "error connecting\r\ncould not \
  open disk image rbd:nova/volume-00000050: No such file or directory\r\n", "id": "libvirt-12"}

And of course it's log in AppArmor and it's pretty explicit:

 $ sudo grep -i denied /var/log/kern.log
server-01 kernel: [28874.202700] type=1400 audit(1341957073.795:51): apparmor="DENIED" 
  operation="open" parent=1 profile="libvirt-bd261aa7-728b-4edb-bd18-2ae2370b6549" 
  name="/etc/ceph/ceph.conf" pid=5833 comm="kvm" requested_mask="r" denied_mask="r" 
  fsuid=108 ouid=0

Now edit the libvirt AppArmor profile, you need to adjust access controls for all VMs, new or existing:

 $ sudo echo "/etc/ceph/** r," | sudo tee -a /etc/apparmor.d/abstractions/libvirt-qemu
 $ sudo service libvirt-bin restart
 $ sudo service apparmor reload

That's all, after this libvirt/qemu will be able to read your ceph.conf and your keyring (if you use cephx) ;-).

III.2. RBD and Glance

III.2.1. RBD as Glance storage backend

I followed the official instructions from OpenStack documentation. I recommend to follow the upstream package from Ceph since the Ubuntu repo doesn't provide a valid version. This issue has been recently reported by Florian Haas in the OpenStack and Ceph mailing list however the bug has already been tracked here. It has been uploaded to precise-proposed for SRU review and waiting for approval, this shouldn't be too long. Be sure to add the Ceph repo (deb http://ceph.com/debian/ precise main) on your Glance server (as I did earlier).

 $ sudo apt-get install python-ceph

Modify your glance-api.conf like so:

# Set the rbd storage
default_store = rbd
# ============ RBD Store Options =============================
# Ceph configuration file path
# If using cephx authentication, this file should
# include a reference to the right keyring
# in a client.<USER> section
rbd_store_ceph_conf = /etc/ceph/ceph.conf
# RADOS user to authenticate as (only applicable if using cephx)
rbd_store_user = glance
# RADOS pool in which images are stored
rbd_store_pool = images
# Images will be chunked into objects of this size (in megabytes).
# For best performance, this should be a power of two
rbd_store_chunk_size = 8

According to the glance configuration, I created a new pool and a new user for RADOS:

 $ rados mkpool images
successfully created pool images
 $ ceph-authtool --create-keyring /etc/glance/rbd.keyring
creating rbd.keyring
 $ ceph-authtool --gen-key --name client.glance --cap mon 'allow r' --cap osd 'allow rwx pool=images' /etc/glance/rbd.keyring
 $ ceph auth add client.glance -i /etc/glance/rbd.keyring
2012-05-24 10:45:58.101925 7f7097c31780 -1 read 122 bytes from /etc/glance/rbd.keyring
added key for client.glance
 $ sudo chown glance:glance /etc/glance/rbd.keyring

After this you should see a new key in ceph:

 $ ceph auth list
installed auth entries:
  key: AQDVGc5PaLVfKBAAqWFONvImdw7WSu4Sf/e4qg==
  key: AQDPGc5PGGXXNxAAoMr9ebDaCwhWo+xbv7cm7A==
  caps: [mds] allow
  caps: [mon] allow rwx
  caps: [osd] allow *
  key: AQC6Gc5PGK4cJxAAxRnNC0rRNGPqpJd3lNYWNA==
  caps: [mds] allow
  caps: [mon] allow rwx
  caps: [osd] allow *
  key: AQDUGc5PWBRiHRAAUMp2s78p1C31Q0D8MjZS+Q==
  caps: [mds] allow
  caps: [mon] allow rwx
  caps: [osd] allow *
  key: AQDJGc5PiGvTCxAAlV4WvTTeGgI2SpR7Vl2V2g==
  caps: [mon] allow rwx
  caps: [osd] allow *
  key: AQC0Gc5PoDLwGRAAjVvMaLhklPfzSfN1K91xOA==
  caps: [mon] allow rwx
  caps: [osd] allow *
  key: AQDOGc5PgDhwLBAAxuwS9w5d3nlVsm6ACMZJ2g==
  caps: [mon] allow rwx
  caps: [osd] allow *
  key: AQDVGc5P0LXzJhAA5C019tbdrgypFNXUpG2cqQ==
  caps: [mds] allow
  caps: [mon] allow *
  caps: [osd] allow *
  key: AQDeJc5PwDqpCxAAdggTbAVxTDxGLqjTV5pJdg==
  caps: [mon] allow r
  caps: [osd] allow rwx pool=images

Now restart your glance server:

 $ sudo service glance-api restart && sudo service glance-registry restart

Before uploading check your images pools:

 $ rados --pool=images ls

Try to upload a new image.

 $ wget http://cloud-images.ubuntu.com/precise/current/precise-server-cloudimg-amd64-disk1.img
 $ glance add name="precise-ceph" is_public=True disk_format=qcow2 container_format=ovf architecture=x86_64 < precise-server-cloudimg-amd64-disk1.img
Uploading image 'precise-ceph'
======================================================================================================[100%] 26.2M/s, ETA  0h  0m  0s
Added new image with ID: 70685ad4-b970-49b7-8bde-83e58b255d95

Check in glance:

 $ glance index
ID                                   Name                           Disk Format          Container Format     Size
------------------------------------ ------------------------------ -------------------- -------------------- --------------
60beab84-81a7-46d1-bb4a-19947937dfe3 precise-ceph                   qcow2                ovf                       227213312

Recheck your images pool, oh! objects :D

 $ rados --pool=images ls

Size of the pool:

 $ du precise-server-cloudimg-amd64.img
221888    precise-server-cloudimg-amd64.img
 $ rados --pool=images df
pool name       category                 KB      objects       clones     degraded      unfound           rd        rd KB           wr        wr KB
images          -                    221889           31            0            0           0           11            9          326      1333327
  total used        24569260         2267
  total avail      259342756
  total space      298878748

Check the image in the glance database:

mysql> use glance;
mysql> select * from images where status='active' \G;
*************************** 1. row ***************************
              id: cc7167d6-6dbe-4a2b-8609-b599a48ebbb6
            name: precise-cephAAA
            size: 227213312
          status: active
       is_public: 1
        location: rbd://cc7167d6-6dbe-4a2b-8609-b599a48ebbb6
      created_at: 2012-06-04 15:29:22
      updated_at: 2012-06-04 15:29:31
      deleted_at: NULL
         deleted: 0
     disk_format: qcow2
container_format: ovf
        checksum: fa7325f35ab884c6598154dcd4548063
           owner: d1f5d27ccf594cdbb034c8a4123494e9
        min_disk: 0
         min_ram: 0
       protected: 0

As you can see, it's stored in RBD: rbd://cc7167d6-6dbe-4a2b-8609-b599a48ebbb6

From now, you should be able to launch new instance, Glance will retrieve image from the RBD pool.

III.2.2. Instance snapshot to RBD

Testing the snapshots:

instance ad0e6a24-9648-406f-b86d-6312ea905888: snapshotting 
sudo nova-rootwrap qemu-img snapshot -c 56642cf3d09b49a7aa400b6bc07494b9 \
qemu-img convert -f qcow2 -O qcow2 -s 56642cf3d09b49a7aa400b6bc07494b9 \
  /var/lib/nova/instances/instance-00000097/disk /tmp/tmpt7EriB/56642cf3d09b49a7aa400b6bc07494b9
sudo nova-rootwrap qemu-img snapshot -d 56642cf3d09b49a7aa400b6bc07494b9 \

Let's describe the process:

  1. The first command initiates and create the snapshot with name 56642cf3d09b49a7aa400b6bc07494b9 from the image disk of the instance located here /var/lib/nova/instances/instance-00000097/disk.
  2. The second command will convert the image from qcow2 to qcow2 and store the backup into Glance thus in RBD. Here the image is stored as qcow2 format, this is not really what we want! I want to store an RBD (format) image.
  3. The third command deletes the local file of the snapshot, no longer needed since the image has been stored in the Glance backend.

When you attempt to perform a snapshot of an instance from the dashboard or via the nova image-create command, nova executes a local copy of changes in a qcow2 file, however this file will be stored in Glance.

If you want to run a RBD snapshot through OpenStack, you need to take a volume snapshot. These functionnality is not exposed in the dashboard yet.

Snapshot a RBD volume:

snapshot snapshot-00000004: creating 
snapshot snapshot-00000004: creating from (pid=18829) create_snapshot
rbd --pool nova snap create --snap snapshot-00000004 volume-00000042
snapshot snapshot-00000004: created successfully from (pid=18829) create_snapshot 


 $ rbd --pool=nova snap ls volume-00000042
2 snapshot-00000004   1073741824

Full RBD managment?

 $ qemu-img info precise-server-cloudimg-amd64.img
image: precise-server-cloudimg-amd64.img
file format: qcow2
virtual size: 2.0G (2147483648 bytes)
disk size: 217M
cluster_size: 65536
 $ sudo qemu-img convert -f qcow2 -O rbd precise-server-cloudimg-amd64.img rbd:images/glance
 $ qemu-img info rbd:nova/ceph-img-cli
image: rbd:nova/ceph-img-cli
file format: raw
virtual size: 2.0G (2147483648 bytes)
disk size: unavailable

There is a surprising value here, why does the image appear as raw format? And why does the file size become unavailable? For those of you, you want to go further with Qemu-RBD snapshot, see the documentation from Ceph

III.3. Does the dream come true?

Boot from a RBD image? I uploaded a new image in the glance RBD backend and try to boot with this image and it works. Glance is able to retrieve images over the RBD backend configured. You will usually see this log message:

INFO nova.virt.libvirt.connection [-] [instance: ce230d11-ddf8-4298-a7d9-40ae8690ff11] Instance spawned successfully. 

III.4. Boot from a volume

Booting from a volume will require specifying a dummy image id, as shown in these scripts:

start-on-rbd on Github

set -e
DIR=`dirname $0`
if [ ! -f $DIR/debian.img ]; then
        echo "Downloading debian image..."
        wget http://ceph.com/qa/debian.img -O $DIR/debian.img
touch $DIR/dummy_img
glance add name="dummy_raw_img" is_public=True disk_format=rawi container_format=ovf architecture=x86_64 < dummy_img
echo "Waiting for image to become available..."
while true; do
        if ( timeout 5 nova image-list | egrep -q 'dummy_raw_img|ACTIVE' ) then
        sleep 2
echo "Creating volume..."
nova volume-create --display_name=dummy 1
echo "Waiting for volume to be available..."
while true; do
        if ( nova volume-list | egrep -q 'dummy|available' ) then
        sleep 2
echo "Replacing blank image with real one..."
# last created volume id, assuming pool nova
DUMMY_VOLUME_ID=$(rbd --pool=nova ls | sed -n '$p')
rbd -p nova rm $VOLUME_ID
rbd -p nova import $DIR/debian.img $DUMMY_VOLUME_ID
echo "Requesting an instance..."
echo "Waiting for instance to start..."
while true; do
        if ( nova list | egrep -q "boot-from-rbd|ACTIVE" ) then
        sleep 2

boot-from-volume on Github

#!/usr/bin/env python
import argparse
import httplib2
import json
import os
def main():
    http = httplib2.Http()
    parser = argparse.ArgumentParser(description='Boot an OpenStack instance from RBD')
        help='the Nova API endpoint (http://IP:port/vX.Y/)',
        help="The image ID Nova will pretend to boot from (ie, 1 -- not ami-0000001)",
        help='The RBD volume ID (ie, 1 -- not volume-0000001)',
        '-v', '--verbose',
        help='be more verbose',
    args = parser.parse_args()
    headers = {
        'Content-Type': 'application/json',
        'x-auth-project-id': 'admin',
        'x-auth-token': 'admin:admin',
        'Accept': 'application/json',
    req = {
            'min_count': 1,
            'flavorRef': 1,
            'name': 'test1',
            'imageRef': args.image_id,
            'max_count': 1,
            'block_device_mapping': [{
                    'virtual': 'root',
                    'device_name': '/dev/vda',
                    'volume_id': args.volume_id,
                    'delete_on_termination': False,
    resp, body = http.request(
    if resp.status == 200:
        print "Instance scheduled successfully."
        if args.verbose:
            print json.dumps(json.loads(body), indent=4, sort_keys=True)
        print "Failed to create an instance: response status", resp.status
        print json.dumps(json.loads(body), indent=4, sort_keys=True)
if __name__ == '__main__':

Both are a little bit deprecated so I re-wrote some parts, it's not that demanding. I barely spent much time on it, so there's still work to be done. For example, I don't use euca API, so I simply re-wrote according to nova-api.

Josh Durgin from Inktank said the following:

What's missing is that OpenStack doesn't yet have the ability to initialize a volume from an image. You have to put an image on one yourself before you can boot from it currently. This should be fixed in the next version of OpenStack. Booting off of RBD is nice because you can do live migration, although I haven't tested that with OpenStack, just with libvirt. For Folsom, we hope to have copy-on-write cloning of images as well, so you can store images in RBD with glance, and provision instances booting off cloned RBD volumes in very little time.

It's already in the Folsom's roadmap

I quickly tried this manipulation, but without success:

 $ nova volume-create --display_name=dummy 1
 $ nova volume-list
| ID |   Status  | Display Name | Size | Volume Type |             Attached to              |
| 69 | available | dummy        | 2    | None        |                                      |
 $ rbd -p nova ls
 $ rbd import debian.img volume-00000045
Importing image: 13% complete...2012-06-08 13:45:34.562112 7fbb19835700  0 client.4355.objecter  pinging osd that serves lingering tid 1 (osd.1)
Importing image: 27% complete...2012-06-08 13:45:39.563358 7fbb19835700  0 client.4355.objecter  pinging osd that serves lingering tid 1 (osd.1)
Importing image: 41% complete...2012-06-08 13:45:44.563607 7fbb19835700  0 client.4355.objecter  pinging osd that serves lingering tid 1 (osd.1)
Importing image: 55% complete...2012-06-08 13:45:49.564244 7fbb19835700  0 client.4355.objecter  pinging osd that serves lingering tid 1 (osd.1)
Importing image: 69% complete...2012-06-08 13:45:54.565737 7fbb19835700  0 client.4355.objecter  pinging osd that serves lingering tid 1 (osd.1)
Importing image: 83% complete...2012-06-08 13:45:59.565893 7fbb19835700  0 client.4355.objecter  pinging osd that serves lingering tid 1 (osd.1)
Importing image: 97% complete...2012-06-08 13:46:04.567426 7fbb19835700  0 client.4355.objecter  pinging osd that serves lingering tid 1 (osd.1)
Importing image: 100% complete...done.
 $ nova boot --flavor m1.tiny --image precise-ceph --block_device_mapping vda=69:::0 --security_groups=default boot-from-rbd

III.5. Live migration with CephFS!

I was brave enough to also experimented the live migration with the Ceph Filesystem. Some of these pre-requites are obvious but just to be sure, with the live-migration comes mandatory requirements as:

For the live-migration configuration I followed the official OpenStack documentation. The following actions need to be performed on each compute node:

Update the libvirt configurations. Modify /etc/libvirt/libvirtd.conf:

listen_tls = 0
listen_tcp = 1
auth_tcp = "none"

Modify /etc/init/libvirt-bin.conf and add the -l option:

 libvirtd_opts=" -d -l"

Restart libvirt. After executing the command, ensure that libvirt is succesfully restarted.

 $ sudo stop libvirt-bin && sudo start libvirt-bin
 $ ps -ef | grep libvirt

Make sure that you see the -l flag in the ps command. You should be able to retrieve the information (passwordless) from an hypervisor to another, to test it simply run:

server-02:/$ sudo virsh --connect qemu+tcp://server-01/system list
Id Name                 State
   1 instance-000000af    running
   3 instance-000000b5    running

My nova.conf options:


Mount the nova instance directory with CephFS and assign nova as the owner of the directory:

 $ sudo mount -t ceph /var/lib/nova/instances -vv -o name=admin,secret=AQARB71PUCuuAxAAPhlUGzkRdDdjNDJy1w8MQQ==
 $ sudo chown nova:nova /var/lib/nova/instances

Check your nova services:

server-01:~$ sudo nova-manage service l
Binary           Host                                 Zone             Status     State Updated_At
nova-consoleauth server-05                            nova             enabled    :-)   2012-05-29 15:34:15
nova-cert        server-05                            nova             enabled    :-)   2012-05-29 15:34:15
nova-scheduler   server-05                            nova             enabled    :-)   2012-05-29 15:34:14
nova-compute     server-02                            nova             enabled    :-)   2012-05-29 15:34:14
nova-network     server-02                            nova             enabled    :-)   2012-05-29 15:34:18
nova-volume      server-03                            nova             enabled    :-)   2012-05-29 15:34:23
nova-compute     server-01                            nova             enabled    :-)   2012-05-29 15:33:50
nova-network     server-01                            nova             enabled    :-)   2012-05-29 15:33:51
server-01:~$ nova list
|                  ID                  |      Name     | Status |             Networks             |
| 1ff0f8c4-bdc9-48d4-95ea-515f3a2ff6d4 | pouet         | ACTIVE | vlan1=, |
| 5e7618a1-15df-45e8-86b6-02698e143b92 | boot-from-rbd | ACTIVE | vlan1=              |
| ce230d11-ddf8-4298-a7d9-40ae8690ff11 | medium-rbd    | ACTIVE | vlan1=              |
| ea68ee9a-7b0b-48d7-a9ce-a9328077ca9d | test          | ACTIVE | vlan1=              |
server-01:~$ nova show ce230d11-ddf8-4298-a7d9-40ae8690ff11
|               Property              |                          Value                           |
| OS-DCF:diskConfig                   | MANUAL                                                   |
| OS-EXT-SRV-ATTR:host                | server-01                                                |
| OS-EXT-SRV-ATTR:hypervisor_hostname | None                                                     |
| OS-EXT-SRV-ATTR:instance_name       | instance-000000b5                                        |
| OS-EXT-STS:power_state              | 1                                                        |
| OS-EXT-STS:task_state               | None                                                     |
| OS-EXT-STS:vm_state                 | active                                                   |
| accessIPv4                          |                                                          |
| accessIPv6                          |                                                          |
| config_drive                        |                                                          |
| created                             | 2012-05-29T13:50:45Z                                     |
| flavor                              | m1.medium                                                |
| hostId                              | ec2890ed9e2f998820c4f767b66822c60910a293d0a63723177fff74 |
| id                                  | ce230d11-ddf8-4298-a7d9-40ae8690ff11                     |
| image                               | precise-cephA                                            |
| key_name                            | seb                                                      |
| metadata                            | {}                                                       |
| name                                | medium-rbd                                               |
| progress                            | 0                                                        |
| status                              | ACTIVE                                                   |
| tenant_id                           | d1f5d27ccf594cdbb034c8a4123494e9                         |
| updated                             | 2012-05-29T15:31:27Z                                     |
| user_id                             | 557273155f8243bca38f77dcdca82ff6                         |
| vlan1 network                       |                                            |
server-01:~$ sudo virsh list
Id Name                 State
1 instance-000000af    running
3 instance-000000b5    running

Run the live-migration command in debug mode:

server-01:~$ nova --debug live-migration ce230d11-ddf8-4298-a7d9-40ae8690ff11 server-02
connect: (, 5000)
send: 'POST /v2.0/tokens HTTP/1.1\r\nHost:\r\nContent-Length: 100\r\ncontent-type: 
  application/json\r\naccept-encoding: gzip, deflate\r\naccept: 
  application/json\r\nuser-agent: python-novaclient\r\n\r\n{"auth": 
  {"tenantName": "admin", "passwordCredentials": {"username": "admin", 
  "password": "admin"}}}'
reply: 'HTTP/1.1 200 OK\r\n'
header: Content-Type: application/json
header: Vary: X-Auth-Token
header: Date: Tue, 29 May 2012 15:31:39 GMT
header: Transfer-Encoding: chunked
connect: (, 8774)
send: u'GET /v2/d1f5d27ccf594cdbb034c8a4123494e9/servers/ce230d11-ddf8-4298-a7d9-40ae8690ff11 HTTP/1.1\r\nHost:\r\nx-auth-project-id: 
  admin\r\nx-auth-token: 8758eb02f8f24810a6c8f11c7434f0b1\r\naccept-encoding: 
  gzip, deflate\r\naccept: application/json\r\nuser-agent: 
reply: 'HTTP/1.1 200 OK\r\n'
header: X-Compute-Request-Id: req-4043a2da-4ed1-4c2e-a9c5-b73e81bbfe99
header: Content-Type: application/json
header: Content-Length: 1377
header: Date: Tue, 29 May 2012 15:31:39 GMT
send: u'GET /v2/d1f5d27ccf594cdbb034c8a4123494e9/servers/ce230d11-ddf8-4298-a7d9-40ae8690ff11 HTTP/1.1\r\nHost:\r\nx-auth-project-id: 
  admin\r\nx-auth-token: 8758eb02f8f24810a6c8f11c7434f0b1\r\naccept-encoding: 
  gzip, deflate\r\naccept: application/json\r\nuser-agent: python-novaclient\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: X-Compute-Request-Id: req-b86ccd91-a0ea-4c0c-9523-0f0f3a0a3a86
header: Content-Type: application/json
header: Content-Length: 1377
header: Date: Tue, 29 May 2012 15:31:39 GMT
send: u'POST /v2/d1f5d27ccf594cdbb034c8a4123494e9/servers/ce230d11-ddf8-4298-a7d9-40ae8690ff11/action HTTP/1.1\r\nHost:\r\nContent-Length: 
  92\r\nx-auth-project-id: admin\r\naccept-encoding: gzip, deflate\r\naccept: 
  application/json\r\nx-auth-token: 8758eb02f8f24810a6c8f11c7434f0b1\r\nuser-agent: 
  python-novaclient\r\ncontent-type: application/json\r\n\r\n{"os-migrateLive": 
  {"disk_over_commit": false, "block_migration": false, "host": "server-02"}}'
reply: 'HTTP/1.1 202 Accepted\r\n'
header: Content-Type: text/html; charset=UTF-8
header: Content-Length: 0
header: Date: Tue, 29 May 2012 15:31:52 GMT

Sometimes you can get this message from the nova-scheduler logs:

Casted 'live_migration' to compute 'server-01' from (pid=10963) cast_to_compute_host /usr/lib/python2.7/dist-packages/nova/scheduler/driver.py:80 

And somehow you must get something from the logs, so check:

The libvirt logs could show those errors:

error : virExecWithHook:328 : Cannot find 'pm-is-supported' in path: No such file or directory
error : virNetClientProgramDispatchError:174 : Unable to read from monitor: Connection reset by peer

The first issue (pm) was solved by installing this package:

 $ sudo apt-get install pm-utils -y

The second one is a little bit more tricky, the only glue I found was to disable the VNC console according to this thread. Finally check the log and see:

instance: 962c222f-2280-43e9-83be-c27a31f77946] Migrating instance to server-02 finished successfully. 

Sometimes this message doesn't appear, but the live-migration successfully performed, the best check is to wait and watch on the distant server:

 $ watch sudo virsh list
Every 2.0s: sudo virsh list
Id Name                  State
Every 2.0s: sudo virsh list
Id Name                  State
6 instance-000000dc    shut off
Every 2.0s: sudo virsh list
Id Name                  State
6 instance-000000dc    paused
Every 2.0s: sudo virsh list
Id Name                  State
6 instance-000000dc    running

During the live migration, you should see those states in virsh:

That's all! The downtime for m2.tiny instance was approximatively 3 sec.

III.6. Virtual instances disk's errors - Solved!

When I use Ceph to store the /var/lib/nova/instances directory of each nova-compute server I have these I/O errors inside the virtual machines...

Buffer I/O error on device vda1, logical block 593914
Buffer I/O error on device vda1, logical block 593915
Buffer I/O error on device vda1, logical block 593916
EXT4-fs warning (device vda1): ext4_end_bio:251: I/O error writing to inode 31112 (offset 7852032 size 524288 starting block 595925)
JBD2: Detected IO errors while flushing file data on vda1-8

Logs from the kernel during the boot sequence of the instance:

server-01 kernel: [  400.354943]  nbd15: p1
server-01 kernel: [  405.710253] EXT4-fs (dm-2): mounted filesystem with ordered data mode. Opts: (null)
server-01 kernel: [  410.400054] block nbd15: NBD_DISCONNECT
server-01 kernel: [  410.400190] block nbd15: Receive control failed (result -32)
server-01 kernel: [  410.400656] block nbd15: queue cleared

This issue appears everytime I launched a new instance. Sometimes waiting for the ext4 auto mecanism recovery solve temporary the problem but the filesystem still stays unstable. This error is probably due to the ext4 filesystem. It happens really often and I don't have any clue at the moment maybe a filesystem option or switching from ext4 to XFS will do the trick. At the moment I tried several mount options inside the VM like nobarrier or noatime but nothing changed. This is what I got when I tried to perform a basic operation like installing a package:

Reading package lists... Error!
E: Unable to synchronize mmap - msync (5: Input/output error)
E: The package lists or status file could not be parsed or opened.

This can be solved by the following commands but it's neither useful nor relevant since this error will occur again and again...

 $ sudo apt-get clean
 $ sudo apt-get update
 $ sudo apt-get install 'your_package'

Filesystem check on each Ceph node:

server6:~$ sudo service ceph stop osd
=== osd.2 ===
Stopping Ceph osd.2 on server6...kill 26140...done
server6:~$ sudo umount /srv/ceph/osd2/
server6:~$ sudo fsck.ext4 -fy /dev/server6/ceph-ext4
e2fsck 1.42 (29-Nov-2011)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/server6/ceph-ext4: 4567/1310720 files (58.0% non-contiguous), 3370935/5242880 blocks

ext4 check on the second server:

server4:~$ sudo fsck.ext4 -fy /dev/server4/lvol0
e2fsck 1.42 (29-Nov-2011)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/server4/lvol0: 3686/3276800 files (5.2% non-contiguous), 2935930/13107200 blocks

ext4 check on the third server:

server-003:~$ sudo fsck.ext4 -fy /dev/nova-volumes/lvol0
e2fsck 1.42 (29-Nov-2011)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
/lost+found not found.  Create? yes
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/nova-volumes/lvol0: ***** FILE SYSTEM WAS MODIFIED *****
/dev/nova-volumes/lvol0: 3435/6553600 files (6.7% non-contiguous), 2783459/26214400 blocks

Nothing relevant, everything is properly working.

This issue is unsolved, it's simply related to the fact that CephFS is not stable enough. It can't handle this amount of I/O. A possible work around, here and here. I don't even thing that using XFS instead of ext4 will change the outcome. It seems that this issue also occur with RBD volume, see on the ceph tracker.

According to this reported bug (and the mailing list discussion) this issue affects rbd volumes inside virtual machine, the workaround here is to active the rbd caching, an option should be added inside the xml file while attaching a device.

<source protocol='rbd' name='your-pool/your-volume:rbd_cache=true'>

I didn't check this workaround yet, but it seems to be solved by enabling the cache.


It seems that Ceph has a lot of difficulties with the direct I/O support, see below:

 $ mount | grep ceph,, on /mnt type ceph (name=admin,key=client.admin)
 $ dd if=/dev/zero of=/mnt/directio bs=8M count=1 oflag=direct
1+0 records in
1+0 records out
8388608 bytes (8.4 MB) copied, 0.36262 s, 23.1 MB/s
 $ dd if=/dev/zero of=/mnt/directio bs=9M count=1 oflag=direct
dd: writing `/mnt/directio': Bad address
1+0 records in
0+0 records out
0 bytes (0 B) copied, 1.20184 s, 0.0 kB/s

This bug has been tracked on the Ceph tracker

It seems that Ceph doesn't support the creation of blocks superior at 9M. And? And if you check your libvirt of an instance you will see this section:

<disk type='file' device='disk'>
   <driver type='qcow2' cache='none'/>
   <source file='/var/lib/nova/instances/instance-000000f9/disk'/>
   <target dev='vda' bus='virtio'/>

Setting the cache to none means using direct I/O... Note from the libvirt documentation:

The optional cache attribute controls the cache mechanism, possible values are "default", "none", "writethrough", "writeback", "directsync" (like "writethrough", but it bypasses the host page cache) and "unsafe" (host may cache all disk io, and sync requests from guest are ignored). Since 0.6.0, "directsync" since 0.9.5, "unsafe" since 0.9.7

Cache parameters explained:

Actually there is already a function to test if direct I/O are supported:

    def _supports_direct_io(dirpath):
        testfile = os.path.join(dirpath, ".directio.test")
        hasDirectIO = True
            f = os.open(testfile, os.O_CREAT | os.O_WRONLY | os.O_DIRECT)
            LOG.debug(_("Path '%(path)s' supports direct I/O") %
                      {'path': dirpath})
        except OSError, e:
            if e.errno == errno.EINVAL:
                LOG.debug(_("Path '%(path)s' does not support direct I/O: "
                            "'%(ex)s'") % {'path': dirpath, 'ex': str(e)})
                hasDirectIO = False
                LOG.error(_("Error on '%(path)s' while checking direct I/O: "
                            "'%(ex)s'") % {'path': dirpath, 'ex': str(e)})
                raise e
        except Exception, e:
            LOG.error(_("Error on '%(path)s' while checking direct I/O: "
                        "'%(ex)s'") % {'path': dirpath, 'ex': str(e)})
            raise e
        return hasDirectIO

Somehow it's not detected, mainly because the issue is related to the block size.

If direct I/O are supported it will specified in this file /usr/lib/python2.7/dist-packages/nova/virt/libvirt/connection.py, on line 1036:

def disk_cachemode(self):
    if self._disk_cachemode is None:
        # We prefer 'none' for consistent performance, host crash
        # safety & migration correctness by avoiding host page cache.
        # Some filesystems (eg GlusterFS via FUSE) don't support
        # O_DIRECT though. For those we fallback to 'writethrough'
        # which gives host crash safety, and is safe for migration
        # provided the filesystem is cache coherant (cluster filesystems
        # typically are, but things like NFS are not).
        self._disk_cachemode = "none"
        if not self._supports_direct_io(FLAGS.instances_path):
            self._disk_cachemode = "writethrough"
    return self._disk_cachemode

The first trick was to modify this line:

self._disk_cachemode = "none"


self._disk_cachemode = "writethrough"

With this change, every instances will have the libvirt cache option set to writethrough even if the filesystem supports direct I/O.

Fix a corrumpted VM:


Reboot the VM :)

Note: writeback is also supported with Ceph, it offers better performance than writethrough but writeback stays the safest way for your data. It depends on your need :)

IV. Benchmarks

Thoses benchmarks have been performed under ext4 filesystem and on 15K RPM hard drive disks.

IV.1. Rados builtin benchmark

IV.1.1. Cluster benchmark
 $ uname -r
 $ ceph -v
ceph version 0.47.2 (commit:f5a9404445e2ed5ec2ee828aa53d73d4a002f7a5)
 $ rados -p nova bench 100 write
Maintaining 16 concurrent writes of 4194304 bytes for at least 100 seconds.
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
    0       0         0         0         0         0         -         0
    1      16        31        15   59.8134        60  0.988616  0.774045
    2      16        46        30   59.8908        60   1.15953  0.835025
    3      16        63        47   62.5881        68  0.914239  0.836658
    4      16        76        60   59.9416        52   1.23871  0.906893
    5      16        94        78   62.3493        72   0.92557  0.912052
    6      16       113        97   64.6216        76   1.14571  0.914297
    7      16       123       107   61.1052        40   1.08826  0.922949
    8      16       138       122   60.9663        60   0.46168  0.969207
    9      16       145       129   57.3044        28    1.0469  0.989164
   10      16       166       150    59.972        84   1.50591   1.02505
   11      16       186       170   61.7913        80   1.06359   0.99008
   12      16       197       181   60.3086        44   1.45907  0.993509
   13      16       212       196   60.2843        60   1.67142   1.01419
   14      16       218       202   57.6929        24   1.57489   1.03316
   15      16       223       207   55.1804        20  0.259759   1.03948
   16      16       239       223   55.7307        64   1.81071   1.10588
   17      16       253       237   55.7461        56   1.17068   1.10739
   18      16       267       251   55.7598        56   1.15406   1.10697
   19      16       280       264   55.5616        52   1.26379   1.10818
min lat: 0.124888 max lat: 2.50869 avg lat: 1.11042
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
   20      16       293       277    55.383        52   1.19662   1.11042
   21      16       304       288   54.8409        44   1.21306   1.11133
   22      16       324       308   55.9839        80  0.633551   1.11404
   23      16       337       321   55.8104        52  0.155063   1.10398
   24      16       350       334   55.6514        52   1.54921    1.1165
   25      16       364       348   55.6651        56   1.26814   1.12392
   26      16       367       351   53.9858        12   1.89539   1.13046
   27      16       384       368   54.5045        68   1.13766   1.15098
   28      16       398       382   54.5576        56   1.46389   1.14698
   29      16       415       399   55.0208        68   1.03303   1.14274
   30      16       431       415   55.3198        64   1.24156   1.14126
   31      16       440       424   54.6965        36   1.19121   1.14321
   32      16       457       441   55.1119        68   1.23561   1.14136
   33      16       469       453   54.8963        48   1.21978   1.14207
   34      16       486       470   55.2814        68    1.2799   1.13989
   35      16       499       483   55.1874        52  0.233549      1.14
   36      16       504       488     54.21        20   1.61804   1.14024
   37      16       513       497   53.7178        36   2.10228   1.16011
   38      16       527       511   53.7776        56   1.37356   1.17257
   39      16       541       525   53.8344        56   1.40289   1.17057
min lat: 0.124888 max lat: 2.5194 avg lat: 1.17259
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
   40      16       553       537   53.6883        48   1.24732   1.17259
   41      16       565       549   53.5494        48    1.5267   1.17512
   42      16       578       562   53.5124        52   1.68045   1.17721
   43      16       594       578   53.7561        64  0.279511    1.1751
   44      16       608       592   53.8069        56   1.23636    1.1711
   45      16       619       603   53.5888        44   1.56834   1.17327
   46      16       633       617   53.6411        56   1.24921    1.1744
   47      16       644       628   53.4359        44  0.228269   1.17318
   48      16       654       638   53.1558        40   1.85967   1.18184
   49      16       667       651   53.1321        52   1.11298   1.18894
   50      16       679       663   53.0293        48   1.24697   1.19045
   51      16       691       675   52.9306        48   1.41656   1.19212
   52      16       704       688   52.9125        52   1.24629   1.19305
   53      16       719       703   53.0461        60   1.23783    1.1931
   54      16       740       724   53.6191        84  0.825043   1.18465
   55      16       750       734   53.3714        40   1.12641   1.18158
   56      16       766       750    53.561        64      1.58   1.18356
   57      16       778       762   53.4634        48   1.33114    1.1805
   58      16       779       763   52.6106         4   1.74222   1.18124
   59      16       796       780   52.8713        68   2.13181   1.20095
min lat: 0.124888 max lat: 2.68683 avg lat: 1.20162
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
   60      16       805       789     52.59        36   1.36423   1.20162
   61      16       817       801   52.5147        48   1.38829   1.20521
   62      16       830       814   52.5063        52   1.26657   1.20691
   63      16       845       829   52.6251        60   1.17306   1.20415
   64      16       853       837   52.3028        32   1.73082   1.20619
   65      16       864       848    52.175        44   1.99292   1.21222
   66      16       880       864    52.354        64   1.09513   1.21345
   67      16       892       876   52.2889        48   1.17609   1.21056
   68      16       908       892    52.461        64   1.21753   1.21081
   69      16       921       905   52.4542        52   1.07357   1.20978
   70      16       936       920   52.5619        60  0.160182   1.20659
   71      16       952       936   52.7229        64  0.251266    1.2015
   72      16       965       949   52.7128        52   1.48819   1.20271
   73      16       986       970   53.1412        84  0.940281   1.19764
   74      16       994       978   52.8554        32  0.873665   1.19506
   75      16      1000       984   52.4707        24   2.18796   1.20107
   76      16      1012       996   52.4117        48   2.58551   1.21175
   77      16      1029      1013    52.614        68   1.12385   1.20813
   78      16      1042      1026    52.606        52   1.22075   1.20693
   79      16      1056      1040   52.6489        56  0.285843   1.20635
min lat: 0.120974 max lat: 2.68683 avg lat: 1.20498
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
   80      16      1067      1051   52.5407        44  0.182956   1.20498
   81      16      1076      1060   52.3365        36   1.74162   1.20995
   82      16      1090      1074   52.3811        56   1.18474   1.21345
   83      16      1103      1087   52.3764        52   1.45589   1.21301
   84      16      1119      1103   52.5146        64   1.20541   1.20995
   85      16      1134      1118   52.6026        60      1.27   1.20745
   86      16      1145      1129   52.5025        44  0.173344    1.2067
   87      16      1162      1146   52.6805        68   1.56221   1.20783
   88      16      1174      1158   52.6273        48   0.12839   1.20479
   89      16      1189      1173     52.71        60   1.27274   1.20651
   90      16      1201      1185   52.6576        48   1.11873   1.20648
   91      16      1211      1195   52.5185        40   1.32622   1.20716
   92      16      1224      1208   52.5128        52   1.49926   1.21086
   93      16      1234      1218   52.3782        40  0.163716   1.21123
   94      16      1251      1235   52.5443        68   1.32683    1.2104
   95      16      1264      1248   52.5385        52   1.01523   1.21017
   96      16      1279      1263   52.6161        60   1.31704   1.20815
   97      16      1294      1278   52.6921        60   1.45825   1.20717
   98      16      1314      1298   52.9707        80  0.281634    1.2014
   99      16      1325      1309     52.88        44   1.45331   1.20097
min lat: 0.120974 max lat: 2.68683 avg lat: 1.20099
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
  100      16      1340      1324   52.9511        60   1.43721   1.20099
  101       2      1341      1339   53.0208        60   1.66956   1.20448
Total time run:        101.114344
Total writes made:     1341
Write size:            4194304
Bandwidth (MB/sec):    53.049
Average Latency:       1.20432
Max latency:           2.68683
Min latency:           0.120974
IV.1.2. OSD Benchmarks

From a console run:

 $ for i in 0 1 2; do ceph osd tell $i bench; done

Monitor the output from an another terminal:

 $ ceph -w
osd.0 495 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 4.575725 sec at 223 MB/sec
osd.1 877 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 22.559266 sec at 46480 KB/sec
osd.2 1274 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 20.011638 sec at 52398 KB/sec

As you can see, I have pretty bad performance on 2 OSDs. Both of them will bring down the performance of my whole cluster. (this statment will be verified bellow)

IV.2. Servers benchmarks

IV.2.1. server-03
server-03:~$ for ((i=0 ; 10 -$i ; i++)) ; do dd if=/dev/zero of=pouet bs=1000M count=1; rm pouet; done
1048576000 bytes (1.0 GB) copied, 2.23271 s, 470 MB/s
1048576000 bytes (1.0 GB) copied, 2.12575 s, 493 MB/s
1048576000 bytes (1.0 GB) copied, 2.12901 s, 493 MB/s
1048576000 bytes (1.0 GB) copied, 2.13956 s, 490 MB/s
1048576000 bytes (1.0 GB) copied, 2.14999 s, 488 MB/s
1048576000 bytes (1.0 GB) copied, 2.12281 s, 494 MB/s
1048576000 bytes (1.0 GB) copied, 2.12963 s, 492 MB/s
1048576000 bytes (1.0 GB) copied, 2.13597 s, 491 MB/s
1048576000 bytes (1.0 GB) copied, 2.14659 s, 488 MB/s
1048576000 bytes (1.0 GB) copied, 2.15181 s, 487 MB/s

Average: 488,6 MB/s

IV.2.2. server-04
server-04:~$ for ((i=0 ; 10 -$i ; i++)) ; do dd if=/dev/zero of=pouet bs=1000M count=1; rm pouet; done
1048576000 bytes (1.0 GB) copied, 4.676 s, 224 MB/s
1048576000 bytes (1.0 GB) copied, 4.62314 s, 227 MB/s
1048576000 bytes (1.0 GB) copied, 4.93966 s, 212 MB/s
1048576000 bytes (1.0 GB) copied, 10.5936 s, 99.0 MB/s
1048576000 bytes (1.0 GB) copied, 4.94419 s, 212 MB/s
1048576000 bytes (1.0 GB) copied, 4.70893 s, 223 MB/s
1048576000 bytes (1.0 GB) copied, 8.94163 s, 117 MB/s
1048576000 bytes (1.0 GB) copied, 4.79279 s, 219 MB/s
1048576000 bytes (1.0 GB) copied, 8.39481 s, 125 MB/s
1048576000 bytes (1.0 GB) copied, 8.97216 s, 117 MB/s

Average: 154,8 MB/s

IV.2.3. server-06
server-06:~$ for ((i=0 ; 10 -$i ; i++)) ; do dd if=/dev/zero of=pouet bs=1000M count=1; rm pouet; done
1048576000 bytes (1.0 GB) copied, 2.35758 s, 445 MB/s
1048576000 bytes (1.0 GB) copied, 2.37689 s, 441 MB/s
1048576000 bytes (1.0 GB) copied, 4.94374 s, 212 MB/s
1048576000 bytes (1.0 GB) copied, 2.55669 s, 410 MB/s
1048576000 bytes (1.0 GB) copied, 6.08993 s, 172 MB/s
1048576000 bytes (1.0 GB) copied, 2.2573 s, 465 MB/s
1048576000 bytes (1.0 GB) copied, 2.29013 s, 458 MB/s
1048576000 bytes (1.0 GB) copied, 5.67836 s, 185 MB/s
1048576000 bytes (1.0 GB) copied, 2.39934 s, 437 MB/s
1048576000 bytes (1.0 GB) copied, 5.87929 s, 178 MB/s

Average: 340,3 MB/s

IV.3. Bandwidth benchmarks

Quick bandwidth test between 2 servers:

server-03:~$ time dd if=/dev/zero of=test bs=2000M count=1; time scp test root@server-04:/dev/null;
2097152000 bytes (2.1 GB) copied, 4.46267 s, 470 MB/s
root@server-04's password:
test                                                         100% 2000MB  52.6MB/s   00:47
real  0m49.298s
user  0m43.915s
sys   0m5.172s

It's not really surprising since Ceph showed an average of 53MB/s. I clairly have a network bottlenck because all my servers are connected with GBit. I also test a copy from the root partition to the ceph shared mount directory to see how long does it take to write data into ceph:

 $ time dd if=/dev/zero of=pouet bs=2000M count=1; time sudo cp pouet /var/lib/nova/instances/;
1+0 records in
1+0 records out
2097152000 bytes (2.1 GB) copied, 4.27012 s, 491 MB/s
real  0m4.465s
user  0m0.000s
sys   0m4.456s
real  0m5.778s
user  0m0.000s
sys   0m3.580s

Monitor from ceph:

16:24:01.943710    pg v11430: 592 pgs: 592 active+clean; 30471 MB data, 71127 MB used, 271 GB / 359 GB avail
16:24:04.129263    pg v11431: 592 pgs: 592 active+clean; 30591 MB data, 71359 MB used, 271 GB / 359 GB avail
16:24:06.187816    pg v11432: 592 pgs: 592 active+clean; 30691 MB data, 71632 MB used, 271 GB / 359 GB avail
16:24:07.345031    pg v11433: 592 pgs: 592 active+clean; 30815 MB data, 71932 MB used, 270 GB / 359 GB avail
16:24:08.283969    pg v11434: 592 pgs: 592 active+clean; 30967 MB data, 72649 MB used, 270 GB / 359 GB avail
16:24:11.458523    pg v11435: 592 pgs: 592 active+clean; 31079 MB data, 72855 MB used, 270 GB / 359 GB avail
16:24:12.543626    pg v11436: 592 pgs: 592 active+clean; 31147 MB data, 73007 MB used, 269 GB / 359 GB avail
16:24:15.447718    pg v11437: 592 pgs: 592 active+clean; 31195 MB data, 73208 MB used, 269 GB / 359 GB avail
16:24:18.258197    pg v11438: 592 pgs: 592 active+clean; 31319 MB data, 73260 MB used, 269 GB / 359 GB avail
16:24:23.187243    pg v11439: 592 pgs: 592 active+clean; 31467 MB data, 73488 MB used, 269 GB / 359 GB avail
16:24:24.680864    pg v11440: 592 pgs: 592 active+clean; 31574 MB data, 73792 MB used, 269 GB / 359 GB avail
16:24:25.299714    pg v11441: 592 pgs: 592 active+clean; 31622 MB data, 74013 MB used, 268 GB / 359 GB avail
16:24:27.015503    pg v11442: 592 pgs: 592 active+clean; 31626 MB data, 74101 MB used, 268 GB / 359 GB avail
16:24:28.554417    pg v11443: 592 pgs: 592 active+clean; 31810 MB data, 74237 MB used, 268 GB / 359 GB avail
16:24:32.029909    pg v11444: 592 pgs: 592 active+clean; 31827 MB data, 74333 MB used, 268 GB / 359 GB avail
16:24:32.814380    pg v11445: 592 pgs: 592 active+clean; 32231 MB data, 74586 MB used, 268 GB / 359 GB avail
16:24:33.803356    pg v11446: 592 pgs: 592 active+clean; 32291 MB data, 74900 MB used, 268 GB / 359 GB avail
16:24:36.476405    pg v11447: 592 pgs: 592 active+clean; 32291 MB data, 74938 MB used, 267 GB / 359 GB avail
16:24:37.674590    pg v11448: 592 pgs: 592 active+clean; 32292 MB data, 75054 MB used, 267 GB / 359 GB avail
16:24:38.711816    pg v11449: 592 pgs: 592 active+clean; 32292 MB data, 75108 MB used, 267 GB / 359 GB avail

The information reported by the -w option are asynchronous and not really significant. For instances we can't tell that storing 2GB in Ceph DFS took 37 seconds.

IV.4. Instance benchmarks

Flavor details:

ubuntu@instance-over-rbd:~$ for ((i=0 ; 10 -$i ; i++)) ; do dd if=/dev/zero of=pouet bs=1000M count=1; rm pouet; done
1048576000 bytes (1.0 GB) copied, 23.1742 s, 45.2 MB/s
1048576000 bytes (1.0 GB) copied, 33.765 s, 31.1 MB/s
1048576000 bytes (1.0 GB) copied, 39.409 s, 26.6 MB/s
1048576000 bytes (1.0 GB) copied, 22.8567 s, 45.9 MB/s
1048576000 bytes (1.0 GB) copied, 37.5275 s, 27.9 MB/s
1048576000 bytes (1.0 GB) copied, 18.422 s, 56.9 MB/s
1048576000 bytes (1.0 GB) copied, 20.1792 s, 52.0 MB/s
1048576000 bytes (1.0 GB) copied, 19.4536 s, 53.9 MB/s
1048576000 bytes (1.0 GB) copied, 15.5978 s, 67.2 MB/s
1048576000 bytes (1.0 GB) copied, 15.7292 s, 66.7 MB/s

Average: 47,34 MB/s

Benchmark your filesystem in order to detect I/O errors (ext4 oriented):

I/O stress Download me
 * Copyright (C) 2010 Canonical
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License
 * as published by the Free Software Foundation; either version 2
 * of the License, or (at your option) any later version.
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * GNU General Public License for more details.
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
 *  Author Colin Ian King,  colin.king@canonical.com
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <linux/fs.h>
#include "fiemap.h"
#define FS_IOC_FIEMAP                      _IOWR('f', 11, struct fiemap)
void syntax(char **argv)
  fprintf(stderr, "%s [filename]...\n",argv[0]);
struct fiemap *read_fiemap(int fd)
  struct fiemap *fiemap;
  int extents_size;
  if ((fiemap = (struct fiemap*)malloc(sizeof(struct fiemap))) == NULL) {
      fprintf(stderr, "Out of memory allocating fiemap\n");    
      return NULL;
  memset(fiemap, 0, sizeof(struct fiemap));
  fiemap->fm_start = 0;
  fiemap->fm_length = 2*1024*1024;        /* Lazy */
  fiemap->fm_flags = 0;
  fiemap->fm_extent_count = 0;
  fiemap->fm_mapped_extents = 0;
  /* Find out how many extents there are */
  if (ioctl(fd, FS_IOC_FIEMAP, fiemap) < 0) {
      fprintf(stderr, "fiemap ioctl() failed\n");
      return NULL;
  /* Read in the extents */
  extents_size = sizeof(struct fiemap_extent) *
  /* Resize fiemap to allow us to read in the extents */
  if ((fiemap = (struct fiemap*)realloc(fiemap,sizeof(struct fiemap) +
                                         extents_size)) == NULL) {
      fprintf(stderr, "Out of memory allocating fiemap\n");    
      return NULL;
  memset(fiemap->fm_extents, 0, extents_size);
  fiemap->fm_extent_count = fiemap->fm_mapped_extents;
  fiemap->fm_mapped_extents = 0;
  if (ioctl(fd, FS_IOC_FIEMAP, fiemap) < 0) {
      fprintf(stderr, "fiemap ioctl() failed\n");
      return NULL;
  return fiemap;
void dump_fiemap(struct fiemap *fiemap, char *filename)
  int i;
  printf("File %s has %d extents:\n",filename, fiemap->fm_mapped_extents);
  printf("#\tLogical          Physical         Length           Flags\n");
  for (i=0;i<fiemap->fm_mapped_extents;i++) {
      printf("%d:\t%-16.16llx %-16.16llx %-16.16llx %-4.4x\n",
int main(int argc, char **argv)
  int i;
  if (argc < 2) {
  for (i=1;i<argc;i++) {
      int fd;
      if ((fd = open(argv[i], O_RDONLY)) < 0) {
          fprintf(stderr, "Cannot open file %s\n", argv[i]);
      else {
          struct fiemap *fiemap;
          if ((fiemap = read_fiemap(fd)) != NULL)
              dump_fiemap(fiemap, argv[i]);

Final results

Openstack + Ceph
Create RBD volume
Delete RBD volume
Snapshot RBD volume
Attaching RBD volume
Glance images storage backend (import)
Snapshot running instance to RBD
Booting from RBD
Booting from a snapshoted image
Boot VMs from shared /var/lib/nova/instances
Live migration with CephFS

This table was created with Compare Ninja.


I hope I will be able to go further and use Ceph for production. Ceph seems fearly stable enough at the moment, for RBD and RADOS, CephFS doesn't seem capable to handle huge I/O traffic. Also keep in mind that a company called Inktank offers a commercial support for Ceph, I don't thing it's a coincidence. Ceph will have a bright future. The recovery procedure is excellent, of course there is a lot of component which I would loved to play like fine crushmap tunning. This article could be updated at any time since I'm taking my research further :).

This article wouldn't have been possible without the tremendous help of Josh Durgin from Inktank, many many thanks to him :)

NFS Over RBD (Source Origin)

| Comments

Since CephFS is not most mature component in Ceph, you won't consider to use it on a production platform. In this article, I offer a possible solution to expose RBD to a shared filesystem.

I. Architecture

My choice was turned to NFS for a couple of reasons:

Overview of the infrastructure. For my own setup, I needed to map and export several pools. For examples you could have one pool for the customers data and one pool for storing your VMs (/var/lib/nova/instances). It's up to you.

II. Prerequisites

Install Ceph client packages and the NFS server, this needs to be performed on every nodes:

 $ sudo apt-get install ceph-common nfs-server -y
 $ sudo echo "manual" > /etc/init/nfs-kernel-server.override

Nothing more, no modprobe rbd, nothing. Pacemaker will manage that for us :)

Create your RBD volumes:

 $ rbd create share1 --size 2048
 $ rbd create share2 --size 2048

You will need to map it somewhere in order to put a filesystem on it:

 $ sudo modprobe rbd
 $ sudo echo ",, name=admin,secret=AQDVGc5P0LXzJhAA5C019tbdrgypFNXUpG2cqQ== rbd share1" > sudo tee /sys/bus/rbd/add
 $ sudo mkfs.xfs /dev/rbd0
 $ rbd unmap /dev/rbd0

And so on for share2.

In order to manage our RBD device we are going to use the RA written by Florian Haas for Ceph which map RBD device. You can have a look at it in the Ceph Github. Integrate the RA to Pacemaker:

 $ sudo mkdir /usr/lib/ocf/resource.d/ceph
 $ cd /usr/lib/ocf/resource.d/ceph/
 $ wget https://raw.github.com/ceph/ceph/master/src/ocf/rbd.in
 $ chmod +x rbd.in

Minor change to the resource agent. According to the official OCF documentation.

@@ -144,7 +144,7 @@ find_rbd_dev() {
 rbd_validate_all() {	
     # Test for configuration errors first
     if [ -z "$OCF_RESKEY_name" ]; then  	
-       ocf_log err 'Required parameter "name" is unset!'	
+       ocf_log err "Required parameter "name" is unset!"	
        exit $OCF_ERR_CONFIGURED

The pull request is waiting here.

III. Setup

III.1. Common

This initial setup only containts 2 nodes so you need to setup Pacemaker according to this number.

 $ sudo crm configure property stonith-enabled=false
 $ sudo crm configure property no-quorum-policy=ignore

Of course if you plan to expand your active/active with a third node, you must unset the no-quorum-policy.

III.2. Primitives

In order to make things really clear I will setup the primitive from the bottom layer to the top, something like:

  1. Map the RBD device
  2. Mount it!
  3. Export it!
  4. Reach it with the virtual IP address
  5. Setup the NFS server

Note: for more comprehension and clarity I always name:

All the operation needs to be performed within the crm shell or simply sudo crm configure before every commands below. You can also do sudo crm configure edit and copy/paste.

First, map RBD:

primitive p_rbd_map_1 ocf:ceph:rbd.in \
        params  user="admin"  pool="rbd"  name="share1"  cephconf="/etc/ceph/ceph.conf" \
        op monitor  interval="10s"  timeout="20s"
primitive p_rbd_map_2 ocf:ceph:rbd.in \
        params  user="admin"  pool="rbd"  name="share2"  cephconf="/etc/ceph/ceph.conf" \
        op monitor  interval="10s"  timeout="20s"

Second, filesystem:

primitive p_fs_rbd_1 ocf:heartbeat:Filesystem \
        params  directory="/mnt/share1"  fstype="xfs"  device="/dev/rbd/rbd/share1"  fast_stop="no" \
        op monitor  interval="20s"  timeout="40s" \
        op start  interval="0"  timeout="60s" \
        op stop  interval="0"  timeout="60s"
primitive p_fs_rbd_2 ocf:heartbeat:Filesystem \
        params  directory="/mnt/share2"  fstype="xfs"  device="/dev/rbd/rbd/share2"  fast_stop="no" \
        op monitor  interval="20s"  timeout="40s" \
        op start  interval="0"  timeout="60s" \
        op stop  interval="0"  timeout="60s"

Third, export directories:

primitive p_export_rbd_1 ocf:heartbeat:exportfs \
  params  directory="/mnt/share1"  clientspec=""  options="rw,async,no_subtree_check,no_root_squash"  fsid="1" \
  op monitor  interval="10s"  timeout="20s" \
  op start  interval="0"  timeout="40s"
primitive p_export_rbd_2 ocf:heartbeat:exportfs \
  params  directory="/mnt/share2"  clientspec=""  options="rw,async,no_subtree_check,no_root_squash"  fsid="2" \
  op monitor  interval="10s"  timeout="20s" \
  op start  interval="0"  timeout="40s"

Fourth, virtual IP addresses:

primitive p_vip_1 ocf:heartbeat:IPaddr \
        params  ip=""  cidr_netmask="24" \
        op monitor  interval="5"
primitive p_vip_2 ocf:heartbeat:IPaddr \
        params  ip=""  cidr_netmask="24" \
        op monitor  interval="5"

Fifth, NFS server:

primitive p_nfs_server lsb:nfs-kernel-server \
  op monitor  interval="10s"  timeout="30s"
primitive p_portmap lsb:portmap \
  op monitor  interval="10s"  timeout="30s"
primitive p_statd lsb:statd \
        op monitor  interval="10s"  timeout="30s"

III.3. Resources group and clone

Groups contain a set of resources that need to be located together, started sequentially and stopped in the reverse order. You need to create a group of resource for each NFS shared first and also for all the NFS dependencies services:

group g_rbd_share_1 p_rbd_map_1 p_fs_rbd_1 p_export_rbd_1 p_vip_1
group g_rbd_share_2 p_rbd_map_2 p_fs_rbd_2 p_export_rbd_2 p_vip_2
group g_nfs p_portmap p_statd p_nfs_server

Clones are resources that can be active on multiple hosts. We have to clone the NFS server, it will act as active/active. It means that the NFS daemon will be running/active on both nodes.

clone clo_nfs g_nfs \
  meta globally-unique="false" target-role="Started"

III.4. Location rules

In this setup, each export must run on a specific server, always. The resource will always remain in its current location unless forced off because the node is no longer eligible to run the resource. These 2 contraints define a Score to determine the location relationship between both resources. Positive values indicate the resources should run on the same node. Setting the score to INFINITY forces the resources to run on the same node.

location l_g_rbd_share_1 g_rbd_share_1 inf: nfs1
location l_g_rbd_share_2 g_rbd_share_2 inf: nfs2

At the end, you should see something like this:

 $ sudo crm_mon -1
Last updated: Mon Jul  2 07:19:40 2012
Last change: Mon Jul  2 04:07:15 2012 via crm_attribute on nfs1
Stack: openais
Current DC: nfs2 - partition with quorum
Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
2 Nodes configured, 2 expected votes
14 Resources configured.
Online: [ nfs1 nfs2 ]
 Resource Group: g_rbd_share_1
     p_rbd_map_1        (ocf::heartbeat:rbd.in):        Started nfs1
     p_fs_rbd_1 (ocf::heartbeat:Filesystem):    Started nfs1
     p_export_rbd_1     (ocf::heartbeat:exportfs):      Started nfs1
     p_vip_1    (ocf::heartbeat:IPaddr):        Started nfs1
 Resource Group: g_rbd_share_2
     p_rbd_map_2        (ocf::heartbeat:rbd.in):        Started nfs2
     p_fs_rbd_2 (ocf::heartbeat:Filesystem):    Started nfs2
     p_export_rbd_2     (ocf::heartbeat:exportfs):      Started nfs2
     p_vip_2    (ocf::heartbeat:IPaddr):        Started nfs2
 Clone Set: clo_nfs [g_nfs]
     Started: [ nfs1 nfs2 ]

Conclusion: here we have a scalable architecture, we can add as many NFS server (clone) as we need. This will expand the active/active mode. That was only one use case. You don't necessary need active/active mode. An active/passive mode should be enough if you only need to map one RBD volume.