An Integrated Virtual Storage Cluster

隨著虛擬技術的日益成熟與精進, 吾人傾向依軟體環境需求, 製造數種樣板. 須要虛擬機器時, 從樣板中選取適當的 root filesystem, 複製 (cp) 並調校後 (configuration), 即可上線. 虛擬機器樣板的管理與傳輸須求, 樣板 size 通常小而美. 能夠即時提供適度儲存空間的 Virtual Storage Cluster, 因此成為虛擬技術應用與發展不可或缺的一環.

再從 Live Migration 的角度審視 VM 儲存空間的配置, 儘可能應該是外掛的. 儲存空間越大越該使用外掛式的 storage space. Migration 前, 先把儲存空間 umount, Migration 完成後, 再把 storage space 掛回去. 這種使用方式, 增加了 Live Migration 的可靠度與實用性.

有關儲存虛擬化的注意事項, 請參考: Storage Virtualization Essentials Notice that the Redundancy, Data mirroring, Snapshots topics mentioned in the articles are built-in features of the Ceph filesystem. And NAS and SAN technologies are implemented in the Ceph filesystem, too. 儲存也是 OpenStack Project 很重要的議題:

OpenStack Storage    OpenStack Object Storage: Swift    OpenStack Storage Server Part 1, Part2    Understanding Swift and Ceph

References
Storage Virtualization    Storage Virtualization Survey
Need for Storage Virtualization    Storage virtualization tames the data beast
Virtualizing Storage Performance    Storage Virtualization - Simplify Data Storage
   
   

About OS

我們使用 Debian Wheezy and Sid (combined) Linux system. I am sure Debian Squeeze, (to be phased out pretty soon), has no built-in rbd.ko kernel module.

hsu@Amath-Client00:~$ find /lib/modules -name "rbd*"
/lib/modules/3.2.0-4-amd64/kernel/drivers/block/rbd.ko

Storage Types according to OpenStack

  1. Object Storage (REST API)
  2. Block Storage (SAN)
  3. File Storage (NAS)

Pros and Cons of Storage Types

Ceph Filesystem

  1. Ceph Introduction

    1. Ceph: Open Source Storage
    2. Ceph: Scalable distributed file system
    3. The RADOS Object Store and Ceph Filesystem and Part 2
    4. Basic Ceph Storage & KVM Virtualisation Tutorial
    5. Setting up Ceph cluster and exporting RBD
    6. Introducing Ceph to OpenStack
    7. Ceph and RBD benchmarks
    8. iSCSI From Ceph wiki
    9. iSCSI Multipath with RBD
  2. Designing a Ceph Cluster

    也許每個 Ceph-OSD 應該安排在獨一的讀寫頭, 而且, 在私有網域以降低網路干擾? 或許我們應該使用 USB disks for portability reason? 但是, USB 3 速度只有 Physical Hard Disk 的一半或更少. Performance Benchmark: 或許存取速度應該以 IOPS (input/output operations per second") 為標準, 而不是讀寫速度.

  3. Creation of a Ceph Cluster

    Following step 1 up to 8, we can create our own Ceph Cluster in no time.

    1. Create Ceph-Temp.img and install these packages in it:

      All the VMs in the Ceph Storage Cluster VMs and the clients of this storage cluster can share this common template. Hence we must test it thoroughly.

    2. For each node in the Ceph Storage Cluster, prepare a storage space for it to carry with.

      This storage space may be empty space created with dd command, a logic disk, say /dev/sdb3, or a partition in an USB disk. But we need to format it as a btrfs block device via command similar to "$ sudo mkfs.btrfs osdBlk.ada"

      $ dd if=/dev/zero of=blank.img bs=1024k count=10000
      
    3. When booting the above node of Ceph Storage Cluster with the kvm command, give it the "-hdb ../Vdisks/osdBlk.ada" extra argument.

      The storage space will be carried by the /dev/sdb device.

    4. On the Physical host, install btrfs-tools (via synaptic) and make btrfs filesystem on each newly prepared storage space.
    5. Prepare the needed Data or Configuration Files:
      1. ceph.conf

        Syntax and Usage Reference: Configuring Ceph   Configuring Ceph (Is this the right one?)  

      2. ceph.hosts

        Nodes in Ceph Storage Cluster talk to each other via ssh, using /etc/hosts file for their IP resolutions. Each node needs these entries in their /etc/hosts files. Also, if the pysical hosts running these Ceph Storage Cluster nodes have these entries in their /etc/hosts files, remote login these nodes will be much more convenient. Especially, it seems Ceph Storage Cluster is very stable, I almost always runs these nodes in the background.

    6. Booting Ceph-Temp, adjust everything in this template.

      We then duplicate its image for all nodes in our Ceph Storage Cluster. Notice that we must turn on the PermitRootLogin yes entry in the /etc/ssh/sshd_config file, since these nodes frequently talk to each other as root. Also, we need to generate ssh key for root and copy it to all nodes so that authentication failure won't happen. Using the next script, we may configure these Storage nodes:

      $ ../bin/Config-Kvm-Storage
      ../bin/Config-Kvm-Storage OS.img hostname VM-IP Ether-card TAP-No btrfs-blk-dev
      

      Remember when the nodes in our Ceph Storage Cluster are ready, booting each of them for the first time in the foreground and execute the script $ sudo ./recover70rules. Otherwise, always get the wrong virtual ethercard, say eth1, eth2, and there won't be usable network!

    7. mkcephfs --- create a new Ceph cluster file system

      Copy the /etc/ceph/keyring.admin file to all of the rest nodes in our Ceph Storage Cluster execept the one we executed mkcephfs command.

      I intend to change the original (10 GB) storage spaces to be the partitions of USB 3 disks. I need to re-run mkcephfs. Any after-effect? Hope not.

    8. After testing the cluster for several times, we may boot all the nodes in background. It seems rather stable.
  4. Ceph Client

    Following step 9 up to 11, we can create our own Ceph Client in no time. Once again, CephClient 主要目地是產生使用 Ceph Storage Cluster 樣板 client 與檢驗 Ceph Storage Cluster 的功能與實用性.

    1. 從 Ceph-Temp 拷貝 Ceph-Client, 用 Config-Kvm 調校並上線.
    2. 從任意 node of ceph storage cluster 拷貝 /etc/ceph/keyring.admin authentication 檔.
    3. ceph-fusemount 指令, 分別 mount ceph filesystem, 確定可行.
    4. 在 ceph storage cluster 裡面, 以 rbd 指令切割出一個 block.
    5. 在 Ceph-Client 裡面 load rbd kernel module, 再以下列指令輸入 rbd block device:
       $ echo "192.168.0.6,192.168.0.7,192.168.0.8 name=admin,secret=`cat /tmp/secretfile` \
         rbd foo" | sudo tee /sys/bus/rbd/add
      

      成功的話 $ ls -l /dev/rbd* 應該見到一個 major number 254 的 block device.

  5. 複製 Ceph-Client 成 Ceph-RBD2, 提供它雙虛擬網卡, 以及 iSCSI Target 功能.

iSCSI

  1. Target (範例在第三項)

  2. Initiator (範例在第三項)

    Copied from /src3/KVM/ResizeDebian-Mini.img, we produced Test-Eth1. In it, we installed tcpdump (and libpcap0.8 which tcpdump depends) to investigate our eth1 packets are really routing in the 192.168.1.* LAN. From Test-Eth1, we created Deb2Nics, a VM with eth0 (IP: 192.168.0.253) and eth1 (IP: 192.168.1.253). We made sure the packets for 192.168.0.* and 192.168.1.* were routing in their own respective subnets. This is the template for creating our iSCSI-iNIT root filesystem.

    Note: (11/10/2012) We need to load rbd kernel module on iSCSI-iNIT. But, the ceph filesystem related packages are installed in Ceph-RBD2, (inherated from Ceph-Client), not in iSCSI-iNIT. I.e. the ceph business is hided from the iSCSI Initiator.

     $ cp Debian-2Nics.img iSCSI-iNIT.img
     # Booting iSCSI-iNIT, edit /etc/rc.local and install open-iscsi and multipath-tools.
    hsu@iSCSI-iNIT:~$ diff /etc/rc.local /etc/rc.local.orig
    19,21c19,20
    < ifconfig eth0 192.168.0.100
    < ifconfig eth1 192.168.1.100
    < route add default gw 192.168.1.1
    ---
    > ifconfig eth0 192.168.1.100
    > route add default gw 192.168.1.33
    hsu@iSCSI-iNIT:~$ sudo apt-get install open-iscsi multipath-tools 
     # Next we open /etc/iscsi/iscsid.conf and set node.startup to automatic. Then we 
     # restart the initiator.  We shouldn't have started open-iscsi automatically, since 
     # our ceph rbd block device is usually not ready, yet.  We prefer to import rbd 
     # device in the iSCSI Target from Ceph Cluster by hands.
    hsu@iSCSI-iNIT:~$ sudo emacs /etc/iscsi/iscsid.conf # We prefer node.startup to be manual
         . 
         . 
    node.startup = automatic
         . 
         . 
    hsu@iSCSI-iNIT:~$ sudo modprobe rbd # Load rbd kernel module to recoginze rbd device
    # Probably, need to load rbd module in /etc/rc.local, since "open-iscsi restart" command 
    # is harmful for the virtual disks already in-used.
    hsu@iSCSI-iNIT:~$ sudo /etc/init.d/open-iscsi restart
    
  3. Our RbdAndSanTarget

  4. Wrong Choice of Target?

  5. Section 8: Advanced Configuration    Multipath iSCSI under Linux

Redundancy

  1. Data Redundancy
  2. San Design: IP Redundancy

Virtual Gateway

網路速度取決於, 網域中最慢的一環. 提高既成網路速度, 除了硬體設備, 佈線工程更浩大. 透過新網域的使用, 重新以高速 switches 與 cables 建構新一層網路更快速, 可行. 唯, 針對新網域, 吾人得先佈建 Gateway, 並測試, 譬如: 192.168.0. 與 192.168.1. 網域的 packets 真的在自己的線路中流通.

iSCSI 透過網路提供儲存空間. 速度與可靠度以及降低干擾, 應該以高速私網域架設 Virtual Storage Cluster, hence Virtual Gateway, too.