1. Create Ceph-Temp.img and install these packages in it:

    As far as I can recall: Resize the image from deb-min, (increased 1G to /dev/sda2) and install a few more packages: btrfs-tools, sudo, synaptic, emacs, ceph.

    Since this VM template is for setting up Cepf Object Storage cluster, it needs btrfs and cepf filesystems, automatically load the related modules at boot time.

    hsu@Ceph-Temp:~$ diff /etc/modules /etc/modules.orig
    8,9d7
    < btrfs
    < ceph
    

    Since Ceph is in active development cycle, we need all its newest features available, set up sources.list as follow:

    hsu@Ceph-Temp:~$ more /etc/apt/sources.list
    deb http://140.120.7.21/debian wheezy main contrib
    deb http://140.120.7.21/debian sid main contrib
    deb http://security.debian.org/ wheezy/updates main contrib
    # deb-src http://ftp.tw.debian.org/debian/ squeeze main
    
  2. Create 10G empty disk
    $ ls -l ../Vdi*/*
    -rw-r--r-- 1 hsu hsu 10485760000 Aug 27 14:48 ../Vdisks/blank.img
    
  3. Modify start-Ceph-Temp-20-efs so that it carries ../Vdisks/blank.img as /dev/sdb. Maybe, we need to format blank.img to be a btrfs filesystem?
    $ diff start-Ceph-Temp-20-efs start-Ceph-Temp-20-efs.orig
    27c27
    < kvm -net vde,vlan=0,sock=/src4/ceph/network-3412 -net nic,vlan=0,macaddr=1c:6f:65:85:64:b3 -m 512M -monitor unix:/src4/ceph/network-3412/MonSock,server,nowait -hda ../cluster/Ceph-Temp.img -hdb ../Vdisks/blank.img&
    ---
    > kvm -net vde,vlan=0,sock=/src4/ceph/network-3412 -net nic,vlan=0,macaddr=1c:6f:65:85:64:b3 -m 512M -monitor unix:/src4/ceph/network-3412/MonSock,server,nowait -hda ../cluster/Ceph-Temp.img -hdb /dev/sda&
    
  4. On the Physical host, install btrfs-tools (via synaptic) so that we can make btrfs filesystem on virtual disk.
     # On Physical Host 
     $ sudo modprobe btrfs 
     $ cat /proc/filesystems | grep btrfs
            btrfs
     $ which mkfs.btrfs  
     $ cd /src4/ceph/Vdisks 
     $ ls -l
    total 10240004
    -rw-r--r-- 1 hsu hsu 10485760000 Aug 27 14:48 blank.img
     $ cp blank.img osdBlk.ada
     $ sudo mkfs.btrfs osdBlk.ada
    [sudo] password for hsu: 
    WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL
    WARNING! - see http://btrfs.wiki.kernel.org before using
    fs created label (null) on osdBlk.ada
            nodesize 4096 leafsize 4096 sectorsize 4096 size 9.77GB
    Btrfs Btrfs v0.19
     $ sudo mount -t btrfs -o loop osdBlk.ada /mnt/tmp
    hsu@amd-6:/src4/ceph/Vdisks$ ls -l /mnt/tmp
    total 0
     $ ls -l /dev/disk/by-uuid | grep loop
    lrwxrwxrwx 1 root root 11 Aug 29 18:49 844d814e-2467-4072-8f10-98cb19d3273b -> ../../loop0 
     $ cat /etc/mtab | grep /mnt/tmp
    /dev/loop0 /mnt/tmp btrfs rw,relatime,space_cache 0 0
     # Similarly, we copy blank.img to osdBlk.bob and osdBlk.cay.  Then, we make btrfs on 
     # each of them and check their UUIDs as above.  Make sure the UUIDs of all of them are
     # different.
     $ pwd
    /src4/ceph/Vdisks
     $ ls -l
    total 40960016
    -rw-r--r-- 1 hsu hsu 10485760000 Aug 27 14:48 blank.img
    -rw-r--r-- 1 hsu hsu 10485760000 Aug 29 18:56 osdBlk.ada
    -rw-r--r-- 1 hsu hsu 10485760000 Aug 29 22:33 osdBlk.bob
    -rw-r--r-- 1 hsu hsu 10485760000 Aug 29 22:34 osdBlk.cay
     $ file osdBlk.ada
    osdBlk.ada: BTRFS Filesystem sectorsize 4096, nodesize 4096, leafsize 4096)
     $ file osdBlk.bob
    osdBlk.bob: BTRFS Filesystem sectorsize 4096, nodesize 4096, leafsize 4096)
     $ file osdBlk.cay
    osdBlk.cay: BTRFS Filesystem sectorsize 4096, nodesize 4096, leafsize 4096)
    
  5. Prepare the needed Data or Configuration Files:
    hsu@amd-6:/src4/ceph/DataFiles$ pwd
    /src4/ceph/DataFiles
    hsu@amd-6:/src4/ceph/DataFiles$ ls -l
    total 8
    -rw-r--r-- 1 hsu hsu 893 Sep  1 09:41 ceph.conf
    -rw-r--r-- 1 hsu hsu 146 Sep  3 09:10 ceph.hosts
    lrwxrwxrwx 1 hsu hsu  10 Sep  3 09:09 hosts -> ceph.hosts
    hsu@amd-6:/src4/ceph/DataFiles$ cat ceph.hosts
    192.168.0.5      Ceph-Temp     ceph-temp
    192.168.0.6      Ada           ada
    192.168.0.7      Bob           bob
    192.168.0.8      Cay           cay
    hsu@amd-6:/src4/ceph/DataFiles$ cat ceph.conf
    # Omit lengthy output
    
  6. Boot Ceph-Temp
    1. Permit Root login so that mkcephfs can be executed as root.
      hsu@Ceph-Temp:~$ diff /etc/ssh/sshd_config /etc/ssh/sshd_config.bkp
      26c26
      < PermitRootLogin yes
      ---
      > PermitRootLogin no
      

      Remember to turn off the RootLogin permission for ceph-temp after ada, bob, and cay are ready.

    2. scp DataFiles/ceph.conf to 192.168.0.5:/tmp and
      hsu@Ceph-Temp:~$ sudo mv /tmp/ceph.conf /etc/ceph
      
    3. The necessary directories /srv/ceph, /srv/ceph/mon, /srv/ceph/osd, /srv/ceph/osd/osd.ada in /srv are created in the Config-Kvm-Storage script. And the next line for mounting (btrfs) object storage is added to /etc/rc.local.
       mount -t btrfs /dev/sdb /srv/ceph/osd/osd.ada
      

      Also, the DataFiles/ceph.hosts is symbolic linked to DataFiles/hosts. From DataFiles/hosts, Config-Kvm-Storage script will append it to newly generated /etc/hosts.

    4. Produce ada, bob, cay VMs
       
       $ cd /src4/ceph/cluster 
       $ cp Ceph-Temp.img Ceph-Ada.img
       $ cd ../bin 
       $ Config-Kvm-Storage ../cluster/Ceph-Ada.img Ada 192.168.0.6 eth0 21  ../Vdisks/osdBlk.ada
      

      Do the same for bob, cay. Then boot all of them in the foreground and execute $ sudo ./recover70rules on all of them to get rid of eth1, the wrong ethernet device.

    5. Generate SSH key for root

      We need to set up ssh keys so that the machine we are running mkcephfs on (master) can ssh in to other nodes (slave) as root. We do this as follows:

      ssh-keygen does not recognize the "-d" option!

       $ ssh -X root@192.168.0.8
      Cay:~# ssh-keygen 
      Generating public/private rsa key pair.
      Enter file in which to save the key (/root/.ssh/id_rsa): 
      Created directory '/root/.ssh'.
      Enter passphrase (empty for no passphrase): 
      Enter same passphrase again: 
      Your identification has been saved in /root/.ssh/id_rsa.
      Your public key has been saved in /root/.ssh/id_rsa.pub.
      The key fingerprint is:
      54:e6:a7:36:60:5d:f4:8b:d3:0a:04:a9:3c:78:2e:7e root@Cay
      The key's randomart image is:
      +--[ RSA 2048]----+
      |        ..o.o    |
      |        .* . .   |
      |     o .+ + . .  |
      |    . =o o o o . |
      |     o .S = o o  |
      |    . .  . o o   |
      |   . .      .    |
      |    . E          |
      |     .           |
      +-----------------+
      Cay:~# ssh-copy-id root@bob
      Cay:~# ssh-copy-id root@ada
      
  7. mkcephfs --- create a new Ceph cluster file system

    The formal names for our VMs are: Ada, Bob, Cay. Nicknames are ada, bob, cay. Unfortunately, in /etc/ceph/ceph.conf file, we use their nicknames when assigning host values. We correct it after the execution of mkcephfs failed. Then tried it again.

    Cay:~# mkcephfs -c /etc/ceph/ceph.conf -a -k /etc/ceph/keyring.admin
    temp dir is /tmp/mkcephfs.AhJuMn5xlH
    preparing monmap in /tmp/mkcephfs.AhJuMn5xlH/monmap
    /usr/bin/monmaptool --create --clobber --add a 192.168.0.6:6789 --add b 192.168.0.7:6789 --add c 192.168.0.8:6789 --print /tmp/mkcephfs.AhJuMn5xlH/monmap
    /usr/bin/monmaptool: monmap file /tmp/mkcephfs.AhJuMn5xlH/monmap
    /usr/bin/monmaptool: generated fsid bcf49709-ae94-4244-b4a9-21b481288a8e
    epoch 0
    fsid bcf49709-ae94-4244-b4a9-21b481288a8e
    last_changed 2012-09-03 18:45:26.667257
    created 2012-09-03 18:45:26.667257
    0: 192.168.0.6:6789/0 mon.a
    1: 192.168.0.7:6789/0 mon.b
    2: 192.168.0.8:6789/0 mon.c
    /usr/bin/monmaptool: writing epoch 0 to /tmp/mkcephfs.AhJuMn5xlH/monmap (3 monitors)
    === osd.0 === 
    pushing conf and monmap to Ada:/tmp/mkfs.ceph.2144
    2012-09-03 18:45:29.545637 7fde12bbd780 -1 filestore(/srv/ceph/osd/osd.Ada) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory
    2012-09-03 18:45:30.162704 7fde12bbd780 -1 created object store /srv/ceph/osd/osd.Ada journal /srv/ceph/osd/osd.0.journal for osd.0 fsid bcf49709-ae94-4244-b4a9-21b481288a8e
    creating private key for osd.0 keyring /etc/ceph/keyring.osd.0
    creating /etc/ceph/keyring.osd.0
    collecting osd.0 key
    === osd.1 === 
    pushing conf and monmap to Bob:/tmp/mkfs.ceph.2144
    2012-09-03 18:45:32.790829 7fde309ac780 -1 filestore(/srv/ceph/osd/osd.Bob) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory
    2012-09-03 18:45:33.384811 7fde309ac780 -1 created object store /srv/ceph/osd/osd.Bob journal /srv/ceph/osd/osd.1.journal for osd.1 fsid bcf49709-ae94-4244-b4a9-21b481288a8e
    creating private key for osd.1 keyring /etc/ceph/keyring.osd.1
    creating /etc/ceph/keyring.osd.1
    collecting osd.1 key
    === osd.2 === 
    2012-09-03 18:45:33.983906 7f474305a780 -1 filestore(/srv/ceph/osd/osd.Cay) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory
    2012-09-03 18:45:34.594248 7f474305a780 -1 created object store /srv/ceph/osd/osd.Cay journal /srv/ceph/osd/osd.2.journal for osd.2 fsid bcf49709-ae94-4244-b4a9-21b481288a8e
    creating private key for osd.2 keyring /etc/ceph/keyring.osd.2
    creating /etc/ceph/keyring.osd.2
    === mds.a === 
    pushing conf and monmap to Ada:/tmp/mkfs.ceph.2144
    creating private key for mds.a keyring /etc/ceph/keyring.mds.a
    creating /etc/ceph/keyring.mds.a
    collecting mds.a key
    === mds.c === 
    creating private key for mds.c keyring /etc/ceph/keyring.mds.c
    creating /etc/ceph/keyring.mds.c
    Building generic osdmap from /tmp/mkcephfs.AhJuMn5xlH/conf
    /usr/bin/osdmaptool: osdmap file '/tmp/mkcephfs.AhJuMn5xlH/osdmap'
    /usr/bin/osdmaptool: writing epoch 1 to /tmp/mkcephfs.AhJuMn5xlH/osdmap
    Generating admin key at /tmp/mkcephfs.AhJuMn5xlH/keyring.admin
    creating /tmp/mkcephfs.AhJuMn5xlH/keyring.admin
    Building initial monitor keyring
    added entity mds.a auth auth(auid = 18446744073709551615 key=AQBQikRQoE2WCBAA4Ok+m3gHjepnuITEx7rBrw== with 0 caps)
    added entity mds.c auth auth(auid = 18446744073709551615 key=AQBPikRQaOdRIhAAgQ2NBgtqeg9p0uwwUUegtQ== with 0 caps)
    added entity osd.0 auth auth(auid = 18446744073709551615 key=AQBKikRQsHT1DRAAgOIDa2aS/B+l6KC1qVlVYw== with 0 caps)
    added entity osd.1 auth auth(auid = 18446744073709551615 key=AQBNikRQQBAiGhAAYRaqXLi64pO4Ea5bXW6m6g== with 0 caps)
    added entity osd.2 auth auth(auid = 18446744073709551615 key=AQBOikRQiGsxJhAAMLlE76RrGZkXzyAu8XiNTA== with 0 caps)
    === mon.a === 
    pushing everything to Ada
    /usr/bin/ceph-mon: created monfs at /srv/ceph/mon/mon.a for mon.a
    === mon.b === 
    pushing everything to Bob
    /usr/bin/ceph-mon: created monfs at /srv/ceph/mon/mon.b for mon.b
    === mon.c === 
    /usr/bin/ceph-mon: created monfs at /srv/ceph/mon/mon.c for mon.c
    placing client.admin keyring in /etc/ceph/keyring.admin
    Cay:~# echo $?
    0
    

    Carefully examine ada, bob, and cay, only cay, the one on which we execute the mkcephfs command, has /etc/ceph/keyring.admin

    Cay:~$ su root
    Password: 
    root@Cay:/home/hsu# for host in ada bob
    >                     do
    >                       scp /etc/ceph/keyring.admin root@$host:/etc/ceph/keyring.admin;
    >                     done
    keyring.admin                                 100%   63     0.1KB/s   00:00    
    keyring.admin                                 100%   63     0.1KB/s   00:00    
    
  8. Deploying with mkcephfs (Source Origin)
    Enabling Authentication

    In the [global] settings of your ceph.conf file, you can enable authentication for your cluster.

    [global]
            auth supported = cephx

    The valid values are cephx or none. If you specify cephx, Ceph will look for the keyring in the default search path, which includes /etc/ceph/keyring. You can override this location by adding a keyring option in the [global] section of your ceph.conf file, but this is not recommended.

    For authentication business, please check its official documentation: Cephx and Authentication. You may also login ceph-enabled VM and check its manual page via

    hsu@Ceph-Temp:~$ man ceph-authtool 
    CEPH-AUTHTOOL(8)                     Ceph                     CEPH-AUTHTOOL(8)
    NAME
           ceph-authtool - ceph keyring manipulation tool
    SYNOPSIS
           ceph-authtool keyringfile [ -l | --list ] [ -C | --create-keyring
           ] [ -p | --print ] [ -n | --name entityname ] [ --gen-key ] [ -a |
           --add-key base64_key ] [ --caps capfils ]
         .
         .
         .
    
  9. Ceph Clients

    First, login ada, bob, and cay, as root, start ceph. Probably, we should put this line in /etc/rc.local.

     # /etc/init.d/ceph start
    

    Now, are we ready to use our storage cluster yet? For reference: Mounting Ceph FS

    Warning: Don't mount ceph using kernel driver on the osd server. Perhaps it will freeze the ceph client and your osd server.

    hsu@amd-6:/src4/ceph$ cp cluster/Ceph-Temp.img clients/Ceph-Client.img
    hsu@amd-6:/src4/ceph$ cd bin
    $ ./Config-Kvm ../clients/Ceph-Client.img ceph-client1 192.168.0.130 eth0 30
    # Ada, Bob, Cay are already online
    hsu@amd-6:/src4/ceph/bin$ start-ceph-client1-30
    hsu@amd-6:~$ ssh -X hsu@192.168.0.130
    $ sudo ceph-authtool -l /etc/ceph/keyring.admin
    [sudo] password for hsu: 
    can't open /etc/ceph/keyring.admin: can't open /etc/ceph/keyring.admin: (2) No such file or directory
    ceph-client1:~$ which ceph-fuse
    /usr/bin/ceph-fuse
    ceph-client1:~$ ceph-fuse -m 192.168.0.6:6789 /mnt/tmp
    # Failed due to: auth: failed to open keyring from /etc/ceph/keyring.admin
    

    We need to get the /etc/ceph/keyring.admin from ada, bob, or cay.

    ceph-client1:~$ ceph-authtool -l /etc/ceph/keyring.admin
    [client.admin]
            key = AQBPikRQiDwkJhAAMjbxXa5fkivW09yCVbvZvw==
    ceph-client1:~$  ceph -s
       health HEALTH_OK
       monmap e1: 3 mons at {a=192.168.0.6:6789/0,b=192.168.0.7:6789/0,c=192.168.0.8:6789/0}, election epoch 6, quorum 0,1,2 a,b,c
       osdmap e7: 3 osds: 3 up, 3 in
        pgmap v232: 576 pgs: 576 active+clean; 8730 bytes data, 8384 KB used, 26915 MB / 30000 MB avail
       mdsmap e5: 1/1/1 up {0=c=up:active}, 1 up:standby
    # Next command need to be executed with sudo.  Otherwise, failed to open /dev/fuse
    ceph-client1:~$ ls -l /dev/fuse
    ceph-client1:~$ sudo ceph-fuse -m 192.168.0.6:6789 /mnt/tmp
    ceph-fuse[2361]: starting ceph client
    ceph-fuse[2361]: starting fuse
    ceph-client1:~$ ls -l /mnt/tmp
    total 0
    ceph-client1:~$ sudo umount /mnt/tmp
    ceph-client1:~$ sudo mount -t ceph 192.168.0.6:6789:/ /mnt/tmp -vv -o name=admin,secret=AQBPikRQiDwkJhAAMjbxXa5fkivW09yCVbvZvw==
    parsing options: rw,name=admin,secret=AQBPikRQiDwkJhAAMjbxXa5fkivW09yCVbvZvw==
    mount: error writing /etc/mtab: Invalid argument
    # Although we got the last error message line, it seems ceph filesystem is ready and 
    # been mounted on the /mnt/tmp directory.  It even shows up in /etc/mtab.
    ceph-client1:~$ ls -lia /mnt/tmp
    total 4
       1 drwxr-xr-x 1 root root    0 Sep  7 15:19 .
    7689 drwxr-xr-x 3 root root 4096 Sep  7 10:13 ..
    ceph-client1:~$ df
    Filesystem         1K-blocks    Used Available Use% Mounted on
    rootfs               2615208  996020   1488120  41% /
    udev                   10240       0     10240   0% /dev
    tmpfs                  50900     156     50744   1% /run
    /dev/sda2            2615208  996020   1488120  41% /
    tmpfs                   5120       0      5120   0% /run/lock
    tmpfs                 101780       0    101780   0% /run/shm
    /dev/sda1             467367   19636    422797   5% /boot
    192.168.0.6:6789:/  30720000 3159040  27560960  11% /mnt/tmp
    ceph-client1:~$ cat /etc/mtab
    rootfs / rootfs rw 0 0
    sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
    proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
    udev /dev devtmpfs rw,relatime,size=10240k,nr_inodes=62132,mode=755 0 0
    devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
    tmpfs /run tmpfs rw,nosuid,noexec,relatime,size=50900k,mode=755 0 0
    /dev/sda2 / ext4 rw,relatime,errors=remount-ro,user_xattr,acl,barrier=1,data=ordered 0 0
    tmpfs /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k 0 0
    tmpfs /run/shm tmpfs rw,nosuid,nodev,noexec,relatime,size=101780k 0 0
    fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0
    /dev/sda1 /boot ext2 rw,relatime,errors=continue 0 0
    192.168.0.6:6789:/ /mnt/tmp ceph rw,relatime,name=admin,secret= 0 0
    
  10. Rbd: Rados block device

    Booting ceph cluster and ceph-client and check its health:

    hsu@amd-6:/src4/ceph/bin$ start-Ada-21-AsDaemon; start-Bob-22-AsDaemon; start-Cay-23-AsDaemon; start-ceph-client1-30-AsDaemon
    hsu@amd-6:~$ ssh -X hsu@192.168.0.130
    hsu@192.168.0.130's password: 
    Linux ceph-client1 3.2.0-3-amd64 #1 SMP Mon Jul 23 02:45:17 UTC 2012 x86_64
       . 
       . 
       . 
    Last login: Sun Sep  9 15:26:12 2012 from 192.168.0.32
    ceph-client1:~$ ceph -s
       health HEALTH_OK
       monmap e1: 3 mons at {a=192.168.0.6:6789/0,b=192.168.0.7:6789/0,c=192.168.0.8:6789/0}, election epoch 2, quorum 0,1,2 a,b,c
       osdmap e29: 3 osds: 3 up, 3 in
        pgmap v607: 576 pgs: 576 active+clean; 9947 bytes data, 9084 KB used, 26915 MB / 30000 MB avail
       mdsmap e19: 1/1/1 up {0=c=up:active}, 1 up:standby
    ceph-client1:~$ 
    

    On the ceph server, create rbd image and resize it:

    $ xs cay
    Cay:~$ sudo rbd create foo --size 256
    Cay:~$ sudo rbd list
    foo
    Cay:~$ sudo rbd resize --image foo --size 512
    Resizing image: 100% complete...done.
    Cay:~$ sudo rbd info foo
    rbd image 'foo':
            size 512 MB in 128 objects
            order 22 (4096 KB objects)
            block_name_prefix: rb.0.0
            parent:  (pool -1)
    Cay:~$ sudo rados ls -p rbd
    foo.rbd
    rbd_directory
    rbd_info
    #########################################################################################
    # Remember issue the following command to delete foo after we have done with client side.
    #########################################################################################
    Cay:~$ sudo  rbd rm foo
    Removing image: 100% complete...done.
    Cay:~$  sudo rbd list
    

    On the ceph client side,

    ceph-client1:~$ ceph-authtool -l /etc/ceph/keyring.admin
    [client.admin]
            key = AQBPikRQiDwkJhAAMjbxXa5fkivW09yCVbvZvw==
    ceph-client1:~$ echo "AQBPikRQiDwkJhAAMjbxXa5fkivW09yCVbvZvw==" >/tmp/secretfile 
    ceph-client1:~$ more /tmp/secretfile
    AQBPikRQiDwkJhAAMjbxXa5fkivW09yCVbvZvw==
    ceph-client1:~$ lsmod | grep rbd
    ceph-client1:~$ sudo modprobe rbd
    [sudo] password for hsu: 
    ceph-client1:~$ lsmod | grep rbd
    rbd                    22311  0 
    libceph                90118  2 ceph,rbd
    ceph-client1:~$ ls -l /sys/bus/rbd
    total 0
    --w------- 1 root root 4096 Sep 10 09:14 add
    drwxr-xr-x 2 root root    0 Sep 10 09:14 devices
    drwxr-xr-x 2 root root    0 Sep 10 09:14 drivers
    -rw-r--r-- 1 root root 4096 Sep 10 09:14 drivers_autoprobe
    --w------- 1 root root 4096 Sep 10 09:14 drivers_probe
    --w------- 1 root root 4096 Sep 10 09:14 remove
    --w------- 1 root root 4096 Sep 10 09:14 uevent
    $ sudo echo "192.168.0.6,192.168.0.7,192.168.0.8 name=admin,secret=`cat /tmp/secretfile` rbd foo" >/sys/bus/rbd/add
    -bash: /sys/bus/rbd/add: Permission denied
    ceph-client1:~$ su
    Password: 
    root@ceph-client1:/home/hsu# echo "192.168.0.6,192.168.0.7,192.168.0.8 name=admin,secret=`cat /tmp/secretfile` rbd foo" >/sys/bus/rbd/add
    root@ceph-client1:/home/hsu# ls /sys/bus/rbd/devices
    0
    root@ceph-client1:/home/hsu# ls -l /sys/bus/rbd/devices
    total 0
    lrwxrwxrwx 1 root root 0 Sep 10 09:31 0 -> ../../../devices/rbd/0
    root@ceph-client1:/home/hsu# ls -l /sys/devices/rbd/0
    total 0
    -r--r--r-- 1 root root 4096 Sep 10 09:33 client_id
    --w------- 1 root root 4096 Sep 10 09:33 create_snap
    -r--r--r-- 1 root root 4096 Sep 10 09:20 current_snap
    -r--r--r-- 1 root root 4096 Sep 10 09:33 major
    -r--r--r-- 1 root root 4096 Sep 10 09:20 name
    -r--r--r-- 1 root root 4096 Sep 10 09:20 pool
    drwxr-xr-x 2 root root    0 Sep 10 09:33 power
    --w------- 1 root root 4096 Sep 10 09:33 refresh
    -r--r--r-- 1 root root 4096 Sep 10 09:33 size
    lrwxrwxrwx 1 root root    0 Sep 10 09:20 subsystem -> ../../../bus/rbd
    -rw-r--r-- 1 root root 4096 Sep 10 09:20 uevent
    root@ceph-client1:/home/hsu# cat /sys/devices/rbd/0/size
    536870912
    root@ceph-client1:/home/hsu# ls -l /dev/rbd0
    brw-rw---T 1 root disk 254, 0 Sep 10 09:20 /dev/rbd0
    root@ceph-client1:/home/hsu# cat /sys/devices/rbd/0/client_id
    client4705
    root@ceph-client1:/home/hsu# cat /sys/devices/rbd/0/name
    foo
    root@ceph-client1:/home/hsu# mkfs -t ext2 /dev/rbd0
    mke2fs 1.42.5 (29-Jul-2012)
    Filesystem label=
    OS type: Linux
    Block size=4096 (log=2)
    Fragment size=4096 (log=2)
    Stride=1024 blocks, Stripe width=1024 blocks
    32768 inodes, 131072 blocks
    6553 blocks (5.00%) reserved for the super user
    First data block=0
    Maximum filesystem blocks=134217728
    4 block groups
    32768 blocks per group, 32768 fragments per group
    8192 inodes per group
    Superblock backups stored on blocks: 
            32768, 98304
    Allocating group tables: done                            
    Writing inode tables: done                            
    Writing superblocks and filesystem accounting information: done
    root@ceph-client1:/home/hsu# mount -t ext2 /dev/rbd0 /mnt
    root@ceph-client1:/home/hsu# ls -l /mnt
    total 16
    drwx------ 2 root root 16384 Sep 10 09:48 lost+found
    root@ceph-client1:/home/hsu# df /mnt
    Filesystem     1K-blocks  Used Available Use% Mounted on
    /dev/rbd0         516040   396    489432   1% /mnt
    root@ceph-client1:/home/hsu# more /etc/mtab
    rootfs / rootfs rw 0 0
    sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
    proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
    udev /dev devtmpfs rw,relatime,size=10240k,nr_inodes=62132,mode=755 0 0
    devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
    tmpfs /run tmpfs rw,nosuid,noexec,relatime,size=50900k,mode=755 0 0
    /dev/sda2 / ext4 rw,relatime,errors=remount-ro,user_xattr,acl,barrier=1,data=ord
    ered 0 0
    tmpfs /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k 0 0
    tmpfs /run/shm tmpfs rw,nosuid,nodev,noexec,relatime,size=101780k 0 0
    fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0
    /dev/sda1 /boot ext2 rw,relatime,errors=continue 0 0
    /dev/rbd0 /mnt ext2 rw,relatime,errors=continue,user_xattr,acl 0 0
    root@ceph-client1:/home/hsu# umount /mnt
    root@ceph-client1:/home/hsu# echo 0 > /sys/bus/rbd/remove
    root@ceph-client1:/home/hsu# ls -l  /dev/rbd0
    ls: cannot access /dev/rbd0: No such file or directory
    root@ceph-client1:/home/hsu# ls -l /sys/bus/rbd/devices
    total 0
    root@ceph-client1:/home/hsu#  ls -l /sys/devices/rbd
    total 0
    drwxr-xr-x 2 root root    0 Sep 10 09:57 power
    -rw-r--r-- 1 root root 4096 Sep 10 09:20 uevent
    
  11. An rbd can not be shared by multiple VMs, (need to verify this statement.) For it to be shared by multiple VMs, we need to export it via iSCSI. With iSCSI, we may also provide redundancy via multipathing, but we need different subnets (and router between subnets). Roughly speaking, iSCSI multipath is to provide two or more IP paths for accessing an iSCSI connected block storage device. And this two IP paths must come from two different subnets, so that if one subnet failed, there is another IP path to reach this block storage. Also, theoretically, the internet bandwidth can be increased via multiple IP paths, especially, if second IP path can be set up on GB ether devices.

    More subtle problem: rbd device would disappear after re-boot

  12. Router/Gateway

    New Reference: My Router Setup   Old Reference: My Router (VC6.4) Installation

    We choose Vyatta to produce our virtual router/gateway. So far so good. Can not upgrade it from VC6.4 to VC6.5. Download new iso from vyatta-livecd_VC6.5R1_amd64.iso and re-install it by following the instruction described in the above New Reference.

    Note: (11/24/2012) Vyatta, a Debian-based software-based virtual router, and claimed to be similar to Juniper JUNOS or Cisco IOS. It has two editions: (1) subscription and (2) open sourced editions. Subscription edition provides web-based management interface, i.e. user friendlier. We take the second approach. Almost all the setups you did based on your Debian experience are in vain, i.e. after rebooting, setups are gone. Only can be done via configure, a command no where to be found. Worst of all, you won't be able to upgrade your software packages using Debian mirror. And you can't install additional packages, such as synaptic and emacs. nano is the only text editor (similar to emacs) available. There is no X GUI interface, everything is based on command line interface (CLI). It dose not offer any upgrade path. You only can reinstall newer version via ISO image. Its documentaion web page: Vyatta Docdl, zip download: VC65.zip

    Note: (10/08/2012) MyRouter is OK, now, I think. To test it, bring up Test-Eth1 (on ac00), a VM with only IP 192.168.1.254, edit its /etc/rc.local so that its default gateway is 192.168.1.1, not 192.168.1.33, the second IP address of ac00. Reboot it. And on amd-6, booting MyRouter and ceph-client1. Login Test-Eth1, from it we can successfully login 192.168.0.33 (ac00), 192.168.0.32 (amd-6), 192.168.0.130 (ceph-client1), but not machines on the 140.120 network. But, I think this is OK, since 192.168.1.0/24 is our own private lan. (Originally, Test-Eth1 with 192.168.1.33 default gateway can reach anywhere.)

    Note: (10/08/2012) Our router should route 192.168.1.0/24 subnet to other subnet. For consistency, we use eth1 (if possible at all, for virtual machines with only one 192.168.1.* IP address, it only has (virtual) eth0 card,) to connect our 192.168.1.0/24 subnet. For Setting up Kvm with 2 Nics and 2 Taps, you may consult the Kvm with 2 Nics. The correct way to setup MyRouter is as follows: Notice that the MAC addresses of the two nics must be the same as the MAC addresses for ethernet eth0 and ethernet eth1 recorded in the /opt/vyatta/etc/config/config.boot file.

     $ Config-Kvm ../Router/MyRouter.img MyRouter 192.168.1.1 eth1 13 
     # Edit start-MyRouter-13, start-MyRouter-13-AsDaemon, stop-MyRouter-restore-lan-13 
     # As follows:
     $ diff start-MyRouter-13  start-MyRouter-13.orig
    17,22d16
    < ################################################################################
    < sudo tunctl -u hsu -t tap103
    < sudo ifconfig tap103 192.168.0.32 netmask 255.255.255.255 up
    < sudo iptables --table nat -A POSTROUTING --out-interface eth0 -j MASQUERADE
    < sudo iptables -A FORWARD --in-interface tap103 -j ACCEPT
    < ################################################################################
    28,32d21
    < ################################################################################
    < sudo sysctl net.ipv4.conf.tap103.proxy_arp=1
    < sudo arp -Ds 192.168.0.2 eth0 pub
    < sudo route add -host 192.168.0.2 dev tap103
    < ################################################################################
    35,37d23
    < ################################################################################
    < vde_switch -tap tap103 -mod 644 -sock=/src4/ceph/network-3039 -mgmt /src4/ceph/network-3039/vde_switch.mgmt -daemon /dev/null
    < ################################################################################
    39,43c25
    < ################################################################################
    < # The MAC addresses for eth0 and eth1 are inscribed in config.boot file, can't be 
    < # changed arbitrarily.
    < ################################################################################
    < kvm -net vde,vlan=0,sock=/src4/ceph/network-3039 -net nic,vlan=0,macaddr=1c:6f:65:4f:cc:8f -net vde,vlan=1,sock=/src4/ceph/network-3049  -net nic,vlan=1,macaddr=1c:6f:65:e5:2f:3d -m 512M -monitor unix:/src4/ceph/network-3049/MonSock,server,nowait -hda ../Router/MyRouter.img &
    ---
    > kvm -net vde,vlan=0,sock=/src4/ceph/network-3049 -net nic,vlan=0,macaddr=1c:6f:65:e5:2f:3d -net nic,vlan=0,macaddr=1c:6f:65:4f:cc:8f -m 512M -monitor unix:/src4/ceph/network-3049/MonSock,server,nowait -hda ../Router/MyRouter.img &
     ############################################################################
     # For start-MyRouter-13-AsDaemon script, it is almost identical to start-MyRouter-13,
     # We only need to pay attention to the "-net" options for the kvm command.  And the 
     # eth0 and eth1 MAC addresses are hard-coded in its /opt/vyatta/etc/config/config.boot 
     # file.   We also use vlan0 and vlan1 as different (virtual) switches for two different
     # subnets.  It seems OK, now.   Surely, we need more testing!!  The difference of the 
     # last line in the start-MyRouter-13-AsDaemon and start-MyRouter-13-AsDaemon.orig shell
     # scripts is kept, the rest differences are the same as above.
     ############################################################################
    < screen -S MyRouter -d -m kvm  -net vde,vlan=0,sock=/src4/ceph/network-3039 -net nic,vlan=0,macaddr=1c:6f:65:4f:cc:8f -net vde,vlan=1,sock=/src4/ceph/network-3049  -net nic,vlan=1,macaddr=1c:6f:65:e5:2f:3d  -m 512M -monitor unix:/src4/ceph/network-3049/MonSock,server,nowait -curses -hda ../Router/MyRouter.img &
    ---
    > screen -S MyRouter -d -m kvm -net vde,vlan=0,sock=/src4/ceph/network-3049 -net nic,vlan=0,macaddr=1c:6f:65:e5:2f:3d -net nic,vlan=0,macaddr=1c:6f:65:4f:cc:8f -m 512M -monitor unix:/src4/ceph/network-3049/MonSock,server,nowait -curses -hda ../Router/MyRouter.img &
    $ diff stop-MyRouter-restore-lan-13 stop-MyRouter-restore-lan-13.orig
    45,47d44
    < ################################################################################
    < sudo pkill -f "vde_switch -tap tap103 -mod 644 -sock=/src4/ceph/network-3039 -mgmt /src4/ceph/network-3039/vde_switch.mgmt"
    < ################################################################################
    52,56d48
    < ################################################################################
    < if [ -S /src4/ceph/network-3039/ctl ]; then rm /src4/ceph/network-3039/ctl; fi
    < if [ -S /src4/ceph/network-3039/vde_switch.mgmt ]; then rm /src4/ceph/network-3039/vde_switch.mgmt; fi
    < if [ -d /src4/ceph/network-3039 ]; then rm -rf /src4/ceph/network-3039; fi
    < ################################################################################
    65,71d56
    < ################################################################################
    < sudo sysctl net.ipv4.conf.tap103.proxy_arp=0
    < sudo ifconfig tap103 192.168.0.32 down
    < # sudo iptables --table nat -D POSTROUTING --out-interface eth1 -j MASQUERADE
    < sudo iptables -D FORWARD --in-interface tap103 -j ACCEPT
    < sudo tunctl -d tap103
    < ################################################################################
    

    mkpartfs command is useless (kept for reference only)

    The mkpartfs command provided by qemu-kvm ends up with "/dev/sda unrecognized disk label". We can use start-Gparted-6-efs (in /src3/KVM/bin) and specify /src4/ceph/Router/MyRouter.img as its argument and use gparted command to partition /dev/sdb (1) First partition 488M, ext2, (2) second partition 3096M, ext4, (3) third partition 512M, swap. I always got 513M for 3rd partition. Also turn on the boot flag for the first partition. Apparently, first 1MB is reserved for MBR, not used. I asked for 488M, only got 487M and the second partition (/) started at sector 999424, the correct offset to use Config-Kvm shell script. The first partition is totally wasted, but we need it to get the right offset for Config-Kvm shellscript to be functional.

     $ mkdir /src4/ceph/Router
     $ mv *iso /src4/ceph/Router
     $ cd /src4/ceph/Router
     $ qemu-img create MyRouter.img 4G
    ##############################################################################
    # When seeing system prompt, type "install system" without the double quotes.
    # print ;; print info about hard disk.
    # mkpartfs primary ext2 1 512  ;; in the unit of MBs.
    # set 1 boot on ;; enable boot option on partition 1.
    # print
    # mkpartfs primary ext4 512 
    ##############################################################################
    
  13. Setup of Kvm with 2 Nics and 2 Taps

    For IP path redundancy, we need some of our VMs to be accessible via two different subnets. Of course, our physical host must have at least two ethernet cards and we can reach it via two different subnets, say 192.168.0.0/24 and 192.168.1.0/24. In the following scenario, 192.168.0.0/24 is the public subnet, 192.168.1.0/24 is the private subnet, i.e. we use this subnet to achieve higher bandwidth and less interrupt.

     $ cd /src4/ceph/clients
     $ cp Debian-Eth1.img Debian-2Nics.img
     $ cd ../bin  
     $ Config-Kvm ../clients/Debian-2Nics.img Deb2Nics 192.168.1.253 eth1 253 
     # Manually add tun/tap for eth0:  eth0 communicates with host via tap243.  And we 
     # would like eth0 and eth1 to be on vlan0 and vlan1, two different switches.  Also,
     # we fake the MAC address of eth1 by subtract 10 from the last three hexadecimal 
     # digits of eth0.  Recall that the MAC address of eth0 is also a facked one!
     $ diff start-Deb2Nics-253-AsDaemon start-Deb2Nics-253-AsDaemon.orig
    17,22d16
    < #################################################################################
    < sudo tunctl -u hsu -t tap243
    < sudo ifconfig tap243 192.168.0.33 netmask 255.255.255.255 up
    < sudo iptables --table nat -A POSTROUTING --out-interface eth0 -j MASQUERADE
    < sudo iptables -A FORWARD --in-interface tap243 -j ACCEPT
    < #################################################################################
    28,32d21
    < #################################################################################
    < sudo sysctl net.ipv4.conf.tap243.proxy_arp=1
    < sudo arp -Ds 192.168.0.253 eth0 pub
    < sudo route add -host 192.168.0.253 dev tap243
    < #################################################################################
    35,37d23
    < #################################################################################
    < vde_switch -tap tap243 -mod 644 -sock=/src4/ceph/network-3610 -mgmt /src4/ceph/network-3610/vde_switch.mgmt -daemon /dev/null
    < #################################################################################
    39c25
    < screen -S Deb2Nics -d -m kvm  -net vde,vlan=0,sock=/src4/ceph/network-3610 -net nic,vlan=0,macaddr=00:25:90:d5:c6:42 -net vde,vlan=1,sock=/src4/ceph/network-3620 -net nic,vlan=1,macaddr=00:25:90:c5:b6:32 -m 512M -monitor unix:/src4/ceph/network-3620/MonSock,server,nowait -curses -hda ../clients/Debian-2Nics.img &
    ---
    > screen -S Deb2Nics -d -m kvm -net vde,vlan=0,sock=/src4/ceph/network-3620 -net nic,vlan=0,macaddr=00:25:90:d5:c6:42 -m 512M -monitor unix:/src4/ceph/network-3620/MonSock,server,nowait -curses -hda ../clients/Debian-2Nics.img &
    
  14. Multipath RBD via iSCSI

    References:

    1. Using iSCSI on Debian   
    2. Sorry, probably I made a wrong choice

    Following the steps outlined in the first article, we can create a Ceph Client and turn it to be an iSCSITarget. This iSCSITarget may also carry other block devices. This iSCSITarget essentially becomes our poor man's SAN storage. Through iSCSIInitiator, we may obtain remote storage space via our LAN.

  15. More References:
         iSCSI Introduction Wiki     Ceph ISCSI Wiki     Ceph Rbd As San Storage    
         Debian iSCSI Wiki     Debian iSCSI     MultiPath    
         5-minute Quick Start     CephFS Quick Start     Ceph FS
         A crash course in Ceph    

Turning Ceph RBD Images into SAN Storage Devices (Source Origin)

RADOS Block Device (RBD) is a block-layer interface to the Ceph distributed storage stack. Here's how you can enhance RBD with SAN storage device compatibility, like iSCSI and Fibre Channel, to connect systems with no native RBD support to your Ceph cluster.

Prerequisites

What you'll need in order to accomplish SAN compatibility for your Ceph cluster is this:

Getting Started

The first thing we'll need to do is create an RBD image. Suppose we would like to create one that is 10GB in size (recall, all RBD images are thin-provisioned, so we won't actually use 10GB in the Ceph cluster right from the start).

rbd -n client.rbd -k /etc/ceph/keyring.client.rbd create --size 10240 test

Question: Option -n client.rbd means name? In the rbd manpage, there is no such option. Manpage too old? Or, shall we use the --id username option, instead?

This means we are connecting to our Ceph mon servers (defined in the default configuration file, /etc/ceph/ceph.conf) using the client.rbd identity, whose authentication key is stored in /etc/ceph/keyring.client.rbd. The nominal image size is 10240MB, and its name is a hardly creative test.

You can run this command from any node inside or outside your Ceph cluster, as long as the configuration file and authentication credentials are stored in the appropriate location. This should be tested, and it is helpful, since we usual run Ceph cluster in the daemon mode. The next step, however, is one that you must complete from your proxy (iSCSITarget) node (the one with the lio tools installed):

 $ modprobe rbd
 $ rbd --user rbd --secret /etc/ceph/secret.client.rbd map test

Note that this syntax applies to the current "stable" Ceph release, 0.48 "argonaut". Newer releases do away with the somewhat illogical --user and --secret syntax, and just allow --id and --keyring which is more in line with all other Ceph tools.

We use the following syntax, more directly but more complicated.
ceph-client1:~$ sudo modprobe rbd
ceph-client1:~$ ceph-authtool -l /etc/ceph/keyring.admin
[client.admin]
        key = AQBPikRQiDwkJhAAMjbxXa5fkivW09yCVbvZvw==
ceph-client1:~$ echo "AQBPikRQiDwkJhAAMjbxXa5fkivW09yCVbvZvw==" >/tmp/secretfile 
ceph-client1:~$ echo "192.168.0.6,192.168.0.7,192.168.0.8 name=admin,secret=`cat /tmp/secretfile` rbd foo" >/sys/bus/rbd/add

Once the map command has completed, you should see a new block device named /dev/rbd0 (provided this is the first device you mapped on this machine), and a handy symlink of the pattern /dev/rbd/<pool>/<image>, in our case /dev/rbd/rbd/test. This is a kernel-level block device like any other, and we can now proceed by exporting it to the Unified Target infrastructure.

Exporting the Target

Once we have our mapped RBD device in place, we can create a target, and export it via one of LIO's fabric modules. The targetcli subshell comes in very handy for this purpose:

# targetcli 
Welcome to the targetcli shell:
 Copyright (c) 2011 by RisingTide Systems LLC.
Visit us at http://www.risingtidesystems.com.
Loaded tcm_loop kernel module.
Created '/sys/kernel/config/target/loopback'.
Done loading loopback fabric module.
Loaded tcm_fc kernel module.
Created '/sys/kernel/config/target/fc'.
Done loading tcm_fc fabric module.
Can't load fabric module qla2xxx.
Loaded iscsi_target_mod kernel module.
Created '/sys/kernel/config/target/iscsi'.
Done loading iscsi fabric module.
Can't load fabric module ib_srpt.
/> cd backstores/iblock
/backstores/iblock> create test /dev/rbd/rbd/test
Generating a wwn serial.
Created iblock storage object test using /dev/rbd/rbd/test.
Entering new node /backstores/iblock/test
/backstores/iblock/test> status
Status for /backstores/iblock/test: /dev/rbd/rbd/liotest deactivated

Note: (10/26/2012) The above modules can be found in Debian 3.2.0-4-amd64 kernel.

hsu@Amath-Client00:~$ find /lib/modules -name "*tcm*"
/lib/modules/3.2.0-4-amd64/kernel/drivers/target/tcm_fc
/lib/modules/3.2.0-4-amd64/kernel/drivers/target/tcm_fc/tcm_fc.ko
/lib/modules/3.2.0-4-amd64/kernel/drivers/target/loopback/tcm_loop.ko
hsu@Amath-Client00:~$ find /lib/modules -name "*iscsi_target*"
/lib/modules/3.2.0-4-amd64/kernel/drivers/target/iscsi/iscsi_target_mod.ko
hsu@Amath-Client00:~$ apt-cache search targetcli
targetcli - administration tool for managing LIO core target
hsu@Amath-Client00:~$ apt-cache search lio-utils
lio-utils - configuration tool for LIO core target
kamailio-utils-modules - Provides a set utility functions for Kamailio

Now we've created a backstore named test, corresponding to our mapped RBD image of the same name. At this point it is deactivated, as it hasn't been assigned to any iSCSI target. Up next, we'll create the target, add the backstore as LUN 0, and assign the target to a Target Portal Group (TPG):

/backstores/iblock> cd ..
/backstores> cd ..
/> cd iscsi 
/iscsi> create
Created target iqn.2003-01.org.linux-iscsi.gwen.i686:sn.7d9ed8ca9557.
Selected TPG Tag 1.
Successfully created TPG 1.
Entering new node /iscsi/iqn.2003-01.org.linux-iscsi.gwen.i686:sn.7d9ed8ca9557/tpgt1
/iscsi/iqn.20...8ca9557/tpgt1> cd luns
/iscsi/iqn.20...57/tpgt1/luns> status
Status for /iscsi/iqn.2003-01.org.linux-iscsi.gwen.i686:sn.7d9ed8ca9557/tpgt1/luns: 0 LUN
/iscsi/iqn.20...57/tpgt1/luns> create /backstores/iblock/test 
Selected LUN 0.
Successfully created LUN 0.
Entering new node /iscsi/iqn.2003-01.org.linux-iscsi.gwen.i686:sn.7d9ed8ca9557/tpgt1/luns/lun0
/iscsi/iqn.20...gt1/luns/lun0> cd ..
/iscsi/iqn.20...57/tpgt1/luns> cd ..
/iscsi/iqn.20...8ca9557/tpgt1> cd portals 
/iscsi/iqn.20...tpgt1/portals> create 192.168.122.117
Using default IP port 3260
Successfully created network portal 192.168.122.117:3260.
Entering new node /iscsi/iqn.2003-01.org.linux-iscsi.gwen.i686:sn.7d9ed8ca9557/tpgt1/portals/192.168.122.117:3260

So now we have a new target, with the IQN iqn.2003-01.org.linux-iscsi.gwen.i686:sn.7d9ed8ca9557, assigned to a TPG listening on 192.168.122.117.

For demonstration purposes, we can now disable authentication and initiator filters. You should obviously not do this on a production system.

/iscsi/iqn.20....122.117:3260> cd ..
/iscsi/iqn.20...tpgt1/portals> cd ..
/iscsi/iqn.20...8ca9557/tpgt1> set attribute authentication=0
Parameter authentication is now '0'.
/iscsi/iqn.20...8ca9557/tpgt1> set attribute demo_mode_write_protect=0 generate_node_acls=1 cache_dynamic_acls=1
Parameter demo_mode_write_protect is now '0'.
Parameter generate_node_acls is now '1'.
Parameter cache_dynamic_acls is now '1'.
/iscsi/iqn.20...8ca9557/tpgt1> exit

There. Now we have an iSCSI target, a target portal group, and a single LUN assigned to it.

Using your new target

And now, you can just connect to this thin-provisioned, dynamically replicated, self-healing and self-rebalancing, snapshot capable, striped and distributed block device as you would to any other iSCSI target.

Here's an example for the Linux standard open-iscsi tools:

# iscsiadm -m discovery -p 192.168.122.117 -t sendtargets
192.168.122.117:3260,1 iqn.2003-01.org.linux-iscsi.gwen.i686:sn.7d9ed8ca9557
# iscsiadm -m node -T iqn.2003-01.org.linux-iscsi.gwen.i686:sn.7d9ed8ca9557 -p 192.168.122.117 --login
Logging in to [iface: default, target: iqn.2003-01.org.linux-iscsi.gwen.i686:sn.7d9ed8ca9557, portal: 192.168.122.117,3260]
Login to [iface: default, target: iqn.2003-01.org.linux-iscsi.gwen.i686:sn.7d9ed8ca9557, portal: 192.168.122.117,3260]: successful

At this point, you'll have a shiny SCSI device showing up under lsscsi and in your /dev tree, and this device you can use for anything you please. Try partitioning it and making a filesystem on one of the partitions.

And when you're done, you just log out:

# iscsiadm -m node -T iqn.2003-01.org.linux-iscsi.gwen.i686:sn.7d9ed8ca9557 -p 192.168.122.117 --logout
Logging out of session [sid: 1, target: iqn.2003-01.org.linux-iscsi.gwen.i686:sn.7d9ed8ca9557, portal: 192.168.122.117,3260]
Logout of [sid: 1, target: iqn.2003-01.org.linux-iscsi.gwen.i686:sn.7d9ed8ca9557, portal: 192.168.122.117,3260]: successful

That's it.

Where To Go From Here

Now you can start doing pretty nifty stuff.

Have a server that needs extra storage, but runs a legacy Linux distro with no native RBD support? Install open-iscsi and provide that box with the replicated, striped, self-healing, auto-mirroring capacity that Ceph and RBD come with.

Have a Windows box that should somehow start using your massive distributed storage cluster? As long as you have Microsoft iSCSI Initiator installed, you can do so in an all-software solution. Or you just get the iSCSI HBA of your choice, and use its Windows driver.

And if you have a server that can boot off iSCSI, you can even run a bare-metal install, or build a diskless system that stores all of its data in RBD.

Want High Availability for your target proxy? Can do. Pacemaker has resource agents both for LIO and for RBD. And highly-available iSCSI servers are a relatively old hat; we've been doing that with other, less powerful storage replication solutions forever.

5-minute Quick Start (Source Origin)

Thank you for trying Ceph! Petabyte-scale data clusters are quite an undertaking. Before delving deeper into Ceph, we recommend setting up a two-node demo cluster to explore some of the functionality. The Ceph 5-Minute Quick Start deploys a Ceph object store cluster on one server machine and a Ceph client on a separate machine, each with a recent Debian/Ubuntu operating system. The intent of this Quick Start is to help you exercise Ceph object store functionality without the configuration and deployment overhead associated with a production-ready object store cluster. Once you complete this quick start, you may exercise Ceph commands on the command line. You may also proceed to the quick start guides for block devices, CephFS filesystems, and the RESTful gateway.

Install Debian/Ubuntu

Install a recent release of Debian or Ubuntu (e.g., 12.04 precise) on your Ceph server machine and your client machine.

Add Ceph Packages

To get the latest Ceph packages, add a release key to APT, add a source location to the /etc/apt/sources.list on your Ceph server and client machines, update your systems and install Ceph.

wget -q -O- https://raw.github.com/ceph/ceph/master/keys/release.asc | sudo apt-key add -
echo deb http://ceph.com/debian/ $(lsb_release -sc) main | sudo tee /etc/apt/sources.list.d/ceph.list
sudo apt-get update && sudo apt-get install ceph

Check the Ceph version you are using and make a note of it so that you have the correct settings in your configuration file:

ceph -v

If ceph -v reflects an earlier version from what you installed, your ceph-common library may be using the version distributed with the kernel. Once you've installed Ceph, you may also update and upgrade your packages to ensure you have the latest ceph-common library installed.

sudo apt-get update && sudo apt-get upgrade

If you want to use a version other than the current release, see Installing Debian/Ubuntu Packages for further details.

Add a Configuration File

The example configuration file will configure Ceph to operate a monitor, two OSD daemons and one metadata server on your Ceph server machine. To add a configuration file to Ceph, we suggest copying the contents of the example file below to an editor. Then, follow the steps below to modify it.

[global]
	# For version 0.55 and beyond, you must explicitly enable 
	# or disable authentication with "auth" entries in [global].
	
	auth cluster required = cephx
	auth service required = cephx
	auth client required = cephx
[osd]
	osd journal size = 1000
	
	#The following assumes ext4 filesystem.
	filestore xattr use omap = true
	# For Bobtail (v 0.56) and subsequent versions, you may 
	# add settings for mkcephfs so that it will create and mount
	# the file system on a particular OSD for you. Remove the comment `#` 
	# character for the following settings and replace the values 
	# in braces with appropriate values, or leave the following settings 
	# commented out to accept the default values. You must specify the 
	# --mkfs option with mkcephfs in order for the deployment script to 
	# utilize the following settings, and you must define the 'devs'
	# option for each osd instance; see below.
	#osd mkfs type = {fs-type}
	#osd mkfs options {fs-type} = {mkfs options}   # default for xfs is "-f"
	#osd mount options {fs-type} = {mount options} # default mount option is "rw, noatime"
	# Execute $ hostname to retrieve the name of your host,
	# and replace {hostname} with the name of your host.
	# For the monitor, replace {ip-address} with the IP
	# address of your host.
[mon.a]
	host = {hostname}
	mon addr = {ip-address}:6789
[osd.0]
	host = {hostname}
	
	# For Bobtail (v 0.56) and subsequent versions, you may 
	# add settings for mkcephfs so that it will create and mount
	# the file system on a particular OSD for you. Remove the comment `#` 
	# character for the following setting for each OSD and specify 
	# a path to the device if you use mkcephfs with the --mkfs option.
	
	#devs = {path-to-device}
[osd.1]
	host = {hostname}
	#devs = {path-to-device}
[mds.a]
	host = {hostname}
  1. Open a command line on your Ceph server machine and execute hostname -s to retrieve the name of your Ceph server machine.

  2. Replace {hostname} in the sample configuration file with your host name.

  3. Execute ifconfig on the command line of your Ceph server machine to retrieve the IP address of your Ceph server machine.

  4. Replace {ip-address} in the sample configuration file with the IP address of your Ceph server host.

  5. Save the contents to /etc/ceph/ceph.conf on Ceph server host.

  6. Copy the configuration file to /etc/ceph/ceph.conf on your client host.

    sudo scp {user}@{server-machine}:/etc/ceph/ceph.conf /etc/ceph/ceph.conf

Tip

Ensure the ceph.conf file has appropriate permissions set (e.g. chmod 644) on your client machine.

New in version 0.55.

Ceph v0.55 and above have authentication enabled by default. You should explicitly enable or disable authentication with version 0.55 and above. The example configuration provides auth entries for authentication. For details on Ceph authentication see Ceph Authentication.

Deploy the Configuration

You must perform the following steps to deploy the configuration.

  1. On your Ceph server host, create a directory for each daemon. For the example configuration, execute the following:

    sudo mkdir -p /var/lib/ceph/osd/ceph-0
    sudo mkdir -p /var/lib/ceph/osd/ceph-1
    sudo mkdir -p /var/lib/ceph/mon/ceph-a
    sudo mkdir -p /var/lib/ceph/mds/ceph-a
  2. Execute the following on the Ceph server host:

    cd /etc/ceph
    sudo mkcephfs -a -c /etc/ceph/ceph.conf -k ceph.keyring

Among other things, mkcephfs will deploy Ceph and generate a client.admin user and key. For Bobtail and subsequent versions (v 0.56 and after), the mkcephfs script will create and mount the filesystem for you provided you specify osd mkfs osd mount and devs settings in your Ceph configuration file.

Start Ceph

Once you have deployed the configuration, start Ceph from the command line of your server machine.

sudo service ceph start

Check the health of your Ceph cluster to ensure it is ready.

sudo ceph health

When your cluster echoes back HEALTH_OK, you may begin using Ceph.

Copy The Keyring to The Client

The next step you must perform is to copy /etc/ceph/ceph.keyring, which contains the client.admin key, from the server machine to the client machine. If you don't perform this step, you will not be able to use the Ceph command line, as the example Ceph configuration requires authentication.

sudo scp {user}@{server-machine}:/etc/ceph/ceph.keyring /etc/ceph/ceph.keyring

Tip

Ensure the ceph.keyring file has appropriate permissions set (e.g., chmod 644) on your client machine.

Proceed to Other Quick Starts

Once you have Ceph running with both a client and a server, you may proceed to the other Quick Start guides.

  1. For Ceph block devices, proceed to Block Device Quick Start.
  2. For the CephFS filesystem, proceed to CephFS Quick Start.
  3. For the RESTful Gateway, proceed to Gateway Quick Start.

CephFS Quick Start (Source Origin)

To use this guide, you must have executed the procedures in the 5-minute Quick Start guide first. Execute this quick start on the client machine.

Important

Mount the CephFS filesystem on the client machine, not the cluster machine.

Kernel Driver

Mount Ceph FS as a kernel driver.

sudo mkdir /mnt/mycephfs
sudo mount -t ceph {ip-address-of-monitor}:6789:/ /mnt/mycephfs

Filesystem in User Space (FUSE)

Mount Ceph FS as with FUSE. Replace {username} with your username.

sudo mkdir /home/{username}/cephfs
sudo ceph-fuse -m {ip-address-of-monitor}:6789 /home/{username}/cephfs

Additional Information

See CephFS for additional information. CephFS is not quite as stable as the block device and the object storage gateway. Contact Inktank for details on running CephFS in a production environment.

Ceph FS (Source Origin)

The Ceph FS file system is a POSIX-compliant file system that uses a RADOS cluster to store its data. Ceph FS uses the same RADOS object storage device system as RADOS block devices and RADOS object stores such as the RADOS gateway with its S3 and Swift APIs, or native bindings. Using Ceph FS requires at least one metadata server in your ceph.conf configuration file.

A crash course in Ceph, a distributed replicated clustered filesystem (Source Origin)

Published September 14th, 2012 by Barney Desmond

We've been looking at Ceph recently, it's basically a fault-tolerant distributed clustered filesystem. If it works, that's like a nirvana for shared storage: you have many servers, each one pitches in a few disks, and the there's a filesystem that sits on top that visible to all servers in the cluster. If a disk fails, that's okay too.

Those are really cool features, but it turns out that Ceph is really more than just that. To borrow a phrase, Ceph is like an onion - it's got layers. The filesystem on top is nifty, but the coolest bits are below the surface.

If Ceph proves to be solid enough for use, we'll need to train our sysadmins all about Ceph. That means pretty diagrams and explanations, which we thought would be more fun to share you.

Diagram

This is the logical diagram that we came up with while learning about Ceph. It might help to keep it open in another window as you read a description of the components and services.

Ceph's major components, click through for a better view

Ceph components

We'll start at the bottom of the stack and work our way up.

OSDs

OSD stands for Object Storage Device, and roughly corresponds to a physical disk. An OSD is actually a directory (eg. /var/lib/ceph/osd-1) that Ceph makes use of, residing on a regular filesystem, though it should be assumed to be opaque for the purposes of using it with Ceph.

Use of XFS or btrfs is recommended when creating OSDs, owing to their good performance, featureset (support for XATTRs larger than 4KiB) and data integrity.

We're using btrfs for our testing.

Using RAIDed OSDs

A feature of Ceph is that it can tolerate the loss of OSDs. This means we can theoretically achieve fantastic utilisation of storage devices by obviating the need for RAID on every single device.

However, we've not yet determined whether this is awesome. At this stage we're not using RAID, and just letting Ceph take care of block replication.

Placement Groups

Also referred to as PGs, the official docs note that placement groups help ensure performance and scalability, as tracking metadata for each individual object would be too costly.

A PG collects objects from the next layer up and manages them as a collection. It represents a mostly-static mapping to one or more underlying OSDs. Replication is done at the PG layer: the degree of replication (number of copies) is asserted higher, up at the Pool level, and all PGs in a pool will replicate stored objects into multiple OSDs.

As an example in a system with 3-way replication:

Any object that happens to be stored on PG-1 will be written to all three OSDs (1,37,99). Any object stored in PG-2 will be written to its three OSDs (4,22,41). And so on.

Pools

A pool is the layer at which most user-interaction takes place. This is the important stuff like GET, PUT, DELETE actions for objects in a pool.

Pools contain a number of PGs, not shared with other pools (if you have multiple pools). The number of PGs in a pool is defined when the pool is first created, and can't be changed later. You can think of PGs as providing a hash mapping for objects into OSDs, to ensure that the OSDs are filled evenly when adding objects to the pool.

CRUSH maps

CRUSH mappings are specified on a per-pool basis, and serve to skew the distribution of objects into OSDs according to administrator-defined policy. This is important for ensuring that replicas don't end up on the same disk/host/rack/etc, which would break the entire point of having replicant copies.

A CRUSH map is written by hand, then compiled and passed to the cluster.

Still confused?

This may not make much sense at the moment, and that's completely understandable. Someone on the Ceph mailing list provided a brief summary of the components which we found helpful for clarifying things:

  • Many objects will map to one PG
  • Each object maps to exactly one PG
  • One PG maps to a list of OSDs. The first one in the list is the primary and the rest are replicas
  • Many PGs can map to one OSD

A PG represents nothing but a grouping of objects; you configure the number of PGs you want, number of OSDs * 100 is a good starting point, and all of your stored objects are evenly pseudo-randomly distributed to the PGs.

So a PG explicitly does NOT represent a fixed amount of storage; it represents 1/pg_num 'th of the storage you happen to have on your OSDs.

Ceph services

Now we're into the good stuff. Pools full of objects are well and good, but what do you do with it now?

RADOS

What the lower layers ultimately provide is a RADOS cluster: Reliable Autonomic Distributed Object Store. At a practical level this translates to storing opaque blobs of data (objects) in high performance shared storage.

Because RADOS is fairly generic, it's ideal for building more complex systems on top. One of these is RBD.

RBD

As the name suggests, a RADOS Block Device (RBD) is a block device stored in RADOS. RBD offers useful features on top of raw RADOS objects. From the official docs:

RBD also takes advantage of RADOS capabilities such as snapshotting and cloning, which would be very handy for applications like virtual machine disks.

CephFS

CephFS is a POSIX-compliant clustered filesystem implemented on top of RADOS. This is very elegant because the lower layer features of the stack provide really awesome filesystem features (such as snapshotting), while the CephFS layer just needs to translate that into a usable filesystem.

CephFS isn't considered ready for prime-time just yet, but RADOS and RBD are.

We're excited!

Anchor is mostly interested in the RBD service that Ceph provides. To date our VPS infrastructure has been very insular, with each hypervisor functioning independently. This works fantastically and avoids putting all our eggs in one basket, but the lure of shared storage is strong.

Our hypervisor of choice, KVM, already has support for direct integration with RBD, which makes it a very attractive option if we want to use shared storage. Shared storage for a VPS enables live migration between hypervisors (moving a VPS to another hypervisor without downtime), which is unbelievably cool.

CephFS is also something we'd like to be able to offer our customers when it matures. We've found sharing files between multiple servers in a highly-available fashion to be clunky at best. We've so far avoided solutions like GFS and Lustre due to the level of complexity, so we're hoping CephFS will be a good option at the right scale.

Further reading

We wouldn't dare to suggest that our notes here are complete or infallibly accurate. If you're interested in Ceph, the following resources are worth a read.

Got any questions, comments or want to report a mistake? Feel free to let us know in the comments below, or send us a mail.

Posted in FTW

 Leave a comment