Discussion:
Restart from....? (DRP)
Albert Shih
2018-06-18 07:46:02 UTC
Permalink
Hi everyone

I've a question about DRP (Disaster Recovery Plan), what's the easiest (=
fastest) way to rebuild a server (with the data) after a server « disappear » (fire,
water flood, etc.).

I see three way to « backup »  the data :

Replication,

Backup service (inside cyrusimapd 3),

Filesystem backup (whatever the technic)

For replication my concern is the speed of the replication, the main server
(I got only one server) got lots of RAM, got SSD, and SAS disk, the
replication got SATA disks (lots of RAM too). When I check I think
everything are indeed replicated on the « slave » but with some delays
(1/2 days).

What do you think ? What's your DRP ?

Regards.

JAS



--
Albert SHIH
Observatoire de Paris
xmpp: ***@obspm.fr
Heure local/Local time:
Mon Jun 18 09:37:59 CEST 2018
Albert Shih
2018-06-18 08:48:16 UTC
Permalink
Le 18/06/2018 à 10:22:03+0200, Niels Dettenbach via Info-cyrus a écrit
Post by Albert Shih
What do you think ? What's your DRP ?
I shoot snapshots from the underlying FS of the spool partition(s) and the
main DB files (skiplist) - incl. (incremental) filesystem dumps of them.
How you do that ?

Because at the beginning my plan was to do both (replication and snapshot).

The problem is currently I'm encounter big issue with the snapshot. I don't
know if this is the right place because I don't know if it's related to
Cyrus, so that's why I didn't talk about at the
first time. But I got a server (Dell PowerEdge, 192Go, 28 mechanicals disk,
2 ssd, 2 SAS (for the OS)).

The system is FreeBSD 11 running on the 2 SAS disk on UFS

The cyrus imap run inside a jail on the 2 ssd ( on zfs pool)

The mailbox and xapian index are on two zfs dataset on a zpool with 28
mechanicals disk.

Everything seem working fine, until I try to send the dataset on other
server. I just cannot send a zfs snapshot from this server to another. If
the dataset are small that's OK, but with the mailbox (~4To) the zfs
command just hang after 10-40 minutes during 1-10 minutes, come back work
during 1 or 2 hours and hang again etc.
in a desaster scenario it usually works well to reinstantiate the last
snapshot and start the server(s) with a forced full reconstruct run. But this
only offers "low resolution" recovery (mails / mods since last snapshot are
gone then).
Beside this we run daily FS backups (incl. cyrus DB dumps) which allows us to
How you do that ? Because cyrus got a lot of DB....
reinstall from zero (i.e. autmated by ansible or similiar) on system and FS
Yes we using puppet, reinstalling the system and configuration are easy.
The hard part are the data.
level.
I'm a bit new to the new included backup mechs and repo features in cyrus 3
and interested in experiences with setups, allowing a efficient "lossless"
recovery too.
I'm a bit new with cyrus so... ;-) All I can say is the replication seem to
works well. I got

master --> first slave (same room) --> second slave (distant datacenter).

I'll will try today to see if it's easy or not to restart with a slave by
cloning it.

Best regards.

--
Albert SHIH
DIO bâtiment 15
Observatoire de Paris
xmpp: ***@obspm.fr
Heure local/Local time:
Mon Jun 18 10:36:19 CEST 2018
Michael Menge
2018-06-20 08:04:31 UTC
Permalink
Hi,
Post by Albert Shih
Hi everyone
I've a question about DRP (Disaster Recovery Plan), what's the easiest (=
fastest) way to rebuild a server (with the data) after a server «
disappear » (fire,
water flood, etc.).
Replication,
Backup service (inside cyrusimapd 3),
Filesystem backup (whatever the technic)
For replication my concern is the speed of the replication, the main server
(I got only one server) got lots of RAM, got SSD, and SAS disk, the
replication got SATA disks (lots of RAM too). When I check I think
everything are indeed replicated on the « slave » but with some delays
(1/2 days).
We have distributed our users across 6 (virtual) servers in an cyrus
2.4 murder setup. The servers
are grouped in pairs, so that one is running on hardware in one
building and the other in the other.
On each server there are 3 cyrus instances running, one frontend, and
one backend and one replic.

In case of disaster, or planed maintenance we will start the replic as
normal backend (we use service
ip addresses for each backend and move this ip to the other server so
we don't have to update the
mupdate master mailbox.db.


The rolling replication is able to keep up. So normally there is only
a small delay (2-5 Secs). If there
is a traffic peak (many newsletters) it may take up to 1-2h. I have
only seen longer delays in case of a
corrupt mailbox where the replication bailed out. We are monitoring
the size of the replication log.

We have ~ 41000 Accounts ~13.5 TB Mails, The VMs are running in an
RHEV System.
Each Server has 20 GB Ram, 8 CPU-Cores, the Mails are stored on
EUROstor iSCSI Systems with SATA disks
Recently we migrated the metadata onto a new EUROstor iSCSI System
with SSDs. At the moment we plan to migrate
to Cyrus 3.0 to use archive partitions so that the recent mails will
be stored on a iSCSI System with SAS disks,
and the older mails will be moved to the old iSCSI system with SATA disks

In addition to the disaster recovery plan we use "expunge_mode:
delayed" and "delete_mode: delayed" and normal file based backup for
the "I deleted my very importent mail by accident" use case.


Regards

Michael Menge


--------------------------------------------------------------------------------
M.Menge Tel.: (49) 7071/29-70316
Universität Tübingen Fax.: (49) 7071/29-5912
Zentrum für Datenverarbeitung mail:
***@zdv.uni-tuebingen.de
Wächterstraße 76
72074 Tübingen

----
Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
To Unsubscribe:
https://lists.andrew.cmu.edu

Loading...