Discussion:
LARGE single-system Cyrus installs?
(too old to reply)
Vincent Fox
2007-10-04 18:04:31 UTC
Permalink
Wondering if anyone out there is running a LARGE Cyrus
user-base on a single or a couple of systems?

Let me define large:

25K-30K (or more) users per system
High email activity, say 2+ million emails a day

We have talked to UCSB, which is running 30K users on a single
Sun V490 system. However they seem to have fairly low activity
levels with emails in the hundred-thousands range not millions.

So far that's the only other place we've talked to. Everyone else
seems to spread out 5K users on a large number of backends.

We have 60K users, and were trying to run them spread across
2 systems and ran into problems when Fall quarter started and
the load just skyrocketed. Email volume today 3.6 million.
John Madden
2007-10-04 18:21:01 UTC
Permalink
> We have talked to UCSB, which is running 30K users on a single
> Sun V490 system. However they seem to have fairly low activity
> levels with emails in the hundred-thousands range not millions.

We've got around 250k users on a single system, but we're in that same
boat: only about 300k emails/day.

John




--
John Madden
Sr. UNIX Systems Engineer
Ivy Tech Community College of Indiana
***@ivytech.edu
Jim Howell
2007-10-04 18:33:52 UTC
Permalink
Hi,
We have around 35k users spread out on 5 different systems. The
largest of which has 12K active users and 200K messages per day. We do
our anti-spam/anti-virus on other systems before delivering to the 5
mailbox systems. I'm guessing you don't have that type of setup?
Jim


Vincent Fox wrote:
> Wondering if anyone out there is running a LARGE Cyrus
> user-base on a single or a couple of systems?
>
> Let me define large:
>
> 25K-30K (or more) users per system
> High email activity, say 2+ million emails a day
>
> We have talked to UCSB, which is running 30K users on a single
> Sun V490 system. However they seem to have fairly low activity
> levels with emails in the hundred-thousands range not millions.
>
> So far that's the only other place we've talked to. Everyone else
> seems to spread out 5K users on a large number of backends.
>
> We have 60K users, and were trying to run them spread across
> 2 systems and ran into problems when Fall quarter started and
> the load just skyrocketed. Email volume today 3.6 million.
>
>
> ----
> Cyrus Home Page: http://cyrusimap.web.cmu.edu/
> Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
> List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
>

--
Jim Howell
Cornell University
CIT Messaging Systems Manager
email: ***@cornell.edu
phone: 607-255-9369
Vincent Fox
2007-10-04 18:41:46 UTC
Permalink
I suppose I should have given a better description:

University mail setup with 60K-ish faculty, staff, and students
all in one big pool no separation into this server for faculty
and this one for students, etc.

Load-balanced pools of smallish v240 class servers for:
SMTP
MX
AV/spam scanning
LDAP
mail-proxy redirection (Perdition)

So yeah we have multiple layers. All the backend Cyrus 2.3.8
systems are doing is mail-store. We have sendmail on them right
now handing off to LMTP but are working on moving to going
from the MX pool using LMTP over TCP to eliminate that step.
We are not even doing Murder. Having had the Perdition proxy
pool already in place didn't really see a compelling need for it.

Anyhow, just wondering if we the lone rangers on this particular
edge of the envelope. We alleviated the problem short-term by
recycling some V240 class systems with arrays into Cyrus boxes
with about 3,500 users each, and brought our 2 big Cyrus units
down to 13K-14K users each which seems to work okay.

We spent some time talking to Ken & Co. at CMU on the phone
about what happens in very high loads but haven't come to a
"fix" for what happened to us. There may not be one. I can and
will describe all the nitty-gritty of that post-mortem in a post in a
day or two.


Jim Howell wrote:
> Hi,
> We have around 35k users spread out on 5 different systems. The
> largest of which has 12K active users and 200K messages per day. We
> do our anti-spam/anti-virus on other systems before delivering to the
> 5 mailbox systems. I'm guessing you don't have that type of setup?
> Jim
>
>
> Vincent Fox wrote:
>> Wondering if anyone out there is running a LARGE Cyrus
>> user-base on a single or a couple of systems?
>>
>> Let me define large:
>>
>> 25K-30K (or more) users per system
>> High email activity, say 2+ million emails a day
>>
>> We have talked to UCSB, which is running 30K users on a single
>> Sun V490 system. However they seem to have fairly low activity
>> levels with emails in the hundred-thousands range not millions.
>>
>> So far that's the only other place we've talked to. Everyone else
>> seems to spread out 5K users on a large number of backends.
>>
>> We have 60K users, and were trying to run them spread across
>> 2 systems and ran into problems when Fall quarter started and
>> the load just skyrocketed. Email volume today 3.6 million.
>>
>>
>> ----
>> Cyrus Home Page: http://cyrusimap.web.cmu.edu/
>> Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
>> List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
>>
>
Dale Ghent
2007-10-04 20:07:06 UTC
Permalink
On Oct 4, 2007, at 2:41 PM, Vincent Fox wrote:

> We spent some time talking to Ken & Co. at CMU on the phone
> about what happens in very high loads but haven't come to a
> "fix" for what happened to us. There may not be one. I can and
> will describe all the nitty-gritty of that post-mortem in a post in a
> day or two.

One of the things Rob Banz recently did here was to move the data/
config/proc directory from a "real" fs to tmpfs. This reduces the
disk IO from Cyrus process creation/management.

So the way we do stuff here is that each Cyrus backend has its own
ZFS pool. That zpool is divided up into four file systems:

/ms1/data
/ms1/mail
/ms1/meta
/ms1/sieve

"ms1" in this case is the name of the zpool... named after the server
which owns it.

Additionally, we have /ms1/data/config/proc mounted as a tmpfs file
system. The info in that dir does not need to persist across reboots,
and every time a new Cyrus process is launched, a file (pid number)
is written there and is updated with state info. This produces
unnecessary disk IO, so it's better of being a in-memory filesystem,
which tmpfs is.

The relevant areas of our imapd.conf looks like so:

configdirectory: /ms1/data/config
defaultpartition: ms1
metapartition-ms1: /ms1/meta
partition-ms1: /ms1/mail
sievedir: /ms1/sieve


/dale

--
Dale Ghent
Specialist, Storage and UNIX Systems
UMBC - Office of Information Technology
ECS 201 - x51705
Robert Banz
2007-10-04 20:32:57 UTC
Permalink
>
> One of the things Rob Banz recently did here was to move the data/
> config/proc directory from a "real" fs to tmpfs. This reduces the
> disk IO from Cyrus process creation/management.
>
> So the way we do stuff here is that each Cyrus backend has its own
> ZFS pool. That zpool is divided up into four file systems:
>
> /ms1/data
> /ms1/mail
> /ms1/meta
> /ms1/sieve
>

Just a side note on the partition lay-out - originally I was thinking
of having the 'meta' partition be all of the databases/index files.
Figured it'd be nice to be able to tune that zfs fs with a different
recordsize if necessary. However, I changed my mind, and our "meta"
partition only contains the squat index files.

Why? We're doing our backups via ZFS snapshots -- and while we DO
want to snapshot the various meta-files, we're not too interested in
saving all the squat indexes.

It seems to be working pretty well, and having a 7-day backlog of the
data/mail/sieve directories has come in quite handy in recovering
from stupid user tricks, such as wiping our their sieve rules, or
folder deletes.*

* haven't added the delayed-folder-delete patch yet ;)

-rob
Rob Mueller
2007-10-04 23:32:52 UTC
Permalink
> Anyhow, just wondering if we the lone rangers on this particular
> edge of the envelope. We alleviated the problem short-term by
> recycling some V240 class systems with arrays into Cyrus boxes
> with about 3,500 users each, and brought our 2 big Cyrus units
> down to 13K-14K users each which seems to work okay.

FastMail has many 100,000's of users in a full replicated setup spread
across 10 backend servers (+ separate MX/Spam/Web/Frontend servers). We use
IBM servers with some off the shelf SATA-to-SCSI RAID DAS (eg like
http://www.areasys.com/area.aspx?m=PSS-6120). Hardware will die at some
stage, that's what replication is for.

Over the years we've tuned a number of things to get the best possible
performance. The biggest things we found:

1. Using the status cache was a big win for us

I did some analysis at one stage, and found that most IMAP clients issue
STATUS calls to every mailbox a user has on a regular basis (every 5 minutes
or so on average, but users can usually change it) so they can update the
unread count on every mailbox. The default status code has to iterate over
the entire cyrus.index file to get the unread count.

Although the cyrus.index file is the smallest file, with 10,000's of users
connected with clients doing this regularly for every folder, it basically
means you either have to have enough memory to keep every cyrus.index hot in
memory, or every 5-15 minutes you'll be forcing a re-read of gigabytes of
data from disk, or you need a better way.

The better way was to have a status cache.

http://cyrus.brong.fastmail.fm/#cyrus-statuscache-2.3.8.diff

This helped reduce meta data IO a lot for us.

2. Split your email data + metadata IO

With the 12 drive SATA-to-SCSI arrays, we get 4 x 150G 10k RPM WD Raptor
drives + 8 x (largest you can get) drives. We then build 2 x 2 drive RAID1 +
2 x 4 drive RAID5 arrays. We use the RAID1 arrays for the meta data (cyrus.*
except squatter) and the RAID5 arrays for the email data. We find the email
to meta ratio about 20-to-1, higher if you have squatter files, so 150G of
meta will support up to 3T of email data fine.

>
David Carter
2007-10-05 08:58:14 UTC
Permalink
On Fri, 5 Oct 2007, Rob Mueller wrote:

> b) make sure you have the right filesystem (on linux, reiserfs is much
> better than ext3 even with ext3s dir hashing) and journaling modes

A data point regarding reiserfs/ext3:

We are in the process of moving from reiserfs to ext3 (with dir_index).

ext3 seems to do substantially better than reiserfs for us, especially for
read heavy loads (squatter runs at least twice as fast as it used do).

I think that this is partly because ext3 does more aggressive read ahead
(which would be a mixed blessing under heavy load), partly because
reiserfs suffers from fragmentation. I imagine that there is probably a
tipping point under the sort of very heavy load that Fastmail see.

data=ordered in both cases. data=journal didn't seem to make any
difference with ext3. data=journal with reiserfs caused amusing kernel
memory leaks, which it looks like Fastmail also hit recently. An dedicated
journal device would probably make a big difference with data=journal.

--
David Carter Email: ***@ucs.cam.ac.uk
University Computing Service, Phone: (01223) 334502
New Museums Site, Pembroke Street, Fax: (01223) 334679
Cambridge UK. CB2 3QH.
John Madden
2007-10-05 14:01:35 UTC
Permalink
> I think that this is partly because ext3 does more aggressive read ahead
> (which would be a mixed blessing under heavy load), partly because
> reiserfs suffers from fragmentation. I imagine that there is probably a
> tipping point under the sort of very heavy load that Fastmail see.

I second that - reiserfs seems to be truly horrible in write-heavy
situations. Worse, a backup of our remaining reiserfs partition takes
*days* to complete -- 165GB at ~500k/s. And this is a 32-disk stripe of
fibre channel.

Then you see things like this:

http://linux.wordpress.com/2006/09/27/suse-102-ditching-reiserfs-as-it-default-fs/

...And you suddenly have an explanation for its performance issues. For
us, mail delivery is the worst part. I haven't quite figured it out
yet, but moving the three db's (mailboxes, quotas, delivery) to another
disk (on ext3) has greatly improved performance.

reiserfs' recovery tools are awful -- I watched this filesystem "fsck"
over an entire weekend recently with all kinds of nasty warnings. It
seems reiserfs (v3 at least) is a dead product, too.

I still have concerns that moving this remaining reiserfs partition to
ext3 will make matters worse, but I have nothing else to go on.

John



--
John Madden
Sr. UNIX Systems Engineer
Ivy Tech Community College of Indiana
***@ivytech.edu
Robert Banz
2007-10-05 14:34:50 UTC
Permalink
On Oct 5, 2007, at 10:01, John Madden wrote:

>> I think that this is partly because ext3 does more aggressive read
>> ahead
>> (which would be a mixed blessing under heavy load), partly because
>> reiserfs suffers from fragmentation. I imagine that there is
>> probably a
>> tipping point under the sort of very heavy load that Fastmail see.
>
> I second that - reiserfs seems to be truly horrible in write-heavy
> situations. Worse, a backup of our remaining reiserfs partition takes
> *days* to complete -- 165GB at ~500k/s. And this is a 32-disk
> stripe of
> fibre channel.

I think what truly scares me about reiser is those rather regular
posts to various mailing lists I'm on saying "my reiser fs went poof
and lost all my data, what should I do?"

-rob
Gerard
2007-10-05 17:07:50 UTC
Permalink
We have also come across the situation where we needed to make a move from
file systems. We ended up going with Veritas file system which we saw the
greatest increase in performance from simply changing the main file system.
It was a big win when we realized the file system actually didn't cost
anything.

On 10/5/07, Robert Banz <***@umbc.edu> wrote:
>
>
>
>
>
> On Oct 5, 2007, at 10:01, John Madden wrote:
>
> >> I think that this is partly because ext3 does more aggressive read
> >> ahead
> >> (which would be a mixed blessing under heavy load), partly because
> >> reiserfs suffers from fragmentation. I imagine that there is
> >> probably a
> >> tipping point under the sort of very heavy load that Fastmail see.
> >
> > I second that - reiserfs seems to be truly horrible in write-heavy
> > situations. Worse, a backup of our remaining reiserfs partition takes
> > *days* to complete -- 165GB at ~500k/s. And this is a 32-disk
> > stripe of
> > fibre channel.
>
> I think what truly scares me about reiser is those rather regular
> posts to various mailing lists I'm on saying "my reiser fs went poof
> and lost all my data, what should I do?"
>
> -rob
>
> ----
> Cyrus Home Page: http://cyrusimap.web.cmu.edu/
> Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
> List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
>
Rob Mueller
2007-10-06 09:46:29 UTC
Permalink
> I think what truly scares me about reiser is those rather regular
> posts to various mailing lists I'm on saying "my reiser fs went poof
> and lost all my data, what should I do?"

I've commented on this before. I believe it's absolutely hardware related
rather than reiserfs related.

http://www.mail-archive.com/info-***@lists.andrew.cmu.edu/msg30656.html

If you use hardware that's broken, you should expect to get burnt.

Rob
Rob Mueller
2007-10-06 09:38:49 UTC
Permalink
> A data point regarding reiserfs/ext3:
>
> We are in the process of moving from reiserfs to ext3 (with dir_index).
>
> ext3 seems to do substantially better than reiserfs for us, especially for
> read heavy loads (squatter runs at least twice as fast as it used do).

Are you comparing an "old" reiserfs partition with a "new" ext3 one where
you've just copied the email over to? If so, that's not a fair comparison.

What we found was that when we first copied the data from reiserfs -> ext3,
everything seemed nice, but that's only because when you copy mailboxes
over, you're effectively defragmenting all the files and directories in your
filesystem. We've actually been thinking of setting up a regular process to
just randomly move users around, because the act of moving them effectively
defragments them regardless of the filesystem you're using!

Give it a month or two of active use though (delivering new emails, deleting
old ones, etc), and everything starts getting fragmented again. Then ext3
really started going to crap on us. Machines that had been absolutely fine
under reiserfs, the load just blew out to unuseable under ext3.

> data=ordered in both cases. data=journal didn't seem to make any
> difference with ext3. data=journal with reiserfs caused amusing kernel
> memory leaks, which it looks like Fastmail also hit recently. An dedicated
> journal device would probably make a big difference with data=journal.

Talking with Chris Mason about this, data=journal is faster in certain
scenarios with lots of small files + fsyncs from different processes,
exactly the type of workload cyrus generates!

As it turns out, the memory leaks weren't critical, because the the pages do
seem to be reclaimed when needed, though it was annoying not knowing exactly
how much memory was really free/used. The biggest problem was that with
cyrus you have millions of small files, and with a 32bit linux kernel the
inode cache has to be in low memory, severely limiting how many files the OS
will cache.

See this blog post for the gory details, and why a 64-bit kernel was a nice
win for us.

http://blog.fastmail.fm/2007/09/21/reiserfs-bugs-32-bit-vs-64-bit-kernels-cache-vs-inode-memory/

Rob
David Carter
2007-10-06 10:03:29 UTC
Permalink
On Sat, 6 Oct 2007, Rob Mueller wrote:

> Are you comparing an "old" reiserfs partition with a "new" ext3 one where
> you've just copied the email over to? If so, that's not a fair comparison.

No, a newly created partitions in both cases. Fragmented partitions are
slower still of course.

> Give it a month or two of active use though (delivering new emails,
> deleting old ones, etc), and everything starts getting fragmented again.
> Then ext3 really started going to crap on us. Machines that had been
> absolutely fine under reiserfs, the load just blew out to unuseable
> under ext3.

We've only been using ext3 for about 3 months now, so I may still have
this to look forward to :).

> Talking with Chris Mason about this, data=journal is faster in certain
> scenarios with lots of small files + fsyncs from different processes,
> exactly the type of workload cyrus generates!

I can't see much difference on our Cyrus systems, but battery backed write
cache on our RAID controllers probably masks a lot of the change. I agree
that it theory it should make a very substantial difference.

> As it turns out, the memory leaks weren't critical, because the the
> pages do seem to be reclaimed when needed, though it was annoying not
> knowing exactly how much memory was really free/used.

Okay, I think that we had a different kernel memory bug.

We were running out of memory after 24 hours, and a 20 line test program
could exhaust memory in seconds. This bug was in SLES four years back, and
it was still there the last time that I looked (some months back now).

--
David Carter Email: ***@ucs.cam.ac.uk
University Computing Service, Phone: (01223) 334502
New Museums Site, Pembroke Street, Fax: (01223) 334679
Cambridge UK. CB2 3QH.
Rob Mueller
2007-10-06 10:08:42 UTC
Permalink
>> Are you comparing an "old" reiserfs partition with a "new" ext3 one where
>> you've just copied the email over to? If so, that's not a fair
>> comparison.
>
> No, a newly created partitions in both cases. Fragmented partitions are
> slower still of course.

That's strange. What mount options are/were you using? We use/used:

reiserfs - rw,noatime,nodiratime,notail,data=journal
ext3 - noatime,nodiratime,data=journal

If you weren't using "notail" on reiserfs, that would definitely have a
performance impact.

> Okay, I think that we had a different kernel memory bug.
>
> We were running out of memory after 24 hours, and a 20 line test program
> could exhaust memory in seconds. This bug was in SLES four years back, and
> it was still there the last time that I looked (some months back now).

Wow weird, must be something different. What kernel was it? Do you know
where the memory leak was occuring?

We never encountered anything that would exhaust memory like that. We did
encounter a bug which caused a deadlock inversion problem and would causes
processes to get stuck in D state. It appears the code path that causes this
has been totally removed sometime between 2.6.20 and 2.6.22, so I think we
can finally drop that 1 line patch we've had to apply for the last couple of
years...

Rob
David Carter
2007-10-06 10:19:20 UTC
Permalink
On Sat, 6 Oct 2007, Rob Mueller wrote:

> That's strange. What mount options are/were you using? We use/used:
>
> reiserfs - rw,noatime,nodiratime,notail,data=journal
> ext3 - noatime,nodiratime,data=journal

Same, but data=ordered in both cases

> If you weren't using "notail" on reiserfs, that would definitely have a
> performance impact.

Definitely using notail.

> Wow weird, must be something different. What kernel was it? Do you know
> where the memory leak was occuring?

Standard SLES kernels for SLES9.

The memory leak could be show by mmap() on a single file (see attachment).
Kernel memory explodes, and nothing is released when the program exits.

--
David Carter Email: ***@ucs.cam.ac.uk
University Computing Service, Phone: (01223) 334502
New Museums Site, Pembroke Street, Fax: (01223) 334679
Cambridge UK. CB2 3QH.
Vincent Fox
2007-10-06 16:26:38 UTC
Permalink
Rob Mueller wrote:
>> We are in the process of moving from reiserfs to ext3 (with dir_index).
>>
>>

ZFS with mirrors across 2 separate storage devices, means never having
to say you're sorry.

I sleep very well at night.
Todd Lyons
2007-10-08 18:49:09 UTC
Permalink
On Sat, Oct 06, 2007 at 09:26:38AM -0700, Vincent Fox wrote:

>ZFS with mirrors across 2 separate storage devices, means never having
>to say you're sorry.

Are you using it under Linux/Fuse or OpenSolaris or other?
- --
Regards... Todd
Open Source: The concept of standing on others' shoulders instead of toes.
Linux kernel 2.6.17-6mdv 3 users, load average: 0.14, 0.22, 0.15
Andrew Morgan
2007-10-09 22:34:44 UTC
Permalink
On Sat, 6 Oct 2007, Rob Mueller wrote:

> As it turns out, the memory leaks weren't critical, because the the pages do
> seem to be reclaimed when needed, though it was annoying not knowing exactly
> how much memory was really free/used. The biggest problem was that with
> cyrus you have millions of small files, and with a 32bit linux kernel the
> inode cache has to be in low memory, severely limiting how many files the OS
> will cache.
>
> See this blog post for the gory details, and why a 64-bit kernel was a nice
> win for us.
>
> http://blog.fastmail.fm/2007/09/21/reiserfs-bugs-32-bit-vs-64-bit-kernels-cache-vs-inode-memory/

Yesterday I checked my own Cyrus servers to see if I was running out of
lowmem, and it sure looked like it. Lowmem had only a couple MB free, and
I had 2GB of free memory that was not being used for cache.

I checked again today and everything seems to be fine - 150MB of lowmem
free and almost no free memory (3GB cached)! Grrr.

Anyways, I was looking into building a 64-bit kernel. I'm running Debian
Sarge (I know, old) on a Dell 2850 with Intel Xeon (Nocona) CPUs and 4GB
RAM. My kernel version is 2.6.14.5, built from kernel.org sources. It
has "High Memory Support (64GB)" selected.

When I run menuconfig, I'm not seeing any obvious place to switch from
32-bit to 64-bit. Could you elaborate a bit about how you switched to a
64-bit kernel? Also, are you running a full 64-bit distro, or just a
64-bit kernel?

Thanks,
Andy
David Lang
2007-10-09 22:32:35 UTC
Permalink
On Tue, 9 Oct 2007, Andrew Morgan wrote:

> On Sat, 6 Oct 2007, Rob Mueller wrote:
>
>> As it turns out, the memory leaks weren't critical, because the the pages do
>> seem to be reclaimed when needed, though it was annoying not knowing exactly
>> how much memory was really free/used. The biggest problem was that with
>> cyrus you have millions of small files, and with a 32bit linux kernel the
>> inode cache has to be in low memory, severely limiting how many files the OS
>> will cache.
>>
>> See this blog post for the gory details, and why a 64-bit kernel was a nice
>> win for us.
>>
>> http://blog.fastmail.fm/2007/09/21/reiserfs-bugs-32-bit-vs-64-bit-kernels-cache-vs-inode-memory/
>
> Yesterday I checked my own Cyrus servers to see if I was running out of
> lowmem, and it sure looked like it. Lowmem had only a couple MB free, and
> I had 2GB of free memory that was not being used for cache.
>
> I checked again today and everything seems to be fine - 150MB of lowmem
> free and almost no free memory (3GB cached)! Grrr.
>
> Anyways, I was looking into building a 64-bit kernel. I'm running Debian
> Sarge (I know, old) on a Dell 2850 with Intel Xeon (Nocona) CPUs and 4GB
> RAM. My kernel version is 2.6.14.5, built from kernel.org sources. It
> has "High Memory Support (64GB)" selected.
>
> When I run menuconfig, I'm not seeing any obvious place to switch from
> 32-bit to 64-bit. Could you elaborate a bit about how you switched to a
> 64-bit kernel? Also, are you running a full 64-bit distro, or just a
> 64-bit kernel?

you need a full 64 bit toolchain to compile a 64 bit kernel, the easy way to do
this is to compile the kernel on a 64 bit distro.

if you have the toolchain you can add arch=x86_64 to your make command

if you are not converting everything over to 64 bit remember to enable 32 bit
userspace (cyrus won't take advantage of all the ram, but the kernel will, so
it's definantly still a win)

with some older versions of the iptables binaries you can run into trouble with
a 64 bit kernel and 32 bit userspace. unless you take the time to make sure that
you aren't running versions that have this problem don't execute any iptables
commands when running in mixed mode.

David Lang
Rob Mueller
2007-10-09 23:10:54 UTC
Permalink
> Yesterday I checked my own Cyrus servers to see if I was running out of
> lowmem, and it sure looked like it. Lowmem had only a couple MB free, and
> I had 2GB of free memory that was not being used for cache.
>
> I checked again today and everything seems to be fine - 150MB of lowmem
> free and almost no free memory (3GB cached)! Grrr.

I'm guessing it can depend on recent operations.
Michael D. Sofka
2007-10-05 19:25:23 UTC
Permalink
On Thursday 04 October 2007 07:32:52 pm Rob Mueller wrote:
> 4. Lots of other little things
>
> a) putting the proc dir on tmpfs is a good idea
> b) make sure you have the right filesystem (on linux, reiserfs is much
> better than ext3 even with ext3s dir hashing) and journaling modes

On a Murder front-end server, could the tls_sessions.db be put on a
tmpfs? What about mailboxes.db, since the murder master would have
the master copy anyway. (This would slowdown startup in the case
of the system loosing power. But, ``the front end servers can be
considered 'dataless' '' according to: http://cyrusimap.web.cmu.edu/ag.html)

We have a Murder Aggregate with two front-end and three back-end
servers, and the master. I've noticed the front-end servers are a little
IO bound. They each have a single disk, and I discovered that half
of the IO wait went away when I buffered the cyrus.log file in syslogd.
But, they still show an average of 5-6% IO Wait for processes.

Moving imap/proc to tmpfs, however, had a negligible effect.

I'll spec a two-disk system when new front-end's are ordered,
but that would only split system from cyrus. Would it make more
sense (and, more importantly, would it work and not foobar us) to order
a machine with more memory, and put configdirectory: on tmpfs? (With
the possible exception of the db snapshots and mboxlist backups.)

Mike

--
Michael D. Sofka ***@rpi.edu
C&MT Sr. Systems Programmer, Email, TeX, Epistemology
Rensselaer Polytechnic Institute, Troy, NY. http://www.rpi.edu/~sofkam/
Dan White
2007-11-08 15:56:54 UTC
Permalink
Michael D. Sofka wrote:
> On Thursday 04 October 2007 07:32:52 pm Rob Mueller wrote:
>> 4. Lots of other little things
>>
>> a) putting the proc dir on tmpfs is a good idea
>> b) make sure you have the right filesystem (on linux, reiserfs is much
>> better than ext3 even with ext3s dir hashing) and journaling modes
>
> On a Murder front-end server, could the tls_sessions.db be put on a
> tmpfs? What about mailboxes.db, since the murder master would have
> the master copy anyway. (This would slowdown startup in the case
> of the system loosing power. But, ``the front end servers can be
> considered 'dataless' '' according to: http://cyrusimap.web.cmu.edu/ag.html)
>
> We have a Murder Aggregate with two front-end and three back-end
> servers, and the master. I've noticed the front-end servers are a little
> IO bound. They each have a single disk, and I discovered that half
> of the IO wait went away when I buffered the cyrus.log file in syslogd.
> But, they still show an average of 5-6% IO Wait for processes.
>
> Moving imap/proc to tmpfs, however, had a negligible effect.
>
> I'll spec a two-disk system when new front-end's are ordered,
> but that would only split system from cyrus. Would it make more
> sense (and, more importantly, would it work and not foobar us) to order
> a machine with more memory, and put configdirectory: on tmpfs? (With
> the possible exception of the db snapshots and mboxlist backups.)
>
> Mike
>

Hi Mike,

While reviewing this thread for optimization ideas, I came across
your comment about buffering the cyrus.log in syslogd. Could you
explain what you did to configure that?

Thanks,
- Dan White
BTC Broadband
Alain Spineux
2007-11-08 18:07:27 UTC
Permalink
On Nov 8, 2007 4:56 PM, Dan White <***@olp.net> wrote:
> Michael D. Sofka wrote:
> > On Thursday 04 October 2007 07:32:52 pm Rob Mueller wrote:
> >> 4. Lots of other little things
> >>
> >> a) putting the proc dir on tmpfs is a good idea
> >> b) make sure you have the right filesystem (on linux, reiserfs is much
> >> better than ext3 even with ext3s dir hashing) and journaling modes
> >
> > On a Murder front-end server, could the tls_sessions.db be put on a
> > tmpfs? What about mailboxes.db, since the murder master would have
> > the master copy anyway. (This would slowdown startup in the case
> > of the system loosing power. But, ``the front end servers can be
> > considered 'dataless' '' according to: http://cyrusimap.web.cmu.edu/ag.html)
> >
> > We have a Murder Aggregate with two front-end and three back-end
> > servers, and the master. I've noticed the front-end servers are a little
> > IO bound. They each have a single disk, and I discovered that half
> > of the IO wait went away when I buffered the cyrus.log file in syslogd.
> > But, they still show an average of 5-6% IO Wait for processes.
> >
> > Moving imap/proc to tmpfs, however, had a negligible effect.
> >
> > I'll spec a two-disk system when new front-end's are ordered,
> > but that would only split system from cyrus. Would it make more
> > sense (and, more importantly, would it work and not foobar us) to order
> > a machine with more memory, and put configdirectory: on tmpfs? (With
> > the possible exception of the db snapshots and mboxlist backups.)
> >
> > Mike
> >
>
> Hi Mike,
>
> While reviewing this thread for optimization ideas, I came across
> your comment about buffering the cyrus.log in syslogd. Could you
> explain what you did to configure that?

Syslogd flush every write to disk
You can ask syslog to omit syncing file by prefixing the name of your
file by "-".
Look the man page for more.

Regards


>
> Thanks,
> - Dan White
> BTC Broadband
>
> ----
> Cyrus Home Page: http://cyrusimap.web.cmu.edu/
> Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
> List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
>



--
Alain Spineux
aspineux gmail com
May the sources be with you
Vincent Fox
2007-11-08 18:18:04 UTC
Permalink
To close the loop since I started this thread:

We still haven't finished up the contract to get Sun out here to
get to the REAL bottom of the problem.

However observationally we find that under high email usage that
above 10K users on a Cyrus instance things get really bad. Like last
week we had a T2000 at about 10,500 users and loads of 5+ and it
was bogging down. We moved 1K users off bringing it down to
9,500 and loads dropped to around 1.0 and everything is fine.

We have investigated doing this with Solaris Zones with multiple instances
of 5K users each on a single system and it seems like a workable idea.
However the PITA of working out patch procedures for all the zones
and the system itself, and it's cluster partner, seem too intensive.

Our latest line of investigation goes back to the Fastmail suggestion,
simply
have multiple Cyrus binary instances on a system. Each running it's own
config and with it's own ZFS filesystems out of the pool to use.
Since we can bring up a virtual interface for each instance we won't even
have to bother with using separate port numbers, etc.
Rob Mueller
2007-11-08 22:50:07 UTC
Permalink
> However observationally we find that under high email usage that
> above 10K users on a Cyrus instance things get really bad. Like last
> week we had a T2000 at about 10,500 users and loads of 5+ and it
> was bogging down. We moved 1K users off bringing it down to
> 9,500 and loads dropped to around 1.0 and everything is fine.

With 10k users, how much disk space and how many peak simultaneous imap
connections does 10k users represent on your system?

It's just that I imagine the number of actual users on a system isn't the
pure bottleneck per-se, it's really some relationship between
number-of-users (more users means more email deliveries) and
number-of-users*percent-that-are-connected (more users that connect more
often means more imapd processes).

Rob
Bron Gondwana
2007-11-09 00:01:29 UTC
Permalink
On Thu, Nov 08, 2007 at 10:18:04AM -0800, Vincent Fox wrote:
> Our latest line of investigation goes back to the Fastmail suggestion,
> simply
> have multiple Cyrus binary instances on a system. Each running it's own
> config and with it's own ZFS filesystems out of the pool to use.
> Since we can bring up a virtual interface for each instance we won't even
> have to bother with using separate port numbers, etc.

Also virtual interfaces means you can move an instance without having
to tell anyone else about it (but it sounds like you're going with an
"all eggs in one basket" approach anyway)

Bron.
Blake Hudson
2007-11-09 01:26:35 UTC
Permalink
Bron Gondwana wrote:
> On Thu, Nov 08, 2007 at 10:18:04AM -0800, Vincent Fox wrote:
>
>> Our latest line of investigation goes back to the Fastmail suggestion,
>> simply
>> have multiple Cyrus binary instances on a system. Each running it's own
>> config and with it's own ZFS filesystems out of the pool to use.
>> Since we can bring up a virtual interface for each instance we won't even
>> have to bother with using separate port numbers, etc.
>>
>
> Also virtual interfaces means you can move an instance without having
> to tell anyone else about it (but it sounds like you're going with an
> "all eggs in one basket" approach anyway)
>
> Bron.
> ----
> Cyrus Home Page: http://cyrusimap.web.cmu.edu/
> Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
> List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
>
The Solaris' zones feature has been mentioned a few times w/ regard to
large cyrus installs. Has anybody tried running Cyrus under Xen
virtualization in linux to achieve similar goals... whether in
production, on a single system, or to simulate a murder or other
environment?

I'd be interested in a way to accommodate ~100,000 users under one
domain (with the scalability to at least double in size). With little
experience with Murder, I am favoring a perdition/nginx setup using
multiple backends, each @ ~ 10k users. This seems the most straight
forward to me as I can easily use simple algorithms compatible with
postfix/perdition/cyrus/my brain to create mailboxes and route
mail/connections to an appropriate backend cyrus server and I can easily
see the design scaling past 200,000 users.

However, if I can use virtualization to increase capacity/reduce costs
it would certainly be a worthwhile endeavor to test.

-B
Vincent Fox
2007-11-09 02:12:33 UTC
Permalink
Bron Gondwana wrote:
> Also virtual interfaces means you can move an instance without having
> to tell anyone else about it (but it sounds like you're going with an
> "all eggs in one basket" approach anyway)
>
>

No, not "all eggs in one basket", but better usage of resources.

It seems silly to spend all the money for a T2000 with redundancies
and SAN and etc. And then have it choke up when it hits (for us) about
10K users. It seems everyone we talk to scratches their head why this
system with all it's cores would choke. We have found even older Sun
V210 are handling 4K users with negligible load.

Our working hypothesis is that CYRUS is what is choking up at a certain
activity level due to bottlenecks with simultaneous access to some shared
resource for each instance.

By keeping instances down to say around 5K users maybe we can fit 4
instances
onto a T2000 and thus for us have 20K users per server, which would be a
better
cost efficiency than current 10K per server.

And yes, if we need to move an instance to another backend because the
activity level has increased radically we could do that. Just use ZFS
snapshot
to send and receive the filesystems and move the virtual IP for that
instance
over to a new server.

We have 2 T2000 setups and are about to add a 3rd one.
For now over half our user-base is spread onto older V210 systems
that were scavenged up when Fall Quarter crushed our original vision.
We are hoping to migrate users back off the ancient/unsupported
hardware onto the newer stuff through the next few months.
Pascal Gienger
2007-11-09 07:24:34 UTC
Permalink
Vincent Fox <***@ucdavis.edu> wrote:

> Our working hypothesis is that CYRUS is what is choking up at a certain
> activity level due to bottlenecks with simultaneous access to some shared
> resource for each instance.

Did you do a

lockstat -Pk sleep 30

(with "-x destructive" when it complains about the system being
unresponsive)?


We had that result, among others:

Adaptive mutex block: 2339 events in 30.052 seconds (78 events/sec)

Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
778 79% 79% 0.00 456354473 0xffffffffa4867730 zfs_zget
61 6% 85% 0.00 466021696 0xffffffffa4867130 zfs_zget
8 1% 87% 0.00 748812180 0xffffffffa4867780 zfs_zget
26 1% 88% 0.00 200187703 0xffffffff9cf97598 dmu_object_alloc
2 1% 89% 0.00 1453472066 0xffffffffa4867de0 zfs_zget
12 1% 89% 0.00 204437906 0xffffffffa4863ad8 dmu_object_alloc
4 1% 90% 0.00 575866919 0xffffffffa4867838 zfs_zinactive
5 1% 90% 0.00 458982547 0xffffffffa48677b8 zfs_zget
4 1% 91% 0.00 563367350 0xffffffffa4867868 zfs_zinactive
3 0% 91% 0.00 629688255 0xffffffffa48677b0 zfs_zinactive

Nearly all locks caused by zfs. The Disk SAN system is NOT the bottleneck
though, having average service times from 5-8 ms, and no wait queue.

456354473 nsecs are 0,456 secs, that is *LONG*.


What's also interestring is tracing open()-calls via dtrace.
Just use this:

#!/usr/sbin/dtrace -s
#pragma D option destructive
#pragma D option quiet

syscall::open:entry
{
self->ts=timestamp;
self->filename=arg0;
}

syscall::open:return
/self->ts > 0/
{
zeit=timestamp - self->ts;
printf("%10d %s\n",zeit,copyinstr(self->filename));
@["open duration"] = quantize(zeit);
self->ts=0;
}

It will show you all files opened and the time needed (in nanosecs) to
accomplish that. After hitting CTRL-C, it will summarize:

open duration
value ------------- Distribution ------------- count
1024 | 0
2048 |@ 80
4096 |@@@@@@@@@@@@@@@@@@@@@ 1837
8192 |@@@@@@ 521
16384 |@@@@@@@ 602
32768 |@@@ 229
65536 |@ 92
131072 | 2
262144 | 0
524288 | 1
1048576 | 1
2097152 | 1
4194304 | 3
8388608 | 12
16777216 |@ 51
33554432 | 38
67108864 | 25
134217728 | 9
268435456 | 2
536870912 | 3
1073741824 | 0

You see the arc memory activity from 4-65 mikroseconds and disk activity
from 8-33ms. And you see some "big hits" from 0,13 - 0,5 secs (!). This is
far too much and I did not figure out why this is happening. As more users
are connecting this "really long opens" become more and more.

We have a Postfix spool running on the same machine and we got some relief
in deactivating its directory hashing scheme. ZFS is very "angry" about
having a deep directory structure it seems. But still, these "long opens"
do occur.


Pascal
Michael D. Sofka
2007-11-08 21:48:06 UTC
Permalink
On Thursday 08 November 2007 10:56:54 am Dan White wrote:
> Michael D. Sofka wrote:
> > On Thursday 04 October 2007 07:32:52 pm Rob Mueller wrote:
> >> 4. Lots of other little things
> >>
> >> a) putting the proc dir on tmpfs is a good idea
> >> b) make sure you have the right filesystem (on linux, reiserfs is much
> >> better than ext3 even with ext3s dir hashing) and journaling modes
> >
> > On a Murder front-end server, could the tls_sessions.db be put on a
> > tmpfs? What about mailboxes.db, since the murder master would have
> > the master copy anyway. (This would slowdown startup in the case
> > of the system loosing power. But, ``the front end servers can be
> > considered 'dataless' '' according to:
> > http://cyrusimap.web.cmu.edu/ag.html)
> >
> > We have a Murder Aggregate with two front-end and three back-end
> > servers, and the master. I've noticed the front-end servers are a
> > little IO bound. They each have a single disk, and I discovered that
> > half of the IO wait went away when I buffered the cyrus.log file in
> > syslogd. But, they still show an average of 5-6% IO Wait for processes.
> >
> > Moving imap/proc to tmpfs, however, had a negligible effect.
> >
> > I'll spec a two-disk system when new front-end's are ordered,
> > but that would only split system from cyrus. Would it make more
> > sense (and, more importantly, would it work and not foobar us) to order
> > a machine with more memory, and put configdirectory: on tmpfs? (With
> > the possible exception of the db snapshots and mboxlist backups.)
> >
> > Mike
>
> Hi Mike,
>
> While reviewing this thread for optimization ideas, I came across
> your comment about buffering the cyrus.log in syslogd. Could you
> explain what you did to configure that?
>
> Thanks,
> - Dan White
> BTC Broadband

Put a '-' in front of a of the file name. E.g.:

local6.* -/var/log/cyrus.log


Mike

--
Michael D. Sofka ***@rpi.edu
C&MT Sr. Systems Programmer, Email, TeX, Epistemology
Rensselaer Polytechnic Institute, Troy, NY. http://www.rpi.edu/~sofkam/
Rafael Mahecha
2007-10-04 18:43:34 UTC
Permalink
We run a single dell 2850 (2-dual @ 2.8GHz, 8gb ram and 900gb internal).
have about 29k users... but our message transfer load is much smaller than
what you describe... may be in the order of 10k.... the systems is at 80+ idle
most of the time.

We will have this setup for about 1 yr now... no
major problems.
--
--
--
Rafael Mahecha
E-mail
Administrator
Office of Information Management
Jackson State
University

JSU e-Center
1230 Raymond Road
Jackson, MS
39204

***@jsums.edu
(601)-979-1783
http://www.jsums.edu

On Thu, October 4, 2007 1:04 pm, Vincent
Fox wrote:
>
> Wondering if anyone out there is running a
LARGE Cyrus
> user-base on a single or a couple of systems?
>
> Let me define large:
>
> 25K-30K (or more)
users per system
> High email activity, say 2+ million emails a day

>
> We have talked to UCSB, which is running 30K users on a
single
> Sun V490 system. However they seem to have fairly low
activity
> levels with emails in the hundred-thousands range not
millions.
>
> So far that's the only other place we've
talked to. Everyone else
> seems to spread out 5K users on a large
number of backends.
>
> We have 60K users, and were trying
to run them spread across
> 2 systems and ran into problems when Fall
quarter started and
> the load just skyrocketed. Email volume today
3.6 million.
>
>
> ----
> Cyrus Home Page:
http://cyrusimap.web.cmu.edu/
> Cyrus Wiki/FAQ:
http://cyrusimap.web.cmu.edu/twiki
> List Archives/Info:
http://asg.web.cmu.edu/cyrus/mailing-list.html
>
Xue, Jack C
2007-10-04 22:26:59 UTC
Permalink
At Marshall University, We have 30K users (200M quota) on Cyrus. We use
a Murder Aggregation Setup which consists of 2 frontend node, 2 backend
nodes and a master node (all are Dell 1855 Blades). We then further
divide the users into 2 storage partitions on each backend (Totals 4
cyrus partitions, all are then mounted on an EMC SAN).

We have never experienced performance problems. The daily mail processed
on our SMPT gateway usually amounts to 2 million messages, 90%-95% of
them are SPAM and are rejected/quarantined at the gateway. So, our
internal servers only process from 100,000 to 200,000 messages a day.


Jack C. Xue

-----Original Message-----
From: info-cyrus-***@lists.andrew.cmu.edu
[mailto:info-cyrus-***@lists.andrew.cmu.edu] On Behalf Of Vincent
Fox
Sent: Thursday, October 04, 2007 2:05 PM
To: info-***@lists.andrew.cmu.edu
Subject: LARGE single-system Cyrus installs?


Wondering if anyone out there is running a LARGE Cyrus
user-base on a single or a couple of systems?

Let me define large:

25K-30K (or more) users per system
High email activity, say 2+ million emails a day

We have talked to UCSB, which is running 30K users on a single
Sun V490 system. However they seem to have fairly low activity
levels with emails in the hundred-thousands range not millions.

So far that's the only other place we've talked to. Everyone else
seems to spread out 5K users on a large number of backends.

We have 60K users, and were trying to run them spread across
2 systems and ran into problems when Fall quarter started and
the load just skyrocketed. Email volume today 3.6 million.
Vincent Fox
2007-10-04 22:33:58 UTC
Permalink
Xue, Jack C wrote:
> At Marshall University, We have 30K users (200M quota) on Cyrus. We use
> a Murder Aggregation Setup which consists of 2 frontend node, 2 backend
> nodes
Interesting, but this is approximately 15K users per backend. Which is
where we are now after 30K users per backend were crushed. I am much
more interested in exploring whether Cyrus hits some tipping point where
a single backend cannot handle more than X users.

That is our hypothesis right now, that the application has certain limits
and if you go beyond a certain number of very active users on a
single backend bad things happen.
Bron Gondwana
2007-10-05 00:08:34 UTC
Permalink
On Thu, Oct 04, 2007 at 03:33:58PM -0700, Vincent Fox wrote:
>
>
> Xue, Jack C wrote:
> > At Marshall University, We have 30K users (200M quota) on Cyrus. We use
> > a Murder Aggregation Setup which consists of 2 frontend node, 2 backend
> > nodes
> Interesting, but this is approximately 15K users per backend. Which is
> where we are now after 30K users per backend were crushed. I am much
> more interested in exploring whether Cyrus hits some tipping point where
> a single backend cannot handle more than X users.
>
> That is our hypothesis right now, that the application has certain limits
> and if you go beyond a certain number of very active users on a
> single backend bad things happen.

We ran over 100,000 users on a single backend for over a year without
problems, but then we had a RAID array failure (3 disks within a day)
with 2Tb of data on a single RAID unit and we learned that users don't
like it when it takes a week to rebuild their email server because you
just can't push the data any faster than that. These new big drives
take a LONG time to fill!

We don't do big instances like that any more - they're just too
unwieldy.

Bron.
David Carter
2007-10-05 08:53:16 UTC
Permalink
On Fri, 5 Oct 2007, Bron Gondwana wrote:

> We ran over 100,000 users on a single backend for over a year without
> problems, but then we had a RAID array failure (3 disks within a day)
> with 2Tb of data on a single RAID unit

We have pretty much given up on RAID 5 because of the reconstruct times
with large disks. Our new systems are 12 disk RAID 10 (plus hot spares).

I think that gives about the same usable capacity as your 2 x RAID5 +
2 x RAID1 setup, but better redundancy. There would be twice as much
work to restore the single RAID 10 set if it failed.

I plan some experiments with split meta next year. My gut feeling is that
12 slow disks will be better than 4 faster disks given the short command
queues in SATA NCQ, but I'm entirely willing to be proved wrong. Multiple
partitions would certainly help with any bottlenecks at the VFS layer.

I suppose that 8 SATA disks for the data and four 15k SAS disks for the
metadata would be a good mix.

--
David Carter Email: ***@ucs.cam.ac.uk
University Computing Service, Phone: (01223) 334502
New Museums Site, Pembroke Street, Fax: (01223) 334679
Cambridge UK. CB2 3QH.
Rob Mueller
2007-10-06 09:30:18 UTC
Permalink
> I suppose that 8 SATA disks for the data and four 15k SAS disks for the
> metadata would be a good mix.

Yes. As I mentioned, our iostat data shows that meta-data is MUCH hotter
than email spool data.

---
Checking iostat, a rough estimate shows meta data get 2 x the rkB/s and 3 x
the wkB/s vs the email spool, even though it's 1/20th the data size and we
have the status cache patch! Basically the meta data is "very hot", so
optimising access to it is important.
---

Splitting was definitely a big win for us.

Rob
Wesley Craig
2007-10-05 18:33:19 UTC
Permalink
On 04 Oct 2007, at 18:33, Vincent Fox wrote:
> Interesting, but this is approximately 15K users per backend.
> Which is
> where we are now after 30K users per backend were crushed. I am much
> more interested in exploring whether Cyrus hits some tipping point
> where
> a single backend cannot handle more than X users.

Cyrus has lots of inherent limits, but nothing as concrete as what
you're hypothesizing.
Vincent Fox
2007-10-05 19:22:34 UTC
Permalink
The iostat and sar data disagrees with it being an I/O issue.

16 gigs of RAM with about 4-6 of it being used for Cyrus
leaves plenty for ZFS caching. Our hardware seemed more than
adequate to anyone we described it to.

Yes beyond that it's anyone guess.
Rob Mueller
2007-10-06 09:50:06 UTC
Permalink
> The iostat and sar data disagrees with it being an I/O issue.
>
> 16 gigs of RAM with about 4-6 of it being used for Cyrus
> leaves plenty for ZFS caching. Our hardware seemed more than
> adequate to anyone we described it to.
>
> Yes beyond that it's anyone guess.

If it wasn't IO limit related, and it wasn't CPU limit related, then there
must be some other single resource that things were contending for.

My only guess then is it's some global kernel lock or the like.

When the load skyrocketed, it must have shown that lots of processes were
not in S (sleep) state. Were they in R (run) state or D (io wait) state?
Since you're on Solaris, you could use DTrace to find out what they were
actually doing/waiting for...

Rob
Wesley Craig
2007-10-06 17:58:36 UTC
Permalink
Personally, I've seen Solaris bottlenecking on file opens in large
directories. This was a while ago, but it was one of the major
reason we switched to Linux -- the order of magnitude improvement in
directory scale was sure handy for 80-90K users with no quota. The
kind of blocking I'm talking about didn't show up in sar or iostat,
despite being "IO" in nature. Of course it was some sort of
algorithmic inefficiency in the directory data structures, not the
speed of moving data on & off the disk. As a general statement, tho,
finding bottlenecks like the one UCDavis is complaining about is done
by process sampling. No guessing required.

:wes

On 06 Oct 2007, at 05:50, Rob Mueller wrote:
>> The iostat and sar data disagrees with it being an I/O issue.
>>
>> 16 gigs of RAM with about 4-6 of it being used for Cyrus
>> leaves plenty for ZFS caching. Our hardware seemed more than
>> adequate to anyone we described it to.
>>
>> Yes beyond that it's anyone guess.
>
> If it wasn't IO limit related, and it wasn't CPU limit related,
> then there must be some other single resource that things were
> contending for.
>
> My only guess then is it's some global kernel lock or the like.
>
> When the load skyrocketed, it must have shown that lots of
> processes were not in S (sleep) state. Were they in R (run) state
> or D (io wait) state? Since you're on Solaris, you could use DTrace
> to find out what they were actually doing/waiting for...
Dale Ghent
2007-10-08 18:56:15 UTC
Permalink
On Oct 6, 2007, at 5:50 AM, Rob Mueller wrote:

> If it wasn't IO limit related, and it wasn't CPU limit related,
> then there
> must be some other single resource that things were contending for.
>
> My only guess then is it's some global kernel lock or the like.
>
> When the load skyrocketed, it must have shown that lots of
> processes were
> not in S (sleep) state. Were they in R (run) state or D (io wait)
> state?
> Since you're on Solaris, you could use DTrace to find out what they
> were
> actually doing/waiting for...

The lockstat command is where one would enter this territory.

lockstat -D 20 sleep 5

Will show you the top 20 contended locks.

lockstat -kgIW sleep 5

Will show you what calls are taking them most time to complete in the
kernel during a 5 second sample period.

/dale

--
Dale Ghent
Specialist, Storage and UNIX Systems
UMBC - Office of Information Technology
ECS 201 - x51705
Rudy Gevaert
2007-10-05 06:52:10 UTC
Permalink
Vincent Fox wrote:
> Wondering if anyone out there is running a LARGE Cyrus
> user-base on a single or a couple of systems?
>
> Let me define large:
>
> 25K-30K (or more) users per system
> High email activity, say 2+ million emails a day

our user based is split over 6 backends:

+----------+-----------------+
| mailhost | count(mailhost) |
+----------+-----------------+
| mail1 | 20857 |
| mail2 | 5655 |
| mail3 | 3237 |
| mail4 | 20926 |
| mail5 | 5645 |
| mail6 | 3296 |
+----------+-----------------+

You can see that mail1 and mail4 are have a lot more user, but these
users are students that have smaller mailboxes.

The other hosts are primarily for staff. e.g. mail3 reaches 1000
concurrent users every day. mail6 nearly reaches 1000. I can't give
you any figures of email received at the moment.

Rudy

--
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Rudy Gevaert ***@UGent.be tel:+32 9 264 4734
Directie ICT, afd. Infrastructuur ICT Department, Infrastructure office
Groep Systemen Systems group
Universiteit Gent Ghent University
Krijgslaan 281, gebouw S9, 9000 Gent, Belgie www.UGent.be
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Paul M Fleming
2007-10-10 02:09:53 UTC
Permalink
You can also use

vm.lower_zone_protection=size_in_mb

to protect portions of low memory. This doesn't help with caching issues but
can help prevent the kernel from getting cornered and resorting to oom-killer.

We haven't tested everything in our enviroment at 64bit so we've used lower
zone protection to give the kernel buffers some working space. This becomes an
issue under 32bit kernels when attempting disk and network IO at Gig speeds. It
also forces more of the page cache to high memory.

Paul


On Tue, 9 Oct 2007, Andrew Morgan wrote:

> On Sat, 6 Oct 2007, Rob Mueller wrote:
>
> > As it turns out, the memory leaks weren't critical, because the the pages do
> > seem to be reclaimed when needed, though it was annoying not knowing exactly
> > how much memory was really free/used. The biggest problem was that with
> > cyrus you have millions of small files, and with a 32bit linux kernel the
> > inode cache has to be in low memory, severely limiting how many files the OS
> > will cache.
> >
> > See this blog post for the gory details, and why a 64-bit kernel was a nice
> > win for us.
> >
> > http://blog.fastmail.fm/2007/09/21/reiserfs-bugs-32-bit-vs-64-bit-kernels-cache-vs-inode-memory/
>
> Yesterday I checked my own Cyrus servers to see if I was running out of
> lowmem, and it sure looked like it. Lowmem had only a couple MB free, and
> I had 2GB of free memory that was not being used for cache.
>
> I checked again today and everything seems to be fine - 150MB of lowmem
> free and almost no free memory (3GB cached)! Grrr.
>
> Anyways, I was looking into building a 64-bit kernel. I'm running Debian
> Sarge (I know, old) on a Dell 2850 with Intel Xeon (Nocona) CPUs and 4GB
> RAM. My kernel version is 2.6.14.5, built from kernel.org sources. It
> has "High Memory Support (64GB)" selected.
>
> When I run menuconfig, I'm not seeing any obvious place to switch from
> 32-bit to 64-bit. Could you elaborate a bit about how you switched to a
> 64-bit kernel? Also, are you running a full 64-bit distro, or just a
> 64-bit kernel?
>
> Thanks,
> Andy
> ----
> Cyrus Home Page: http://cyrusimap.web.cmu.edu/
> Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
> List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
>
Eric Luyten
2007-11-09 11:05:14 UTC
Permalink
> It seems silly to spend all the money for a T2000 with redundancies
> and SAN and etc. And then have it choke up when it hits (for us) about
> 10K users. It seems everyone we talk to scratches their head why this
> system with all it's cores would choke. We have found even older Sun
> V210 are handling 4K users with negligible load.
>
> Our working hypothesis is that CYRUS is what is choking up at a certain
> activity level due to bottlenecks with simultaneous access to some shared
> resource for each instance.

Vincent (and others),


I have been following this thread with quite some interest since the start.

We operate a single-instance Cyrus server (2.2.13) on an 8-processor Sun
V1280 with 32 GB of RAM and Solaris 9.
1.3 TB mail data, +50k users, +23M messages in the spool, 200k deliveries
per day (the latter figure being considerably lower than yours, IIRC).


Our server is happily humming away with a load average around 2 (two).

Most of the time we stay under 1,000 IMAP+POP processes combined.
About fifteen percent of our user base is exclusively using POP clients.

Installing an IMAP proxy on our Webmail frontend did cut the number of
active IMAP processes by two-thirds. Webmail sessions are kept "alive"
for a couple of minutes, thereby eliminating many IMAP server process
execs and terminations/wrapups.


Another thought : if your original problem is related to a locking issue
of shared resources, visible upon imapd process termination, the rate of
writing new messages to the spool does not need to be a directly contri-
buting factor.
Were you experiencing the load problem while (briefly) halting deliveries
to the mail spool ?


Eric.
Vincent Fox
2007-11-09 17:40:25 UTC
Permalink
Eric Luyten wrote:
> Another thought : if your original problem is related to a locking issue
> of shared resources, visible upon imapd process termination, the rate of
> writing new messages to the spool does not need to be a directly contri-
> buting factor.
> Were you experiencing the load problem while (briefly) halting deliveries
> to the mail spool ?
>

Yep we tried stopping mail delivery.

If there's something that 3 admins could do to alleviate load we did it.

The bigger problem I am seeing is that Cyrus doesn't in our
usage seem to ramp load smoothly or even predictably. It goes
fine up to a certain point, and then you hit a brick wall without
very much in the way of warning. You add that small chunk of
extra users or load and suddenly everything goes to hell. Keeping
the user-count per instance appropriate was the only thing we did
over multiple days of desperate "try this" that did the job.

A generous engineering cushion of capacity seems more critical than usual.
Jure Pečar
2007-11-09 18:10:37 UTC
Permalink
On Fri, 09 Nov 2007 09:40:25 -0800
Vincent Fox <***@ucdavis.edu> wrote:

> If there's something that 3 admins could do to alleviate load we did it.
>
> The bigger problem I am seeing is that Cyrus doesn't in our
> usage seem to ramp load smoothly or even predictably. It goes
> fine up to a certain point, and then you hit a brick wall without
> very much in the way of warning. You add that small chunk of
> extra users or load and suddenly everything goes to hell. Keeping
> the user-count per instance appropriate was the only thing we did
> over multiple days of desperate "try this" that did the job.
>
> A generous engineering cushion of capacity seems more critical than usual.

In my expirience the "brick wall" you describe is what happens when disks
reach a certain point of random IO that they cannot keep up with.

In cases such as yours, the only reasonable thing is announce some kind of
"extended maintenance" to users so they dont bog you down with whining and
then go metodologicaly through the system, testing and eliminating possible
causes one by one until you zero-in to the root cause. If this is some
solaris "feature" as you suspect, then I think a Dtrace expert is a man
you're looking for.

I'm still on linux and was thinking a lot about trying out solaris 10, but
stories like yours will make me think again about that ...


--

Jure Pečar
http://jure.pecar.org/
John Madden
2007-11-09 18:28:05 UTC
Permalink
On Fri, 2007-11-09 at 19:10 +0100, Jure Pečar wrote:
> I'm still on linux and was thinking a lot about trying out solaris 10,
> but
> stories like yours will make me think again about that ...

Agreed -- with the things I see from the Solaris (and zfs) and Sparc
hardware in general, my money's still on Linux/LVM/Reiser/ext3.

250,000 mailboxes, 1,000 concurrent users, 60 million emails, 500k
deliveries/day. For us, backups are the worst thing, followed by
reiserfs's use of BLK, followed by the need to use a ton of disks to
keep up with the i/o.

John



--
John Madden
Sr. UNIX Systems Engineer
Ivy Tech Community College of Indiana
***@ivytech.edu
Bron Gondwana
2007-11-11 00:18:21 UTC
Permalink
On Fri, Nov 09, 2007 at 01:28:05PM -0500, John Madden wrote:
> On Fri, 2007-11-09 at 19:10 +0100, Jure Pečar wrote:
> > I'm still on linux and was thinking a lot about trying out solaris 10,
> > but
> > stories like yours will make me think again about that ...
>
> Agreed -- with the things I see from the Solaris (and zfs) and Sparc
> hardware in general, my money's still on Linux/LVM/Reiser/ext3.
>
> 250,000 mailboxes, 1,000 concurrent users, 60 million emails, 500k
> deliveries/day. For us, backups are the worst thing, followed by
> reiserfs's use of BLK, followed by the need to use a ton of disks to
> keep up with the i/o.

For us backups are hardly a blip on the radar :) The joy of writing
your own custom backup system that knows more about Cyrus internals
than just about anything else. It starts with some stat calls, and
if any of the cyrus.header, cyrus.index or cyrus.expunge files have
changed then it will lock them all then stream them all to the backup
server.

The backup server then parses them and decides (based on GUID) if
there are any data files it hasn't yet fetched. If so, it fetches
them and checks the sha1 of the fetched file against the GUID.

The whole thing takes a couple of seconds per user and requires
less IO than even using direct IMAP calls would.

Now our big IO user is cyr_expire. We run it once per week, and
it's a killer. I'd be tempted to run it a lot more frequently if
it didn't have such a high baseline IO cost on top of the actual
message unlinks (though the unlinks are the real killer)

Bron ( and the BKL, *sigh*. I just installed an external RAID
unit with 8x1TB drives in it for data. That 6GB/300Mb == 20
data partitions plus 20 meta partitions to go with it. That's
a lot of BKL! )
David Carter
2007-11-13 10:24:22 UTC
Permalink
On Sun, 11 Nov 2007, Bron Gondwana wrote:

>> 250,000 mailboxes, 1,000 concurrent users, 60 million emails, 500k
>> deliveries/day. For us, backups are the worst thing, followed by
>> reiserfs's use of BLK, followed by the need to use a ton of disks to
>> keep up with the i/o.
>
> For us backups are hardly a blip on the radar :) The joy of writing
> your own custom backup system that knows more about Cyrus internals than
> just about anything else. It starts with some stat calls, and if any of
> the cyrus.header, cyrus.index or cyrus.expunge files have changed then
> it will lock them all then stream them all to the backup server.

Cyrus is pretty ideal for fast incremental updates to a backup system:
hence replication. You shouldn't need to lock anything with delayed
expunge, delayed delete and fast rename in place.

--
David Carter Email: ***@ucs.cam.ac.uk
University Computing Service, Phone: (01223) 334502
New Museums Site, Pembroke Street, Fax: (01223) 334679
Cambridge UK. CB2 3QH.
Bron Gondwana
2007-11-13 12:02:12 UTC
Permalink
On Tue, Nov 13, 2007 at 10:24:22AM +0000, David Carter wrote:
> On Sun, 11 Nov 2007, Bron Gondwana wrote:
>
> >> 250,000 mailboxes, 1,000 concurrent users, 60 million emails, 500k
> >> deliveries/day. For us, backups are the worst thing, followed by
> >> reiserfs's use of BLK, followed by the need to use a ton of disks to
> >> keep up with the i/o.
> >
> > For us backups are hardly a blip on the radar :) The joy of writing
> > your own custom backup system that knows more about Cyrus internals than
> > just about anything else. It starts with some stat calls, and if any of
> > the cyrus.header, cyrus.index or cyrus.expunge files have changed then
> > it will lock them all then stream them all to the backup server.
>
> Cyrus is pretty ideal for fast incremental updates to a backup system:
> hence replication. You shouldn't need to lock anything with delayed
> expunge, delayed delete and fast rename in place.

If you're planning to lift a consistent copy of a .index file, you need
to lock it for the duration of reading it (read lock at least).

Yeah - replication is one way to do it. We happen to read from the
masters at the moment, but it would be pretty trivial to switch to
using the replicas (change a $Store->MasterSlot() to
$Store->ReplicaSlot() at one place in the code in fact) if we wanted to.

But since I would like a consistent snapshot of the mailbox state,
I lock the cyrus.header and then the cyrus.index and then (if it's
there) the cyrus.expunge. That means no sneaky process could (for
example) delete the mailbox and create another one with the same
name while I was busy downloading the last file - giving me totally
bogus data. This is particularly important because I store things
by mailbox uniqueid rather than imap path (with pointers from the
imap path of course) so that a folder rename turns into a symlink
delete (well, replacement with one having an empty target anyway)
and a symlink create in the tar file.

Bron ( and right now I'm running the process to finish the upgrade
from MD5 based to SHA1 based internal identifiers in the
backup system, since all our indexes are upgraded )
David Carter
2007-11-13 13:55:25 UTC
Permalink
On Tue, 13 Nov 2007, Bron Gondwana wrote:

> If you're planning to lift a consistent copy of a .index file, you need
> to lock it for the duration of reading it (read lock at least).

mailbox_lock_index() blocks flag updates (but this doesn't seem to be
something that imapd worries about when FETCHing data). You don't need to
worry about expunge or append events once the mailbox is open.

> But since I would like a consistent snapshot of the mailbox state, I
> lock the cyrus.header and then the cyrus.index and then (if it's there)
> the cyrus.expunge. That means no sneaky process could (for example)
> delete the mailbox and create another one with the same name while I was
> busy downloading the last file - giving me totally bogus data.

chdir() into the mailbox data directory: with delayed delete and fast
rename it shouldn't matter if the mailbox is replaced under your feet.
That's the way replication worked on my 2.1 systems, prior to split-meta.

(Locking isn't a big deal, but safe concurrent access is always nice).

--
David Carter Email: ***@ucs.cam.ac.uk
University Computing Service, Phone: (01223) 334502
New Museums Site, Pembroke Street, Fax: (01223) 334679
Cambridge UK. CB2 3QH.
Vincent Fox
2007-11-09 18:35:44 UTC
Permalink
Jure Pečar wrote:
> In my expirience the "brick wall" you describe is what happens when disks
> reach a certain point of random IO that they cannot keep up with.
>

The problem with a technical audience, is that everyone thinks they have
a workaround
or probable fix you haven't already thought of. No offense. I am guilty
of it myself but
it's very hard to sometimes say "I DON'T KNOW" and dig through telemetry and
instrument the software until you know all the answers.

With something as complex as Cyrus, this is harder than you think.
Unfortunately when it comes to something like a production mail service
these days it's nearly impossible to get the funding and manhours and
approvals to run experiments on live guinea pigs to really get to the
bottom of problems. We throw systems at the problem and move on.

But in answer to your point, our iostat numbers for busy or service time
didn't
indicate there to be any I/O issue. That was the first thing we looked
at of course.
Even by eyeball our array drives are more idle than busy.
Michael Bacon
2007-11-13 21:41:14 UTC
Permalink
At the risk of being yet one more techie who thinks he has a workaround...

I'm back (in the past two months) doing Cyrus administration after a three
year break. I ran Cyrus instance at Duke University before, and am now
getting up to speed to run the one at UNC. At Duke we started as a
multi-host install, and moved to a single instance just as I was leaving.
Here at UNC, we've been on a single instance for years. Both places have
been Solaris all along, and both places had over 50k users and receiving
several million messages a day.

Part of the way we handle it here is with massive hardware -- An 8
processor Sun 6800 with the processor boards swapped out to UltraSparc 4s.
These are still a couple of years old at this point. That said, our CPU
load is really pretty minimal.

While we're on a very old version of Cyrus right now (1.6), I think reading
this that I've got a good feel for what you're looking at. There's been a
lot of talk about the linked list in the kernel and the fact that it
freezes all processes with that file mmap'ed when the file gets written.
If the spanning of the linked list were really the problem, I think we
would have seen a total system meltdown here a long time ago.

I'm much more inclined to think that what you're running into is all of the
processes freezing during the latency period for the re-write of the
mailboxes file. This won't show up as I/O blocking on your disk, as there
won't be any real contingency for that file or even for the channel. But
the latency of the write, while only a few milliseconds, is going to kill
you if your mailboxes file gets big.

I haven't had any role yet in the design and configuration of UNC's system,
but there's one thing we have that I think saves us an enormous amount of
pain. Since we're still on 1.6, and hence using the "plain text" mailboxes
format, bear in mind that all changes to the mailboxes database involve a
lock on the file, a complete rewrite of the file next to it on the file
system, and a rename() system call. This is SLOOOWWW. How are we not dead?

Solid state disk for the partition with the mailboxes database.

This thing is amazing. We've got one of the gizmos with a battery backup
and a RAID array of Winchester disks that it writes off to if it loses
power, but the latency levels on this thing are non-existent. Writes to
the mailboxes database return almost instantaneously when compared to
regular spinning disks. Based on my experience, that's bound to be a much
bigger chunk of time than traversing a linked list in kernel memory.

For anyone doing a big Cyrus install, I would strongly recommend this.

Michael Bacon
ITS - UNC Chapel Hill

--On Friday, November 09, 2007 10:35 AM -0800 Vincent Fox
<***@ucdavis.edu> wrote:

> Jure Pečar wrote:
>> In my expirience the "brick wall" you describe is what happens when disks
>> reach a certain point of random IO that they cannot keep up with.
>>
>
> The problem with a technical audience, is that everyone thinks they have
> a workaround
> or probable fix you haven't already thought of. No offense. I am guilty
> of it myself but
> it's very hard to sometimes say "I DON'T KNOW" and dig through telemetry
> and instrument the software until you know all the answers.
>
> With something as complex as Cyrus, this is harder than you think.
> Unfortunately when it comes to something like a production mail service
> these days it's nearly impossible to get the funding and manhours and
> approvals to run experiments on live guinea pigs to really get to the
> bottom of problems. We throw systems at the problem and move on.
>
> But in answer to your point, our iostat numbers for busy or service time
> didn't
> indicate there to be any I/O issue. That was the first thing we looked
> at of course.
> Even by eyeball our array drives are more idle than busy.
>
>
> ----
> Cyrus Home Page: http://cyrusimap.web.cmu.edu/
> Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
> List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Vincent Fox
2007-11-14 18:21:10 UTC
Permalink
Michael Bacon wrote:
>
> Solid state disk for the partition with the mailboxes database.
>
> This thing is amazing. We've got one of the gizmos with a battery
> backup and a RAID array of Winchester disks that it writes off to if
> it loses power, but the latency levels on this thing are
> non-existent. Writes to the mailboxes database return almost
> instantaneously when compared to regular spinning disks. Based on my
> experience, that's bound to be a much bigger chunk of time than
> traversing a linked list in kernel memory.
>
> For anyone doing a big Cyrus install, I would strongly recommend this.
>

Thanks for the idea Michael.

I am thinking when our Sun Dtrace testing starts, to see if I can throw
in one config where the DB are run out of tmpfs in order to excercise
whether latency to those databases is causing the pileup. I have also
seen a posting from Pascal that ZFS mirrored configs have latency issues
which may be contributing.

I'm not ready to point any fingers but it certainly seems worth
investigating.

It's a pity I can't find any Sun SDD drives that could just slot into
our existing SAN setups.
Michael Bacon
2007-11-14 20:20:30 UTC
Permalink
Sun doesn't make any SSDs, I don't think, but while I'm not certain, I
think the RamSan line (http://www.superssd.com/products/ramsan-400/) has
some sort of partnership with Sun. To be honest, I'm not sure which brand
we're using, but like RamSan, it's a FC disk that slots into our SAN like
any other target.

I'd love to find out what your dtrace output says, though.

-Michael

--On Wednesday, November 14, 2007 10:21 AM -0800 Vincent Fox
<***@ucdavis.edu> wrote:

> Michael Bacon wrote:
>>
>> Solid state disk for the partition with the mailboxes database.
>>
>> This thing is amazing. We've got one of the gizmos with a battery
>> backup and a RAID array of Winchester disks that it writes off to if
>> it loses power, but the latency levels on this thing are
>> non-existent. Writes to the mailboxes database return almost
>> instantaneously when compared to regular spinning disks. Based on my
>> experience, that's bound to be a much bigger chunk of time than
>> traversing a linked list in kernel memory.
>>
>> For anyone doing a big Cyrus install, I would strongly recommend this.
>>
>
> Thanks for the idea Michael.
>
> I am thinking when our Sun Dtrace testing starts, to see if I can throw
> in one config where the DB are run out of tmpfs in order to excercise
> whether latency to those databases is causing the pileup. I have also
> seen a posting from Pascal that ZFS mirrored configs have latency issues
> which may be contributing.
>
> I'm not ready to point any fingers but it certainly seems worth
> investigating.
>
> It's a pity I can't find any Sun SDD drives that could just slot into our
> existing SAN setups.
>
>
Rob Banz
2007-11-14 21:32:21 UTC
Permalink
On Nov 14, 2007, at 15:20, Michael Bacon wrote:

> Sun doesn't make any SSDs, I don't think, but while I'm not certain, I
> think the RamSan line (http://www.superssd.com/products/ramsan-400/)
> has
> some sort of partnership with Sun. To be honest, I'm not sure which
> brand
> we're using, but like RamSan, it's a FC disk that slots into our SAN
> like
> any other target.
>
> I'd love to find out what your dtrace output says, though.

So, is it just your mailboxes db that lies on the SSD? Or your entire
'meta' partition?
Michael Bacon
2007-11-14 22:27:37 UTC
Permalink
The whole "meta" partition as of 1.6 (so no fancy splitting of mailbox
metadata), minus the proc directory, which is on tmpfs.

-Michael

--On Wednesday, November 14, 2007 4:32 PM -0500 Rob Banz <***@nofocus.org>
wrote:

>
> On Nov 14, 2007, at 15:20, Michael Bacon wrote:
>
>> Sun doesn't make any SSDs, I don't think, but while I'm not certain, I
>> think the RamSan line (http://www.superssd.com/products/ramsan-400/)
>> has
>> some sort of partnership with Sun. To be honest, I'm not sure which
>> brand
>> we're using, but like RamSan, it's a FC disk that slots into our SAN
>> like
>> any other target.
>>
>> I'd love to find out what your dtrace output says, though.
>
> So, is it just your mailboxes db that lies on the SSD? Or your entire
> 'meta' partition?
>
>
> ----
> Cyrus Home Page: http://cyrusimap.web.cmu.edu/
> Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
> List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Vincent Fox
2007-11-15 04:15:56 UTC
Permalink
This thought has occurred to me:

ZFS prefers reads over writes in it's scheduling.

I think you can see where I'm going with this. My WAG is something
related to Pascal's, namely latency. What if my write requests to
mailboxes.db
or deliver.db start getting stacked up, due to the favoritism shown to
reads?
The actual usage mix on a Cyrus system ends up being more writes than reads.
Despite having channels that seem under-utilized perhaps the stacking of
small latencies hits a tipping point that is causing the slowdown.

We have all Cyrus lumped in one ZFS pool, with separate filesystems for
imap, mail, sieve, etc. However, I do have an unused disk in each array
such that I could setup a simple ZFS mirror pair for /var/cyrus/imap so
that the databases are in their own pools. Or even I suppose a UFS
filesystem with directio and all that jazz set. Perhaps I wouldn't get
all the
effects of a RAM-SAN 500 but it could be a worthwhile improvement.

I liked having one big pool, but if it works, c'est la vie.
Michael Bacon
2007-11-15 15:04:34 UTC
Permalink
Interesting thought. We haven't gone to ZFS yet, although I like the idea
a lot. My hunch is it's an enormous win for the mailbox partitions, but
perhaps it's not a good thing for the meta partition. I'll have to let
someone else who knows more about ZFS and write speeds vs. read speeds
chime in here.

I have heard tell of funny behavior that ZFS does if you've got
battery-backed write caches on your arrays. (i.e., ZFS aggressively
flushes those caches, making them essentially useless.) Here's the best
article I was able to find on it in a short google:

http://blogs.digitar.com/jjww/?itemid=52

Only do that, of course, if you can trust your drives to write the data in
the caches in case of a power outage, and if you don't have any non-battery
backed caches somewhere else on your system.

-Michael

--On Wednesday, November 14, 2007 8:15 PM -0800 Vincent Fox
<***@ucdavis.edu> wrote:

> This thought has occurred to me:
>
> ZFS prefers reads over writes in it's scheduling.
>
> I think you can see where I'm going with this. My WAG is something
> related to Pascal's, namely latency. What if my write requests to
> mailboxes.db
> or deliver.db start getting stacked up, due to the favoritism shown to
> reads?
> The actual usage mix on a Cyrus system ends up being more writes than
> reads. Despite having channels that seem under-utilized perhaps the
> stacking of small latencies hits a tipping point that is causing the
> slowdown.
>
> We have all Cyrus lumped in one ZFS pool, with separate filesystems for
> imap, mail, sieve, etc. However, I do have an unused disk in each array
> such that I could setup a simple ZFS mirror pair for /var/cyrus/imap so
> that the databases are in their own pools. Or even I suppose a UFS
> filesystem with directio and all that jazz set. Perhaps I wouldn't get
> all the
> effects of a RAM-SAN 500 but it could be a worthwhile improvement.
>
> I liked having one big pool, but if it works, c'est la vie.
>
>
>
> ----
> Cyrus Home Page: http://cyrusimap.web.cmu.edu/
> Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
> List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Pascal Gienger
2007-11-15 15:50:46 UTC
Permalink
Michael Bacon <***@email.unc.edu> wrote:

> I have heard tell of funny behavior that ZFS does if you've got
> battery-backed write caches on your arrays.


/etc/system:

set zfs:zfs_nocacheflush=1


is your friend. Without that, ZFS' performance on hardware arrays with
large RAM caches is abysmal.

Some arrays do have the possibility to ignore these flush requests
although, but still following them when internal battery storage is faulted
or in phase of regeneration (alltogether with setting write-through-mode
on).

Pascal
Vincent Fox
2007-11-15 17:57:36 UTC
Permalink
>
> /etc/system:
>
> set zfs:zfs_nocacheflush=1

Yep already doing that, under Solaris 10u4. Have dual array controllers in
active-active mode. Write-back cache is enabled. Just poking in the 3510FC
menu shows cache is ~50% utilized so it does appear to be doing some work.
Ian G Batten
2007-11-16 12:10:23 UTC
Permalink
On 15 Nov 07, at 1504, Michael Bacon wrote:

> Interesting thought. We haven't gone to ZFS yet, although I like
> the idea
> a lot. My hunch is it's an enormous win for the mailbox
> partitions, but
> perhaps it's not a good thing for the meta partition. I'll have to
> let
> someone else who knows more about ZFS and write speeds vs. read speeds
> chime in here.

We're finding it a real win for the meta-partition. We're handing
~1000 users on a 2-way stripe by two-way mirror on the internal disks
in a T2000 for the meta-data, with the message data coming in over
NFS. We do see a few spikes of write operations (this is one
instance from zpool isotat -v 1):

capacity operations bandwidth
pool used avail read write read write
------------ ----- ----- ----- ----- ----- -----
pool1 52.1G 25.9G 4 657 3.96K 3.71M
mirror 26.0G 13.0G 4 354 3.96K 1.42M
c0t0d0s4 - - 0 135 0 1.42M
c0t1d0s4 - - 0 126 63.4K 1.42M
mirror 26.0G 13.0G 0 302 0 2.29M
c0t2d0s4 - - 0 112 0 2.29M
c0t3d0s4 - - 0 109 0 2.29M
------------ ----- ----- ----- ----- ----- -----


but it's showing no signs at all of being IO bound on the metadata.
The spikes are really just spikes for a second: the typical level is
about 10 ops / disk / sec.

ian
Wesley Craig
2007-11-15 18:29:54 UTC
Permalink
On 14 Nov 2007, at 23:15, Vincent Fox wrote:
> We have all Cyrus lumped in one ZFS pool, with separate filesystems
> for
> imap, mail, sieve, etc. However, I do have an unused disk in each
> array
> such that I could setup a simple ZFS mirror pair for /var/cyrus/
> imap so
> that the databases are in their own pools. Or even I suppose a UFS
> filesystem with directio and all that jazz set.

About 30% of all I/O is to mailboxes.db, most of which is read. I
haven't personally deployed a split-meta configuration, but I
understand the meta files are similarly heavy I/O concentrators.

:wes
Rob Mueller
2007-11-15 23:25:24 UTC
Permalink
> About 30% of all I/O is to mailboxes.db, most of which is read. I
> haven't personally deployed a split-meta configuration, but I
> understand the meta files are similarly heavy I/O concentrators.

That sounds odd.

Given the size and "hotness" of mailboxes.db, and in most cases the size of
mailboxes.db compared to the memory your machine has, basically the OS
should end up caching the entire thing in memory. If it has to keep going
back to disk to get parts of it, it suggests something is wrong with the OS
caching eviction policy.

I wonder if it's worth setting up a separate process that just mmaps the
whole file and then uses mlock() to keep the pages in memory, that should
mean it removes all read IO for mailboxes.db.

Rob
Pascal Gienger
2007-11-16 06:39:43 UTC
Permalink
Rob Mueller <***@fastmail.fm> wrote:

>
>> About 30% of all I/O is to mailboxes.db, most of which is read. I
>> haven't personally deployed a split-meta configuration, but I
>> understand the meta files are similarly heavy I/O concentrators.
>
> That sounds odd.
>
> Given the size and "hotness" of mailboxes.db, and in most cases the size
> of mailboxes.db compared to the memory your machine has, basically the
> OS should end up caching the entire thing in memory.

Solaris 10 does this in my case. Via dtrace you'll see that open() on the
mailboxes.db and read-calls do not exceed microsecond ranges. mailboxes.db
is not the problem here. It is entirely cached and rarely written
(creating, deleting and moving a mailbox).

Pascal
Rob Mueller
2007-11-16 06:39:47 UTC
Permalink
>> About 30% of all I/O is to mailboxes.db, most of which is read. I

> Solaris 10 does this in my case. Via dtrace you'll see that open() on the
> mailboxes.db and read-calls do not exceed microsecond ranges. mailboxes.db
> is not the problem here. It is entirely cached and rarely written
> (creating, deleting and moving a mailbox).

So what does "30% of all I/O" mean in that original statement? Does that
mean "30% of all IO requests from application to OS" or does that mean "30%
of all IO requests from OS to IO device"? I assumed you meant the second
because usually accessing data cached in memory isn't actually considered IO
by most people...

Rob
Michael Bacon
2007-11-16 16:59:51 UTC
Permalink
--On Friday, November 16, 2007 7:39 AM +0100 Pascal Gienger
<***@uni-konstanz.de> wrote:

> Solaris 10 does this in my case. Via dtrace you'll see that open() on the
> mailboxes.db and read-calls do not exceed microsecond ranges.
> mailboxes.db is not the problem here. It is entirely cached and rarely
> written (creating, deleting and moving a mailbox).

This is where I think the actual user count may really influence this
behavior. On our system, during heavy times, we can see writes to the
mailboxes file separated by no more than 5-10 seconds.

If you're constantly freezing all cyrus processes for the duration of those
writes, and those writes are taking any appreciable time at all, you're
going to have a stuttering server with big load averages.

Again, it's not I/O throughput to be worried about here -- it's latency.
If you don't have write caches in front of your disk, even with RAID you're
still at the mercy of drive latency in the millisecond range. Not a
problem if those writes are once every five minutes, but if you're at peak
load on a big system and seeing them every couple of seconds, that's brutal.

-Michael
Rob Mueller
2007-11-17 09:09:20 UTC
Permalink
> This is where I think the actual user count may really influence this
> behavior. On our system, during heavy times, we can see writes to the
> mailboxes file separated by no more than 5-10 seconds.
>
> If you're constantly freezing all cyrus processes for the duration of
> those writes, and those writes are taking any appreciable time at all,
> you're going to have a stuttering server with big load averages.

This shouldn't really be a problem. Yes the whole file is locked for the
duration of the write, however there should be only 1 fsync per
"transaction", which is what would introduce any latency. The actual writes
to the db file itself should be basically instant as the OS should just
cache them.

Still, you have a point that mailboxes.db is a global point of contention,
and it is access a lot, so blocking all processes on it for a write could be
an issue.

Which makes me even more glad that we've split up our servers into lots of
small cyrus instances, even less points of contention...

Rob
Ian G Batten
2007-11-19 08:50:16 UTC
Permalink
On 17 Nov 07, at 0909, Rob Mueller wrote:
>
> This shouldn't really be a problem. Yes the whole file is locked
> for the
> duration of the write, however there should be only 1 fsync per
> "transaction", which is what would introduce any latency. The
> actual writes
> to the db file itself should be basically instant as the OS should
> just
> cache them.

One thing that's worth noting for ZFS-ites is that on ZFS, you can
have multiple writer threads in a file simultaneously, which UFS can
only do for directio under certain conditions I can't recall. That's
a win for overlapping transactions into a file-based database.
We're not hitting mailboxes.db remotely rapidly enough for this to be
an issue, but I can imagine it being so for big shops.

In production releases of ZFS fsync() essentially triggers sync()
(fixed in Solaris Next). So if you anticipate a lot of writes (and
hence fsync()s) to mailboxes.db then you don't want mailboxes.db in
the same ZFS filesystem as things with lots of un-sync'd writes going
on. I've broken up /var/imap for ease of taking and rolling back
snapshots, but it has the handy side-effect of isolating delivery.db
and mailboxes.db from all the metadata partitions.

In my darker moments, by the way, I'm tempted to put deliver.db into
tmpfs. For planned reboot I could copy it somewhere stable, and I
could periodically dump it out to disk. But if I lost it, the
consequences aren't serious, and it's most of the write load through
that particular filesystem.

ian

mailhost-new# zfs list -t filesystem | grep imap; df /var/imap/proc
pool1/mailhost-space/imap 1.34G 24.6G 346M /var/imap
pool1/mailhost-space/imap-seen 105M 24.6G 22.4M /var/imap/
user
pool1/mailhost-space/meta-partition-1 2.48G 24.6G 972M /var/imap/
meta-partition-1
pool1/mailhost-space/meta-partition-2 12.4G 24.6G 4.82G /var/imap/
meta-partition-2
pool1/mailhost-space/meta-partition-3 4.86G 24.6G 1.60G /var/imap/
meta-partition-3
pool1/mailhost-space/meta-partition-7 5.60G 24.6G 1.41G /var/imap/
meta-partition-7
pool1/mailhost-space/meta-partition-8 14.0G 24.6G 5.39G /var/imap/
meta-partition-8
pool1/mailhost-space/meta-partition-9 1.08G 24.6G 415M /var/imap/
meta-partition-9
pool1/mailhost-space/sieve 5.26M 24.6G 1.62M /var/imap/
sieve
/var/imap/proc (swap ): 514496 blocks 2356285 files
mailhost-new#



>
> Still, you have a point that mailboxes.db is a global point of
> contention,
> and it is access a lot, so blocking all processes on it for a write
> could be
> an issue.



>
> Which makes me even more glad that we've split up our servers into
> lots of
> small cyrus instances, even less points of contention...
>
> Rob
>
> ----
> Cyrus Home Page: http://cyrusimap.web.cmu.edu/
> Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
> List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Bron Gondwana
2007-11-19 10:30:31 UTC
Permalink
On Mon, Nov 19, 2007 at 08:50:16AM +0000, Ian G Batten wrote:
>
> On 17 Nov 07, at 0909, Rob Mueller wrote:
>>
>> This shouldn't really be a problem. Yes the whole file is locked for the
>> duration of the write, however there should be only 1 fsync per
>> "transaction", which is what would introduce any latency. The actual
>> writes
>> to the db file itself should be basically instant as the OS should just
>> cache them.
>
> One thing that's worth noting for ZFS-ites is that on ZFS, you can have
> multiple writer threads in a file simultaneously, which UFS can only do for
> directio under certain conditions I can't recall. That's a win for
> overlapping transactions into a file-based database. We're not hitting
> mailboxes.db remotely rapidly enough for this to be an issue, but I can
> imagine it being so for big shops.
>
> In production releases of ZFS fsync() essentially triggers sync() (fixed in
> Solaris Next). So if you anticipate a lot of writes (and hence fsync()s)
> to mailboxes.db then you don't want mailboxes.db in the same ZFS filesystem
> as things with lots of un-sync'd writes going on. I've broken up
> /var/imap for ease of taking and rolling back snapshots, but it has the
> handy side-effect of isolating delivery.db and mailboxes.db from all the
> metadata partitions.

Skiplist requires two fsync calls per transaction (single
untransactioned actions are also one transaction), and it
also locks the entire file for the duration of said
transaction, so you can't have two writes happening at
once. I haven't built Cyrus on our Solaris box, so I don't
know if it uses fcntl there, it certainly does on the Linux
systems, but it can fall back to flock if fcntl isn't
available.

> In my darker moments, by the way, I'm tempted to put deliver.db into tmpfs.
> For planned reboot I could copy it somewhere stable, and I could
> periodically dump it out to disk. But if I lost it, the consequences
> aren't serious, and it's most of the write load through that particular
> filesystem.

Sounds pretty reasonable to me.

>>
>> Still, you have a point that mailboxes.db is a global point of contention,
>> and it is access a lot, so blocking all processes on it for a write could
>> be
>> an issue.
>
>
>
>>
>> Which makes me even more glad that we've split up our servers into lots of
>> small cyrus instances, even less points of contention...

Yeah, it's nice. It's a pain that the entire mailboxes.db blocks
on writes, but it sure keeps the skiplist format simple. I'd be
interested to see if there are cases where a transaction is kept
open longer than it needs to be though.

Bron.
Andrew McNamara
2007-11-20 04:40:58 UTC
Permalink
>> In production releases of ZFS fsync() essentially triggers sync() (fixed in
>> Solaris Next).
[...]
>Skiplist requires two fsync calls per transaction (single
>untransactioned actions are also one transaction), and it
>also locks the entire file for the duration of said
>transaction, so you can't have two writes happening at
>once. I haven't built Cyrus on our Solaris box, so I don't
>know if it uses fcntl there, it certainly does on the Linux
>systems, but it can fall back to flock if fcntl isn't
>available.

Note that ext3 effectively does the same thing as ZFS on fsync() - because
the journal layer is block based and does no know which block belongs
to which file, the entire journal must be applied to the filesystem to
achieve the expected fsync() symantics (at least, with data=ordered,
it does).

--
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/
Bron Gondwana
2007-11-20 05:04:23 UTC
Permalink
On Tue, 20 Nov 2007 15:40:58 +1100, "Andrew McNamara" <***@object-craft.com.au> said:
> >> In production releases of ZFS fsync() essentially triggers sync() (fixed in
> >> Solaris Next).
> [...]
> >Skiplist requires two fsync calls per transaction (single
> >untransactioned actions are also one transaction), and it
> >also locks the entire file for the duration of said
> >transaction, so you can't have two writes happening at
> >once. I haven't built Cyrus on our Solaris box, so I don't
> >know if it uses fcntl there, it certainly does on the Linux
> >systems, but it can fall back to flock if fcntl isn't
> >available.
>
> Note that ext3 effectively does the same thing as ZFS on fsync() -
> because
> the journal layer is block based and does no know which block belongs
> to which file, the entire journal must be applied to the filesystem to
> achieve the expected fsync() symantics (at least, with data=ordered,
> it does).

Lucky we run reiserfs then, I guess...

Bron.
--
Bron Gondwana
***@fastmail.fm
Vincent Fox
2007-11-20 06:51:43 UTC
Permalink
Bron Gondwana wrote:
> Lucky we run reiserfs then, I guess...
>
>

I suppose this is inappropriate topic-drift, but I wouldn't be
too sanguine about Reiser. Considering the driving force behind
it is in a murder trial last I heard, I sure hope the good bits of that
filesystem get turned over to someone who gives it a future.
Bron Gondwana
2007-11-20 07:21:23 UTC
Permalink
On Mon, 19 Nov 2007 22:51:43 -0800, "Vincent Fox" <***@ucdavis.edu> said:
> Bron Gondwana wrote:
> > Lucky we run reiserfs then, I guess...
> >
> >
>
> I suppose this is inappropriate topic-drift, but I wouldn't be
> too sanguine about Reiser. Considering the driving force behind
> it is in a murder trial last I heard, I sure hope the good bits of that
> filesystem get turned over to someone who gives it a future.

There are a bunch of people who know a fair bit about it and have been
happy to help debug issues, including quite recently. Besides, it's
pretty stable now and isn't bitrotting too badly.

That said, we're hanging out for btrfs to be stable - It would be nice,
and it's sort of inherited a bit from zfs and a bit from reiserfs in its
ways of doing things.

Bron ( running local Maildirs on it right now, synced with offlineimap to
FM. I wouldn't dream of running it production yet - it dies horribly
if you ever fill it more than about 70% )
--
Bron Gondwana
***@fastmail.fm
Michael R. Gettes
2007-11-20 13:32:27 UTC
Permalink
I am wondering about the use of fsync() on journal'd file systems
as described below. Shouldn't there be much less use of (or very
little use) of fsync() on these types of systems? Let the journal
layer due its job and not force it within cyrus? This would likely
save a lot of system overhead. It makes sense to use on non-journal'd
fs. I also wonder whether modern arrays even respect FULLFSYNC
given their more complex nature and I/O scheduling algorithms. It
may be time that fsync() (and fcntl(F_FULLFSYNC)) have become moot
since there is likely little way to influence, in an effective
targeted way, I/O behavior in complex environments these days.

/mrg

On Nov 19, 2007, at 23:40, Andrew McNamara wrote:

>>> In production releases of ZFS fsync() essentially triggers sync()
>>> (fixed in
>>> Solaris Next).
> [...]
>> Skiplist requires two fsync calls per transaction (single
>> untransactioned actions are also one transaction), and it
>> also locks the entire file for the duration of said
>> transaction, so you can't have two writes happening at
>> once. I haven't built Cyrus on our Solaris box, so I don't
>> know if it uses fcntl there, it certainly does on the Linux
>> systems, but it can fall back to flock if fcntl isn't
>> available.
>
> Note that ext3 effectively does the same thing as ZFS on fsync() -
> because
> the journal layer is block based and does no know which block belongs
> to which file, the entire journal must be applied to the filesystem to
> achieve the expected fsync() symantics (at least, with data=ordered,
> it does).
>
> --
> Andrew McNamara, Senior Developer, Object Craft
> http://www.object-craft.com.au/
> ----
> Cyrus Home Page: http://cyrusimap.web.cmu.edu/
> Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
> List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Ian G Batten
2007-11-20 15:34:12 UTC
Permalink
On 20 Nov 07, at 1332, Michael R. Gettes wrote:

> I am wondering about the use of fsync() on journal'd file systems
> as described below. Shouldn't there be much less use of (or very
> little use) of fsync() on these types of systems? Let the journal
> layer due its job and not force it within cyrus? This would likely
> save a lot of system overhead.

fsync() forces the data to be queued to the disk. A journaling
filesystem won't usually make any difference, because no one wants to
keep an intent log of every 1 byte write, or the 100 overwrites of
the same block. If you want every write() to go to disk,
immediately, the filesystem layout doesn't really matter: it's just a
matter of disk bandwidth. Journalling filesystems are more usually
concerned with metadata consistency, so that the filesystem isn't
actively corrupt if the music stops at the wrong point in a directory
create or something.

ian
David Lang
2007-11-20 17:56:37 UTC
Permalink
On Tue, 20 Nov 2007, Ian G Batten wrote:

> On 20 Nov 07, at 1332, Michael R. Gettes wrote:
>
>> I am wondering about the use of fsync() on journal'd file systems
>> as described below. Shouldn't there be much less use of (or very
>> little use) of fsync() on these types of systems? Let the journal
>> layer due its job and not force it within cyrus? This would likely
>> save a lot of system overhead.
>
> fsync() forces the data to be queued to the disk. A journaling
> filesystem won't usually make any difference, because no one wants to
> keep an intent log of every 1 byte write, or the 100 overwrites of
> the same block. If you want every write() to go to disk,
> immediately, the filesystem layout doesn't really matter: it's just a
> matter of disk bandwidth. Journalling filesystems are more usually
> concerned with metadata consistency, so that the filesystem isn't
> actively corrupt if the music stops at the wrong point in a directory
> create or something.

however a fsync on a journaled filesystem just means the data needs to be
written to the journal, it doesn't mean that the journal needs to be flushed to
disk.

on ext3 if you have data=journaled then your data is in the journal as well and
all that the system needs to do on a fsync is to write things to the journal (a
nice sequential write), and everything is perfectly safe. if you have
data=ordered (the default for most journaled filesystems) then your data isn't
safe when the journal is written and two writes must happen on a fsync (one for
the data, one for the metadata)

for cyrus you should have the same sort of requirements that you would have for
a database server, including the fact that without a battery-backed disk cache
(or solid state drive) to handle your updates, you end up being throttled by
your disk rotation rate (you can only do a single fsync write per rotation, and
that good only if you don't have to seek), RAID 5/6 arrays are even worse, as
almost all systems will require a read of the entire stripe before writing a
single block (and it's parity block) back out, and since the stripe is
frequently larger then the OS readahead, the OS throws much of the data away
immediatly.

if we can identify the files that are the bottlenecks it would be very
interesting to see the result of puttng them on a solid-state drive.

David Lang
Ian G Batten
2007-11-21 09:50:25 UTC
Permalink
On 20 Nov 07, at 1756, David Lang wrote:

>
> however a fsync on a journaled filesystem just means the data needs
> to be
> written to the journal, it doesn't mean that the journal needs to
> be flushed to
> disk.
>
> on ext3 if you have data=journaled then your data is in the journal
> as well and
> all that the system needs to do on a fsync is to write things to
> the journal (a
> nice sequential write),

Assuming the journal is on a distinct device and the distinct device
can take the load. It isn't on ZFS, although work is in progress.
One of the many benefits of the sadly underrated Solaris Disksuite
product was the metatrans devices, which at least permitted metadata
updates to go to a distinct device. When the UFS logging code went
into core Solaris (the ON integration) that facility was dropped,
sadly. My Pillar NFS server does data logging to distinct disk
groups, but mostly --- like such boxes tend to do --- relies on 12GB
of RAM and a battery. A sequential write is only of benefit if the
head is in the right place and the platter is at the right rotational
position and the write is well-matched to the transfer rate of the
spindle: if the spindle is doing large sequential writes while also
servicing reads and writes elsewhere, or can't keep up with writing
tracks flat out, the problems increase.
>
> for cyrus you should have the same sort of requirements that you
> would have for
> a database server, including the fact that without a battery-backed
> disk cache
> (or solid state drive) to handle your updates, you end up being
> throttled by
> your disk rotation rate (you can only do a single fsync write per
> rotation, and
> that good only if you don't have to seek), RAID 5/6 arrays are even
> worse, as
> almost all systems will require a read of the entire stripe before
> writing a
> single block (and it's parity block) back out, and since the stripe is
> frequently larger then the OS readahead, the OS throws much of the
> data away
> immediatly.
>
> if we can identify the files that are the bottlenecks it would be very
> interesting to see the result of puttng them on a solid-state drive.

I've split the meta-data out into separate partitions. The meta data
is stored in ZFS filesystems in a pool which is a RAID 0+1 4 disk
group with SAS drives, the message data is coming out of the lowest
QoS on my Pillar. A ten second fsstat on VM operations shows that by
request (this measures filesystem activity, not the implied disk
activity) it's the meta partitions taking the pounding (ten second
sample):

map addmap delmap getpag putpag pagio
0 0 0 45 0 0 /var/imap
11 11 11 17 0 0 /var/imap/meta-partition-1
290 290 290 463 5 0 /var/imap/meta-partition-2
139 139 139 183 3 0 /var/imap/meta-partition-3
66 66 66 106 10 0 /var/imap/meta-partition-7
347 347 342 454 16 0 /var/imap/meta-partition-8
57 57 57 65 5 0 /var/imap/meta-partition-9
4 4 8 4 0 0 /var/imap/partition-1
11 11 22 14 0 0 /var/imap/partition-2
1 1 2 1 0 0 /var/imap/partition-3
6 6 12 49 10 0 /var/imap/partition-7
15 15 28 457 0 0 /var/imap/partition-8
1 1 2 2 0 0 /var/imap/partition-9

Similarly, by non-VM operation:

new name name attr attr lookup rddir read read write write
file remov chng get set ops ops ops bytes ops bytes
0 0 0 2.26K 0 6.15K 0 0 0 45 1.22K /
var/imap
0 0 0 356 0 707 0 0 0 6 3.03K /
var/imap/meta-partition-1
3 0 3 596 0 902 0 6 135K 90 305K /
var/imap/meta-partition-2
0 0 0 621 0 1.08K 0 0 0 3 1.51K /
var/imap/meta-partition-3
3 0 3 1.04K 0 1.70K 0 6 149K 36 650K /
var/imap/meta-partition-7
0 0 0 2.28K 0 4.24K 0 0 0 7 1.87K /
var/imap/meta-partition-8
0 0 0 18 0 32 0 0 0 2 176 /
var/imap/meta-partition-9
2 2 2 22 0 30 0 1 2.37K 2 7.13K /
var/imap/partition-1
3 4 12 84 0 157 0 1 677 3 7.51K /
var/imap/partition-2
1 1 1 1.27K 0 2.16K 0 0 0 1 3.75K /
var/imap/partition-3
2 2 4 35 0 56 0 1 3.97K 36 279K /
var/imap/partition-7
1 2 1 256 0 514 0 0 0 1 3.75K /
var/imap/partition-8
0 0 0 0 0 0 0 0 0 0 0 /
var/imap/partition-9


And looking at the real IO load, ten seconds of zpool (for the meta
data and /var/imap_

capacity operations bandwidth
pool used avail read write read write
------------ ----- ----- ----- ----- ----- -----
pool1 51.6G 26.4G 0 142 54.3K 1001K
mirror 25.8G 13.2G 0 68 38.4K 471K
c0t0d0s4 - - 0 36 44.7K 471K
c0t1d0s4 - - 0 36 0 471K
mirror 25.8G 13.2G 0 73 15.9K 530K
c0t2d0s4 - - 0 40 28.4K 531K
c0t3d0s4 - - 0 39 6.39K 531K
------------ ----- ----- ----- ----- ----- -----

is very different to ten seconds of sar for the NFS:

09:46:34 device %busy avque r+w/s blks/s avwait avserv

[...]
nfs73 1 0.0 3 173 0.0 4.2
nfs86 3 0.1 12 673 0.0 6.5
nfs87 0 0.0 0 0 0.0 0.0
nfs89 0 0.0 0 0 0.0 0.0
nfs96 0 0.0 0 0 0.0 1.8
nfs101 1 0.0 1 25 0.0 8.0
nfs102 0 0.0 0 4 0.0 9.4

The machine has a _lot_ of memory (32GB) so it's likely that all mail
that is delivered and then read within ten minutes never gets read
back from the message store: the NFS load is almost entirely write as
seen from the server.

ian
David Lang
2007-11-21 18:15:21 UTC
Permalink
On Wed, 21 Nov 2007, Ian G Batten wrote:

>> however a fsync on a journal ed filesystem just means the data needs to be
>> written to the journal, it doesn't mean that the journal needs to be
>> flushed to
>> disk.
>>
>> on ext3 if you have data=journal ed then your data is in the journal as well
>> and
>> all that the system needs to do on a fsync is to write things to the
>> journal (a
>> nice sequential write),
>
> Assuming the journal is on a distinct device and the distinct device can take
> the load. It isn't on ZFS, although work is in progress.

I was responding to the comments about ext3 and other journal ed filesystems as
alternatives to zfs and the claim that doing a fsync on one of them required
flushing the entire journal. sorry if I wasn't clear enough about this.

David Lang
Gabor Gombas
2007-11-22 10:53:29 UTC
Permalink
On Tue, Nov 20, 2007 at 09:56:37AM -0800, David Lang wrote:

> for cyrus you should have the same sort of requirements that you would have for
> a database server, including the fact that without a battery-backed disk cache
> (or solid state drive) to handle your updates, you end up being throttled by
> your disk rotation rate (you can only do a single fsync write per rotation, and
> that good only if you don't have to seek), RAID 5/6 arrays are even worse, as
> almost all systems will require a read of the entire stripe before writing a
> single block (and it's parity block) back out, and since the stripe is
> frequently larger then the OS readahead, the OS throws much of the data away
> immediatly.

You're mixing things up. Readahead has absolutely zero influence on when
data is evicted from the cache.

> if we can identify the files that are the bottlenecks it would be very
> interesting to see the result of puttng them on a solid-state drive.

On Linux you can use blktrace to log every I/O operations, you could use
it to catch I/O ops that take longer than expected. It works below the
files system level however so you need to use something like debugfs to
translate the sector numbers back to inodes/file names.

Gabor

--
---------------------------------------------------------
MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
---------------------------------------------------------
David Lang
2007-11-23 15:12:05 UTC
Permalink
On Thu, 22 Nov 2007, Gabor Gombas wrote:

> On Tue, Nov 20, 2007 at 09:56:37AM -0800, David Lang wrote:
>
>> for cyrus you should have the same sort of requirements that you would have for
>> a database server, including the fact that without a battery-backed disk cache
>> (or solid state drive) to handle your updates, you end up being throttled by
>> your disk rotation rate (you can only do a single fsync write per rotation, and
>> that good only if you don't have to seek), RAID 5/6 arrays are even worse, as
>> almost all systems will require a read of the entire stripe before writing a
>> single block (and it's parity block) back out, and since the stripe is
>> frequently larger then the OS readahead, the OS throws much of the data away
>> immediatly.
>
> You're mixing things up. Readahead has absolutely zero influence on when
> data is evicted from the cache.

if the system is set to do a 1M readahead and to do that readahead it needs to
read in 5M of data to verify the integrity, the system doesn't keep all 5M of
data in it's cache, only the 1M that is it's readahead (or at least in some
cases this is true)

David Lang
Rob Banz
2007-11-20 19:02:11 UTC
Permalink
We went through a similar discussion last year in OpenAFS land, and
came the same conclusion -- basically, if your filesystem is
reasonably reliable (such as ZFS is), and you can trust your
underlying storage not to lose transactions that are in-cache during a
'bad event', the added benefit of fsync() may be less than its
performance cost.

-rob

On Nov 20, 2007, at 08:32, Michael R. Gettes wrote:

> I am wondering about the use of fsync() on journal'd file systems
> as described below. Shouldn't there be much less use of (or very
> little use) of fsync() on these types of systems? Let the journal
> layer due its job and not force it within cyrus? This would likely
> save a lot of system overhead. It makes sense to use on non-journal'd
> fs. I also wonder whether modern arrays even respect FULLFSYNC
> given their more complex nature and I/O scheduling algorithms. It
> may be time that fsync() (and fcntl(F_FULLFSYNC)) have become moot
> since there is likely little way to influence, in an effective
> targeted way, I/O behavior in complex environments these days.
Pascal Gienger
2007-11-20 19:57:21 UTC
Permalink
Rob Banz <***@nofocus.org> wrote:

>
> We went through a similar discussion last year in OpenAFS land, and
> came the same conclusion -- basically, if your filesystem is
> reasonably reliable (such as ZFS is), and you can trust your
> underlying storage not to lose transactions that are in-cache during a
> 'bad event', the added benefit of fsync() may be less than its
> performance cost.

Would'nt it be nice to have a configuration option to completely turn off
fsync() in Cyrus? If you want, with a BIG WARNING in the doc stating NOT TO
USE IT unless you know what you doing. :)



Pascal (in train of reconfiguring our SAN to make more cyrus checks)

PS: Putting deliver.db on tempfs seems to be a nice idea, but in current
cyrus code you may not give extra paths to single cyrus databases. Our
actual deliver.db on one machine is ca 600 MB tall, so it won't be of any
problem to store it completely on tmpfs.
Ken Murchison
2007-11-20 20:38:32 UTC
Permalink
Pascal Gienger wrote:
> Rob Banz <***@nofocus.org> wrote:
>
>> We went through a similar discussion last year in OpenAFS land, and
>> came the same conclusion -- basically, if your filesystem is
>> reasonably reliable (such as ZFS is), and you can trust your
>> underlying storage not to lose transactions that are in-cache during a
>> 'bad event', the added benefit of fsync() may be less than its
>> performance cost.
>
> Would'nt it be nice to have a configuration option to completely turn off
> fsync() in Cyrus? If you want, with a BIG WARNING in the doc stating NOT TO
> USE IT unless you know what you doing. :)

Its already in imapd.conf(8):

skiplist_unsafe

--
Kenneth Murchison
Systems Programmer
Project Cyrus Developer/Maintainer
Carnegie Mellon University
Rob Banz
2007-11-20 20:54:35 UTC
Permalink
On Nov 20, 2007, at 15:38, Ken Murchison wrote:

> Pascal Gienger wrote:
>> Rob Banz <***@nofocus.org> wrote:
>>> We went through a similar discussion last year in OpenAFS land, and
>>> came the same conclusion -- basically, if your filesystem is
>>> reasonably reliable (such as ZFS is), and you can trust your
>>> underlying storage not to lose transactions that are in-cache
>>> during a
>>> 'bad event', the added benefit of fsync() may be less than its
>>> performance cost.
>> Would'nt it be nice to have a configuration option to completely
>> turn off fsync() in Cyrus? If you want, with a BIG WARNING in the
>> doc stating NOT TO USE IT unless you know what you doing. :)
>
> Its already in imapd.conf(8):
>
> skiplist_unsafe

Well shiver me timbers! Ya'll rock.

-rob
Andrew McNamara
2007-11-22 00:25:26 UTC
Permalink
>>> Would'nt it be nice to have a configuration option to completely
>>> turn off fsync() in Cyrus? If you want, with a BIG WARNING in the
>>> doc stating NOT TO USE IT unless you know what you doing. :)
>>
>> Its already in imapd.conf(8):
>>
>> skiplist_unsafe
>
>Well shiver me timbers! Ya'll rock.

Note, however, that fsync() still serves a purpose with data journalling -
without it, your application writes may be sitting in buffer cache,
and may be arbitrarily re-ordered before hitting the disk. The fsync()
should not return until the blocks have reached the disk (or journal),
and thus forms a syncronisation point (which is critical for maintaining
sanity of on-disk data structures like skiplists).

--
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/
John Madden
2007-11-20 21:17:01 UTC
Permalink
> > Would'nt it be nice to have a configuration option to completely
> turn off
> > fsync() in Cyrus? If you want, with a BIG WARNING in the doc stating
> NOT TO
> > USE IT unless you know what you doing. :)
>
> Its already in imapd.conf(8):
>
> skiplist_unsafe

I see most of our writes going to the spool filesystems, not so much the
meta filesystem, so I'd prefer to see something where we can keep the
main databases fsnyc()ing properly but allow the individual mailboxes to
just rely on filesystem journaling. Is there a
cacheandindexfile_unsafe? :)

John



--
John Madden
Sr. UNIX Systems Engineer
Ivy Tech Community College of Indiana
***@ivytech.edu
Rob Banz
2007-11-20 21:01:20 UTC
Permalink
On Nov 20, 2007, at 14:57, Pascal Gienger wrote:

> Rob Banz <***@nofocus.org> wrote:
>
>>
>> We went through a similar discussion last year in OpenAFS land, and
>> came the same conclusion -- basically, if your filesystem is
>> reasonably reliable (such as ZFS is), and you can trust your
>> underlying storage not to lose transactions that are in-cache
>> during a
>> 'bad event', the added benefit of fsync() may be less than its
>> performance cost.
>
> Would'nt it be nice to have a configuration option to completely
> turn off fsync() in Cyrus? If you want, with a BIG WARNING in the
> doc stating NOT TO USE IT unless you know what you doing. :)

That's basically what we did with the AFS volserver & fileserver.
Oddly, when the patch got integrated into the OpenAFS code, they
didn't like the idea of it being an option and made it the default ;)
Marco Colombo
2007-11-23 16:03:07 UTC
Permalink
Andrew McNamara wrote:
> Note that ext3 effectively does the same thing as ZFS on fsync() - because
> the journal layer is block based and does no know which block belongs
> to which file, the entire journal must be applied to the filesystem to
> achieve the expected fsync() symantics (at least, with data=ordered,
> it does).

Well, "does not know which block belongs to which file" sounds weird. :)

With data=ordered, the journal holds only metadata. If you fsync() a
file, "ordered" means that ext3 syncs the data blocks first (with no
overhead, just like any other filesystem, of course it knows what blocks
to write), then the journal.

Now, yes, the journal possibly contains metadata updates for other files
too, and the "ordered" semantics requires the data blocks of those files
to be synced as well, before the journal sync.

I'm not sure if a fsync() flushes the whole journal or just up to the
point it's necessary (that is, up to the last update on the file you're
fsync()ing).

data=writeback is what some (most) other journalled filesystems do.
Metadata updates are allowed to hit the disk _before_ data updates. So,
on fsync(), the FS writes all data blocks (still required by fsync()
semantics), then the journal (or part of it), but if updates of other
files metadata are included in the journal sync, there's not need to
write the corresponding data blocks. They'll be written later, and
they'll hit the disk _after_ the metadata changes.

If power fails in between, you can have a file whose size/time is
updated, but contents not. That's the problem with data=writeback, but
it should be noted that's pretty normal for other journalled
filesystems, too. It applies only to files that were not fsync()'ed.

I think that if you're running into performance problems, and your
system is doing a lot of fsync(), data=orderer is the worst option.

data=journal is fsync()-friendly in one sense, it does write
*everything* out, but in one nice sequential (thus extremely fast) shot.
Later, data blocks will be written again to the right places. It doubles
the I/O bandwith requirements, but if you have a lot of bandwidth, it
may be a win. We're talking sequential write bandwidth, which is hardly
a problem.

data=writeback is fsync() friendly in the sense that it writes only the
data blocks of the fsync()'ed file plus (all) metadata. It's the lowest
overhead option.

If you have a heavy sustained write traffic _and_ lots of fsync()'s,
then data=writeback may be the only option.

I think some people are scared by data=writeback, but they don't realize
it's just what other journalled FS do. I'm not familiar with ReiserFS,
it think it's metadata-only as well.

data=ordered is good, for general purpose systems. For any application
that uses fsync(), it's useless overhead.

I've never hit performance problems, my numbers are 200 users with 2000
messages/day delivered to lmtp, _any_ decent PC handles that load
easily, and I've never considered turning data=ordered to data=writeback
for my filesystems. Now that I think about it, I've also forgot to set
noatime after the last HW upgrade (what a luxury!).

/me fires vi on /etc/fstab and adds 'noatime'

.TM.
Dale Ghent
2007-11-16 19:19:35 UTC
Permalink
On Nov 16, 2007, at 1:39 AM, Pascal Gienger wrote:

>> Solaris 10 does this in my case. Via dtrace you'll see that open()
>> on the
> mailboxes.db and read-calls do not exceed microsecond ranges.
> mailboxes.db
> is not the problem here. It is entirely cached and rarely written
> (creating, deleting and moving a mailbox).


Hmm, I'm wondering if the Cyrus devs would be receptive to the idea of
implementing some dtrace probes in Cyrus.

Stuff such as mailbox open/close, IMAP operations such as SELECTs,
message retrievals, and so on.

I run cyrus on my personal server now, so maybe I'll fool around with
that idea.

/dale

--
Dale Ghent
Specialist, Storage and UNIX Systems
UMBC - Office of Information Technology
ECS 201 - x51705
Ken Murchison
2007-11-16 19:56:04 UTC
Permalink
Dale Ghent wrote:
> On Nov 16, 2007, at 1:39 AM, Pascal Gienger wrote:
>
>>> Solaris 10 does this in my case. Via dtrace you'll see that open()
>>> on the
>> mailboxes.db and read-calls do not exceed microsecond ranges.
>> mailboxes.db
>> is not the problem here. It is entirely cached and rarely written
>> (creating, deleting and moving a mailbox).
>
>
> Hmm, I'm wondering if the Cyrus devs would be receptive to the idea of
> implementing some dtrace probes in Cyrus.
>
> Stuff such as mailbox open/close, IMAP operations such as SELECTs,
> message retrievals, and so on.

We'd probably accept a patch, as long as its portable.

--
Kenneth Murchison
Systems Programmer
Project Cyrus Developer/Maintainer
Carnegie Mellon University
Dale Ghent
2007-11-16 20:52:38 UTC
Permalink
On Nov 16, 2007, at 2:56 PM, Ken Murchison wrote:

> Dale Ghent wrote:
>> On Nov 16, 2007, at 1:39 AM, Pascal Gienger wrote:
>>>> Solaris 10 does this in my case. Via dtrace you'll see that
>>>> open() on the
>>> mailboxes.db and read-calls do not exceed microsecond ranges.
>>> mailboxes.db
>>> is not the problem here. It is entirely cached and rarely written
>>> (creating, deleting and moving a mailbox).
>> Hmm, I'm wondering if the Cyrus devs would be receptive to the idea
>> of implementing some dtrace probes in Cyrus.
>> Stuff such as mailbox open/close, IMAP operations such as SELECTs,
>> message retrievals, and so on.
>
> We'd probably accept a patch, as long as its portable.


Portable in what sense, exactly?

Currently the only OSes which offer DTrace is OSX 10.5 and Solaris 10
(and Solaris Next), so would I be correct to assume that you mean that
a dtrace feature would have to work on those two OSes?

/dale

--
Dale Ghent
Specialist, Storage and UNIX Systems
UMBC - Office of Information Technology
ECS 201 - x51705
Ken Murchison
2007-11-16 20:56:01 UTC
Permalink
Dale Ghent wrote:
> On Nov 16, 2007, at 2:56 PM, Ken Murchison wrote:
>
>> Dale Ghent wrote:
>>> On Nov 16, 2007, at 1:39 AM, Pascal Gienger wrote:
>>>>> Solaris 10 does this in my case. Via dtrace you'll see that open()
>>>>> on the
>>>> mailboxes.db and read-calls do not exceed microsecond ranges.
>>>> mailboxes.db
>>>> is not the problem here. It is entirely cached and rarely written
>>>> (creating, deleting and moving a mailbox).
>>> Hmm, I'm wondering if the Cyrus devs would be receptive to the idea
>>> of implementing some dtrace probes in Cyrus.
>>> Stuff such as mailbox open/close, IMAP operations such as SELECTs,
>>> message retrievals, and so on.
>>
>> We'd probably accept a patch, as long as its portable.
>
>
> Portable in what sense, exactly?
>
> Currently the only OSes which offer DTrace is OSX 10.5 and Solaris 10
> (and Solaris Next), so would I be correct to assume that you mean that a
> dtrace feature would have to work on those two OSes?

I don't care if it only works on Solaris 10, but the code can't get in
the way of it compiling and running on any other non-Dtrace system.

--
Kenneth Murchison
Systems Programmer
Project Cyrus Developer/Maintainer
Carnegie Mellon University
Dale Ghent
2007-11-16 21:19:07 UTC
Permalink
On Nov 16, 2007, at 3:56 PM, Ken Murchison wrote:

> I don't care if it only works on Solaris 10, but the code can't get
> in the way of it compiling and running on any other non-Dtrace system.


Oh naturally. That's what autoconf and #ifdef HAVE_DTRACE is for.

/dale

--
Dale Ghent
Specialist, Storage and UNIX Systems
UMBC - Office of Information Technology
ECS 201 - x51705
Wesley Craig
2007-11-16 20:26:17 UTC
Permalink
On 15 Nov 2007, at 18:25, Rob Mueller wrote:
>> About 30% of all I/O is to mailboxes.db, most of which is read. I
>> haven't personally deployed a split-meta configuration, but I
>> understand the meta files are similarly heavy I/O concentrators.
>
> That sounds odd.

Yeah, it's not right. I was reading my iostat output backwards. In
fact, it's writes and presumably an artifact of having system logs on
the same device as mailboxes.db. Sorry for the confusion.

:wes
Bron Gondwana
2007-11-15 23:39:12 UTC
Permalink
On Thu, Nov 15, 2007 at 01:29:54PM -0500, Wesley Craig wrote:
> On 14 Nov 2007, at 23:15, Vincent Fox wrote:
> > We have all Cyrus lumped in one ZFS pool, with separate filesystems
> > for
> > imap, mail, sieve, etc. However, I do have an unused disk in each
> > array
> > such that I could setup a simple ZFS mirror pair for /var/cyrus/
> > imap so
> > that the databases are in their own pools. Or even I suppose a UFS
> > filesystem with directio and all that jazz set.
>
> About 30% of all I/O is to mailboxes.db, most of which is read. I
> haven't personally deployed a split-meta configuration, but I
> understand the meta files are similarly heavy I/O concentrators.

Which is a good argument for checkpointing it (gah, hate that term -
it's so non-specific. I've spent some time working on terminology
maps for this stuff, and "repack" is the current winner, mainly due
to be shorter than the runner up "consolidate")

What was I saying again? Oh - yes. Current skiplist metric is that
the mailboxes.db has to be be twice the size of the last checkpointed
size plus 16k before it re-checkpoints. Given that a checkpoint takes
approximately 2 seconds on our systems, and it means that you're not
seeking all over the place any more, it would almost certainly be a
win.

That said, we don't have a single machine where the memory pressure
is tight enough to ever push mailboxes.db out of the cache, so it's
not ever going to be hitting the disk for reads anyway!

Bron.
Pascal Gienger
2007-11-22 07:54:03 UTC
Permalink
Vincent Fox <***@ucdavis.edu> wrote:

> This thought has occurred to me:
>
> ZFS prefers reads over writes in it's scheduling.
>
> I think you can see where I'm going with this. My WAG is something
> related to Pascal's, namely latency. What if my write requests to
> mailboxes.db
> or deliver.db start getting stacked up, due to the favoritism shown to
> reads?

I got substantial benefits from setting compression=on and recordsize=32K
on the filesystem where deliver.db resides. After talking with our SAN
staff it showed up that storage was our problem - it has problems with
concurrent write and read calls, the system won't answer read requests if
the write channel is "full". I don't know whether it is a firmwire issue or
a non-capability of the storage system.

Lowering ZFS' recordsize and activating compression on that partition cut
down i/o rate and things are going normal here again.

Thanks to all who helped!

Pascal

PS: The mirror resilvering problem was a misconfiguration of a brocade
switch... Sometimes you don't see the forest due to so many trees (german
proverb, "Man sieht den Wald vor lauter Bäumen nicht")...
Vincent Fox
2007-11-09 18:41:51 UTC
Permalink
Jure Pečar wrote:
>
> I'm still on linux and was thinking a lot about trying out solaris 10, but
> stories like yours will make me think again about that ...
>
>

We are I think an edge case, plenty of people running Solaris Cyrus no
problems.
To me ZFS alone is enough reason to go with Solaris. I would only go
back to
the bad old days of running fsck on 100+gig filesystems if you put a gun
to my head.
Jim Howell
2007-11-12 13:31:36 UTC
Permalink
Hi,
We run a 35,000 mailbox system with no problems on Solaris. We did
a few years back have a bad time with using Berkeley DB and cyrus and
switching to skiplist fixed that. I believe that problem has been
solved though. I would recommend using multiple spools however, having
one big one would scare me to death.
Jim


Vincent Fox wrote:
> Jure Pečar wrote:
>
>> I'm still on linux and was thinking a lot about trying out solaris 10, but
>> stories like yours will make me think again about that ...
>>
>>
>>
>
> We are I think an edge case, plenty of people running Solaris Cyrus no
> problems.
> To me ZFS alone is enough reason to go with Solaris. I would only go
> back to
> the bad old days of running fsck on 100+gig filesystems if you put a gun
> to my head.
>
>
>
> ----
> Cyrus Home Page: http://cyrusimap.web.cmu.edu/
> Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
> List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

--
Jim Howell
Cornell University
CIT Messaging Systems Manager
email: ***@cornell.edu
phone: 607-255-9369
Continue reading on narkive:
Loading...