Redolog Contention associated to Redo Sizing and Number ... and Numbers in Oracle
Last post about this was a little philosophical, so gets back to the numbers, let's put a few initial conditions to make numbers easily read.
We'll have a simple, 10G insert in a non indexed table (to simplify) , a db_block of 8192 (most common), and .....
500 datafiles on the database these seems to be disconnected, but as you'll see it's deeply interconnected with the redo log processing and therefore with the transactions performance of the database.
Condiciones:
As everything goes thru Undo .. then we duplicate our 10G, and later we'll talk of archived logs , let's keep that there by the moment.
Let's put the numbers and talk a little about them.
Why there's a IO difference ? , every redo switch (when we move from one redo to the following one) triggers a few actions
one of them : a process to write a block on every datafile to record that "we are there", of course this should wait til the changes contained on the datafiles are effectively written to the datafile,
these waits won't add immediately as processes are asynchronous (everybody continues doing it's stuff as much as possible) but with few redo groups, eventually you have all of them "active" trying to catch up between the headers and the datafile writes, Oracle have a process CKPT dedicated to this coordination,
anyway
point 1:
the more datafiles, the more headers to write, the smaller the redologs the more write to the headers of the datafiles
(I'm not adding these writes to the undo because I don't know if they go there and we'll go crazy on a recursive chase, wtiting, undoing the changes to mark, etc)
point 2:
the less groups you have the more possibility you've to reach the "1 Current, the rest Active, no Inactive" status where all your redologs are waiting for the headers and datafiles to catch up , at this point the database freezes in small bursts.
point 3: before you think in putting all the database in a big file and forget about all this extra header IOs carnival, remember that RMAN backups (and recovery) paralelize on datafiles, and recovery of a 32gb datafile is a little faster than recovery of 1TB datafile, so as everything in life, it's a balance.
Note 1 why don't we see more complains about all of this ?
the answer is "controllers cache", we can have all this inefficiency at the database level because there's a SAN, or controller Cache hiding the IO disaster from our eyes.
We've at least 32G (or more) cushion, ready to catch the extra IOs, giving superfast response time and distributing the IO so we won't note the issue.
Note 2 why should we worry about ? if the SAN masks all of this
Because SAN(or controller) caches are not infinite as we have more information eventually we'll write more than the size of a cache in few seconds, and then suddenly people will notice a catastrophic performance issue that they can't explain as "everything was ok a second ago".
In the next post, I'll talk a little about what happens next
(ArchiveLogs , Tapes and Dataguard)