I had a look a system here in the Netherlands for a company that was having severe I/O performance problems. They tried to switch to DirectIO (filesystemset_option=setall), but they discovered that the IO performance got worse. So they quickly turned it back to ASYNC. The System Admins told the DBAs several times that there were no OS or File System issues. However they did recently migrate from one SAN to another with Volumne Manager Mirrorring. And the local was remotely mirrored, supposedly in asynchronous mode.
So then I had a look at the system. The system has 24 CPUs, 57GB of internal memory and 14 Oracle databases running on it. The two most important databases did according to statspack around 1700 I/Os per second. So I did a quickscan of the Physical Reads per SQL statement and found that some of the statements were missing indexes (based on Tapios Rules of Engagement), I also noticed that the buffer caches were really small.
After adding some indexes and increasing the buffer cache(s) it was discovered that the writes were still a bit (understatement) slow. To get the focus of the database and more onto the system and OS, I decided to use the test program for writes from Jonathan Lewis. The tests performed were 1K, 10000 blocks sequential writes, 8K, 10000 blocks random writes and 16K, 10000 blocks sequential writes. The tests were run on different mounts points and the interesting things were observered. The tests also performed slow on the database mount points. So the database was no longer the root of problem, something else was. Now the system administrators had to pay attention :) Then it was quickly discovered that volume manager mirroring was still enabled for the database mount point. When that was discovered and turned off, the performance improved but was still not good enough (acutally it was 3-4 times too slow). Then it was discovered that the remote SAN had it is cache disabled and the asynchronous operations became synchronous. When that was fixed, the writes almost returned to normal.
- Don’t rely on SA to tell you what not the problem is, proof him wrong.
- Indexes still matter
- Buffer Caches are way too small for todays hardware :)
- Use the test IO program more often :)