I have been looking for a while at the Google Visualization API and finally found a way to produce the required JSON to produce any graphs directly and indirectly from Oracle database. I will make a white paper available on how to do this. Checkout this link: http://lab4.oraperf.com/demogoogle.html. The procedure handles the SQL that the Google API uses (not for all functionality yet (pivot, offset, format are not yet supported).

More on this later.

Posted by: akolk | March 4, 2009

PROTOCOL VIOLATION DETECTED

This error showed in a trace file (udump) and caused all kind of strange errors in the user application at a customer where I was working. A quick search on metalink pointed me to Doc ID 601709.1 titled “What does the message PROTCOL VIOLATION DETECTED mean in the trace files”.  The message seems to be related to bug 4669305 (which is not published by Oracle Support), but it moved me into the direction of ora-600[12333] and ora-3106. There I found as simple test case where some one in SQL*Plus had a reproducable case: select 1 from dual where rownum < ###. This caused the message to show in the trace file and complain about the bind variable value ‘###’. That looked funny as this was not a bind variable, but ofcourse the database had CURSOR_SHARING set to similar or to force. My customer had CURSOR_SHARING set to similar. Aftter we changed that to EXACT the problem went away :)

Posted by: akolk | February 20, 2009

Seminar: Deploying Oracle on Amazon Web Services

We are organizing a Seminar about Oracle on Amazon Web Services. This event will be held on June 16th and 17th in The Netherlands. The exact location will be will be announced later and more information will follow later.

Seminar Description

If you say the words “Cloud Computing” frequently in meetings and at the workplace, your colleagues will realize that you are very intelligent, and you will be promoted.  In addition, cloud computing improves upon traditional hosted infrastructure in many ways.  Deployment of new host resources takes only minutes. Applications can scale up quickly without costly hardware upgrades.  In this one-day seminar we will learn the pros, cons and details of deploying Oracle databases on the Amazon Elastic Computing Cloud. We will also learn how to leverage Amazon’s Web Services to maximize performance and availability using features such as RMAN, Data Guard, ASM and RAC.

If you are interested in this seminar please sign up here for more information

Posted by: akolk | February 19, 2009

Oracle on 32-bit Windows: ORA-4030 errors

Over the last year I have seen a number of customers that have problems with Oracle on windows. They are getting frequently an ORA-4030 error.  Now Oracle on windows is implemented differently than Oracle on UNIX. The big difference is that Oracle on Windows shares one process with all clients (threads in this case). This process has a limited process space, which is by default 2GB and can be enlarged to 3.5GB. So this process space has to accommodate all the Oracle sessions (threads). So if you have a quick look at the Process space it basically consists of 4 parts (there are probably more, but I am not an Windows geek/expert): 

  1. The Oracle Executable
  2. The Oracle SGA
  3. The Heap (for all threads/sessions)
  4. The Stack

This has a number of implications. You can’t really tune the SGA to a very large size; you can only support a certain number of users; the heap can easily get fragmented and that means that large allocations will fail and there are other related problems.

So what can you really do to make this work? Reduce the SGA_MAX_SIZE or even remove the parameter and rely on your db_block_buffers and shared_pool_size settings. Or you can reduce the number of sessions/threads that connect to Oracle. But the your best bet is to reduce the amount of memory needed for the SGA. Another option is to switch to 64 bit windows and 64 bit windows, that will also allow for a larger process space.

I will update later with some more info.

Posted by: akolk | February 12, 2009

forcedirectio: Another victim

I see it all the time: people using Best Practices and ending up in a big mess afterwards. This time it is the mount option forcedirectio. According to a NetApp best practice for Oracle one should always use forcedirectio for the File Systems that store the Oracle Files. So people migrating to these systems read the white papers and best practices and then run into performance problems. A quick diagnosis shows that it is all related to IO. Ofcourse the NAS is blamed and NetApp gets a bad reputation. It is not only NetApp, it is true for all vendors that advice you to use the forcedirectio.

What does forcedirectio do?

It basically by passes the File System buffer cache and because of that it is using a shorter and faster code path in the OS to get the I/O done. That it is ofcourse what we want, however you are now nolonger using the File System Buffer Cache. Depending on your OS and defaults, a large portion of your internal memory could be used as the FS Buffer Cache. So most DBAs don’t dare to set Oracle Buffer Caches bigger than 2-3 GB and don’t dare to use Raw Devices. So the FS cache is used very heavily. It is not uncommon to see that on Oracle Database uses 2 to 10 more caching in the FS than the Oracle Buffer Cache. I have seen a system that used 20 to 40 times more caching the FS than the Oracle Buffer Cache.

So just imagine what happens if one than bypasses the FS Buffer Cache :) Most I/Os will become physical I/Os. Especially the reads will suffer. If the reads suffer, the end user response time will suffer directly. If end users are unhappy, managers will start to realise that they are important and they will so their face and interest again.

So how can we fix all of this? Easy. Just remember that if you remove the FS caching, you want to cache it some where else! And don’t rely on the NAS or SAN cache. The best and the cheapest place to cache the data, is the Oracle Buffer Cache. This will help to improve your Buffer Cache Hit Ratio again :)

So if you use forcedirectio, have a look at your Oracle Buffer Cache size!

I had a look a system here in the Netherlands for a company that was having severe I/O performance problems. They tried to switch to DirectIO (filesystemset_option=setall), but they discovered that the IO performance got worse. So they quickly turned it back to ASYNC. The System Admins told the DBAs several times that there were no OS or File System issues. However they did recently migrate from one SAN to another with Volumne Manager Mirrorring. And the local was remotely mirrored, supposedly in asynchronous mode.

So then I had a look at the system. The system has 24 CPUs, 57GB of internal memory and 14 Oracle databases running on it. The two most important databases did according to statspack around 1700 I/Os per second. So I did a quickscan of the Physical Reads per SQL statement and found that some of the statements were missing indexes (based on Tapios Rules of Engagement), I also noticed that the buffer caches were really small.

After adding some indexes and increasing the buffer cache(s) it was discovered that the writes were still a bit (understatement) slow. To get the focus of the database and more onto the system and OS, I decided to use the test program for writes from Jonathan Lewis. The tests performed were 1K, 10000 blocks sequential writes, 8K, 10000 blocks random writes and 16K, 10000 blocks sequential writes. The tests were run on different mounts points and the interesting things were observered. The tests also performed slow on the database mount points. So the database was no longer the root of problem, something else was. Now the system administrators had to pay attention :) Then it was quickly discovered that volume manager mirroring was still enabled for the database mount point. When that was discovered and turned off, the performance improved but was still not good enough (acutally it was 3-4 times too slow). Then it was discovered that the remote SAN had it is cache disabled and the asynchronous operations became synchronous. When that was fixed, the writes almost returned to normal.

Lessons leaned:

  • Don’t rely on SA to tell you what not the problem is, proof him wrong.
  • Indexes still matter
  • Buffer Caches are way too small for todays hardware :)
  • Use the test IO program more often :)
Posted by: akolk | February 10, 2009

Google Sync for Windows Mobile for Contacts and Calendar

Just started to use this new Google feature and installed for my Google Apps domain (www.miraclebenelux.nl) and it works great. I can now sync my Contacts and Google Calendar automatically. It uses the Active Sync utility from Windows Mobile and you have to enable this feature for your Google Apps Domain.

Now the only thing to do is the Tasks list from the mail view and I am all set :)

Posted by: akolk | February 7, 2009

Using Amazon Cloudfront

Last year, a customer was running into trouble with static product images on their website. The images were store in an OCFS2 Filesystem and every so often OCFS2 would hang and the site became slowly unresponsive. We had talked a number of times about using a Content Delivery Network (CDN) to also improve the download streams to the client. The number of concurrent downloads from a domain is limited and different per browser. So increasing the number of domains for your site will help to improve the concurrent download. While we were discussing this (and after another problem with OCFS2) Amazon AWS sent out an email about the availability of Cloudfront. Here you can store static content and it will be cached in different servers around the world that are the closest to the browsers that request content from them. So we decide to implement this.  

Step 1 was to signup for the Amazon AWS service. Then we had to create an S3 bucket and upload the static content. There you run into of one performance issues with the Cloud. Uploading large amounts of data to the S3 (European) bucket is limited by your internet UPLOAD speed. Most people use ADSL connection with high DOWNLOAD speeds but with lower UPLOAD speeds. So uploading 30 GB of data will take some time (depending on your upload speed). An S3 bucket is basically a raw datastore. You have to tell what objects in that datastore are directories or files. Uploading the data was done with the JetS3 program. This has a commandline interface (CLI) and is written in Java so it can run on different platforms. After 3 days(a weekend) all the data was online and available from the Cloudfront. We implemented a monitor service with ZABBIX to see the peformance and availability of the Cloudfront service. It hasn’t been down but we noticed a couple of performance degradations, but the service has been available 100%. 

So was it worth all the effort that was put into it? The reason we did it was that OCFS2 seem to have a performance and stability problem. Well 2 months later we discovered the real reason for the problem. There was a firewall between the Read-Only nodes an the one Read-Write node of the OCFS2 cluster. This firewall was doing some housekeeping and lost control of that . So certain connections and messages between the OCFS2 got lost. That caused hangs and performance degradations. So the reason for switching has been fixed, but we haven’t switched back to the old implementation.

This Amazon Cloudfront service turned out to be a nice solution to serve static content. 

More to follow later.

Posted by: akolk | February 5, 2009

Shared Pool Latch Contention

Recently I was looking at a system that had some shared pool instability. Once a week during the day it would start flushing and loading the objects back in when needed. This resulted in large library cache pin and library cache load lock waits. That problem was attacked with some simple changes. One of the problems was that this customer had changed the reserved size minimal alloc (hidded parameter) from the default 5120 bytes to 51200 bytes. As a result the (large) reserved shared pool wasn’t used.

Another strange problem was that every hour there was a spike in shared pool latch waits. It turned out that an DBA had built to script to check the shared pool (queries against x$ksmsp) and that caused some problems. While querying this view, Oracle needs to hold the shared pool latch. So if there are many small pieces that need to be checked, one can hold on a long time for this latch.  When we killed the script, the spikes also dissappeared :)

Posted by: akolk | January 30, 2009

Stability is your friend

Oracle and other Microsoft are putting more and more automatic and self – everything features into their database. There are of course many reasons why that makes sense (for Oracle and Microsoft), but does it make sense for all Oracle Systems and their DBAs? I don’t think so. Consider this:

All these automatic and self-tuning features will manage resources and make decisions that can and will change the behavior of your system. Now consider that you are the DBA of a mission critical Oracle system. Do you want  a system that runs good enough and stable or do you want a system that sometimes runs perfect and sometimes runs badly? Let me know.

Older Posts »

Categories

Follow

Get every new post delivered to your Inbox.