Database design and RAID 0
This article is presented by Dick Correa, DTI Data Recovery’s head programmer and RAID Data Recovery engineer.
Databases, by design, are probably one of the single most demanding applications in regards to I/O throughput. This article will take a look at this fact and offer a solution using RAID 0 technologies. In order to do that, let’s first take a look at a standard database and dissect its functioning.
Databases are comprised of two basic elements. A flat file, and an index into that flat file. For example, one may have a database of patients. Each patient should have basic information, such as name, address, phone, insurance carrier, etc, etc. Each one of the patients, and their information is called a record. There is one record for each patient and a record is stored in a file, a ‘flat file’. Flat file meaning a file which is not sorted. For illustration lets say that each record, will take approximately 300 bytes of storage. As each new patient is added, the new patients record is appended to the end of the file. I have personally worked with many hospitals and doctors and I can safely say that there can be millions of patients in one hospital over a period of 2-3 years.
We now have a flat file, with millions of patients, and in walks Mr. Johnson. Now we need to look up Mr Johnsons record. Are we going to look at EVERY record to find Mr Johnsons record? How long would that take? How much pressure on the server would that transfer? Compound this with the fact that there may be several clerks querying the database simultaneously. How do we, as database administrators, stop the I/O bottle neck, and ultimately a long wait between queries?
First of all, we use indexing. ISAM is one method, B+tree is another. We design indexes into the flat file. The key here is that all indexes are sorted and therefore search engines can be designed to use that fact. ISAM uses ascending ascii as a parameter, B+Tree uses a modified ISAM and a balanced search engine. The point here is that the indexes are usually very small, and sorted. For example, an index may only contain the last name of a patient. So, we enter Johnson, we use the index to search for Johnson, then we look at the index pointer which tells us where in the flat file Mr. Johnsons record is. Now, there may be many Johnsons in the index, so we may have another index of just social security numbers. So, the key to indexes is the sorting methodology used, and index design. How does RAID 0 figure into all of this?
RAID 0 uses several drives across separate I/O ports to perform as a single drive. The RAID is striped using a user defined size and the data is stored using the stripe as a basis. For example, lets take a four drive RAID 0, with a stripe size of 8K. We have a 256k block of data that we are trying to save. The procedure executes like this, 8K to drive 0, 8K to drive 1, 8K to drive 2, 8K to drive 3. 32k written. Then the procedure starts again and continues until all 256K is written. The RAID card that controls the array does not wait for each write to finish until it goes to the next drive, it notifies all 4 drives simultaneously and instead of one 8K block being written, a 32K stripe is written, in the same amount of time. For purposes of illustration lets say that it takes 1 second to write an 8K block to a single drive. With 256K of data to write it would take 32 seconds to write the data to a single drive. With a four drive RAID 0 it would take one quarter of the time, or 8 seconds. The data is the same, the data size is the same, but having a RAID 0 makes your throughput four times faster.
Now we can take our example one step further. When writing to the drive it is the head positioning and notification that takes all the time, the actual write is theoretically instantaneous. With that in mind it is better to write as much data as possible once the head is in position. How do we accomplish that with RAID 0? Instead of a stripe size of 8K, lets increase the size to 32K. We can use the same write speed in our example since we understand that it is head positioning, and not the actual write that takes the time. So, using the same math we can speed up throughput from 32 seconds to a single drive, to 8 seconds for a RAID 0 with an 8K stripe, to 2 seconds with an increase in the stripe size to 32K. Ultimately you can take this and change the stripe size to 64K and take only one second to write a 256K block of data.
Although this is very simplified in order to make a point it is easily seen that a RAID 0 configured properly will in fact increase throughput and ultimately speed up database access.
When designing a storage system for a database look at record size, cache sizes on the RAID card as well as the drives. How well does the firmware handle look ahead, on the RAID card as well as the drives themselves? Does the database cache in chunks that can be optimized to work well with your RAID caching schema? These are only a few factors that must be taken into consideration when designing your database.
Lastly, although RAID 0 may seem to be the end all and be all to database design, it has one drawback. It contains no native recovery functions. In other words, it is not like RAID 1 which is mirrored, or like RAID 5 which uses XOR math and single drive parity. However, with RAID 1 capacity is diminished and I/O speed is severely throttled. With RAID 5 you may have the same capacity, but the firmware still must calculate parity, and that parity has to be written to the drive. In other words for every write there is a calculation and an extra write. With a large and active database the I/O bottleneck could be significant.
So with RAID 0, institute a well planned backup system that takes all possibilities into account include restoration drills. Design your database so that it works well with your I/O system. Do those things and you will find that RAID 0 can be the answer to all of your storage requirements.
If ever you require RAID 0 Data Recovery please call DTI at (866) 438-6932