The feature that our NetApp team has been primarily working on is known as the SCSI T10 Protection Information (PI). Ultimately, this standard adds an additional guarantee that the data we are storing is correct. Hard drives already provide a CRC for each sector and some other internal magic to ensure data correctness, but PI adds end-to-end support for such guarantees.
This standard requires that the sector size on hard drives be increased from 512 bytes to 520 bytes. These extra 8 bytes will then be split up into three fields:
Application Tag (2 bytes): A field controllable by the file system, database, or other application actually storing the data.
Guard Tag (2 bytes): An additional CRC.
Reference Tag (4 bytes): This contains the logical block address, or where we physically are going to put the data on disk.
The goal of all of this stuff is to be able to prove (to some extent) that the data is correct and went to the desired location. For example, when the user actually gets some data back, they can request to see the guard tag so they can verify correctness of data themselves, instead of relying on block boxed disk drives.
The first step of this is getting our hard disk manufacturers to add support for this new standard. Since its new, our drive certification department can't really test them, so we get 'em and have to figure if they actually work at all or not. Most of the time they do, but sometimes we'll spend weeks hunting a bug to find out its a manufacturer firmware problem. While all this is going on, we need to make sure we satisfy the vendor needs, which can be spread out across the board.
The LSI controller firmware code base consists of over 2 million lines of C/C++, has four layers of preprocessing, relies strongly on function pointer passing and has gotta be wicked fast. All of this translates into a monster code base that that is quite cryptic to look at! Increasing the size of the a block on disks has a huge impact in this code and was very difficult to get everything correct. Sure, we've got a define in there that we just flipped from 512 to 520, but its not that simple...
When moving stuff between controller caches, persistent cache backup devices, and to/form disk, we need to know whether or not to verify or insert the correct data for these new 8 bytes. This gets really tricky when using a RAID 6 algorithm, where the Q parity can't have correct data! This is because Q parity is used to reconstruct any data piece, but unlike P parity which is a simple XOR, the Q parity is done using Galois Field Multipliers (GFM). In addition, sometimes we use an intermediate RAID volume to store data temporarily while doing a critical operation. Because this data isn't in its real home, the reference tag can't be correct and thus, we need to know to forward it and not checking. In addition, the chips and errata on the different platforms and interfaces need to be surfaced and solved. This can be incredibly tricky because this existing hardware was not intended for PI, and we often have to play games to trick the it into doing what we want. If we make mistakes anywhere, we end up hanging ourselves because we think something failed and have no idea why the PI is incorrect. The user sees this as their arrays going offline...
I've added support to a number of existing operations for PI and from an implementation standpoint there are some very interesting problems to solve. Simply by requiring manipulation of user cache data, this feature is prone to all kinds of problems. On top of that, if we as developers make a mistake, we have to protect the customer and shut down our controllers! This is a huge deal and thus a very sensitive issue.
As a result of such critical operations, we have been working on moving towards an automated test-driven development methodology - something that LSI hasn't fully adopted in the past. For every feature we add support to, we development legacy functional tests and PI correctness tests. This is often more work than the actual feature development itself, but has already saved us from countless pitfalls!
All in all, its exciting to be part of such bleeding edge technology. Contributing to discussions with manufactures and vendors has been very interesting and opened my eyes to what truly goes into delivering and supporting enterprise level systems.