International Data Corporation (IDC) is a provider of market intelligence and advisory services for the information technology, telecommunications, and consumer technology markets. A recent presentation titled “Beyond Organisational Boundaries: Answering the Enterprise Computing Challenge” analyzed world-wide spending on Servers, Power & Cooling, and Management Administration. As noted the “Second Critical Challenge” is controlling Management and Administration cost. On the near horizon, as hardware cost rise slowly or even decline, is cost parity between New Server Spending and Power & Cooling. Over the decade Power & Cooling has gone from 10% of spend to nearly 75%.
Energy cost is trending higher as seen in this EIA data “U.S. Electric Utility Sales, Revenue and Average Retail Price of Electricity “. The following shows Average Retail Price for the Industrial Sector, by State, over the last two decades. Relative price stability has been replaced with sharper state-by-state disparities and rising cost.
The trend is clearly towards increasing cost of electricity needed to power and cool the data center. The economic cycles and associated increases and decreases in demand are modulating the upward trend. Embedded within the cost are renewable energy requirements and regulation. A recent ieee spectrum article “Trouble Brewing for Wind?” notes that operation and maintenance (O&M) cost are increasing sharply, two or three times higher than initially projected. The referenced Wind Energy Operations & Maintenance Report estimates world O&M cost at 27 U.S. cents per kWh with credits offsetting 20 c/kWh. The cost of the credits must be financed by other means, ultimately driving up the average retail cost of electricity. Unsubsidized, this form of alternative energy would most likely quintuple cost and force significant increases in Power & Cooling, placing P&C on par with Management/Administration cost. Overall this highlights the importance of pro-active data center planning, from capacity through adoption of new technologies, with complementary facilities and IT strategies. The life cycle of a data center places you directly in the path of higher P&C cost.
Following a recent announcement of a partnership between Seagate and LSI, LSI announced the bootable SSS6200 PCIe form factor solid state storage solution. Capacity is initially targeted at 300GB, scalable up to 1TB as flash technology evolves. The 300GB solution operates within the PCIe power envelope of 25W. In addition the front-end is PCIe 2.0 x8 while the flash module connectivity is 6G SAS. The primary competitor of the SSS6200 is the Fusion-io’s ioDrive product line which currently is not bootable. This technology offers IT groups the opportunity to increase service levels as well as reduce server, storage and energy footprints due to orders of magnitude increase in random performance over rotating media RAID.
More to come ……
Both Facebook and MySpace are now using Fusion-io Solid State Storage to address the performance challenges of Social Networking. As James Hamilton notes in his Perspectives blog entry Scaling at MySpace social networking are difficult to implement. He calls them “hairballs” because the many-to-many relationships are difficult to partition and scale. Massive daily growth of the hairball makes social networking architectures leading edge.
MySpace and Facebook employ a 3-tier architecture. MySpace infrastructure is comprised of Web tier, Cache Tier and Database tier. MySpace also uses Memcached for in-memory key-value chunks of data paged from SQL server. Solid state storage is ideal for hot data, the cache tier.
Fusion-io’s case study ‘MySpace Uses Fusion Powered I/O to Drive Greener and Better Data Centers‘ highlights that MySpace was able to cut “hardware need by 60%”. It was clarified in the ‘Scaling at MySpace’ comments that the HW reduction was for the cache tier. As summarized in ‘Scaling at MySpace’ the infrastructure consist of:
· 3,000 web servers
· 800 cache servers
· 440 SQL servers
· 440 SQL Server Systems hosting over 1,000 databases
· Each running on an HP ProLiant DL585
o 4 dual core AMD processors
o 64 GB RAM
· Storage tier: 1,100 disks on a distributed SAN (3PAR)
· 1PB of SQL Server hosted data
Applying the 60% reduction factor to the cache tier would indicate that initially was 2000 servers. These servers were provisioned with 10-12 15K rpm SAS drives and a RAID storage controller. After the upgrade the cache tier was provisioned at 800 cache servers with two 320GB ioDrives each for a total of 1600 ioDrives. In summary it appears MySpace was able to decommission roughly 1,200 cache servers, 2,000 SAS RAID controllers and 20,000 SAS HDDs by the deployment of Fusion-io ioDrives in this random hot data application.
While energy footprint was not the sole reason for deploying solid state storage in the cache tier it is noteworthy that the overall power reduction was most likely near 700 kWh.
· 1,200 cache Servers @ 250 watts each = 300 kWh
· 2,000 PCIe SAS RAID controllers @ 25 watts each = 50 kWh
· 20,000 15K rpm SAS HDDs @ 18 watts each = 360 kWh
Add back in 1,600 Fusion-io ioDrives and factor in a PUE of 1.6, this would be a reduction of 1,070 kWh. At $0.10 kWh a savings of $940K/yr would be realized. This kind of real-world data is valuable when considering the deployment of new technology. Realistic ROIs are can now be determined for right-sizing of capacity and performance. In addition there are some significant ramifications for existing technologies such as traditional PCIe RAID when used in hot data or random workload applications.
As a follow-up to “Latency – How far have we come in a decade“, I thought it would be useful to add the recent Solid State Storage data to the mix. As already discussed under SATA SSD Product Announcements, there are several SATA SSDs available, from consumer up through enterprise class. The advantage of SATA SSDs is the ability to leverage an existing infrastructure and to also take advantage of the 3 primary values of RAID controllers – physical drive capacity aggregation, volume management and performance. But the RAID controller can also add latency for the software stack. The Fusionio product line integrates PCI express, a storage controller, Flash Translation Layer (FTL) and NAND flash into a single PCIe solid state storage solution.
From Fusion-io’s ioDrive ioDrive Duo web site specification tab:
This is a significant latency and throughput improvement over HDD based RAID and the current crop of SSDs. As can be seen on system level chart shown below Fusion-io’s PCIe solid state storage only incurs ~ 1.5 order of magnitude increase in latency for off processor complex references. This is significant and can greatly benefit those aspects of the compute and storage hierarchy that are latency sensitive.
Where and how is solid state storage used? Can system level capacity and performance be improved? Are there limitations? Aspects to watch out for? That is a discussion for another day ……
The latest wave of product releases is revealing some interesting aspects of SSDs. You have to read beyond the headlines to find the nuggets. Let’s look at one nugget derived from various articles:
Seagate ‘Pulsar’ is spec’ing 4K random reads of 30,000 IOPs while sustained 4K random reads range from 10,500 IOPs down to 2,600 IOPs as a function of drive capacity. The I/O queue depth is listed as Q=32. The SSD controller and firmware are 3rd party and can only be inferred from Seagate’s investment strategy.
OCZ’s Vertex-2 Pro is based on SandForce’s SF-1500 SSD processor. At Benchmark Reviews OCZ preliminary specification for 4K random write is 19,500 IOPs, higher than Seagate’s Pulsar but interestingly the I/O queue depth is listed as Q=8, not Q=32.
Forgoing a SSD technology ‘deep-dive’, keep in mind that storage I/O ‘blocks’ are not the same as flash blocks. Micron provides several NAND Flash Technical Notes that can be found here. The Small-Block vs. Large-Block NAND Flash Devices note describes the array organization of large-block NAND devices. A NAND flash ‘page’ is 2Kbytes+64bytes. A ‘block’ is 64 pages or 128Kbytes+4Kbytes. Read and Program operations can be performed on a Page but Erase operations must be performed on a Block level. Once a file system workload consumes the SSD drive with a specific workload that generates Program cycles the ‘dirty’ flash blocks must be Erased and returned to free pool for future Page level programming. Managing the logical-to-physical translation, erase-before-write requirements and the many endurance requirements is the responsibility of several integrated and co-dependent algorithms, commonly packaged and referred to as a Flash Translation Layer (FTL).
The preliminary SSD specifications reveal that I/O queue depth may be an important factor in SSD write performance. Asymmetric R/W throughput would be consistent with the level of FTL work required to translate the logical 4Kbyte writes into physical flash block program-erase cycles, but not immediately obvious to early adopters of this technology. Planning your SSD adoption strategy will require a thorough understanding of the storage hierarchy workload characteristics.
In summary, buyers beware. Your SSD plan may look good on paper at 30,000 IOPs per drive but in production sustain performance of less than 5000 IOPs per drive under heavy write queuing. Artificially limiting queue depths may have other deleterious effects. Does $30/GB for SSDs still make sense at 5000 IOPs? What other factor flying below the headlines come into play? In-depth real-world Capacity, Performance and Cost analysis are the only way to Right-Size a solution for success.
LSI and Seagate announced a partnership to enter the PCI express-based enterprise Solid-State Storage market. Go to PR Newswire for the press release. By combining LSI PCIe SAS host adapters and Seagate’s recently announced SSD Drive technology the collaboration should establish both companies in the SSD enterprise segment. Although the press release did not specifically mentioning ‘Pulsar’, the only SSD ‘Drive’ technology that Seagate offers is Pulsar. The board-level solution will therefore require, as a minimum, a repackaging of Seagate Pulsar technology from 2.5″ drive form factor to something that would enable a “board-level solution”. I would expect the new form factor to be consistent with Seagate’s high-volume qualification and manufacturing capabilities.
LSI does offer high-performance RAID and non-RAID SAS/SATA host controllers that could aggregate and manage multiple Pulsar technology drives. Last year LSI demonstrated one million IOPs with three 6Gb/s 8-port LSISAS2108’s controllers. This is roughly 42,000 IOPs per SAS port, consistent with the SSD peak performance levels shown under “SATA SSD Product Announcements – Pre/Post CES 2010 “. Since SSDs already protect the data and have a more predictable reliability profile, maybe the most important value of this collaboration would be the aggregation and management of SSDs vs. actual RAID data protection. As noted elsewhere this is a direct challenge to FusionIO. Both this collaboration and FusionIO are addressing the latency gap between the processor complex and storage. In a follow-on post I will add these solutions to Latency – How far have we come in a decade?
As of this post neither LSI nor Seagate are the supplier of the SSD Controller and the Flash Translation Layer (FTL) IP. For Pulsar drives the SSD controller and FTL are 3rd party IP. The question left on the table for either LSI or Seagate – Is this a wait-and-see strategy related to NAND-based SSD technology or is there more to come?
So far 2010 has been an active year for SSD product announcements and early evaluation snapshots. Several products based on SandForce controllers were announced while Micron and Marvell announced a 6Gbps SATAII drive that appears to have the performance lead. The primary flash suppliers see an opportunity to add higher value to the product line. Existing storage suppliers need to introduce new technology in the 15K rpm HDD space. Partnerships abound as each establish market presence. On the date of this post Seagate had not released the name of its 3rd-party controller although they are an investor in SandForce. It is an interesting strategy for Seagate as they own neither the Flash Translation Layer IP or the Flash IP for their Pulsar product.
A brief survey of announcements, specifications and early evaluations provides a perspective on the emerging landscape.
Overall product specifications were somewhat spotty as each vendor nails down their product definition for a target market. This post was not intended to be an extensive overview of the technology associated with using NAND flash as a storage drive but there are some important take-aways.
- As expected SSDs significantly outperform HDDs on small block random performance. For 4K random short duration workloads an SSD drive provides roughly 100X IOPs over 15K HDD drives.
- For 4K random longer duration workloads the advantage drops to 4X to 16X over 15K HDD drives. Controller workload for managing the write wear leveling, read disturbance and overall endurance of the NAND array increases over the life cycle of a SSD. A ‘clean’ SSD requires little to no endurance management as the next write can be written to an unused cell. As free cells are consumed and others released for erasure the endurance algorithm overhead increases and consumes controller resources, resulting in lower performance. This can be seen under Anandtech’s recent review of Micron’s RealSSD C300 from CES 2010. Random 4K writes peak early and after roughly 25 minutes or 75 Mbytes of data, drop sharply to a sustained level. Each vendor specifies this differently. Seagate specifies a ‘Peak’ performance level that can be achieved under shorter durations and corresponds to a clean SSD. Seagate also specifies a ‘Sustained’ performance level that may be seen if the workload consumes cells and results in higher endurance algorithm overhead. Some call this a ‘dirty’ SSD. Plextor has taken the approach of specifying a single performance number, whether ‘Clean or Dirty’. At first glance Plextor performance appears low, but when viewed in the context of Peak and Sustained characteristics for SSDs, may be sufficient for a particular system environment.
- The comparison is missing warranty or MTTF life times. NAND flash does have finite Program-Erase cycles for both MLC (lower) and SLC (higher). By throttling the rate at which these cycles are consumed the life expectancy and time to BER uptick can be extended. I suspect each vendor is performing a rigorous reliability analysis to determine product specifications in this area. It is also possible that vendor-customer relationships drive a unique combination of performance and reliability operational points for a particular market.
In summary we are now seeing a richer selection of SSD products from flash vendors and traditional storage vendors who can supply OEMs at scale. SSDs will allow storage solutions to offer lower latency, especially random workloads. But as always, buyer beware. That 100X advantage over HDDs can disappear under some workloads. A thorough evaluation of the system level operational environment should be completed before deploying SSDs in production. A 10X SSD performance droop could present some interesting system level response time problems and obvious issues with management for spending large $$/GB for the latest / greatest storage.
I’d like to bring your attention to Google Fellow Jeff Dean keynote talk at LADIS 2009 on “Designs, Lessons and Advice from Building Large Distributed Systems”. The slides (PDF) can be found here. In this study a single server has 16GB of DRAM and 2TB of storage. This is Google’s building block of a large distributed architecture that now exceeds 500K nodes. The aspect for this post is slide 24, “Numbers Everyone Should Know”, which is a summary of system level latencies. Google employs commodity servers and SATA drives for storage. The Google File System (GFS) is architected as Jeff noted, “Things will crash. Deal with it!” It is dealt with through a large distributed system that copy information 3 to 8 times. GFS does not employ RAID for data protection. Data protection is inherent in the system architecture.
As highlighted in the Leveraging the Cloud for Green IT CMG paper large distributed systems like Amazon’s AWS are also based on commodity servers and storage. A decade ago the server of choice for performance benchmarking was the Dual Pentium III with the 440BX chipset. This machine consistently provided the performance team the highest throughput, both MB/s and IOPs. So, how far have we come in a decade? Let’s look at latency, from the processor to the rotating media storage, by comparing the PIII/440BX system to Jeff’s “Numbers Everyone Should Know”. Keep in mind Jeff’s numbers are not high end enterprise numbers, but the building blocks of some distributed Cloud offerings. Later we will look at bandwidth.
Now your first reaction is that today’s systems performs much better, how could this be true? Well today’s architectures do have next generation microarchitectures, multiple cores, wider data paths, deeper buffers, etc. Parallelism is much greater but under light loads latency relationships, within orders of magnitude, have changed little over the years.
A glaring gap exist between server complex and external storage. More on this later.
IDC recently released their 2010 predictions for enterprise data storage. TechTarget covered it in a January 4, 2010 article titled:
Enterprise data storage outlook 2010: industry predictions for storage pros
I’d like to highlight a couple of points from the TechTarget article:
“……full recovery from the most recent downturn is still a ways off as 2010 arrives. That means capital and operational efficiencies will continue to rule purchasing decisions this year….
“….IT will undergo “a shift away from capital cost efficiencies to operational cost efficiencies……”
“….IDC said business units will look to establish greater independence from IT……”
The “independence from IT” prediction needs some color added. Businesses will always value IT, just what aspect of IT is on the table. As discussed in the paper the business unit maintained control of the Business Intelligence application and process. What changed was the underlying platform provisioning, which dynamically grew into a services-oriented infrastructure offered as a commodity. I believe that in 2010 we will begin to see strategic planning, direction and execution of the disaggregation of Business Intelligence applications from the underlying IT platform. The benefits will be agility at scale and if planned correctly, lower cost.
I should note that this is not virtualization, the consolidation of several applications onto a single server. The “greater independence from IT” will be business units minimizing application in-house IT infrastructure requirements and dynamically provisioning (greater independence) with on-demand IT. This is a shift from CapEx to OpEx cost models. It is also a shift in application architecture thinking.
The Leveraging the Cloud for Green IT paper, which highlighted the move from on-premise traditional CapEx cost model to an OpEx cost model enabled by Cloud services, demonstrated a process to achieve those business goals.
The IDC predictions are for storage but as everyone knows storage requirements are driving many IT decisions, from servers to data center provisioning. By sheer size services and servers will increasingly follow the data.
InnoDisk, a provider of flash based storage devices for industrial applications and embedded systems, announced a ‘Matador II’ PCIe SSD on December 22, 2009. An EE Times Asia press release can be found here. Development began 3 years ago and offers PCIe x16, 2TB capacity and “internal RAID allocation functions”. The target market is stated as “enterprise server, high-end applications, cloud computing, etc”. Iometer throughput screen shots on the announcement page show 22,354 IOPs and 917 MB/s. Not earth shattering and somewhat low as compared to the FusionIO product line. Pricing was not available at the time of this post.
As previously noted in the “WD enters SAS HDD market, the entry point implications” post, SSDs are targeting the 15K rpm HDD market that requires high random performance and low latency. The EE Times press release touts bandwidth, not a primary requirement for SSD storage. Given the lack of specifics it is not clear that the innoDisk product is competitive in this market segment. Hopefully more details are forthcoming.