Thanks, nice essay. Just a nit: your scenarios don't seem to reflect the full impact of electricity. An H100 takes nearly a kW of electricity, so because we're assuming full utilization, the $0.01/0.03/0.10 kWh charges can approximately be subtracted from the rental rate. In that case, looking at the $1/hr scenarios, I'd expect the 3-year revenue projections to differ by a few percent, not a small fraction of a percent.
Thanks! This is correct for the $4.50 rental rate, but needs to be fixed for the other ones. For example, your cell E26 should subtract E$10 (electricity fee), not E25 (IRR from previous scenario).
Hi Eugene, thanks for the article! I was wondering how you are thinking about the colocation cost ? I usually think about that as another opex item along with electricity cost.
I added in a naive 50k per node for facility/colocation, as a capex item, leaving only electricity as opex. Which uses a 1 (gpu) * 1.2 (system overhead) * 1.2 (facility overhead) multiple
I was honestly split between capex-vs-opex. And for nearly all other server scenerios would have left it as opex.
But decided to roll it into capex, given how many of the larger clusters are being deployed into purpose built datacenters, or retrofitted datacenters / server rooms, with upgraded cooling and power. With substantial upfront costs.
I also have low faith the GPUs will last longer then 6-8 years considering the failure rates.
You folks probably have way more experience then me, in projecting the facility cost
For facility cost we typically think about it as an opex item as the capex for datacenter equipment and fit out will be done by the colocation operator and we always think of it as being split into different layers. In most cases they are indeed separate people owning the GPUs vs owning the datacenter but some are vertically integrated and own both. In this case - we still like to consider it opex - i.e. transfer pricing,.
Anyway - if I use $150/kW/mth over a 5y expected lifetime, I get $89k of costs over the 5 years - I also typically add some support engineers and misc direct costs of about $4k a year. But I think its best to convert this to opex to get a more accurate IRR. If you put it as a capex item it unfairly lowers your IRR because in reality you will be paying these over the 5y and not upfront.
H100s could be deployed into datacenters with not too high of a rack power density, you just put only one H100 per rack, so on H100s you many not always need a retrofit or purpose built datacenter.
Excellent article! Maybe a silly question, but looking at the H100 instances on AWS, Azure, and Google Cloud, the prices for H100 are still above $4 per GPU on 3-year commitments and over $10 on-demand. Is there any trade-off in using these resellers instead of the three big cloud providers, or is this IaaS commoditized? I’m thinking about this more to assess the ROI of cloud providers—should their prices converge to around $2 as well (given that the free market is clearing at this price), or are their services/features so different from the resellers that it is a different market?
Hi Eugene. Great essay, thanks! I'm new to your substack, so pardon my question if it's been answered already in another post. Any thoughts on inference economics, especially in light of chain-of-thought models like o1 and future open-source models that will do the same?
Eugene, my question is if the demand for GPUs are supposedly falling because only a very small number of companies need them, and this number continues to decline, and all the demand is going to inferencing, then why is demand for Blackwell even higher? And why is Hopper demand still so strong in this current quarter??
Time lag is a major factor. Before this article, I am aware of new H100 clusters still being pitched+fundraised+being-built, these datacenter buildouts are in 6 month+ cycles.
For Blackwell superclusters, its very likely a handful of companies with the billion+ funding that is purchasing the bulk of the orders. Microsoft, OpenAI, Facebook, X.ai, all have confirmed publicly that they purchased Blackwell chips
Due to the unique requirements for training, I have little doubt as long as those billions get raised, that the purchase cycle for blackwell will remain strong.
Thanks, nice essay. Just a nit: your scenarios don't seem to reflect the full impact of electricity. An H100 takes nearly a kW of electricity, so because we're assuming full utilization, the $0.01/0.03/0.10 kWh charges can approximately be subtracted from the rental rate. In that case, looking at the $1/hr scenarios, I'd expect the 3-year revenue projections to differ by a few percent, not a small fraction of a percent.
Author here: it is subtracted from the rental rates - you can run through the math in the full spreadsheet here (linked at the end of the article too)
https://docs.google.com/spreadsheets/d/1kZosZmvaecG6P4-yCPzMN7Ha3ubMcTmF9AeJNDKeo98/edit?usp=sharing
Didn't go too deeply into the spreadsheet itself, cause the article was slowly entering "too much information" category
Thanks! This is correct for the $4.50 rental rate, but needs to be fixed for the other ones. For example, your cell E26 should subtract E$10 (electricity fee), not E25 (IRR from previous scenario).
Ah ur right, thats a mix up on my side - surprisingly the changes isn't as large as I panicked it would be (given how big of a mistake it is).
There are some feedback as well as using a lower IRR as a comparison against (10% is high), so will adjust and update
Hi Eugene, thanks for the article! I was wondering how you are thinking about the colocation cost ? I usually think about that as another opex item along with electricity cost.
The capex items can be found listed here:
https://docs.google.com/spreadsheets/d/1Ft3RbeZ-w43kYSiLfYc1vxO41mK5lmJpcPC9GOYHAWc/edit?usp=sharing
I added in a naive 50k per node for facility/colocation, as a capex item, leaving only electricity as opex. Which uses a 1 (gpu) * 1.2 (system overhead) * 1.2 (facility overhead) multiple
I was honestly split between capex-vs-opex. And for nearly all other server scenerios would have left it as opex.
But decided to roll it into capex, given how many of the larger clusters are being deployed into purpose built datacenters, or retrofitted datacenters / server rooms, with upgraded cooling and power. With substantial upfront costs.
I also have low faith the GPUs will last longer then 6-8 years considering the failure rates.
You folks probably have way more experience then me, in projecting the facility cost
Thanks for sharing the capex items!
For facility cost we typically think about it as an opex item as the capex for datacenter equipment and fit out will be done by the colocation operator and we always think of it as being split into different layers. In most cases they are indeed separate people owning the GPUs vs owning the datacenter but some are vertically integrated and own both. In this case - we still like to consider it opex - i.e. transfer pricing,.
Anyway - if I use $150/kW/mth over a 5y expected lifetime, I get $89k of costs over the 5 years - I also typically add some support engineers and misc direct costs of about $4k a year. But I think its best to convert this to opex to get a more accurate IRR. If you put it as a capex item it unfairly lowers your IRR because in reality you will be paying these over the 5y and not upfront.
H100s could be deployed into datacenters with not too high of a rack power density, you just put only one H100 per rack, so on H100s you many not always need a retrofit or purpose built datacenter.
Very strong work. I have numerous confirming datapoints about GPU supply,
If you can elaborate (on datapoints), would love to hear more
Excellent deep dive, well done Eugene. Really crisp insights
Another case. Last year, there was a 15% profit margin to get H100 into China. But this year, only 1% profit margin.
Excellent article! Maybe a silly question, but looking at the H100 instances on AWS, Azure, and Google Cloud, the prices for H100 are still above $4 per GPU on 3-year commitments and over $10 on-demand. Is there any trade-off in using these resellers instead of the three big cloud providers, or is this IaaS commoditized? I’m thinking about this more to assess the ROI of cloud providers—should their prices converge to around $2 as well (given that the free market is clearing at this price), or are their services/features so different from the resellers that it is a different market?
Hi Eugene. Great essay, thanks! I'm new to your substack, so pardon my question if it's been answered already in another post. Any thoughts on inference economics, especially in light of chain-of-thought models like o1 and future open-source models that will do the same?
Wildly interesting. Thanks for sharing.
GPUs will become much more harder to accumulate from now on.
I wonder how that will affect AI development?
Eugene, my question is if the demand for GPUs are supposedly falling because only a very small number of companies need them, and this number continues to decline, and all the demand is going to inferencing, then why is demand for Blackwell even higher? And why is Hopper demand still so strong in this current quarter??
Time lag is a major factor. Before this article, I am aware of new H100 clusters still being pitched+fundraised+being-built, these datacenter buildouts are in 6 month+ cycles.
For Blackwell superclusters, its very likely a handful of companies with the billion+ funding that is purchasing the bulk of the orders. Microsoft, OpenAI, Facebook, X.ai, all have confirmed publicly that they purchased Blackwell chips
Due to the unique requirements for training, I have little doubt as long as those billions get raised, that the purchase cycle for blackwell will remain strong.