This publication runs on Streamed.News. Yours could too.

Get this for your newsroom →

— From video to newspaper —

Thursday, May 7, 2026 streamed.news From video to newspaper
Technology

Training a Top Open-Source AI Model Costs Around $75 Million and 4,000 Tons of CO2

Training a Top Open-Source AI Model Costs Around $75 Million and 4,000 Tons of CO2

🌐 This article is also available in Spanish.

Original source: Stanford Online
This article is an editorial summary and interpretation of that content. The ideas belong to the original authors; the selection and writing are by Streamed.News.


This video from Stanford Online covered a lot of ground. 6 segments stood out as worth your time. Everything below links directly to the timestamp in the original video.

Training the best publicly available AI model today costs about as much as a mid-budget Hollywood film — and the bill grows tenfold with each new generation.


Training a Top Open-Source AI Model Costs Around $75 Million and 4,000 Tons of CO2

A detailed cost breakdown of training Meta's Llama 3 400B — currently the most capable publicly available large language model — puts the total bill at roughly $75 million: approximately $52 million in GPU rental costs for 16,000 Nvidia H100 chips running for about 70 days, plus an estimated $25 million in staff salaries. The model's training run consumed around 4,000 metric tons of CO2 equivalent, comparable to 2,000 round-trip transatlantic flights. Notably, the training computation came in just below the threshold established by a Biden administration executive order requiring special government scrutiny of frontier AI models.

The figures illustrate how rapidly the economics of AI development are escalating. Each new generation of frontier models multiplies compute by roughly tenfold, meaning that if current trajectories hold, the carbon footprint alone could become a serious concern within two or three model generations — even though today's numbers remain manageable relative to other industrial activities.

"Every new generation the number of flops essentially multiplies 10x, or at least that's what they try, if they have enough energy and if they can buy enough GPUs."

▶ Watch this segment — 54:56


AI Scaling Laws Let Researchers Predict Model Performance Years in Advance — and Expose Architecture Tweaks as Largely Irrelevant

Since around 2020, researchers have established that the relationship between computational resources and AI model performance follows a reliable log-linear pattern: double the data or the model size and performance improves by a predictable amount, with no sign of plateauing. This predictability has transformed how companies allocate resources — instead of tuning the final large model directly, teams now run experiments on smaller models across different scales, fit a curve to the results, and use it to forecast which configuration will perform best once scaled up massively.

A landmark paper known as Chinchilla quantified the optimal balance: for pure training efficiency, a model should see about 20 tokens of data per parameter. For real-world deployment, where inference costs accumulate over time, the practical optimum shifts to roughly 150 tokens per parameter, favouring smaller models that are cheaper to run repeatedly. The broader implication is counterintuitive: incremental architecture innovations — new activation functions, layer tweaks — matter far less than raw data quality and scale, a conclusion the lecturer described as one the research community has been slow to accept.

"Once you start thinking about it in scaling law terms, you really realize that all the architecture differences that we can make — the small minor ones — all they do is maybe change a little bit the intercept. But really, that doesn't matter."

▶ Watch this segment — 40:43


Building an AI on 'the Internet' Requires Filtering Out 99% of What It Contains

The process of preparing data to train a large language model is far more intensive than the phrase 'trained on the internet' suggests. Starting from roughly 250 billion web pages — about a petabyte of raw HTML — engineers must extract readable text, strip boilerplate like headers and footers, remove harmful or private content, deduplicate paragraphs that appear thousands of times across the web, and then apply both rule-based and machine-learning filters to discard low-quality documents. One standard technique uses Wikipedia's outbound links as a proxy for quality, training a classifier to identify and favour content that resembles sources Wikipedia considers credible. After all filtering, usable datasets have grown from around 150 billion tokens in earlier academic benchmarks to roughly 15 trillion tokens for today's leading models — implying the raw crawl is filtered down by a factor of roughly 100.

Companies rarely disclose their data practices publicly, driven partly by competitive advantage and partly by legal exposure around copyright. The final stage of pre-training typically involves briefly fine-tuning on a small corpus of high-quality material — such as Wikipedia — at a reduced learning rate, essentially allowing the model to 'overfit' on the best available text before deployment.

"Collecting world data is a huge part of practical large language model training. Some might say it's actually the key."

▶ Watch this segment — 28:32


Fine-Tuning an AI Assistant Requires as Few as 2,000 Examples Because It Teaches Style, Not Knowledge

The process of turning a raw language model into a useful AI assistant — known as supervised fine-tuning — requires far less data than researchers once assumed. Research has shown that scaling the number of training examples from 2,000 to 32,000 yields minimal improvement, because fine-tuning does not inject new knowledge into the model. Instead, it instructs the model to respond like one specific type of user — one who answers questions directly — rather than mirroring the full diversity of writing styles it encountered during pre-training. The knowledge was already absorbed; fine-tuning is essentially a formatting lesson.

The Stanford-developed Alpaca project demonstrated a practical shortcut: using an earlier generation language model to generate 52,000 synthetic question-and-answer pairs from just 175 human-written seeds, then fine-tuning a smaller model on that synthetic data. The result performed comparably to early chatbot systems built with expensive human-labelled corpora. The episode helped launch an entire subfield of synthetic data generation, now central to how both academic and commercial AI teams reduce the human labour required to build assistant models.

"All you learn is how to format your desired answers. Your pre-trained model essentially models the distribution of every user on the internet — all you tell your model is: you should actually be optimising more for this type of user than another one."

▶ Watch this segment — 59:34


A Simpler Technique Called DPO Has Largely Replaced the Reinforcement Learning Method Behind ChatGPT

ChatGPT's original alignment method — Reinforcement Learning from Human Feedback using an algorithm called PPO — involved a multi-stage process: collecting human preference rankings, training a separate reward model on those rankings, and then running a notoriously finicky reinforcement learning loop. A later method called Direct Preference Optimization, or DPO, collapses the same outcome into a single maximum-likelihood training step: increase the probability of generating responses humans preferred, decrease the probability of those they rejected. Under certain mathematical assumptions, both approaches converge to the same optimal solution, but DPO requires no separate reward model and no reinforcement learning infrastructure.

The practical consequences have been significant. PPO's complexity — with clipping, rollout loops, and poorly documented edge cases — made it difficult to implement reliably outside well-resourced labs. DPO, being mathematically equivalent but far simpler to run, has become the standard approach in the open-source AI community and is increasingly used in industry. The hypothesis that supervised fine-tuning itself may contribute to hallucination — by training models to produce plausible-sounding answers to prompts they never encountered during pre-training — has added further urgency to finding alignment methods that don't amplify false confidence.

"With PPO you had to collect human preferences, then train a reward model with maximum likelihood, then use reinforcement learning. Now all you do is basically maximum likelihood — much simpler."

▶ Watch this segment — 1:09:27


AI Benchmarks Are Broken by Length Bias: Verbose Models Score Higher Even When They're Not Better

Evaluating aligned AI models is fundamentally harder than measuring raw language model performance, because standard metrics like perplexity and validation loss become meaningless once a model has been trained to maximise human preferences rather than predict text distributions. The most trusted public benchmark, Chatbot Arena, addresses this by having real users blindly compare outputs from two models and vote on which is better — aggregated across hundreds of thousands of comparisons. A cheaper automated alternative, AlpacaEval, uses GPT-4 as a judge and achieves 98% correlation with Chatbot Arena rankings at a fraction of the cost.

Both approaches share a critical vulnerability: a systematic preference for longer answers. An experiment showed that prompting GPT-4 to be verbose pushed its win rate to 64%, while instructing it to be concise dropped the figure to 20% — against the same underlying model as baseline. This length bias is more dangerous with automated judges than with humans, because a human eventually rejects a five-page answer to a simple question, while an AI judge may keep rewarding verbosity indefinitely. Applying causal inference techniques to statistically control for response length substantially reduces the distortion.

"If we ask GPT-4 to be slightly more verbose — we just say in the prompt 'be verbose in your answers' — it gets a win rate of 64%. And if we ask it to be concise, it gets 20%. So there's a huge variance depending on whether you ask it to be concise."

▶ Watch this segment — 1:23:42


Summarised from Stanford Online · 1:44:31. All credit belongs to the original creators. Streamed.News summarises publicly available video content.

Streamed.News

This publication is generated automatically from YouTube.

Convert your full video library into a digital newspaper.

Get this for your newsroom →
Share