Moving Towards Greener LLMs

It’s no surprise that, with the upsurge of AI and chatbot usage, there come serious social and economic implications: People — particularly students — become over-reliant, jobs that require little emotional intelligence become automated, privacy invasion and data leaks become more frequent, etc. Perhaps an issue that is often brushed over (deliberately or not) is the environmental impact of large language models (LLMs).

To begin — what are LLMs? They’re all about word prediction, or the ability to process and generate human language based on a prompt from the user. There are plenty of LLMs that we’ve come in contact with, including ChatGPT, LlaMa, Gemini, and Claude. To be able to accurately generate responses, these models are trained on vast amounts of text. The training process of models itself requires a hefty amount of energy. For example:

ChatGPT-3: 1,287,000 kWh
ChatGPT-4: 50,000,000 to 60,000,000 kWh
Llama 3: 500,000 kWh

These predictions are only estimates, as various factors contribute to an LLM’s energy consumption. Model size is a big factor (larger models = more energy consumed), but so is the number of parameters (indicative of size) and quantization method (which is where high-precision data representations are converted to ones with lower precision).

These are only the factors during LLM training. The infrastructure chosen to operate the LLMs is also a big deciding factor. Graphic processing units (GPUs) handle the main computation of LLMs, including interpreting the query, using mathematical operations to determine context, and generating a response. NVIDIA’s A100 GPU consumes around 400 watts of energy max., while the newer H100 GPU SXM consumes around 700 watts max. These are the two most popular GPUs on the market currently. Tensor processing units (TPUs) are less widely available and versatile, but similarly help speed up machine learning workload. The second-generation TPU consumes less energy at around 200-250 watts per chip, however v3 and v4 consume around 300 to 400 watts as a trade-off for optimal performance.

In terms of when the LLM is actually deployed, including during inference, there are many more factors that add to energy consumption. This makes an estimation of energy consumption per query much harder. Rough estimates are as follows:

Llama 3.1 8B: 114J
ChatGPT-4o: 1224J
Claude Opus 3: 14,580J
Gemini 2.0 Flash: 79.2J

For reference, an LED consumes around 1000J of energy in 2-3 minutes. This number is small! However, once we factor in the fact that hundreds of millions of people use LLMs daily, and each person asks multiple queries, our energy usage quickly multiplies. ChatGPT, for instance, processes over 1 billion queries a day.

This is only for text generation-focused models. Those that generate videos, such as OpenAI’s Sora or Google’s Lumiere, require significantly more computational power. A 5-second video from CogVideoX takes 3.4 million joules of energy, not to mention that these videos are often tweaked repeatedly by a user before it’s deemed sufficient. By 2028 (in just 3 years!) AI’s energy demand would be equivalent to powering 22% of US households, and by 2030, data centres are predicted to produce 2.5 billion tons of greenhouse gases.

Despite these grim statistics, some researchers are working on ways to alleviate AI’s energy consumption. Many techniques help do so, one of which is called prompt caching (done with the help of, of course, natural language processing).

Prompt caching is all about recycling prompts. When you send the model a prompt, the model will convert that text into numerical form through tokenization. This process, as well as identifying the relationships between words, all require computational power. Prompt caching stores the computations in a cache, which is hardware that stores data. By doing so, not only is latency reduced (the delay between prompt and answer), but the LLM saves energy. This technique already manifests in chatbots — conversation history is stored so that even an obscure reference to a previous topic of conversation can be understood by the model.

With the benefits of sustainability come trade-offs with security and performance. In the case of prompt caching, there is the glaring risk of data leakage and privacy violation from cyberattacks. With other techniques like model distillation, where a larger model can help fine-tune a smaller one, research has been made on energy-efficient distillation processes, which are not only sustainable but reduce computational costs. Nonetheless, there is the risk of performance degradation in the student model; the distillation process itself may be complex.

Recently, there has emerged an arguably more efficient technique at energy conservation: Mixture of Experts (MoE). Developed theoretically by Geoffrey Hinton in 1991, the framework is clever — a problem is divided into components such that multiple experts are trained to handle the problem. A gating network (router) determines which experts are suited to which tasks. Computationally, such an inherently sparse model can reduce costs as parameters do not have to be activated at once. MoEs can also scale to large models with billions of parameters such as GPT-4, which simultaneously enhances performance.

As AI becomes increasingly integrated into our daily lives, we must consider not only the social and ethical consequences of LLMs, but also their environmental footprint. While innovation has made LLMs faster and more accessible, it has also come at a cost to sustainability. Thankfully, the field is evolving. We must balance privacy, performance, and efficiency to allow AI and our planet to coexist.

References

Liu, J., Tang, P., Wang, W., Ren, Y., Hou, X., Heng, P.-A., Guo, M., & Li, C. (2018). A Survey on Inference Optimization Techniques for Mixture of Experts Models. Arxiv.org. https://arxiv.org/html/2412.14219v2

Masanet, E., Shehabi, A., Lei, N., Smith, S., & Koomey, J. (2020). Recalibrating global data center energy-use estimates. Science, 367(6481), 984–986. https://doi.org/10.1126/science.aba3758

Mehta, S. (2024, July 3). How Much Energy Do LLMs Consume? Unveiling the Power Behind AI. Association of Data Scientists. https://adasci.org/how-much-energy-do-llms-consume-unveiling-the-power-behind-ai/

Nvidia’s H100 microchips projected to surpass energy consumption of entire nations. (n.d.). Www.electronicspecifier.com. https://www.electronicspecifier.com/news/analysis/nvidia-s-h100-microchips-projected-to-surpass-energy-consumption-of-entire-nations

O’Donnell, J., & Crownhart, C. (2025, May 20). We did the math on AI’s energy footprint. Here’s the story you haven’t heard. MIT Technology Review. https://www.technologyreview.com/2025/05/20/1116327/ai-energy-usage-climate-footprint-big-tech/

Skiborowski, M. (2023). Synthesis and design methods for energy-efficient distillation processes. Current Opinion in Chemical Engineering, 42, 100985. https://doi.org/10.1016/j.coche.2023.100985