LLM, which stands for Large Language Model, is one of the most frequently mentioned terms in today's technology landscape. But what is it actually, and what separates a large language model from every other program that handles text?

The short version: an LLM is a neural network trained on enormous amounts of text. It has learned to predict the next word. And that single task, executed at enough scale, is enough to produce systems that can reason, answer questions, and solve complex problems. It is one of the most counterintuitive insights in modern AI, and it is a good place to start.

Short answer

An LLM is a neural network trained on text at massive scale, with the ability to understand and generate language in a way that can resemble human reasoning.

Why it matters

LLMs are the foundation of most AI systems being built today, from chatbots and document analysis to autonomous AI agents and coding tools.

Common misconception

That an LLM "knows" things in the same way as a search engine. A language model generates the likely next token. It does not look up facts.

What Is a Language Model and an LLM?

The term "language model" describes a system that models the probability distribution over text: given a sequence of words, how likely is it that the next word is X? That sounds simple, and in its basic form it is.

A large language model (LLM) is a language model in a completely different size class. Billions of parameters. Trained on hundreds of billions or trillions of tokens gathered from the internet, books, code, and scientific papers. Training requires thousands of GPUs for months and costs millions of dollars.

It is the scale, not the principle, that creates the capabilities. And that is also why they are called large language models.

What Do 7B, 70B, and Parameters Mean in Practice?

When people talk about model size, they often mean the number of parameters. The parameters are the weights the model learned during training. Once training is finished, those values are largely fixed, and inference, meaning actual use of the model, consists of running the input through all layers with those weights and computing the next token step by step.

That means every answer is, in practice, the result of a very large number of matrix operations. For each new token, the model has to read its context, run it through the network, and calculate the probability of each possible next token. That is why inference is so computationally heavy even when the model is already fully trained.

This is where hardware becomes decisive. A CPU can run a language model, but it is much worse at the massive parallelism these calculations require. A GPU or other accelerator is built to perform many such operations at the same time, which makes inference dramatically faster. That is why modern language models are almost always run on GPUs or similar specialized hardware.

Memory is the other major constraint. For inference to be fast, you want the model weights to fit in fast memory on the accelerator, often VRAM or HBM. Normal consumer cards often sit somewhere around 8 to 24 GB of VRAM, which makes them useful for smaller open-weight models or quantized models that have been compressed. Larger models, by contrast, require much more memory, sometimes several GPUs at once or different forms of offloading that make execution slower.

This is also where it becomes important to be precise about what model size actually means. On the open-weight side, you often see names like 7B, 13B, 70B, or 405B, meaning roughly that many billions of parameters. On the closed-source side, for example ChatGPT, Gemini, and Claude, the parameter count is often not public. Some modern models also use mixture-of-experts, which means the total parameter count does not always tell the whole story about how heavy the model is to run or how much is activated per token.

That means what you experience as "ChatGPT" on your computer is normally not running on your own machine at all. When you write to ChatGPT, Gemini, or Claude, your request is sent to the provider's servers, the model runs there on large clusters of accelerators, and the answer is then sent back to you. That is an important practical difference from smaller open-weight models, which can actually be run locally if they fit within the hardware you have available.

Three Types of Tasks

Not all language models do the same thing. There are three fundamental types of tasks that models can be built for:

Classification

One label per text

The model takes a text and assigns a label. Positive or negative? Which topic? Which intent? One answer, not a chain of reasoning.

BERT · DistilBERT · RoBERTa

Tagging and extraction

Labels on parts of the text

The model marks parts of the text or assigns several labels where needed. That can mean parts of speech, names of people and companies, or other structured information that should be extracted.

BERT variants · RoBERTa · token classification

Generation

New text as output

The model produces new text, answers, summaries, code, and analyses. This is what most people mean by "AI" today. It is the focus of the rest of this article.

GPT-4o · Claude · Gemini · Llama

Classification and tagging are powerful for specific, bounded tasks. But generation is what drives most modern business applications, and that is what we focus on from here.

The Transformer Family: Three Architectures

Almost all modern language models build on the same basic architecture: the Transformer, introduced in the paper "Attention is All You Need" (2017). But the transformer architecture is used in three different ways, with clearly different strengths:

Encoder-only

The BERT family

Reads the entire text at once and builds a rich representation. Excellent for classification, named entity recognition, and semantic search. Cannot generate text.

BERT · DistilBERT · RoBERTa

Encoder-decoder

The T5 family

Encodes the input and decodes it into output. Good for tasks with a clear input-output structure: translation, summarization, and structured transformation of text.

T5 · mT5 · BART

Decoder-only

The GPT family

Generates text token by token. This architecture has scaled exceptionally well: more parameters and more data often lead to stronger capabilities. It is the architecture behind today's AI wave.

GPT-4o · Claude · Gemini · Llama

Why did decoder-only have such a breakthrough? For generative use cases, this architecture has shown that it scales exceptionally well. Each new generation of GPT models has reinforced that more parameters and more training data often lead to stronger capabilities, a pattern that has been especially clear in the decoder-only family.

How Is an LLM Trained?

Pretraining at Scale

The foundation of an LLM is pretraining: the model is exposed to enormous amounts of text and learns to predict the next token. Not a specific answer. Not a specific task. Just: given everything that came before, what is most likely to come next?

This happens over trillions of tokens, from books, web pages, code, scientific papers, and discussion forums. Training takes weeks on thousands of specialized GPUs and costs millions of dollars. That is why only a small number of labs in the world train frontier-class models.

Attention, Why Transformers Changed Everything

Earlier architectures, such as RNNs and LSTMs, processed text sequentially, one word at a time, with limited backward memory. Long texts were a problem. The model "forgot" early parts.

The transformer solved this with self-attention: for each token, the model computes how relevant every other token in the whole sequence is. Not only the neighbors, but all of them. The word "it" at the end of a sentence can connect directly to the subject twenty words earlier. A reference in one paragraph can connect to the definition in another.

That is what allows modern models to handle contexts of hundreds of thousands of tokens and understand complex dependencies in long documents. The title "Attention is All You Need" was a provocative claim in 2017, and it turned out to be right.

Self-attention, each token can connect directly to all others regardless of distance

The line thickness shows how strongly mat attends to each token, including cat early in the sentence.

More Than Predicting Tokens: Emergent Knowledge

What makes LLMs fascinating, and sometimes difficult to understand, is that no capabilities are programmed in explicitly. The model is trained only on next-token prediction. Yet the following emerge:

factual knowledge about the world
logical inference and reasoning
the ability to write and debug code
understanding of intentions and context
multilingual translation

This is called emergent capability: capacities that arise from large-scale training without being trained directly. It explains why an LLM can answer questions it has never "seen the answer to". It also explains why it sometimes generates plausible but incorrect answers. The model has an internal model of the world, but it does not look up facts. It generates what is statistically the next likely token.

Instruction Tuning: From Raw Model to AI Assistant

A pretrained model does not know how to have a conversation. It continues text, nothing more. To turn it into an assistant that answers questions and follows instructions, instruction tuning is used: an extra training step with human-annotated examples of good and bad answers, often combined with RLHF, reinforcement learning from human feedback.

That is the step that created ChatGPT as a product from GPT-4 as a model. If you want to adapt the behavior further for your own business, the next step is finetuning.

How Does an LLM Generate Text?

Generation happens token by token. The model takes all preceding text, your question plus everything it has written so far, and calculates the probability of each possible next token. It picks one, with some randomness controlled by temperature, and adds it to the sequence. Then it starts again. Token by token until the answer is complete.

Autoregressive generation, token by token until the answer is complete

That has three practical consequences worth knowing:

Longer input costs more. The model processes the entire context for every new token it generates.
The context window is a hard limit. If too much is included, the oldest material falls away.
Hallucination is built in. The model always picks a likely next token, even when the right answer would be "I do not know".

What Is a Token?

A token is not a word. It is a piece of text, sometimes a word, sometimes part of a word, sometimes punctuation or whitespace. How text is split into tokens directly affects cost, quality, and how the model interprets your input. If you want to go deeper into why models use tokens instead of words or raw binary data, read our deep dive on the tokenizer.

Interactive Tokenizer

47 characters: 10 words: 14 tokens

Token Representation

AI␣läser␣inte␣ord␣som␣vi␣gör.␣Den␣läser␣tokens.

Encoding: cl100k_base•via gpt-tokenizer

Type any text and see how it gets split into tokens, and what that means for cost and processing.

The Context Window

Everything the model can "see" when generating an answer has to fit inside the context window. Your instruction, the conversation history, and loaded documents compete for the same space. Anything that does not fit cannot be taken into account by the model. If you want to go deeper into which tokens actually count, why they cost money, and why context engineering has become so important, read our deep dive on the context window.

Loading visualization...

The context window is the model's working memory. Understand what fits, and what falls out.

How Do You Use an LLM in Practice?

An LLM is rarely a finished system by itself. It is the foundation that other solutions are built on. There are several main ways to work with an LLM in a business context:

Give the model your data

RAG, Retrieval-Augmented Generation

Instead of retraining the model on your documents, relevant information is retrieved for every question. The model answers based on current, correct data, not on training weights from a year ago.

Summary

An LLM is trained to predict the next token, and from that single task, executed at enough scale, capabilities emerge that nobody programmed in explicitly. The transformer architecture and self-attention are the technical breakthroughs that made this possible.

For practical use, three things are central:

an LLM generates likely text. It does not look up facts, which makes hallucination an inherent trait to design around.
everything that does not fit in the context window cannot be taken into account by the model. That shapes the architecture of every serious system.
more tokens cost more. Tokenization directly affects what a system costs to run in production.

The rest is about building the right layers around the model: RAG for current data, agents for automation, finetuning for specialization.