A large language model is a statistical engine that predicts the next token in a sequence, where a token is roughly a short fragment of text. The model is trained on public and licensed text using a transformer architecture, which is a specific neural-network design that uses attention mechanisms to weight how much each prior token should influence the prediction of the next one. Billions of parameters encode the patterns learned during training, and a softmax layer at the end converts those parameter weights into a probability distribution over the next possible token. That is the mechanism, stripped of marketing.
Three properties matter for security practitioners. I find these are the properties that change how someone thinks about AI once they sit with them.
First, the model has no inherent notion of truth, safety, or intent. Every constraint on its behavior comes from post-training alignment work, which is a separate and fragile process that adjusts the model's tendency to produce certain outputs. That alignment can be undone with the right inputs, which is what jailbreaks exploit. Treat the model as a powerful text completion engine that has been taught to behave, not as a reasoning system that has values.
Second, everything inside the context window is treated as trustworthy input by default. The context window is the span of text the model is currently attending to, including the system prompt, the user's message, prior assistant turns, retrieved documents, and tool outputs. The model does not know which of those came from you and which came from an adversary. This is the root of prompt injection. If an adversary can influence anything the model sees, the adversary can influence what the model produces.
Third, the model cannot reliably inspect its own weights or explain its own behavior. When you ask a model why it produced a particular output, the answer it gives is another prediction, not an introspection. That has deep implications for how we debug AI security failures. The empirical approach is the only approach.
Once those three properties are internalized, most of the AI security threat map follows. You do not need to understand gradient descent or backpropagation to secure AI systems. You need to understand that the system behaves according to statistical patterns in its training and whatever is in its current context, and both of those surfaces are reachable by someone who wants to manipulate it.
A practical implication for the next time you evaluate an AI security tool. Ask the vendor what inputs the model sees during inference. Ask what prompt template they use, and whether customers can modify it. Ask what they log. Ask what happens when the model fails. The answers to those four questions tell you most of what you need to know about whether the tool is safe to deploy.
Key takeaways
- LLMs predict tokens. They do not reason or verify facts.
- Alignment reduces bad output probabilistically. It does not eliminate it.
- The context window is one trust boundary. Adversary-influenceable inputs are the risk.
- Ask vendors about inputs, prompt template, logging, and failure modes. Those four answers are most of the signal.
Sources
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems. https://arxiv.org/abs/1706.03762
- OWASP Foundation (2025). OWASP Top 10 for LLM Applications. OWASP Project. https://owasp.org/www-project-top-10-for-large-language-model-applications/