Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion docs/getting-started/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ sidebar_custom_props:
icon: /img/cpu.svg
---

import Newsletter from '@site/src/components/Newsletter';

# Getting started

Before you can run an LLM in production, you first need to make a few key decisions. These early choices will shape your infrastructure needs, costs, and how well the model performs for your use case.
Expand All @@ -12,4 +14,6 @@ Before you can run an LLM in production, you first need to make a few key decisi
import DocCardList from '@theme/DocCardList';

<DocCardList />
```
```

<Newsletter />
6 changes: 5 additions & 1 deletion docs/inference-optimization/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ sidebar_custom_props:
icon: /img/speed.svg
---

import Newsletter from '@site/src/components/Newsletter';

# Inference optimization

Running an LLM is just the starting point. Making it fast, efficient, and scalable is where inference optimization comes into play. Whether you're building a chatbot, an agent, or any LLM-powered tool, inference performance directly impacts both user experience and operational cost.
Expand All @@ -14,4 +16,6 @@ If you're using a serverless endpoint (e.g., OpenAI API), much of this work is a
import DocCardList from '@theme/DocCardList';

<DocCardList />
```
```

<Newsletter />
5 changes: 4 additions & 1 deletion docs/inference-optimization/llm-inference-metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ keywords:

import LinkList from '@site/src/components/LinkList';
import Button from '@site/src/components/Button';
import Newsletter from '@site/src/components/Newsletter';

# Key metrics for LLM inference

Expand Down Expand Up @@ -176,4 +177,6 @@ Using a serverless API can abstract away these optimizations, leaving you with l
* [Mastering LLM Techniques: Inference Optimization](https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/)
* [LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators](https://arxiv.org/pdf/2411.00136)
* [Throughput is Not All You Need](https://hao-ai-lab.github.io/blogs/distserve/)
</LinkList>
</LinkList>

<Newsletter />
6 changes: 5 additions & 1 deletion docs/infrastructure-and-operations/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ sidebar_custom_props:
icon: /img/setting.svg
---

import Newsletter from '@site/src/components/Newsletter';

# Infrastructure and operations

LLMs don't run in isolation. They need robust infrastructure behind them, from high-performance GPUs to deployment automation and comprehensive observability. A strong model and solid inference optimization determine how well your application performs. But it’s your infrastructure platform and inference operation practices that determine how far you can scale and how reliably you can grow.
Expand All @@ -12,4 +14,6 @@ LLMs don't run in isolation. They need robust infrastructure behind them, from h
import DocCardList from '@theme/DocCardList';

<DocCardList />
```
```

<Newsletter />
5 changes: 4 additions & 1 deletion docs/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ keywords:
---

import Features from '@site/src/components/Features';
import Newsletter from '@site/src/components/Newsletter';

# LLM Inference Handbook

Expand Down Expand Up @@ -44,4 +45,6 @@ You can read it start-to-finish or treat it like a lookup table. There’s no wr

## Contributing

We welcome contributions! If you spot an error, have suggestions for improvements, or want to add new topics, please open an issue or submit a pull request on our [GitHub repository](https://github.com/bentoml/llm-inference-handbook).
We welcome contributions! If you spot an error, have suggestions for improvements, or want to add new topics, please open an issue or submit a pull request on our [GitHub repository](https://github.com/bentoml/llm-inference-handbook).

<Newsletter />
6 changes: 5 additions & 1 deletion docs/llm-inference-basics/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ sidebar_custom_props:
collapsed: false
---

import Newsletter from '@site/src/components/Newsletter';

# LLM inference basics

LLM inference is where models meet the real world. It powers everything from instant chat replies to code generation, and directly impacts latency, cost, and user experience. Understanding how inference works is the first step toward building smarter, faster, and more reliable AI applications.
Expand All @@ -13,4 +15,6 @@ LLM inference is where models meet the real world. It powers everything from ins
import DocCardList from '@theme/DocCardList';

<DocCardList />
```
```

<Newsletter />
6 changes: 5 additions & 1 deletion docs/llm-inference-basics/what-is-llm-inference.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ keywords:
- LLM inference, AI inference, inference layer
---

import Newsletter from '@site/src/components/Newsletter';

# What is LLM inference?

LLM inference refers to using trained LLMs, such as GPT-4, Llama 4, and DeepSeek-V3, to generate meaningful outputs from user inputs, typically provided as natural language prompts. During inference, the model processes the prompt through its vast set of parameters to generate responses like text, code snippets, summaries, and translations.
Expand Down Expand Up @@ -69,4 +71,6 @@ Understanding LLM inference early gives you a clear edge. It helps you make smar
- **If you're a technical leader**: Inference efficiency directly affects your bottom line. A poorly optimized setup can cost 10× more in GPU hours while delivering worse performance. Understanding inference helps you evaluate vendors, make build-vs-buy decisions, and set realistic performance goals for your team.
- **If you're just curious about AI**: Inference is where the magic happens. Knowing how it works helps you separate AI hype from reality and makes you a more informed consumer and contributor to AI discussions.

For more information, see [serverless vs. self-hosted LLM inference](./serverless-vs-self-hosted-llm-inference).
For more information, see [serverless vs. self-hosted LLM inference](./serverless-vs-self-hosted-llm-inference).

<Newsletter />