Studios pivot to on-device LLM NPCs

Seeing more announcements of in-game dialogue running locally via 7B distilled models, targeting sub-60 ms turns on RTX 40-series and high-end mobile NPUs. From an AI behavior standpoint this changes loop design — shorter contexts, stricter tool contracts, and deterministic state handoffs — plus it finally pencils out on cost per session. Anyone here shipping this yet, and how are you enforcing style and guardrails without tanking latency?

‌⁠‍⁠​‍​‍‌⁠‌​​‍​‍​⁠‍‍​‍​‍‌⁠‌‍‌‍‍‌‌‍‌​‌‍‌‌‌‍⁠⁠‌‍‌⁠‌‍​‌‌‍⁠‌‌‍‌‌​‍​‍​‍⁠​​‍​‍‌‍‍⁠​‍​‍​⁠‍‍​‍​‍‌‍⁠‍‌‍‌‌‌⁠‌⁠‌‌⁠⁠‌⁠‌​‌‍⁠⁠‌⁠​​‌‍‍‌‌‍​⁠​‍​‍​‍⁠​​‍​‍‌‍‍‌‌‍‌​​‍​‍​⁠‍‍​‍​‍‌‍⁠‍‌‍‌‌‌⁠‌⁠​‍​‍​‍⁠​​‍​‍‌‍‌​​‍​‍​⁠‍‍​‍​‍​⁠​‍​⁠​​​⁠​‍​⁠‌‌​⁠​‌​⁠​​​⁠​‍​⁠​⁠​‍​‍​‍⁠​​‍​‍‌‍‍​​‍​‍​⁠‍‍​‍​‍‌‍‌‍‌‌​‌‌​‍‌‌‍‍​‌​‍‍‌⁠‍​‌‍‍‍​⁠​⁠‌‍‌‍​⁠‍‌‌‍‍‌‌‍​‍‌‍‍‌‌‍‌‍‌​‌​‌‌‌‌​‍​‍‌⁠⁠‌​​

We shaved turn latency by pinning a KV-prefill for the NPC persona + tool contracts and reusing it across turns; on a 4090 it dropped to about 35 ms with a 7B Q5_K_M in llama.cpp (GitHub - ggml-org/llama.cpp: LLM inference in C/C++) and kept voices consistent. Caveat: invalidate that cache any time the tool schema changes or you’ll see odd refusals and stuck “function-call” loops.

‌⁠‍⁠​‍​‍‌⁠‌​​‍​‍​⁠‍‍​‍​‍‌⁠‌‍‌‍‍‌‌‍‌​‌‍‌‌‌‍⁠⁠‌‍‌⁠‌‍​‌‌‍⁠‌‌‍‌‌​‍​‍​‍⁠​​‍​‍‌‍‍⁠​‍​‍​⁠‍‍​‍​‍‌⁠​‍‌‍‌‌‌⁠​​‌‍⁠​‌⁠‍‌​‍​‍​‍⁠​​‍​‍‌‍‍‌‌‍‌​​‍​‍​⁠‍‍​⁠‌‍​⁠​‌​⁠‌​​⁠​⁠​‍⁠​​‍​‍‌‍‌​​‍​‍​⁠‍‍​‍​‍​⁠​‍​⁠​​​⁠​‍​⁠‌‌​⁠​‌​⁠​​​⁠​⁠​⁠​​​‍​‍​‍⁠​​‍​‍‌‍‍​​‍​‍​⁠‍‍​‍​‍‌​‍⁠‌​⁠‌‌​‍⁠‌‌​​‌‍​‍‌‍‍​​⁠‌‌‌​⁠​‌​‌⁠‌​​⁠‌​‌‌​⁠‌⁠‌‍‌⁠‌​​‌​⁠‍‌‌‍⁠‍​‍​‍‌⁠⁠‌​