# Enforcing Deterministic JSON Outputs in Local LLMs: Analyzing llama.cpp Release b9590

> A critical patch for LFM2 and LFM2.5 models restores schema constraints, enabling reliable agentic workflows on edge devices.

**Published:** June 10, 2026
**Author:** PSEEDR Editorial
**Category:** edge
**Content tier:** free
**Accessible for free:** true
**Editorial format:** analysis
**News quality eligible:** true
**Source count:** 1
**Word count:** 896


**Tags:** llama.cpp, JSON Schema, Edge AI, LFM2, Local LLMs, Open Source

**Canonical URL:** https://pseedr.com/edge/enforcing-deterministic-json-outputs-in-local-llms-analyzing-llamacpp-release-b9

---

The recent [llama.cpp b9590 release](https://github.com/ggml-org/llama.cpp/releases/tag/b9590) addresses a critical gap in local agentic workflows by fixing JSON schema enforcement for LFM2 and LFM2.5 models. By correcting a chat template handler that previously ignored schema constraints, this update underscores the growing engineering priority of bringing deterministic, cloud-grade structured outputs to resource-constrained edge environments.

## The Mechanics of the Schema Omission

In the evolving landscape of local large language models (LLMs), the ability to constrain outputs to a specific format is a strict requirement for programmatic integration. Prior to this release, developers utilizing LFM2 and LFM2.5 models within the llama.cpp ecosystem encountered a silent failure in output formatting. According to the release notes, the specialized chat template handler for these models was exclusively configured to build grammars for tool-calling. Tool-calling grammars are typically narrow, designed only to output specific function names and their immediate arguments.

Consequently, when developers passed a broader `json_schema` parameter within the `response_format` object-intending to enforce a custom, user-defined data structure-the handler ignored the instruction entirely. Pull Request #24377 directly targets this omission. By patching the chat template handler, the inference engine now correctly parses the requested JSON schema and translates it into a grammar that constrains the model's token generation. This ensures that the generated text strictly adheres to the defined JSON structure, rather than relying on the model's probabilistic adherence to prompt-based formatting instructions.

## Implications for Local Agentic Workflows

The significance of PR #24377 extends beyond a simple bug fix; it represents a necessary maturation of the local inference stack. In cloud-based LLM APIs, structured output guarantees have become a standard feature, allowing developers to safely route model responses directly into databases, APIs, or deterministic code execution paths. Replicating this reliability on local hardware is challenging but essential for privacy-preserving or offline agentic workflows.

Prompt engineering alone is insufficient for production-grade data extraction. Even highly capable models will occasionally inject conversational filler, omit required keys, or violate data types. By enforcing JSON schemas at the inference level via grammar-based sampling, llama.cpp allows developers to build robust pipelines around LFM2 and LFM2.5 models. The engine evaluates the allowed tokens at each generation step, physically preventing the model from outputting invalid syntax. This deterministic behavior eliminates the need for complex, error-prone post-processing or retry logic that is typically required when an LLM hallucinates a malformed JSON key or forgets a closing bracket. For edge devices operating autonomously without human oversight, this reliability is a foundational requirement.

## Hardware Matrix and Build Limitations

The b9590 release continues llama.cpp's strategy of broad hardware support, shipping with an extensive matrix of pre-built binaries. These include configurations for macOS (Apple Silicon and Intel), Linux (Vulkan, ROCm 7.2, OpenVINO), Windows (CUDA 12/13, Vulkan), Android, and openEuler. This wide distribution ensures that developers can deploy structured-output-capable models across highly heterogeneous edge and server environments, from high-end NVIDIA GPU clusters to low-power ARM CPUs.

However, the release notes explicitly mark several platform configurations as disabled. Notably, macOS Apple Silicon builds with KleidiAI enabled, Windows x64 builds utilizing SYCL, and specific openEuler configurations (such as those relying on ACL Graph) are currently unsupported in this tag. The explicit disabling of these builds highlights the ongoing engineering friction involved in maintaining a unified C++ inference engine across rapidly diverging hardware acceleration frameworks. For enterprise teams relying on SYCL for Intel GPUs or KleidiAI for optimized ARM execution, this release introduces a temporary deployment blocker, forcing a choice between schema enforcement and optimal hardware acceleration.

## Open Questions and Edge Trade-offs

While the restoration of JSON schema constraints is a critical improvement, several technical questions remain unaddressed in the release documentation. First, the specific architectural lineage and origin of the LFM2 and LFM2.5 models-likely referring to Liquid Foundation Models-are not detailed in the release notes, leaving ambiguity regarding the exact tokenization quirks or chat template structures that necessitated a specialized handler in the first place.

Furthermore, grammar-based JSON schema enforcement introduces notable performance overhead. Translating a complex JSON schema into a finite state machine (FSM) and subsequently masking logits at every step of token generation requires continuous CPU cycles. For resource-constrained edge devices, this overhead can significantly degrade tokens-per-second (TPS) throughput. The release does not provide benchmarks quantifying the latency penalty of this FSM evaluation for LFM2 models.

Additionally, the technical rationale behind disabling KleidiAI and SYCL for this specific build remains opaque. It is unclear whether these exclusions are due to compilation failures, unresolved bugs in the underlying acceleration libraries, or incompatibilities introduced by recent changes to the core tensor operations.

Ultimately, the b9590 release of llama.cpp delivers a highly targeted but vital capability for developers building local AI agents. By ensuring that LFM2 and LFM2.5 models respect strict JSON schemas, the project bridges the gap between raw generative capability and the deterministic reliability required by modern software engineering, even as the ecosystem continues to navigate the complexities of hardware fragmentation and inference overhead.

### Key Takeaways

*   llama.cpp release b9590 resolves a bug where the LFM2/LFM2.5 template handler ignored the json\_schema parameter.
*   The fix enables deterministic JSON outputs via grammar-based sampling, critical for local agentic workflows.
*   The release includes a wide array of pre-built binaries, though specific configurations like macOS KleidiAI and Windows SYCL are disabled.
*   Performance overhead from finite state machine (FSM) evaluation during grammar enforcement remains an unbenchmarked trade-off for edge devices.

---

## Sources

- https://github.com/ggml-org/llama.cpp/releases/tag/b9590
