Zhipu AI Releases Open-AutoGLM Framework for Autonomous Android Control

Zhipu AI has open-sourced Open-AutoGLM, a framework enabling autonomous control of Android devices through visual language models (VLMs). Released in December 2025, the system utilizes the AutoGLM-Phone-9B model and Android Debug Bridge (ADB) to perceive screen content and execute complex workflows across over 50 mainstream Chinese applications, signaling a significant advance in on-device agentic AI.

As the industry shifts focus from conversational AI to agentic systems capable of executing tasks, Zhipu AI has established a foothold in the mobile sector with the release of Open-AutoGLM. This open-source framework, built upon the company's AutoGLM-Phone-9B model, is designed to automate interactions on Android devices by combining visual perception with standard command-line tools. Unlike traditional automation scripts that rely on rigid code selectors or accessibility service APIs, Open-AutoGLM operates by visually analyzing the device screen and issuing commands via the Android Debug Bridge (ADB).

The framework's architecture addresses the complexity of modern mobile interfaces, which often feature dynamic content that breaks conventional automation tools. By utilizing a Visual Language Model (VLM), the system can interpret graphical user interfaces (GUIs) similarly to a human user. According to the release documentation, the agent can execute a sequence of actions-such as clicking, swiping, and text input-based on natural language instructions. For instance, a user can command the system to "Open Taobao and search for wireless headphones," and the agent will autonomously navigate the app, handle search queries, and scroll through results.

A key differentiator for Open-AutoGLM is its verified compatibility with the Chinese mobile ecosystem, which is known for its "super-app" structures and heavy reliance on mini-programs. The framework has been tested against more than 50 popular applications, including WeChat, Taobao, JD.com, Meituan, and Xiaohongshu. This broad compatibility suggests that the underlying VLM has been trained on a diverse dataset of UI interactions specific to these platforms, allowing it to handle complex workflows like ordering food or booking tickets without specialized API integrations for each service.

The release offers flexibility in deployment, catering to both researchers and application developers. The AutoGLM-Phone-9B model can be hosted locally for those prioritizing data privacy and latency, or accessed via third-party APIs for lighter-weight client implementations. The repository includes a "Web-Station" interface, providing a visual dashboard where users can monitor the agent's real-time screen view and the step-by-step execution of tasks. This transparency is crucial for debugging agentic behaviors, which can often be non-deterministic in nature.

Safety remains a primary concern for autonomous agents capable of financial transactions or social interactions. Zhipu AI has implemented a "human-in-the-loop" protocol for sensitive actions. According to the documentation, the system is designed to pause and request user confirmation before finalizing payments or sending messages. Furthermore, because the system relies on ADB, users maintain the ability to physically intervene or terminate the connection instantly if the agent behaves unexpectedly. However, the reliance on visual recognition means the system may still struggle with sudden pop-ups or UI updates that differ significantly from its training data.

The open-sourcing of Open-AutoGLM places Zhipu AI in direct competition with other multimodal agent frameworks. By lowering the barrier to entry for building on-device agents, the release is likely to accelerate the development of personalized assistants that go beyond simple voice commands. While currently optimized for the Chinese Android ecosystem, the underlying methodology of VLM-driven ADB control presents a scalable model for global mobile automation [analysis].

Key Takeaways

Visual-Based Automation: Open-AutoGLM utilizes the AutoGLM-Phone-9B model to perceive screen content visually, allowing it to interact with apps without relying on deep system accessibility APIs.
Broad Application Support: The framework is verified to support over 50 mainstream Chinese applications, including WeChat, Taobao, JD.com, Meituan, and Xiaohongshu.
Safety Protocols: The system includes built-in mechanisms for human confirmation during sensitive operations and allows for immediate manual takeover via ADB.
Hybrid Deployment: Developers can deploy the model locally or utilize third-party APIs, facilitating use cases ranging from automated testing to accessibility assistants.

Key Takeaways

Sources