UI-TARS Desktop: ByteDance's AI Agent That Actually Sees and Controls Your Computer
The gap between AI reasoning and AI execution has been the defining frustration of the agent era. Models can plan complex workflows, write sophisticated code, and reason through multi-step problems. But when it comes to actually clicking a button, filling a form, or navigating a desktop application — they're blind. They live in text. They need APIs. And most software doesn't have them.
ByteDance's UI-TARS Desktop closes that gap. At 31,000+ GitHub stars and over 3,000 forks, it's one of the most significant open-source agent projects to emerge from a major tech company. It doesn't just think about tasks — it sees your screen, moves your mouse, and types your keystrokes. And it outperforms GPT-4o and Claude while doing it.
───
What UI-TARS Desktop Actually Is
UI-TARS Desktop is not one thing. It's a multimodal AI agent stack shipping two projects under one umbrella:
Agent TARS — a CLI and Web UI for general-purpose agent work. It brings GUI Agent and Vision capabilities into your terminal, browser, and product. It's built on MCP (Model Context Protocol) and connects to real-world tools: calendars, databases, email, APIs. Think of it as the "headless" side.
UI-TARS Desktop — a native desktop application that literally sees your screen and controls your computer. It takes screenshots, understands what it's looking at through vision-language models, and acts through mouse and keyboard input. Think of it as the "embodied" side.
Together they form something closer to how humans actually work than any agent that's come before. We don't just reason in language — we look, we point, we click. UI-TARS gives agents the same interface.
───
The Core Innovation: Visual Grounding
What makes UI-TARS different from Claude Code, OpenClaw, Hermes Agent, or any terminal-only agent is simple: it sees pixels, not just text.
The system uses the UI-TARS-1.5 vision-language model (available in 7B and 72B parameter versions), trained on approximately 50 billion tokens of screenshot data. This isn't a model that reads HTML and guesses where buttons might be. It parses screenshots — understanding element types, spatial relationships, bounding boxes, visual descriptions, and layout structure.
When you tell it to "install the autoDocstring extension in VS Code," it doesn't run a command. It:
1. Looks at your screen and recognizes VS Code isn't open
2. Clicks the VS Code icon
3. Waits for the window to fully load (it sees the loading state)
4. Identifies the Extensions tab in the sidebar by its visual position
5. Clicks it — and if the click misses, it notices the UI didn't change and tries again
6. Types "autoDocstring" into the search field
7. Watches for the install button to appear
8. Clicks Install and waits for confirmation
Every step is reasoned through visually. When something goes wrong, it doesn't crash — it notices the screen didn't change as expected and self-corrects. This is fundamentally different from API-based automation that breaks on the first unexpected dialog box.
───
Benchmark Dominance
The research paper behind UI-TARS, published by ByteDance and Tsinghua University, reports state-of-the-art performance across 10+ GUI benchmarks. The numbers are striking:
copy
| Benchmark | UI-TARS 72B | GPT-4o | Claude 3.5 | Gemini 1.5 Pro |
| -------------- | ----------- | ------ | ---------- | -------------- |
| VisualWebBench | 82.8% | 78.5% | 78.2% | — |
| WebSRC | 93.6%* | — | — | — |
| ScreenQA-short | 88.6% | — | — | — |
*7B model result on WebSRC
In OSWorld — which tests open-ended computer tasks — and AndroidWorld — 116 programmatic tasks across 20 mobile apps — UI-TARS consistently leads. The researchers note that Claude Computer Use "performs strongly in web-based tasks but significantly struggles with mobile scenarios," while UI-TARS "exhibits excellent performance in both website and mobile domains."
This cross-domain capability is significant. Most computer-use agents are web-specialized. UI-TARS works across desktop apps, mobile interfaces, and web applications — same model, same approach.
───
How It Works Under the Hood
UI-TARS's training pipeline is what makes the visual understanding possible:
Screenshot-based training data with parsed metadata — element descriptions, types, bounding boxes, visual descriptions, element functions, and text content. The model learns not just what a button looks like, but what it does.
State transition captioning — the model identifies and describes differences between two consecutive screenshots. This lets it recognize whether a click actually did something, a page loaded, or an error appeared.
Set-of-Mark (SoM) prompting — overlays distinct marks (letters, numbers) on specific screen regions. This gives the model a coordinate system for precise pointing: "click on the element marked 'B'" instead of "click somewhere in the top right."
Dual-system reasoning — the model performs both System 1 (fast, intuitive, automatic) and System 2 (slow, deliberate, multi-step) thinking. It plans, reflects, recognizes milestones, and corrects errors. When a click misses, it doesn't retry blindly — it reasons about why the click might have missed and adjusts.
Error correction training — researchers identified mistakes in training data, labeled corrective actions, and simulated recovery steps. The model learned not just to perform tasks, but to recover when things go wrong.
Short-term and long-term memory — handles immediate task context while retaining historical interactions to improve future decisions. Over time, the agent gets better at navigating interfaces it's seen before.
───
What the User Experience Looks Like
Open the desktop app, type a task in natural language, and watch it work:
• A thinking panel on the left shows the agent's step-by-step reasoning — what it sees, what it plans to do, why it's doing it
• The action window on the right shows your actual desktop as the agent controls it
• Every mouse movement, click, and keystroke is visible in real time
• The agent explains its reasoning before acting, so you can intervene before it does something wrong
This "explain-then-act" pattern builds trust in a way that opaque automation never can. You're not hoping the script worked — you're watching it work.
───
Agent TARS: The Terminal Side
For developers who prefer CLI workflows, Agent TARS provides the same capabilities in a terminal package:
Bash
npm install @agent-tars/cli@latest -g
agent-tars --provider anthropic --model claude-sonnet-4-6 --apiKey your-key
It supports multiple model providers (Volcengine/Doubao, Anthropic, OpenAI, and others), runs headful with a Web UI or headless as a server, and integrates with any MCP-compatible tool. The v0.3.0 release added streaming multi-tool support, runtime timing statistics, and an Event Stream Viewer for debugging agent data flow.
The hybrid browser agent is particularly clever: it can use visual grounding (looking at pixels), DOM analysis (reading HTML), or both — switching strategies based on what works better for the current page.
For isolated execution, it supports the AIO Agent Sandbox, letting the agent run in a containerized environment without risking your actual machine.
───
Remote Operation: The Killer Feature
Version 0.2.0 introduced something that changes the game: Remote Computer Operator and Remote Browser Operator — completely free.
No configuration. Click a button. The agent controls any remote computer or browser. This turns UI-TARS from a personal automation tool into something usable for remote support, distributed testing, cloud-based workflows, and multi-machine orchestration.
The remote browser operator is particularly practical for web scraping and testing: run agents against websites from cloud machines with different IPs, screen resolutions, and browser configurations — all through the same interface.
The remote computer operator opens possibilities for server administration, legacy system interaction, and cross-platform workflows where you need AI on a machine you're not sitting in front of.
───
The Rough Edges
UI-TARS is massive and mature by open-source standards, but it's not without friction:
386 open issues tells you this is actively developed and actively used. The issue count comes from real-world usage, not neglect. But expect to hit edge cases.
Model dependency is real. The best results come from ByteDance's own UI-TARS-1.5 and Seed-1.5-VL/1.6 models. Running with third-party models like Claude or GPT-4o works but doesn't leverage the full visual training pipeline. The 72B model requires serious hardware to run locally.
Visual automation is slower than API automation. If you're running the same task 100 times a day, write a script. UI-TARS shines for complex, infrequent, multi-application workflows — not repetitive operations where milliseconds matter.
Safety concerns are non-trivial. An agent that can move your mouse and type on your keyboard is an agent that can delete files, send messages, or make purchases. The thinking-panel transparency helps, but careful supervision and sandboxing are essential for high-stakes operations.
The ByteDance factor. For users concerned about Chinese tech company involvement, the Apache 2.0 license and fully open-source code provide transparency. Everything runs locally. Nothing phones home to ByteDance servers unless you explicitly use their hosted model APIs.
───
How It Compares
copy
| Aspect | UI-TARS | Claude Computer Use | holaOS | OpenClaw/Hermes |
| ----------------- | -------------------------------------- | -------------------------- | ----------------------- | --------------------- |
| Approach | Vision model sees screen + controls OS | API-based computer control | Shared visual workspace | Terminal + text tools |
| OS Control | Native mouse/keyboard | API-mediated | Via agent harness | Shell only |
| Mobile support | Yes (AndroidWorld-tested) | No (struggles) | No | No |
| Remote operation | Built-in, free | Limited | Via VNC | Not applicable |
| Model flexibility | Best with own models | Claude-only | Multi-model | Multi-model |
| License | Apache 2.0 | Proprietary | Modified Apache 2.0 | MIT |
| Stars | 31K+ | N/A (API) | 4.5K+ | 370K+ (OC) |
UI-TARS occupies a unique position: it's the only project that combines visual screen understanding, native OS control, remote operation, and open-source licensing in one package. It's not better than terminal agents for everything — but for tasks that require actual GUI interaction, nothing else in open source comes close.
───
Who This Is For
If your agent work lives entirely in terminals, code editors, and APIs, Agent TARS's CLI might be all you need. The visual desktop component adds overhead you don't require.
But if you've ever needed an agent to:
• Navigate a legacy enterprise application with no API
• Book flights across multiple travel sites comparing prices
• Change system settings across different operating systems
• Test your application visually across different screen sizes
• Automate workflows that span three different desktop applications
• Run agents on remote machines without setting up complex infrastructure
...then UI-TARS Desktop solves a problem no terminal agent can touch.
It's also positioned as a research platform. The batch trajectory generation, Atropos RL environments, and trajectory compression tools are explicitly designed for training the next generation of tool-calling models. If you're working on agent evaluation or training, the benchmark infrastructure alone is worth the install.
───
Links:
• GitHub: github.com/bytedance/UI-TARS-desktop — 31K+ stars
• Website: agent-tars.com
• Paper: UI-TARS: Pioneering Automated GUI Interaction with Native Agents
• Quick Start: npx @agent-tars/cli@latest or clone the desktop repo
• Discord: discord.gg/HnKcSBgTVx
Space Agent: The AI That Rewrites Its Own Interface While You Watch Every AI agent you've ever used shares a fundamental limitation: the interface is fixed. Someone else designed the chat window, the terminal, the web dashboard. The agent can think, reason, and execute — but it can't change how you see and interact with it. The interface is a wall the agent can talk through but never reshape. Space Agent from the Agent Zero project tears that wall down. At 1,100+ GitHub stars and growing fast, it's not the biggest agent project. But it might be the most radical. It doesn't just have a web UI — it lives in the web UI. And it can change that UI at runtime, in real time, based on what you ask it to do. ─── What Space Agent Actually Is Space Agent is a browser-native AI agent. Not "it has a web UI." Not "there's a frontend that talks to a backend agent." The agent literally executes inside your browser's JavaScript runtime. It runs in the same environment that renders the page you're looking at. This architectural choice — described by creator Yan as "the agent lives and executes inside the browser's JavaScript runtime" — unlocks capabilities no other agent has. Because the agent and the UI share the same runtime, the agent can mutate the DOM directly. It can create new interface elements on the fly. It can build dashboards, widgets, games, and tools — not by generating static images or running a server-side process, but by writing and injecting JavaScript that runs right there in your tab. You ask for a crypto dashboard. It fetches prices via API, generates renderer functions, and draws ticker prices, charts, and news feeds directly onto an infinite canvas grid — all instantly, all interactive. You say "make me four world clocks for Tokyo, Rome, London, and New York." Sixty seconds later you have live-updating clocks. 7,000 tokens total, including refinements. You want a notes app with folders, markdown support, and drag-and-drop? It builds one. From scratch. In your browser. ─── The Core Innovation: Client-Side Agency This is the conceptual breakthrough that makes Space Agent different from everything else: Most agent architectures stack layers. The agent runs on a server (Python, Node.js). You interact through an intermediary: Telegram, Discord, a terminal, or a pre-built web dashboard. The agent controls its own layer but cannot touch the layers above it. Space Agent collapses the stack. The agent runs on the client side — the same layer that renders the UI. When it needs to show you something, it doesn't generate an artifact and hope the right software opens it. It writes code into the page you're looking at. This matters for three reasons:
1. Instant, Interactive Output When Space Agent builds a weather widget, it's not a generated image. It's a live, interactive component — JavaScript, HTML, and CSS — that updates, responds to clicks, and can be modified further. The agent can come back to it later and add features.
2. Token Efficiency Through Architecture Space Agent doesn't use tool calling, structured output, or JSON schemas. Those add overhead — sometimes doubling the token cost of an interaction. Instead, Space Agent uses natural language for everything. When it needs to execute code, it appends a two-token marker (__javascript) followed by raw JavaScript. The browser detects the marker and runs the code. Web browsing is similarly lean. A full session that navigates to Google, accepts cookies, searches for "agent zero," opens the official site, navigates to GitHub, and finds the oldest release — the entire browsing part adds just 6,000 tokens. The trick: website DOM trees get transcribed into lists of images, links, and text elements in "transient space" — appended after the last caching breakpoint so they don't accumulate in chat history. Yan is explicit about the philosophy: "It's much more important for a codebase to be optimized for agents than for humans."
3. True Persistence Without Infrastructure Widgets persist across page refreshes because their state lives in the DOM. Spaces are just folders of text files — kilobytes, not gigabytes. No database, no Redis, no S3. Close the tab, come back in a week, and your workspace is exactly how you left it. ─── How It Works Under the Hood Space Agent runs entirely on standard web technologies: • Frontend: React + TypeScript, rendered in a browser tab • Agent execution: Browser JavaScript runtime — no WebAssembly, no plugins • LLM communication: Direct API calls to your chosen provider (OpenRouter with 200+ models, Anthropic, OpenAI, or local models via WebGPU) • Persistence: Widgets saved as text files. An optional thin Node.js backend handles file storage and permissions for multi-user setups The backend, when used, is deliberately dumb. It serves files and manages permissions. It never runs agent logic. All reasoning and UI generation happens client-side. This means: • No server costs for inference (you pay only for LLM API calls) • No data leaves your machine unless you choose a cloud model • Instant UI updates — no page reloads, no WebSocket latency • Full browser APIs available: Canvas, Audio, WebGL, file system access ─── The Skill System: Everything Is a SKILL.md Space Agent's modularity is built into its foundation. Every capability — the browser, the spaces system, development documentation, even core framework pieces — lives as a SKILL.md file in a virtual filesystem. The agent can read these files. And it can modify them. "Everything in the framework including the core is built as a module," Yan explains. "Modules can be added or removed at any time, and they can be developed by the agent itself." No server restart. No rebuild. The agent extends its own capabilities at runtime. This creates a feedback loop that no other agent system has: the agent gets better over time not because the model improved, but because the agent wrote better tools for itself. ─── Documentation-Driven Development Space Agent uses a hierarchical AGENTS.md system — Markdown documentation that lives alongside the code. When the agent needs to modify a widget or extend a capability, it reads the relevant documentation first to understand design intent, constraints, and patterns. This solves the core problem that plagues AI-assisted development: context fragmentation. Without persistent documentation, agents rewrite working code because they forgot why it was written that way, introduce breaking changes by not understanding dependencies, and repeat the same mistakes across sessions. With AGENTS.md files, the agent has persistent, searchable memory of design decisions. It's a blueprint for how all AI-assisted development should work. ─── Multi-User Architecture: Personal to Hierarchical Space Agent supports deployment from personal use to team infrastructure: • Per-user isolation: Users build in their own layer without affecting anyone else • Group sharing: Teams can share tools, workflows, and behavior when ready • Permission granularity: Read-only or read-write per folder, per group • Home directory development: Users can create custom capabilities in their home directory that only they see An accounting department could have specialized financial tools. An engineering team could share deployment scripts. Individual users keep personal workflows private. All in the same instance. ─── Time Travel and Admin Mode Self-modifying systems raise an obvious question: what happens when the agent writes buggy code that breaks its own runtime? Space Agent handles this better than any agent framework I've seen. Every user and group folder maintains an automatic Git repository. Break something? Revert to any previous state immediately.
The page won't even render? Admin mode splits the screen — one side runs static firmware that lets you browse files and time travel even when the main interface is completely broken. It's a control plane separate from the agent's workspace, inspired by what makes hardware devices recoverable. ─── Sandboxed Sharing One of the most practical features: create something interesting and generate a share link. Recipients open it in their browser without installing anything. "They don't need to worry about malicious code stealing their secrets," Yan notes — the shared workspace runs isolated in their browser, separate from their own Space Agent instance. No risk to their data, no installation friction. This lowers the barrier between "I built something cool" and "someone else can use it" to almost zero. Expect to see interactive AI-built tools shared the way people share Notion templates or Figma mockups. ─── The Rough Edges Space Agent is young. 47 open issues, 1,100 stars — this is an early-stage project from a small team (Agent Zero). It's not battle-tested at scale. Browser-sandbox limitations are real. Everything runs in the browser sandbox. No native OS control. No desktop app integration beyond what JavaScript can do. UI-TARS can click system tray icons and change OS settings; Space Agent cannot. Dependent on your LLM. Space Agent provides an efficient runtime, not better AI. If your chosen model can't handle the task, no amount of architectural cleverness helps. The system prompt is 9,000 tokens, leaving roughly 7,000 for conversation context — tight for complex multi-turn work. WebGPU inference is aspirational. Local model inference via WebGPU works but requires "a beefy GPU." Yan is realistic: "I don't expect the speed to be great here." For practical use, you'll be calling cloud APIs. The demo server's guest accounts are temporary. Deleted after a few days of inactivity. Self-hosting is straightforward but adds a Node.js process to manage. ─── How It Compares copy
| Aspect | Space Agent | UI-TARS Desktop | holaOS | OpenClaw/Hermes |
| -------------------- | ---------------------------- | --------------------------- | ----------------------- | --------------------- |
| Where agent runs | Browser JS runtime | Model backend + OS | Electron desktop | Server-side |
| Interface mutability | Agent builds UI in real time | Sees but doesn't reshape | Shared visual workspace | Fixed (terminal/text) |
| OS-level control | No (browser sandbox) | Yes (native mouse/keyboard) | Via agent harness | Shell only |
| Token efficiency | Very high (no JSON/tools) | Moderate | Moderate | Low (JSON tool calls) |
| Self-modification | Agent extends its own skills | No | No | No |
| Multi-user | Hierarchical, built-in | No | Workspace-level | Single-tenant |
| Time travel/rollback | Git-backed, admin mode | No | Via filesystem | Backups only |
| License | MIT | Apache 2.0 | Modified Apache 2.0 | MIT |
| Stars | 1.1K+ | 31K+ | 4.5K+ | 370K+ (OC) |
| Maturity | Early (Mar 2026) | Production (1yr+) | Mid | Varies |
Space Agent occupies a unique position. It's not trying to control your OS (UI-TARS) or be a shared desktop (holaOS) or be a personal terminal agent (OC/Hermes). It's trying to answer a different question: what happens when the agent is the interface?
─── Who This Is For
Space Agent is for builders who want their AI to build with them. If your workflow involves: • Rapidly prototyping interactive dashboards and tools • Creating custom visualizations from data without writing frontend code • Exploring ideas visually — "show me what that would look like" • Sharing interactive AI-built tools without recipients installing anything • Extending your agent's capabilities over time through self-written skills ...Space Agent offers something no other agent can. It's also for developers who care about token economics. The architectural efficiency gains — no JSON tool schemas, transient web browsing messages, two-token code execution markers — add up to real savings at scale. If you're paying per token, Space Agent's approach cuts overhead dramatically. ─── Links: • Website: space-agent.ai — try it live, no install • GitHub: github.com/agent0ai/space-agent — 1.1K+ stars, MIT license • Demo: Guest account with one click, just add your API key • Self-host: git clone && npm install && node space serve • Discord: discord.gg/B8KZKNsPpj • DeepWiki: deepwiki.com/agent0ai/space-agent
inMusic Native:
The All-in-One Production Ecosystem That Breaks Down DAW Walls In a bold move that challenges the fragmented world of music production software,
inMusic Brands has just unveiled inMusic Native — a unified music creation ecosystem that bridges hardware, software, and collaboration like never before. Announced today, inMusic Native isn't just another digital audio workstation. It's a platform-agnostic environment designed to work seamlessly across popular DAWs including Ableton Live, Logic Pro, FL Studio, Cubase, and Studio One, while also offering a standalone native application (currently in beta for macOS and Windows). The platform brings together the legendary brands under the inMusic umbrella — Akai Professional, Alesis, Alto Professional, Denon DJ, M-Audio, Marantz Professional, Numark, and Rane — into a single creative workflow. What Makes It Different The core philosophy behind inMusic Native is "One Ecosystem, Any DAW." Rather than forcing users to abandon their preferred software, inMusic Native integrates directly into existing workflows via VST3, AU, and AAX plugin formats. It's not a replacement for your DAW; it's a supercharged layer on top. Key components include: • Unified Plug-in Suite: A massive collection of instruments, effects, and production tools drawn from inMusic's brand portfolio, all accessible through a single interface. Think classic MPC drum samplers, AIR Music Tech synths, Denon DJ effects, and studio-grade processors — consistently organized and instantly recallable. • Hardware Deep Integration: Instruments like the Akai MPC and M-Audio controllers gain new depth with automatic parameter mapping, template switching, and tactile control of the entire software environment. It's a level of hardware-software cohesion that's previously been locked behind brand-specific ecosystems. • Native Collaboration Cloud: Real-time project sharing, version history, and feedback loops with collaborators using different DAWs. One user might be in Ableton, another in Logic — they can work on the same session elements without file conversion gymnastics. • Sound Content Marketplace: Browse, preview, and download expansions, sample packs, and artist-curated presets from within the interface. It's a direct pipeline from discovery to creation without ever leaving the creative flow. • Cross-DAW Session Portability: A proprietary session format that lets you open your work in any supported DAW while preserving track routing, plugin settings, and automation. This is a holy grail for producers who switch between environments or collaborate across studios. Why This Matters The DAW market has been Balkanized for decades. Users invest years mastering a specific workflow, and switching costs are astronomical. inMusic Native doesn't ask you to switch — it enhances what you already use. That's a strategic masterstroke: by embedding itself as a plugin layer, inMusic bypasses the zero-sum game of DAW competition. For current owners of Akai, M-Audio, or other inMusic hardware, the value proposition is immediate. These devices suddenly unlock new capabilities without any firmware updates. For producers drowning in third-party plugin management, the unified suite offers a clean, well-curated alternative with the pedigree of inMusic's decades-long history in music technology. Pricing and Availability inMusic Native enters public beta today for macOS and Windows, with a free tier offering core instruments and effects. A Pro subscription ($9.99/month) unlocks the full plug-in suite, collaboration cloud, and hardware integration features. Early adopters who sign up during the beta period receive a 30% lifetime discount.