Agentic AI · Game Engineering

VisionTactical

Two off-the-shelf LLMs play a 3D shooter on sight alone. No RL, no training, zero gradient steps.

Year2025

RoleAgentic AI · Game Engineering

ScopeVision-Language Models, Multi-Agent Systems, Game AI

DeviceAutonomous Agent

ToolsPython, Ursina, Mistral 3.1, DeepSeek V3, Pynput, MSS

Linkgithub.com/zaineli/vision-tactical

01Context

Human perception, robotically.

VisionTactical is an autonomous agent that plays a custom 3D first-person shooter. It tests a simple idea: can off-the-shelf language models play a real-time game using nothing but sight and reasoning, with no reinforcement learning and no training at all?

I built the game and the agents. The game is a 3D FPS in Ursina; the agents sit on top, perceiving the world through screenshots and JSON state and acting through simulated keyboard and mouse.

02The Problem

Real-time play, zero training budget.

Game-playing agents usually mean reinforcement learning: millions of frames, reward shaping, and a policy that only works in the world it trained on. I wanted the opposite: generality over a training loop, and a real-time control budget that two LLM calls have to fit inside.

03Approach

A Commander and a Lieutenant.

Two roles. The Commander (Mistral 3.1), a vision model, looks at the screen and sets strategy from what it sees. The Lieutenant (DeepSeek V3), a text model, reads structured game state and turns the Commander's intent into the next concrete move.

The loop is deliberately simple: capture the screen and state, ask the Commander what's happening and what to do, ask the Lieutenant for the action, execute it with keyboard and mouse, repeat.

VisionTactical AI-driven game system architecture — System architecture: a Ursina game environment, screenshot + JSON state capture, the Commander/Lieutenant decision layer, and pynput action execution.

04Key Decisions

What shaped it

Split sight from tactics

A vision model owns perception and strategy; a text model owns precise actions on structured state.
Each model does what it's best at, and neither has to do both.

Own the whole stack

I built the 3D game in Ursina so the agents had a world to act in, and the state schema they read.
Screenshots via MSS, control via pynput. The agent uses the same inputs a human would.

No RL, on purpose

No reward shaping, no training run, no fine-tune. Only prompting and a tight loop.
The bet: frontier models already have enough world-model to adapt in-context.

05Craft Moment

Lose, then adapt, then win.

The whole thesis lives in two clips. First the agent gets demolished by a single enemy: the loop runs, but the strategy is wrong. Then, with the same models and no retraining, it reads the situation differently and wins. Adaptation without a single gradient step.

Attempt one: outplayed.

Attempt two: adapts and wins.

06What Shipped

The system

Custom 3D FPS

A first-person shooter built in Python + Ursina, with a clean state schema.

Commander (vision)

Mistral 3.1 reads the screen and sets strategy.

Lieutenant (text)

DeepSeek V3 turns intent into the next keyboard/mouse action.

Perceive-act loop

MSS screen capture in, pynput control out. Open-sourced under MIT.

The arena, built in Ursina: a low-poly FPS with deployable cover and a live enemy. The world the agents actually play in.

07Outcome

Where it landed

A working proof that two prompted models, split by role, can play a real-time 3D game and adapt across attempts. Built end to end, game included.

Training

None, prompting only

Agents

Vision + text, 2 roles

Code

Open source · MIT

08What I Learned

Portable lessons

01Split the problem by what each model is best at.

02Perception plus reasoning can replace a training loop.

03Give an agent the same controls a human has, and it adapts.