jmerelnyc/Photo-agents

What you get here is a research-grade framework for building agents that watch your screen, understand what they see, and gradually write their own reusable skills as they complete desktop tasks. Instead of hardcoding a library of actions, the system lets an LLM observe visual context, execute a task, then encode what worked into a persistent memory layer. The next time a similar situation appears, the agent pulls from that accumulated skill bank rather than reasoning from scratch. Over many runs, it theoretically gets faster and more capable without human intervention. The appeal for AI engineers is real: layered vision-grounded memory is a genuine architectural problem that most agent frameworks sidestep. If you are prototyping an autonomous workflow tool or building a desktop RPA replacement with adaptive behavior, this gives you a starting skeleton worth studying. The honest reservation is that it is clearly early and experimental. There is limited documentation, no production hardening, and self-written code execution on a live machine carries obvious risk if guardrails slip. -> Best for: AI engineers prototyping adaptive desktop automation agents who can tolerate rough edges and want to study self-evolving skill architectures firsthand.