A new benchmark study shows that leading AI agents can successfully complete only a tiny fraction of economically valuable remote work projects at a professional standard comparable to human freelancers.
The Remote Labor Index (RLI), developed by researchers from Scale AI and the Center for AI Safety, tested six frontier AI systems, including Manus (top performer at 2.5% automation rate), Grok 4 (2.1%), Sonnet 4.5 (2.1%), GPT-5 (1.7%), ChatGPT agent (1.3%), and Gemini 2.5 Pro (0.8%), on 240 real freelance projects sourced from platforms like Upwork.
These projects, spanning 23 categories such as product design, 3D animation, architecture, game development, data visualization, audio-video editing, and scientific document preparation, represented more than 6,000 hours of paid human labor valued at approximately $140,000. Each included the original client brief and the accepted human deliverable as the gold standard.
The study, first published in October 2025 and widely covered in early January 2026 (including by the Washington Post on January 8), measured the automation rate, the percentage of projects AI completed at a quality level that would be accepted by a reasonable client. Results placed performance near the floor across the board.
Common failure modes included:
- Poor quality (45.6% of deliverables): inconsistencies, low-grade visuals, missing assets, or errors like incorrect colors, overlapping text, or missing data in visualizations.
- Incompletion (35.7%): tasks left unfinished.
- Technical issues (nearly 1 in 5 cases): corrupt files, empty outputs, or basic execution failures (e.g., producing an 8-second video when 8 minutes were requested).
The researchers noted that AI struggles with multi-step coordination, self-verification, visual understanding, and integrating multiple skill layers, issues that compound in realistic freelance workflows. While scattered successes were achieved in narrower tasks, such as text-heavy data visualization, audio cleanup, or simple image generation, full end-to-end project autonomy remains far out of reach.
Jason Hausenloy, one of the RLI researchers, stated: “Current models are not close to being able to automate real jobs in the economy.”
The benchmark provides a grounded counterpoint to hype around AI agents, emphasizing that lab benchmarks and isolated tasks do not reflect the complexity of paid remote labor. The team plans to update the index as newer models emerge.
The findings were first detailed in the October 2025 arXiv paper and gained renewed attention in January 2026 media reports. No major updates or retractions have appeared as of January 11, 2026.

