On the OSWorld benchmark test, which evaluates a model's ability to use a computer, humans typically score around 70-75%, and Claude scored just 14.9%. But that's nearly double the score of the ...