harbor
A benchmark to evaluate the ability of AI agents to extend existing AI research through research experiment implementation tasks. Original benchmark: https://github.com/tinlaboratory/rexbench. Website: https://rexbench.com/.
uvx harbor run -d rexbench@1.0
uvx harbor run -d rexbench@1.0 -t cogs
uvx harbor run -d rexbench@1.0 -t othello