researchcodebench

v1.0

ResearchCodeBench evaluates AI agents' ability to implement algorithms from academic papers. Contains 212 code implementation tasks across 20 ML/AI research problems from top-tier venues (ICLR, NeurIPS, CVPR, COLM). Tests paper comprehension, algorithm understanding, and precise code implementation skills with 1,449 lines of reference code.

uvx harbor run -d researchcodebench@1.0

uvx harbor run -d researchcodebench@1.0

Tasks (212)

siss_importance_sampling_weights

uvx harbor run -d researchcodebench@1.0 -t siss_importance_sampling_weights

uvx harbor run -d researchcodebench@1.0 -t siss_importance_sampling_weights

69581ca

siss_subtracted_importance_sampled_scores_importance_sampling_with_mixture

uvx harbor run -d researchcodebench@1.0 -t siss_subtracted_importance_sampled_scores_importance_sampling_with_mixture

uvx harbor run -d researchcodebench@1.0 -t siss_subtracted_importance_sampled_scores_importance_sampling_with_mixture

69581ca

tabdiff_compute_total_noise_for_categorical_features_with_learnable_k

uvx harbor run -d researchcodebench@1.0 -t tabdiff_compute_total_noise_for_categorical_features_with_learnable_k

uvx harbor run -d researchcodebench@1.0 -t tabdiff_compute_total_noise_for_categorical_features_with_learnable_k

69581ca

tabdiff_compute_total_noise_for_numerical_features_with_learnable_rho

uvx harbor run -d researchcodebench@1.0 -t tabdiff_compute_total_noise_for_numerical_features_with_learnable_rho

uvx harbor run -d researchcodebench@1.0 -t tabdiff_compute_total_noise_for_numerical_features_with_learnable_rho

69581ca

tabdiff_initialize_the_learnable_feature-wise_parameter_k_for_categorical_features

uvx harbor run -d researchcodebench@1.0 -t tabdiff_initialize_the_learnable_feature-wise_parameter_k_for_categorical_features

uvx harbor run -d researchcodebench@1.0 -t tabdiff_initialize_the_learnable_feature-wise_parameter_k_for_categorical_features

69581ca

tabdiff_initialize_the_learnable_feature-wise_parameter_rho_for_numerical_features

uvx harbor run -d researchcodebench@1.0 -t tabdiff_initialize_the_learnable_feature-wise_parameter_rho_for_numerical_features

uvx harbor run -d researchcodebench@1.0 -t tabdiff_initialize_the_learnable_feature-wise_parameter_rho_for_numerical_features

69581ca

tabdiff_make_sure_learnable_parameter_ks_for_categorical_features_are_positive

uvx harbor run -d researchcodebench@1.0 -t tabdiff_make_sure_learnable_parameter_ks_for_categorical_features_are_positive

uvx harbor run -d researchcodebench@1.0 -t tabdiff_make_sure_learnable_parameter_ks_for_categorical_features_are_positive

69581ca

tabdiff_make_sure_learnable_parameter_rhos_are_greater_than_rho_offset

uvx harbor run -d researchcodebench@1.0 -t tabdiff_make_sure_learnable_parameter_rhos_are_greater_than_rho_offset

uvx harbor run -d researchcodebench@1.0 -t tabdiff_make_sure_learnable_parameter_rhos_are_greater_than_rho_offset

69581ca

tanh-init_identity_matrix

uvx harbor run -d researchcodebench@1.0 -t tanh-init_identity_matrix

uvx harbor run -d researchcodebench@1.0 -t tanh-init_identity_matrix

69581ca

tanh-init_identity_matrix_else

uvx harbor run -d researchcodebench@1.0 -t tanh-init_identity_matrix_else

uvx harbor run -d researchcodebench@1.0 -t tanh-init_identity_matrix_else

69581ca

tanh-init_proposed_weight_initialization

uvx harbor run -d researchcodebench@1.0 -t tanh-init_proposed_weight_initialization

uvx harbor run -d researchcodebench@1.0 -t tanh-init_proposed_weight_initialization

69581ca

tanh-init_update

uvx harbor run -d researchcodebench@1.0 -t tanh-init_update

uvx harbor run -d researchcodebench@1.0 -t tanh-init_update

69581ca