ml-dev-bench

v1.0

ML-Dev-Bench: A benchmark for testing AI agents on machine learning development tasks including model implementation, training, debugging, and optimization.

uvx harbor run -d ml-dev-bench@1.0

Tasks (33)

ml_dev_bench_basic_vision_finetuning
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_basic_vision_finetuning
044856a
ml_dev_bench_bert_eval_debug
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_bert_eval_debug
044856a
ml_dev_bench_boolq_performance
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_boolq_performance
044856a
ml_dev_bench_channel_vit_implementation
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_channel_vit_implementation
044856a
ml_dev_bench_channel_vit_implementation_easy
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_channel_vit_implementation_easy
044856a
ml_dev_bench_channel_vit_implementation_no_test
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_channel_vit_implementation_no_test
044856a
ml_dev_bench_cifar_10_lt_performance
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_cifar_10_lt_performance
044856a
ml_dev_bench_cifar100_performance
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_cifar100_performance
044856a
ml_dev_bench_dataset_not_available_download
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_dataset_not_available_download
044856a
ml_dev_bench_dataset_preprocess
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_dataset_preprocess
044856a
ml_dev_bench_full_train_workflow_performance_test
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_full_train_workflow_performance_test
044856a
ml_dev_bench_full_train_workflow_setup_test
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_full_train_workflow_setup_test
044856a
ml_dev_bench_improve_cifar10_baseline
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_improve_cifar10_baseline
044856a
ml_dev_bench_improve_segmentation_baseline
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_improve_segmentation_baseline
044856a
ml_dev_bench_lora_implementation
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_lora_implementation
044856a
ml_dev_bench_mcts_implementation
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_mcts_implementation
044856a
ml_dev_bench_mla_implementation
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_mla_implementation
044856a
ml_dev_bench_mla_implementation_hidden_tests
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_mla_implementation_hidden_tests
044856a
ml_dev_bench_nan_loss_debug
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_nan_loss_debug
044856a
ml_dev_bench_noisy_dataset_download
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_noisy_dataset_download
044856a
ml_dev_bench_noisy_label_annotation
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_noisy_label_annotation
044856a
ml_dev_bench_normalization_bug
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_normalization_bug
044856a
ml_dev_bench_parse_logs
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_parse_logs
044856a
ml_dev_bench_ppo_implementation
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_ppo_implementation
044856a
ml_dev_bench_pretrained_bert_base_uncased_load
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_pretrained_bert_base_uncased_load
044856a
ml_dev_bench_pretrained_model_load_from_torchvision
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_pretrained_model_load_from_torchvision
044856a
ml_dev_bench_shape_mismatch_output
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_shape_mismatch_output
044856a
ml_dev_bench_shape_mismatch_train
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_shape_mismatch_train
044856a
ml_dev_bench_small_dataset_overfit
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_small_dataset_overfit
044856a
ml_dev_bench_training_files_debug
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_training_files_debug
044856a
ml_dev_bench_var_implementation
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_var_implementation
044856a
ml_dev_bench_vit_debugging
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_vit_debugging
044856a
ml_dev_bench_wandb_logging
uvx harbor run -d ml-dev-bench@1.0 -t ml_dev_bench_wandb_logging
044856a