Creating more domain-specific versions of common machine learning benchmarks

Standard benchmarks (e.g., for image recognition or natural language inference) are usually quite domain-agnostic (e.g., do not cover data from domains of science and technology).

Having domain-specific versions of standard benchmarks could help both in improving the state-of-the-art models for this domain, and, more generally, help in determining how well certain models perform across different kinds of data.


Conducting large-scale and/or high-quality human curation of data

In the creation of datasets or benchmarks it is often the case that, ideally, large quantities of data that underwent high-quality curation by humans are available (e.g., for labeling or annotating data items).

Strategies for incentivizing, funding, organizing and quality-checking such human curation efforts might have high leverage for improving progress directly (via directly increasing the utility of data), and are also likely to improve AI model training and benchmarking.