While AI models can quickly master benchmarks and surpass human baselines, they often fall short in the real world. The solution, most researchers argue, is not to abandon these benchmarks—but to make them better. LongReads
By strategically adding stickers to a stop sign, for example, researchers in 2018 fooled standard image recognition systems into seeing a speed limit sign instead. And a 2018 project called Gender Shades found the accuracy of gender identification for commercial face-recognition systems dropped from 90% to 65% for dark-skinned women’s faces.
Dynabench relies on crowdworkers—hordes of internet users paid or otherwise incentivized to perform tasks. Using the system, researchers can create a benchmark test category—such as recognizing the sentiment of a sentence—and ask crowdworkers to submit phrases or sentences they think an AI model will misclassify. Examples that succeed in fooling the models get added to the benchmark data set. Models train on the data set, and the process repeats.
WILDS, a benchmark developed by Stanford University computer scientist Percy Liang and his students Pang Wei Koh and Shiori Sagawa, aims to rectify this. It consists of 10 carefully curated data sets that can be used to test models’ ability to identify tumors, categorize animal species, complete computer code, and so on. Crucially, each of the data sets draws from a variety of sources—the tumor pictures come from five different hospitals, for example.
Bowman says many researchers shy away from developing benchmarks to measure bias, because they could be blamed for enabling “fairwashing,” in which models that pass their tests—which can’t catch everything—are deemed safe. “We were sort of scared to work on this,” he says. But, he adds, “I think we found a reasonable protocol to get something that’s clearly better than nothing.” Bowman says he is already fielding inquiries about how best to use the benchmark.
Bowman has a different approach to closing off shortcuts. For his latest benchmark, posted online in December 2021 and called QuALITY , he hired crowdworkers to generate questions about text passages from short stories and nonfiction articles. He hired another group to answer the questions after reading the passages at their own pace, and a third group to answer them hurriedly under a strict time limit.
A more radical rethinking of scores acknowledges that often there’s no “ground truth” against which to say a model is right or wrong. People disagree on what’s funny or whether a building is tall. Some benchmark designers just toss out ambiguous or controversial examples from their test data, calling it noise.
Singapore Latest News, Singapore Headlines
Similar News:You can also read news stories similar to this one that we have collected from other news sources.
Stratolaunch aces 5th test flight with giant hypersonic aircraft carrierRoc flew again this Star Wars Day.
Read more »
JonBenet Ramsey's father supports petition demanding new review of DNA 25 years after deathJohn Ramsey said he wants DNA evidence that was never tested before to be transferred away from Boulder police to a different agency,
Read more »
NASA nearing crewed flight tests for its all-electric X-57 MaxwellNASA is edging nearer to its first flight test for its all-electric experimental 'X-plane' X-57 Maxwell after completing ground tests on the aircraft.
Read more »
These 15 Hair Thickening Shampoos Are the Real DealI’ve tested each one on my uber-fine, flat hair.
Read more »
Ozark Season 4 Finale: Jason Bateman Teases What's Next For The ByrdesJason Bateman on what the Byrdes will do after the OzarkSeason4 finale 👀 'My assumption is that, while they’re smarter now than when we first met them, I still feel like their hubris and arrogance will continue to trip them up.'
Read more »