A Google study finds that the standard three to five human raters per test example often aren't enough for reliable AI ...