Experiments
We needed to know which fashions and duties would profit most from our curation course of. As baselines for our experiments, we fine-tuned two LLMs of various sizes (Gemini Nano-1 with 1.8B parameters and Nano-2 with 3.25B parameters) on two duties of various complexity (decrease and better, primarily based on knowledgeable alignment) utilizing crowdsourced labels. Every crowdsourced knowledge set has ~100K annotations and a powerful class imbalance, with round 95% benign labels on common.
We in contrast every of those 4 baseline circumstances in opposition to the corresponding curated situation through which every mannequin (Nano-1 and Nano-2) is fine-tuned over a number of rounds utilizing the curation course of described above. At every iteration, we chosen our curated set of examples and used them for mannequin analysis and fine-tuning, as described above. All fashions plateaued earlier than reaching parity with the specialists’ inner alignment, so we stopped at 6 iterations (~400 fine-tuning and ~250 analysis samples) for the decrease complexity process and 5 iterations (~250 fine-tuning and ~150 analysis samples) for the upper complexity process. (Observe that the decrease complexity process had a bigger number of examples, which can account for the longer time wanted to converge.) Each knowledge units had a closing class steadiness of ~40% constructive examples.
The desk beneath supplies an summary of the dimensions and high quality of the info utilized in every situation. Consultants reached a median pairwise Cohen’s Kappa of .81 (on the decrease complexity process) and .78 (on the upper complexity process) by means of the curation course of. We contemplate these the ceiling for mannequin efficiency. To evaluate the standard of our crowdsourced knowledge, we calculated Kappa alignment between crowdsourced annotations and specialists primarily based on our full curated set, which was .59 (decrease complexity) and .41 (larger complexity).
