Experiments
We performed experiments on 4 datasets, the place three datasets correspond with downstream generative duties and one dataset with a classification activity. Generative duties are sometimes tougher than classification duties. It is because the generative duties are evaluated by the next-token prediction accuracy, which requires the artificial knowledge to protect fine-grained textual info from the non-public knowledge. In distinction, the classification duties solely require sustaining the co-occurrence patterns between labels and phrases within the non-public knowledge.
The three generative duties are chosen to cowl a various set of sensible eventualities: PubMed (medical paper abstracts), Chatbot Enviornment (human-to-machine interactions), and Multi-Session Chat (human-to-human day by day dialogues). To guage the standard of the generated artificial knowledge, we adopted the setup of Aug-PE to coach a small downstream language mannequin on the artificial knowledge after which compute the next-token prediction accuracy on the actual take a look at knowledge.
The classification activity is carried out on the OpenReview (educational paper critiques) dataset. To guage the standard of the generated artificial knowledge, we prepare a downstream classifier on the artificial knowledge, and compute the classification accuracy on the actual take a look at knowledge.
To mitigate issues relating to knowledge contamination, we rigorously analyzed our chosen datasets. Our evaluation confirmed no overlap between our pre-training knowledge and the downstream datasets.