19.4 C
Canberra
Wednesday, March 4, 2026

Evaluating progress of LLMs on scientific problem-solving


Programmatic and model-based evaluations

Duties in CURIE are different and have ground-truth annotations in combined and heterogeneous kind, e.g., as JSONs, latex equations, YAML recordsdata, or free-form textual content. Evaluating free-form era is difficult as a result of solutions are sometimes descriptive, and even when a format is specified, as in most of our instances, the response to every discipline can have differing types. For instance, supplies grid factors might generally be specified as “[p, q, r]” and at different instances as “p × q × r”. Therefore, along with the programmatic analysis metrics, equivalent to ROUGE-L, intersection-over-inion (used for BIOGR), and id ratio (utilized in PDB), we suggest two model-based analysis metrics.

(1) LMScore: Prompts an LLM asking how carefully the predictions match floor fact on a 3-point scale: “good” if the prediction has few minor errors, “okay” if there are lots of minor errors, and “dangerous” if there are main errors. We take into account the weighted common of the log-likelihood scores of the tokens to provide a remaining confidence.

(2) LLMSim: Is used for retrieval duties the place we ask the mannequin to exhaustively extract many particulars, e.g., descriptors, properties and values of supplies from a analysis doc, and supply as output an unordered record of dictionaries or information. We use a chain-of-thought (CoT) immediate that asks the LLM to have a look at every ground-truth document and establish the expected information that accurately match every discipline (key) and worth of the bottom fact. As soon as we match the ground-truth information with predicted information, we are able to then measure precision and recall for the retrieval job, and compute the imply common precision, recall and F1 scores throughout all paperwork.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

[td_block_social_counter facebook="tagdiv" twitter="tagdivofficial" youtube="tagdiv" style="style8 td-social-boxed td-social-font-icons" tdc_css="eyJhbGwiOnsibWFyZ2luLWJvdHRvbSI6IjM4IiwiZGlzcGxheSI6IiJ9LCJwb3J0cmFpdCI6eyJtYXJnaW4tYm90dG9tIjoiMzAiLCJkaXNwbGF5IjoiIn0sInBvcnRyYWl0X21heF93aWR0aCI6MTAxOCwicG9ydHJhaXRfbWluX3dpZHRoIjo3Njh9" custom_title="Stay Connected" block_template_id="td_block_template_8" f_header_font_family="712" f_header_font_transform="uppercase" f_header_font_weight="500" f_header_font_size="17" border_color="#dd3333"]
- Advertisement -spot_img

Latest Articles