Issue about reproducing results #8

Ignoramus0817 · 2024-08-21T09:12:40Z

We tested MAmmoTH2-8B-Plus on Alpaca Eval 2.0 and Arena Hard, but our results are far lower than reported. Specifically,

	Alpaca Eval 2.0	Arena Hard
reported	18.5	16.6
reproduced	12.53	9.7

Note that we also tested other models like SimPO (Llama-3-Instruct-8B-SimPO-v0.2) on Alpaca Eval 2.0, the gap of results are not that large (51.54/53.7).

We use vllm-0.4.0 as inference backend and gpt-4-1106-preview as judge.
Do you have any idea where the problem is? Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue about reproducing results #8

Issue about reproducing results #8

Ignoramus0817 commented Aug 21, 2024 •

edited

Loading

Issue about reproducing results #8

Issue about reproducing results #8

Comments

Ignoramus0817 commented Aug 21, 2024 • edited Loading

Ignoramus0817 commented Aug 21, 2024 •

edited

Loading