You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We tested MAmmoTH2-8B-Plus on Alpaca Eval 2.0 and Arena Hard, but our results are far lower than reported. Specifically,
Alpaca Eval 2.0
Arena Hard
reported
18.5
16.6
reproduced
12.53
9.7
Note that we also tested other models like SimPO (Llama-3-Instruct-8B-SimPO-v0.2) on Alpaca Eval 2.0, the gap of results are not that large (51.54/53.7).
We use vllm-0.4.0 as inference backend and gpt-4-1106-preview as judge.
Do you have any idea where the problem is? Thanks.
The text was updated successfully, but these errors were encountered:
We tested MAmmoTH2-8B-Plus on Alpaca Eval 2.0 and Arena Hard, but our results are far lower than reported. Specifically,
Note that we also tested other models like SimPO (Llama-3-Instruct-8B-SimPO-v0.2) on Alpaca Eval 2.0, the gap of results are not that large (51.54/53.7).
We use vllm-0.4.0 as inference backend and gpt-4-1106-preview as judge.
Do you have any idea where the problem is? Thanks.
The text was updated successfully, but these errors were encountered: