hello @OmkarThawakar , I used the LLM360 Analysis repo to run eval for siqa task:
python Analysis360/eval/harness/main.py --device cuda:0 --model=hf-causal-experimental --batch_size=auto:1 --model_args="pretrained=MBZUAI/MobiLlama-05B,trust_remote_code=True,dtype=bfloat16" --tasks=social_iqa --num_fewshot=0 --output_path=Analysis360-MobiLlama-05B.json
it only gives 0.3327, which is close to random numbers, since there are only three choices.
| Tasks |
Version |
Filter |
n-shot |
Metric |
Value |
|
Stderr |
| social_iqa |
0 |
none |
0 |
acc |
0.3327 |
± |
0.0107 |
Could you share how you ran the siqa evaluation? Thanks
hello @OmkarThawakar , I used the LLM360 Analysis repo to run eval for siqa task:
python Analysis360/eval/harness/main.py --device cuda:0 --model=hf-causal-experimental --batch_size=auto:1 --model_args="pretrained=MBZUAI/MobiLlama-05B,trust_remote_code=True,dtype=bfloat16" --tasks=social_iqa --num_fewshot=0 --output_path=Analysis360-MobiLlama-05B.jsonit only gives 0.3327, which is close to random numbers, since there are only three choices.
Could you share how you ran the siqa evaluation? Thanks