diff --git a/federatedscope/llm/eval/eval_for_rougel/README.md b/federatedscope/llm/eval/eval_for_rougel/README.md
index 305d728b0..d3da1a99c 100644
--- a/federatedscope/llm/eval/eval_for_rougel/README.md
+++ b/federatedscope/llm/eval/eval_for_rougel/README.md
@@ -1,15 +1,19 @@
 # Rouge-L
 
+## Dolly-15K
 To assess the performance of our fine-tuned model, we leverage the Rouge-L 
-metric and conduct experiments with a large number of clients, utilizing the 
-Dolly-15K dataset as our training corpus. The Dolly-15K dataset encompasses 
-a total of 15,015 data points, distributed across eight distinct tasks. For 
-a more comprehensive evaluation, we allocate the final task exclusively for 
-evaluation purposes, while dedicating the remaining ones to the training 
-phase. Our experimental setup involves a network of 200 clients, utilizing a Dirichlet distribution for data partitioning to emulate non-IID conditions across the client base.
+metric and conduct experiments with a large number of clients, utilizing the Dolly-15K dataset as our training corpus. 
+The Dolly-15K dataset encompasses a total of 15,015 data points, distributed across eight distinct tasks. For a more comprehensive evaluation, we allocate the final task exclusively for evaluation purposes, while dedicating the remaining ones to the training phase. Our experimental setup involves a network of 200 clients, utilizing a Dirichlet distribution for data partitioning to emulate non-IID conditions across the client base.
 
 To do the evaluation, run
 ```bash
-python federatescope/eval/eval_for_rougel/eval.py --cfg 
-federatescope/llm/baselime/xxx.yaml
+python federatescope/eval/eval_for_rougel/eval_dolly.py --cfg federatescope/llm/baselime/xxx.yaml
+```
+
+## Natural Instructions
+We also leverage the Rouge-L metric and conduct experiments with a large number of clients, utilizing the Natural Instructions (NI) dataset as our training corpus.  In the NI dataset, we allocate each of the 738 training tasks exclusively to a distinct client for model training, thereby cultivating a non-IID setting characterized by feature distribution skew. Meanwhile, evaluation is performed on separate test tasks.
+
+To do the evaluation, run
+```bash
+python federatescope/eval/eval_for_rougel/eval_ni.py --cfg federatescope/llm/baselime/xxx.yaml
 ```
\ No newline at end of file
diff --git a/federatedscope/llm/eval/eval_for_rougel/eval.py b/federatedscope/llm/eval/eval_for_rougel/eval_dolly.py
similarity index 100%
rename from federatedscope/llm/eval/eval_for_rougel/eval.py
rename to federatedscope/llm/eval/eval_for_rougel/eval_dolly.py