ChronoPlay Logo

ChronoPlay RAG Leaderboard

A Framework for Modeling Dual Dynamics and Authenticity in Game RAG Benchmarks

Liyang He1, Yuren Zhang1, Ziwei Zhu2, Zhenghui Li3, Shiwei Tong1,†
1Tencent 2The Chinese University of Hong Kong 3Independent Researcher
Corresponding author: shiweitong@tencent.com

ChronoPlay is a novel framework for automated and continuous generation of game Retrieval Augmented Generation benchmarks. This leaderboard evaluates RAG systems across three popular games: Dune: Awakening, Dying Light 2, and PUBG Mobile.

📊 Evaluation Methodology: While the original paper evaluates models across different temporal segments to capture the dual dynamics of game evolution, this leaderboard presents a holistic evaluation for each game. Each system is assessed on the complete game dataset to provide a comprehensive measure of overall model performance, making it easier to compare different RAG approaches at a glance.

Loading...

📝 How to Submit Your Results?

  1. Fork this repository
  2. Create a new JSON file in the submissions/ directory (e.g., my_system.json)
  3. Fill in your results in the following format:
    {
      "system_name": "My RAG System v1.0",
      "description": "Dense retrieval with BM25 reranking and GPT-4 for generation",
      "games": {
        "dune": {
          "topk": 3,
          "recall": 0.85,
          "f1": 0.78,
          "ndcg": 0.82,
          "correctness": 0.88,
          "faithfulness": 0.91
        },
        "dying_light_2": {
          "topk": 3,
          "recall": 0.83,
          "f1": 0.76,
          "ndcg": 0.80,
          "correctness": 0.86,
          "faithfulness": 0.89
        },
        "pubg_mobile": {
          "topk": 3,
          "recall": 0.81,
          "f1": 0.74,
          "ndcg": 0.78,
          "correctness": 0.84,
          "faithfulness": 0.87
        }
      }
    }
  4. Submit a Pull Request
  5. After the PR is merged, the leaderboard will be automatically updated

Note: All metric scores (recall, f1, ndcg, correctness, faithfulness) should be decimals between 0-1. The scores R, G, and Total Score will be calculated automatically by the frontend. You must provide results for all three games.

📚 Citation

If you use this leaderboard or the ChronoPlay benchmark in your research, please cite our paper:

@article{he2025chronoplay,
  title={ChronoPlay: A Framework for Modeling Dual Dynamics and Authenticity in Game RAG Benchmarks},
  author={He, Liyang and Zhang, Yuren and Zhu, Ziwei and Li, Zhenghui and Tong, Shiwei},
  journal={arXiv preprint arXiv:2510.18455},
  year={2025}
}

Resources: Paper | Code | Dataset