ChronoPlay Logo

ChronoPlay RAG Leaderboard

A Framework for Modeling Dual Dynamics and Authenticity in Game RAG Benchmarks

Liyang He1, Yuren Zhang1, Ziwei Zhu2, Zhenghui Li3, Shiwei Tong1,†
1Tencent 2The Chinese University of Hong Kong 3Independent Researcher
Corresponding author: shiweitong@tencent.com

ChronoPlay is a novel framework for automated and continuous generation of game Retrieval Augmented Generation benchmarks. This leaderboard evaluates RAG systems across three popular games: Dune: Awakening, Dying Light 2, and PUBG Mobile.

Evaluation Methodology: Building on ChronoPlay's dual-dynamic temporal analysis, this leaderboard provides a holistic evaluation for each game. Each system is assessed on the complete game dataset, offering a comprehensive measure of overall RAG performance across all temporal segments at a glance.

About ChronoPlay

Understanding the dual dynamics that shape game knowledge and player behavior over time.

01 — Core Concept

Dual Dynamics in Gaming

Games evolve through two concurrent dynamics: Knowledge Evolution, where game content continuously updates through patches, expansions, and balance changes; and User Interest Drift, where player attention shifts across topics over time. ChronoPlay captures both dynamics to create temporally-aware benchmarks.

Dual Dynamics: Knowledge Evolution and User Interest Drift
Figure 1. Knowledge Evolution (top) tracks game content changes across versions; User Interest Drift (bottom) captures shifting player attention patterns.
02 — Data Pipeline

Dual-Source Data Synthesis Pipeline

ChronoPlay synthesizes benchmark data from two complementary sources: authoritative game knowledge bases (wikis, patch notes) and player community discussions (forums, social media). This dual-source approach ensures both factual accuracy and real-world relevance.

Dual-Source Data Synthesis Pipeline
Figure 2. The data synthesis pipeline combines authority knowledge with community-driven user personas to generate authentic, high-quality QA pairs.
03 — Update Mechanism

Dual-Dynamic Update Mechanism

The framework continuously evolves through NER-based knowledge updates that track game content changes, and interest drift detection that monitors shifting player attention patterns. This dual-dynamic mechanism keeps the benchmark aligned with the ever-changing game landscape.

Dual-Dynamic Update Mechanism
Figure 3. The update mechanism monitors official sources for knowledge changes and community forums for interest drift, triggering benchmark regeneration when significant shifts are detected.

Leaderboard

Comparing RAG system performance across all three game domains.

Loading leaderboard data…

How to Submit Your Results

  1. Fork this repository
  2. Create a new JSON file in the submissions/ directory (e.g., my_system.json)
  3. Fill in your results in the following format:
    {
      "system_name": "My RAG System v1.0",
      "description": "Dense retrieval with BM25 reranking and GPT-4 for generation",
      "games": {
        "dune": {
          "topk": 3,
          "recall": 0.85,
          "f1": 0.78,
          "ndcg": 0.82,
          "correctness": 0.88,
          "faithfulness": 0.91
        },
        "dying_light_2": {
          "topk": 3,
          "recall": 0.83,
          "f1": 0.76,
          "ndcg": 0.80,
          "correctness": 0.86,
          "faithfulness": 0.89
        },
        "pubg_mobile": {
          "topk": 3,
          "recall": 0.81,
          "f1": 0.74,
          "ndcg": 0.78,
          "correctness": 0.84,
          "faithfulness": 0.87
        }
      }
    }
  4. Submit a Pull Request
  5. After the PR is merged, the leaderboard will be automatically updated

Note: All metric scores (recall, f1, ndcg, correctness, faithfulness) should be decimals between 0-1. The scores R, G, and Total Score will be calculated automatically by the frontend. You must provide results for all three games.

Citation

If you use this leaderboard or the ChronoPlay benchmark in your research, please cite our paper:

@article{he2025chronoplay,
  title={ChronoPlay: A Framework for Modeling Dual Dynamics and Authenticity in Game RAG Benchmarks},
  author={He, Liyang and Zhang, Yuren and Zhu, Ziwei and Li, Zhenghui and Tong, Shiwei},
  journal={arXiv preprint arXiv:2510.18455},
  year={2025}
}

Resources: Paper | Code | Dataset