AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field

1College of Civil Engineering, Tongji University
2Arcplus Group East China Architectural Design & Research Institute Co., Ltd.
3Shanghai Qi Zhi Institute
4College of Design and Innovation, Tongji University
(† denotes the corresponding author)
News
  • 2025.10.9: Our project homepage is launched !
AECBench

An overview of the AECBench framework and tasks.

  • 🌟 Hierarchical Cognitive Framework: A five-level framework enabling fine-grained evaluation of LLM capabilities in AEC tasks.
  • 🌟 High-Quality Benchmark Dataset: A high-quality dataset of 4,800 questions across 23 tasks reflecting real-world AEC scenarios.
  • 🌟 Automated Evaluation Pipeline: An "LLM-as-a-judge" approach for long-form responses, with code and data released open-source.

High-Quality Benchmark Dataset

AECBench

Building lifecycle knowledge dimensions: The figure illustrates the multifaceted knowledge required in this integrative paradigm, which encompasses 11 key domains and a wide range of specialized topics radiating from these central areas.

AECBench

Designed evaluation tasks of AECBench: A total of 23 tasks were involved in the evaluation framework for the various cognitive levels. These tasks were all derived from real-world AEC scenarios and were curated by domain engineers.

AECBench

Workflow of constructing the dataset: The dataset was established through three steps: data collection, data cleaning, and data review. The data review process highlights a two-round review mechanism, each individual item reviewed by a mid-level engineer and then confirmed by an expert.

Evaluation

AECBench

Automated evaluation pipeline for open-ended questions

AECBench

One-shot example for multiple-choice questions

Final performance

AECBench

Result: Performance of models in Knowledge Memorization/ Understanding/ Reasoning/ Calculation tasks

AECBench

Result: Performance of models in Knowledge Application tasks

AECBench

Discussion: Performance decline in questions related to tables in codes

AECBench

Discussion: Scatter Plots with LOWESS Curves of Model Performance in Evaluation Tasks
(a) original performance of LLMs; (b) performance of DeepSeek-V3 with calibration methods

A glimpse of AECBench

img
img
img
img
img
img
img
img
img
img
img
img
img
img
img
img
img
img
img
img
img
img
img
img
img
img

BibTeX

 @misc{liang2025aecbench,
      title={AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field}, 
      author={Chen Liang, Zhaoqi Huang, Haofen Wang, Fu Chai, Chunying Yu, Huanhuan Wei, Zhengjie Liu, Yanpeng Li, Hongjun Wang, Ruifeng Luo, Xianzhong Zhao},
      year={2025},
      eprint={2509.18776},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}