AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field

Chen Liang¹, Zhaoqi Huang¹, Haofen Wang⁴, Fu Chai¹, Chunying Yu¹, Huanhuan Wei², Zhengjie Liu¹, Yanpeng Li², Hongjun Wang², Ruifeng Luo^1,2†, Xianzhong Zhao^1,3†,

¹College of Civil Engineering, Tongji University
²Arcplus Group East China Architectural Design & Research Institute Co., Ltd.
³Shanghai Qi Zhi Institute
⁴College of Design and Innovation, Tongji University

(† denotes the corresponding author)

arXiv Code Dataset

🌟 Hierarchical Cognitive Framework: A five-level framework enabling fine-grained evaluation of LLM capabilities in AEC tasks.
🌟 High-Quality Benchmark Dataset: A high-quality dataset of 4,800 questions across 23 tasks reflecting real-world AEC scenarios.
🌟 Automated Evaluation Pipeline: An "LLM-as-a-judge" approach for long-form responses, with code and data released open-source.

Building lifecycle knowledge dimensions: The figure illustrates the multifaceted knowledge required in this integrative paradigm, which encompasses 11 key domains and a wide range of specialized topics radiating from these central areas.

Designed evaluation tasks of AECBench: A total of 23 tasks were involved in the evaluation framework for the various cognitive levels. These tasks were all derived from real-world AEC scenarios and were curated by domain engineers.

Workflow of constructing the dataset: The dataset was established through three steps: data collection, data cleaning, and data review. The data review process highlights a two-round review mechanism, each individual item reviewed by a mid-level engineer and then confirmed by an expert.

Result: Performance of models in Knowledge Memorization/ Understanding/ Reasoning/ Calculation tasks

Result: Performance of models in Knowledge Application tasks

Discussion: Scatter Plots with LOWESS Curves of Model Performance in Evaluation Tasks
(a) original performance of LLMs; (b) performance of DeepSeek-V3 with calibration methods

BibTeX

 @misc{liang2025aecbench,
      title={AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field}, 
      author={Chen Liang, Zhaoqi Huang, Haofen Wang, Fu Chai, Chunying Yu, Huanhuan Wei, Zhengjie Liu, Yanpeng Li, Hongjun Wang, Ruifeng Luo, Xianzhong Zhao},
      year={2025},
      eprint={2509.18776},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field

An overview of the AECBench framework and tasks.

High-Quality Benchmark Dataset

Building lifecycle knowledge dimensions: The figure illustrates the multifaceted knowledge required in this integrative paradigm, which encompasses 11 key domains and a wide range of specialized topics radiating from these central areas.

Designed evaluation tasks of AECBench: A total of 23 tasks were involved in the evaluation framework for the various cognitive levels. These tasks were all derived from real-world AEC scenarios and were curated by domain engineers.

Workflow of constructing the dataset: The dataset was established through three steps: data collection, data cleaning, and data review. The data review process highlights a two-round review mechanism, each individual item reviewed by a mid-level engineer and then confirmed by an expert.

Evaluation

Automated evaluation pipeline for open-ended questions

One-shot example for multiple-choice questions

Final performance

Result: Performance of models in Knowledge Memorization/ Understanding/ Reasoning/ Calculation tasks

Result: Performance of models in Knowledge Application tasks

Discussion: Performance decline in questions related to tables in codes

Discussion: Scatter Plots with LOWESS Curves of Model Performance in Evaluation Tasks
(a) original performance of LLMs; (b) performance of DeepSeek-V3 with calibration methods

A glimpse of AECBench

BibTeX

AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field

An overview of the AECBench framework and tasks.

High-Quality Benchmark Dataset

Building lifecycle knowledge dimensions: The figure illustrates the multifaceted knowledge required in this integrative paradigm, which encompasses 11 key domains and a wide range of specialized topics radiating from these central areas.

Designed evaluation tasks of AECBench: A total of 23 tasks were involved in the evaluation framework for the various cognitive levels. These tasks were all derived from real-world AEC scenarios and were curated by domain engineers.

Workflow of constructing the dataset: The dataset was established through three steps: data collection, data cleaning, and data review. The data review process highlights a two-round review mechanism, each individual item reviewed by a mid-level engineer and then confirmed by an expert.

Evaluation

Automated evaluation pipeline for open-ended questions

One-shot example for multiple-choice questions

Final performance

Result: Performance of models in Knowledge Memorization/ Understanding/ Reasoning/ Calculation tasks

Result: Performance of models in Knowledge Application tasks

Discussion: Performance decline in questions related to tables in codes

Discussion: Scatter Plots with LOWESS Curves of Model Performance in Evaluation Tasks (a) original performance of LLMs; (b) performance of DeepSeek-V3 with calibration methods

A glimpse of AECBench

BibTeX

Discussion: Scatter Plots with LOWESS Curves of Model Performance in Evaluation Tasks
(a) original performance of LLMs; (b) performance of DeepSeek-V3 with calibration methods