Existing code summarization approaches primarily leverage Abstract Syntax Trees (ASTs) and sequential information from source code to generate code summaries while often overlooking the critical consideration of the interplay of dependencies among code elements and code hierarchy. However, effective summarization necessitates a holistic analysis of code snippets from three distinct aspects: lexical, syntactic, and semantic information. In this paper, we propose a novel code summarization approach utilizing Heterogeneous Code Representations (HCRs) and our specially designed HierarchyNet. HCRs adeptly capture essential code features at lexical, syntactic, and semantic levels within a hierarchical structure. Our HierarchyNet processes each layer of the HCR separately, employing a Heterogeneous Graph Transformer, a Tree-based CNN, and a Transformer Encoder. In addition, our approach demonstrates superior performance compared to fine-tuned pre-trained models, including CodeT5, and CodeBERT, as well as large language models that employ zero/few-shot settings, such as StarCoder and CodeGen.
All source code are written in Python. Besides Pytorch, we also use many other libraries such as DGL, scikit-learn, pandas, jsonlines.
-
Datasets All the datasets used in the paper are publicly accessible.
-
Data preprocessing: Folder preprocessing is used to prepare data in the proper format before training. Go to this folder for more information.
-
Modify the configuration file in the folder c2nl/configs such that all the paths are valid
-
Train model
cd c2nl
bash main/train.sh
- Baselines Examined baselines are grouped into three categories:
-
Training from scratch: PA-former, CAST, NCS
-
Fine-tuning pretrained models: CodeT5, CodeBERT
-
In-context learning: StarCoder and CodeGen-Multi 2B