On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages

On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages
Link copied to clipboard!

: View this record at source repository

Description:: Pre-trained Language Models (PLM) such as CodeBERT and GraphCodeBERT, when trained on a large corpus of code, have recently displayed promising results in Software Engineering (SE) down-stream tasks. A PLM is most useful if it can be leveraged to improve the performance on code corpora written in low-resource programming languages, where training data is limited. In this work, our focus is on studying the impact of PLMs on a low-resource programming language corpus — specifically, we choose Ruby as the study subject. A recent study by Ahmed and Devanbu reported that using a corpus of code written in multilingual datasets to fine-tune multilingual PLMs achieves higher performance as opposed to using a corpus of code written in just one programming language. However, no analysis was made with respect to monolingual PLMs. Furthermore, some programming languages are inherently different and code written in one language usually cannot be interchanged with the others, i.e., Ruby and Java code possess very different structure. To better understand how monolingual and multilingual PLM affects different programming languages, we investigate 1) the performance of PLMs on Ruby for two popular SE tasks: Code Summarization and Code Search, 2) the strategy (to select programming languages) that works well on fine-tuning multilingual PLMs for Ruby, and 3) the performance of the fine-tuned PLMs on Ruby given different code lengths — here, we bin the Ruby code based on its number of tokens; understanding the performance on different code lengths will enable developers to make more informed decision on the use of PLMs based on their code. This dataset, containing the PLMs and their fine-tuned models (there are over a hundred trained and fine-tuned models), was generated by the researchers at the University of British Columbia, Singapore Management University and JetBrains.

Author(s):: Chen, Fuxiang
Chen, Fuxiang
University of British Columbia

Source Repository:: FRDR
Publisher(s):: Federated Research Data Repository / dépôt fédéré de données de recherche

URL:: https://doi.org/10.20383/102.0563

Publication date:: 2022-03-23

Keywords:: PLM, Fine-tuned Models, Pre-trained Language Models, Ruby, and CodeBERT

Geospatial information

Citation


APA Citation:: Chen, F. (2022). On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages [Data set]. FRDR. https://doi.org/10.20383/102.0563

Citation copied to clipboard

Export to citation manager:: Download bibtex file (.bib)

On the Transferability of Pre-trained Language Models for Low-Resource Programming LanguagesLink copied to clipboard!

Geospatial information

Citation

On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages
Link copied to clipboard!