sdmay23-20 • Automated annotation of source code datasets for ML applications

Project Overview

There is a rising trend for AI for Code applications like AI-assisted programming, code summarization, defect detection, etc. These applications help software developers improve productivity and make programming accessible to domain experts who are increasingly writing code for scientific applications. As more machine learning models are used for such tasks, there is a need for better annotated datasets which can focus on complex characteristics of code that go beyond surface level features. The field of Natural Language Processing has numerous such datasets focusing on specific linguistic features which has led to significant advancements. Our aim is to create meaningful labels for source code characteristics and automatically annotate source code for said labels. These can be used to train machine learning models as well as to probe pretrained models on whether they have learned certain human recognizable characteristics or spurious correlations.

Team Members

Maxwell Sutcliffe

Client Communication Coordinator

Senior in Software Engineering.

Robby Rice

Digital Content Coordinator

Senior in Software Engineering.

Gavin Canfield

Quality Control

Senior in Computer Engineering.

Tanner Dunn

Agile Framework Organizer

Senior in Software Engineering.

Amon McAllister

Individual Component Design

Senior in Software Engineering.

Weekly Reports

Report 1
Report 2
Report 3
Report 4
Report 5
Report 6
Report 7
Report 8

Design Documents

User Needs
Requirements
Project Plan
Design Context and Exploration
Proposed Design
Testing
Final Planning Document
S E 492 Final Document

Final Slides

491 Final Slides
492 Final Slides