CSER 2024 Spring will be held on the campus of Queen's University, Kingston on June 10-11.
CSER meetings seek to motivate engaging discussions among faculty, graduate students and industry participants about software engineering research in a broad sense, as well as the intersection between software engineering, other areas of computer science, and other disciplines.
CSER brings together (primarily) Canadian-based software engineering researchers, including faculty, graduate students, industry participants, and any others who are interested.
Deadline | |
---|---|
proposal submission | Friday, May 10, 2024, 11.59pm (EDT) |
Acceptance announcement | Wednesday, May 15, 2024 |
Early Registration | Wednesday, May 22, 2024 |
Abstract: Foundation Models (FM) and their powered systems (e.g.,ChatGPT and GitHub Copilot) have democratized access to a wide range of advanced technologies, so that average person can now complete various intricate tasks from poem writing to software development which only domain experts or field specialists could perform previously. However, due to the limitations of existing FM technologies, these systems suffer from a few serious drawbacks which limit their potential impact on border applicability. Instead of simply waiting for mature FM technologies, AI engineering researchers and practitioners must develop innovative engineering solutions to mitigate or resolve these issues. One promising direction is through agents, which are artificial entities capable of perceiving their surroundings using sensors, making decisions, and then taking actions in response using actuators. In this talk, I will present the various advances in the area of FM-powered agent-oriented software engineering and its associated challenges.
Bio: Zhen Ming (Jack) Jiang is an associate professor at the Department of Electrical Engineering and Computer Science, York University. Before his academic career, he previously worked with the Performance Engineering team at BlackBerry (RIM) where tools resulting from his research are currently used daily to monitor and debug the health of several ultra-large commercial software systems within BlackBerry. His current research interests lie in the intersection of AI and SE. He is one of the co-organizers for the Shonan meeting on “Foundation Models and Software Engineering: Challenges and Opportunities” as well as 2024 FM+SE Summit.
Abstract: Software developers spend a significant portion of the workday trying to understand and review the code changes of their teammates. Currently, most code reviewing and change comprehension is done using textual diff tools, such as the commit diff in GitHub or Gerrit. Such diff tools are insufficient, especially for complex changes, which move code within the same file or between different files. Abstract Syntax Tree (AST) diff tools brought several improvements in making easier the understanding of source code changes. However, they still have some constraints and limitations that affect negatively their accuracy. In this keynote, I will demonstrate these limitations using real case studies from open-source projects. At the same time, I will show how the AST diff generated by our tool addresses these limitations. Finally, I will introduce the Benchmark we created based on commits from the Defects4J and Refactoring Oracle datasets, and present the precision and recall of state-of-the-art AST diff tools on our benchmark. Vive la révolution!
Bio: Nikolaos Tsantalis is an Associate Professor in the Department of Computer Science and Software Engineering at Concordia University, Montreal, Canada. His research interests include software maintenance, software evolution, software design, empirical software engineering, refactoring recommendation systems, and refactoring mining. He has developed tools, such as the Design Pattern Detection tool, JDeodorant and RefactoringMiner, which are used by many practitioners, researchers, and educators. He has been awarded three Most Influential Paper awards at SANER 2018, SANER 2019 and CASCON 2023, two ACM SIGSOFT Distinguished Paper awards at FSE 2016 and ICSE 2017, two ACM Distinguished Artifact awards at FSE 2016 and OOPSLA 2017, and four Distinguished Reviewer Awards at MSR 2020, ICPC 2022, ASE 2022 and ICSME 2023. He served as the Program co-chair for ICSME 2021 and SCAM 2020. He currently serves on the Editorial Board of the IEEE Transactions on Software Engineering as Associate Editor. Finally, he is a Senior member of the IEEE and the ACM, and holds a license from the Association of Professional Engineers of Ontario (PEO).
PDF version of the program CSER_24_Program
Abstract: Software chatbots are used in many aspects of the software development process. These chatbots work alongside developers, providing assistance in various software development tasks, from answering development questions to running tests and controlling services. While there are numerous chatbots, and their capability of supporting software practitioners is encouraging, little is known about the development challenges and usage benefits of software engineering chatbots.
In this talk, I will present our in-depth study to understand the most pressing and difficult challenges faced by chatbot practitioners in developing chatbots. I will present our chatbot, MSRBot, that answers software development and maintenance questions.
Based on our studies, we have identified two critical challenges in chatbot development: selecting a Natural Language Understanding (NLU) platform for chatbot implementation and curating a high-quality dataset to train the NLU. To help developers create more efficient chatbots, we assess the performance of multiple widely used NLUs using representative software engineering tasks to guide chatbot developers in designing more efficient chatbots. Additionally, we propose approaches for augmenting software engineering chatbot datasets. Our work helps advance the state-of-the-art in the use of chatbots in software engineering.
Bio:Ahmad Abdellatif is an Assistant Professor in the Department of Electrical and Software Engineering at the University of Calgary. Before, he worked as a postdoctoral researcher at DASLab in the Department of Computer Science and Software Engineering at Concordia University. His research interests and expertise are in Software Engineering, with a special interest in Software Chatbots, Mining Software Repositories and Engineering AI-based Systems. His work has been published in top-ranked Software Engineering venues, such as IEEE Transactions on Software Engineering (TSE), the Empirical Software Engineering Journal (EMSE), the International Conference on Software Engineering (ICSE), and the International Conference on Mining Software Repositories (MSR). You can find more about him at: (https://aabdllatif.github.io/)
Abstract: Continuous integration is a DevOps practice in which software changes are frequently and automatically built, tested, and deployed. The primary objectives of continuous integration are to identify and address bugs quickly, and to improve software quality. However, the complexity of modern software systems and the lack of debugging information can make it challenging to locate and understand the root causes of bugs. In my upcoming talk, I will share my research insights into the challenges of automated debugging in DevOps and how they contribute to my broader goal of improving software quality and reliability. I will discuss how my past research, with a particular focus on developing new debugging innovations to minimize development and maintenance costs, has contributed to the field of automated debugging. Additionally, I will present my vision for the future of automated debugging in DevOps.
Bio:An Ran Chen is an Assistant Professor at the Electrical and Computer Engineering Department of University of Alberta. His research interests cover various software engineering topics, including automated debugging, software testing, mining software repositories, and DevOps practices. His work has been published in flagship journals and selected as the featured article for the TSE journal. He is also the co-chair of Testing and Tools track for IEEE International Conference on Software Testing, Verification and Validation (ICST) 2024. Prior to pursuing research, An Ran worked as a web and software developer at the Bank of Canada and McGill University. For more information, please visit his personal website at (https://anrchen.github.io/home/).
Abstract: Given a program P that exhibits a certain property ψ (e.g., a C program that crashes GCC when it is being compiled), the goal of program reduction is to minimize P to a smaller variant P ′ that still exhibits the same property, i.e., ψ (P ′). Program reduction is important and widely demanded for testing and debugging. For example, all compiler/interpreter development projects need effective program reduction to minimize failure-inducing test programs to ease debugging.
In this talk, I will present Perses, a novel framework for effective, efficient, and general program reduction. The key insight is to exploit, in a general manner, the formal syntax of the programs under reduction and ensure that each reduction step considers only promising, syntactically valid variants to avoid futile efforts on syntactically invalid variants.
Bio:Dr. Chengnian Sun is an Assistant Professor in the Cheriton School of Computer Science at the University of Waterloo. His primary research interests encompass the domains of software engineering and programming languages. His focused efforts involve the conceptualization and implementation of techniques, tools, and methodologies that contribute to the enhancement of software reliability and developers’ productivity. He has published more than 50 peer-reviewed papers at top-tier conferences and journals, such as ICSE, ASPLOS, PLDI, FSE and TOSEM.
These works have generated over 3700+ citations. Before joining UWaterloo, he was a full-time software engineer at Google Headquarters, working on Java/Android compiler toolchains and machine learning libraries for Google Search. Prior to Google, he spent three wonderful years as a postdoctoral fellow at the University of California, Davis, working on compiler validation techniques, which have detected 1600+ bugs in GCC and LLVM. He holds a Ph.D. in Computer Science from the National University of Singapore.
Abstract: Software ecosystems, such as npm, Maven, and PyPI, have completely changed how we develop software. By providing a platform of reusable libraries and packages, software ecosystems have enabled developers to write less code, increasing productivity and improving the quality of delivered software. However, this level of code reuse has created significant challenges in software maintenance: developers struggle to select well-maintained libraries among the myriad of options, dependency maintenance issues abound, and vulnerable dependencies are widespread, risking the integrity of delivered software.
In this talk, I will present the challenges of dependency management in the era of software ecosystems, how my past research has contributed to the field and my vision for a more transparent and proactive approach to dependency management.
Bio:Diego Elias Costa is an Assistant Professor in the CSSE department of Concordia University. Before that, he was an Assistant Professor in the Computer Science department at UQAM, Canada. He received his Ph.D. in Computer Science from Heidelberg University, Germany. His research interests cover a wide range of software engineering topics, including SE4AI, dependency management, performance testing, and software engineering bots. His work has been published in journals, such as IEEE TSE, EMSE, and TOSEM and at premier venues, such as ICSE, FSE, and ASE. You can find more about him at (https://diegoeliascosta.github.io/).
Abstract: Mining insights from the social and textual data in software engineering involves analyzing non-code elements like chat logs and GitHub collaboration patterns. These elements provide a window into team communication and collaboration dynamics, which could be crucial for the success of software projects.
The challenge lies in effectively analyzing and interpreting vast amounts of unstructured social and textual data to extract meaningful insights. In this talk, I will share insights from research into analyzing social and textual interactions in software engineering. The talk will explore how these interactions, when effectively analyzed, can uncover interesting insights in how software is developed and maintained. The presentation will also highlight ongoing and future research initiatives that aim to derive more knowledge from within this type of data, by leveraging the power of Large Language Models.
Bio:Dr. El Mezouar obtained a PhD in Computing from Queen's University in 2019, where she was a member of the Software Evolution and Analytics Lab. Prior to that, she completed her M.Sc. in Software Engineering at Al Akhawayn University in Morocco. She joined the Department of Mathematics and Computer Science at The Royal Military College as an assistant professor in 2022. The main research field of Dr. El Mezouar is Empirical Software Engineering. She uses methodologies such as machine learning, statistics and qualitative techniques to better understand the software development phenomena. She analyzes historical data (particularly textual data) using NLP techniques to provide approaches and techniques that can support software practitioners in the workplace.
Abstract: Individuals o2en develop so2ware based on their own cogni8ve preferences and perspec8ves. However, given the diverse ways in which people process informa8on, it becomes crucial to examine how we can effec8vely test and implement so2ware with inclusivity in mind. This presenta8on will delve into the rela8onship between inclusivity and technology, addressing two main ques8ons: What are inclusivity bugs? And how can we find and fix them in so2ware products? I will introduce a Why/Where/Fix systema8c inclusivity debugging process to help find inclusivity bugs (using the GenderMag cogni8ve walkthrough method), localize the Informa8on Architecture (IA) faults behind them, and then fix the IA to remove the inclusivity bugs found. Addi8onally, I will share insights from various teams using GenderMag to enhance inclusivity in their products and processes.
Bio:Dr. Mariam Guizani is an Assistant Professor in the Department of Electrical and Computer Engineering at Queen’s University. She holds a PhD and second MSc in Computer Science from Oregon State University and was a Fulbright fellowship recipient. She also holds a BSc and a MSc degree in So2ware Engineering. At the intersec8on of So2ware Engineering and Human-Computer Interac8on, her research centers around designing diversity and inclusion processes and tools for sustainable socio-technical ecosystems. More specifically, her research focuses on improving the state of diversity and inclusion in open-source so2ware (OSS). The broader impact of her work applies to academia, industry, and large OSS organiza8ons. Dr. Mariam Guizani has worked together with Google and the Apache So2ware Founda8on (ASF) for several years to inves8gate OSS projects’ experiences and implement ac8onable interven8ons. Her research at Microso2 Research was recognized by GitHub for its contribu8on to the currently deployed GitHub Discussion Dashboard, GitHub Blocks, and their future roadmap. She has also collaborated with founda8ons and departments such as Wikimedia, and Oregon State University IT department to empower communi8es to dismantle cogni8ve barriers in so2ware. Her research has been published at both ACM and IEEE conferences. Dr. Mariam Guizani has been invited to present her work at academic and industry venues including ICSE, CSCW, Google, GitHub, and the Linux Founda8on Open-Source Summit.
Abstract: Large language models (LLMs) have been increasingly adopted for Software Engineering (SE) tasks and showing better and better performance on benchmarks such as code generation and bug fixing. One common trend in the application of LLMs on SE tasks is to integrate pre-trained / fine-tuned models with program analysis techniques. Moreover, the adoption and evaluation of LLMs for SE tasks still face many challenges, e.g., the complex software development, evolution, and testing workflow in practice. In this talk, I will demonstrate how traditional program analysis techniques are used in the era of LLMs,with examples of my own work on LLM-based test completion (TeCo) and code comment co-evolution (CoditT5). Based on that, I will discuss the path forward for building more accurate, robust, and interpretable LLM-based solutions for SE.
Bio:Pengyu Nie is an Assistant Professor in the Cheriton School of Computer Science at the University of Waterloo. His research interest is improving developers' productivity during software development,testing and maintenance. Specific topics include execution-guided models for test completion and lemma naming, learning to evolve code and comments, and frameworks for maintaining executable comments and specifications.
Pengyu obtained his Ph.D. in 2023 and M.Sc. in 2020 from The University of Texas at Austin, advised by Milos Gligoric. He received his B.Sc. from University of Science and Technology of China (School of the Gifted Young) in 2017.
Abstract: Runtime monitoring is often deployed in high-criticality systems to catch incorrect behaviour before it can cause failures. Often, however, the operators of these systems are interested in more than a warning klaxon at the moment a fault presents itself. Monitors that calculate operational information for system comprehension have more utility than simple error detectors. This talk will explore some current thinking in the field of Runtime Verification around monitors that do more than report specification violations and some of the accompanying challenges. These challenges include specification formalisms, computational complexity, and information transmission.
Bio:Sean Kauffman is an Assistant Professor in the Department of Electrical and Computer Engineering at Queen's University. Before coming to Queen's, Sean obtained his Ph.D. in Electrical and Computer Engineering at the University of Waterloo and spent two years as a Postdoctoral Researcher at Aalborg University in Denmark. Sean came back to academia after spending more than ten years working in industry as a software engineer, where his last position was a principal software engineer for Oracle.
Sean's research focuses on safety-critical software and fits into the broad themes of Formal Methods, Runtime Verification, Anomaly Detection, and Explainable AI. He has collaborated on research with partners such as NASA's Jet Propulsion Laboratory, the Embedded Systems Institute, QNX, and Pratt and Whitney Canada. Sean's teaching philosophy focuses on fostering engagement, using techniques like active learning, productive failure, and peer instruction.
Abstract: Given either a specification written in natural language or an input program, automated program generation techniques produce a program according to the given specification or by modifying the input program. Automated program generation is a powerful technique that can be used for finding bugs in software systems that take programs as input or fixing bugs in the input programs. In this talk, I will share our latest results on automated program generation for (1) fixing bugs in large language model (LLM) based automated program generation, (1) testing static program analyzers. For the first part of the talk, I will present our study that categorizes common mistakes in LLM like Codex, and our insights in applying automated program repair techniques in fixing mistakes by Codex. For the second part of the talk, I will introduce several automated testing techniques that find bugs in static analyzers using semantic-preserving transformations, and annotation synthesizers.
Bio:Shin Hwei Tan is an Associate Professor (Gina Cody Research Chair) in Concordia University. Before moving to Concordia University, she was an Assistant Professor in Southern University of Science and Technology in Shenzhen, China. She obtained her PhD degree from National University of Singapore and her B.S (Hons) and MSc degree from UIUC, USA. Her main research interests are in automated program repair, software testing and open-source development. She is an Associate Editor for TOSEM and the Guest Editors in Chief for the New Frontier in Software Engineering track in TOSEM. She has also served as PCs for top-tier software engineering conferences, where she won 3 best reviewers award (FSE 2020, ASE 2020, ICSE 2022 NIER-track).
Abstract: The world is going mobile. Android has surpassed its counterparts and become the most popular operating system all over the world. The openness and fast evolution of Android are the key factors that lead to its rapid growth. However, these characteristics have also created the notorious problem: Android fragmentation. There are numerous different Android device models and operating system versions in use, making it difficult for app developers to exhaustively test their apps on these devices. An Android app can behave differently on the different device models, inducing various compatibility issues that reduce software reliability. Such fragmentation-induced compatibility issues (compatibility issues for short) have been well-recognized as a prominent problem in Android app development. In this talk, I will introduce the problem of Android compatibility issues, review the past efforts to address Android compatibility issues and discuss potential research opportunities surrounding Android compatibility issues.
Bio:Lili Wei is an assistant professor in the department of Electrical and Computer Engineering at McGill University. Prior to joining McGill University, she received her Ph.D. degree and worked as a post-doctoral fellow at the Hong Kong University of Science and Technology. Her research interests lie in program analysis and testing with a focus on mobile apps, smart contracts and IoT software. Her research outcomes were recognized by several awards, including an ACM SIGSOFT Distinguished Paper Award, an ACM SIGSOFT Distinguished Artifact award, a Google PhD Fellowship and a Microsoft Research Asia PhD Fellowship. She is also actively serving the software engineering research community. She received a Distinguished Reviewer Award from ASE 2022. More information can be found on her personal website:https://liliweise.github.io
Topics on Foundation Models for Software Engineering
Panelists:
Secrets leading to a Successful Research Career
Panelists:
list of the presenters:
Abstract: The rapid progress of AI-powered programming assistants, such as GitHub Copilot, has facilitated the development of software applications. These assistants rely on large language models (LLMs), which are foundation models (FMs) that support a wide range of tasks related to understanding and generating language. LLMs have demonstrated their ability to express UML model specifications using formal languages like the Object Constraint Language (OCL). However, the context size of the prompt is limited by the number of tokens an LLM can process. This limitation becomes significant as the size of UML class models increases. In this study, we introduce PathOCL, a novel path-based prompt augmentation technique designed to facilitate OCL generation. PathOCL addresses the limitations of LLMs, specifically their token processing limit and the challenges posed by large UML class models. PathOCL is based on the concept of chunking, which selectively augments the prompts with a subset of UML classes relevant to the English specification. Our findings demonstrate that PathOCL, compared to augmenting the complete UML class model (UML-Augmentation), generates a higher number of valid and correct OCL constraints using the GPT-4 model. Moreover, the average prompt size crafted using PathOCL significantly decreases when scaling the size of the UML class models.
Abstract: The exponential growth of the mobile app market underscores the importance of constant innovation. User satisfaction is paramount, and developers rely on user reviews and industry trends to identify areas for improvement. However, the sheer volume of reviews poses challenges in manual analysis, necessitating automated approaches. Existing automated approaches either analyze only a target app’s reviews, neglecting valuable insights from competitors or fail to provide actionable feature enhancement suggestions. To address these gaps, we propose LLM-CURE (LLM-based Competitive User Review Analysis for Feature Enhancement), a novel approach powered by a large language model (LLM) to automatically generate suggestions for mobile app feature improvements by leveraging insights from the competitive landscape. LLM-CURE operates in two phases. First, it identifies and categorizes user complaints within reviews into high-level features using its LLM capabilities. Second, for each complaint, LLM-CURE analyzes highly rated features in competing apps and proposes potential improvements specific to the target application. We evaluate LLM-CURE on 70 popular Android apps. Our evaluation demonstrates that LLM-CURE outperforms baselines in assigning features to reviews and highlights its effectiveness in utilizing user feedback and competitive analysis to guide feature enhancement strategies.
Abstract: A large body of research proposed machine learning-based solutions to suggest where to insert logging statements. However, before answering the question ``where to log?’’, practitioners first need to determine whether a file needs logging at the first place. To address this question, we characterize the log density (i.e., ratio of log lines over the total LOC) through an empirical study in seven open-source software projects. Then, we propose a deep learning based approach to predict the log density based on syntactic and semantic features of the source code. We also evaluate the consistency of our model over time and investigate the problem of concept drift. Our log density models achieve an average accuracy of 84%, which is consistent across different projects and over time. However, the performances can significantly drop when a model is trained on data from one time period and tested on datasets from different time periods.
Abstract: Large-scale code reuse significantly reduces both development costs and time. However, the massive share of third-party code in software projects poses new challenges, especially in terms of maintenance and security. In this paper, we propose a novel technique to specialize dependencies of Java projects, based on their actual usage. Given a project and its dependencies, we systematically identify the subset of each dependency that is necessary to build the project, and we remove the rest. As a result of this process, we package each specialized dependency in a JAR file. Then, we generate specialized dependency trees where the original dependencies are replaced by the specialized versions. This allows building the project with significantly less third-party code than the original. As a result, the specialized dependencies become a first-class concept in the software supply chain, rather than a transient artifact in an optimizing compiler toolchain. We implement our technique in a tool called DepTrim , which we evaluate with 30 notable open-source Java projects. DepTrim specializes a total of 343 (86.6%) dependencies across these projects, and successfully rebuilds each project with a specialized dependency tree. Moreover, through this specialization, DepTrim removes a total of 57,444 (42.2%) classes from the dependencies, reducing the ratio of dependency classes to project classes from 8.7 × in the original projects to 5.0 × after specialization. These novel results indicate that dependency specialization significantly reduces the share of third-party code in Java projects.
Abstract: Data quality is crucial in the field of software analytics, especially for machine learning (ML) applications such as software defect prediction (SDP). Despite the extensive use of ML models in software engineering, studies have primarily focused on singular antipatterns, while a multitude of antipatterns exist in practice. This comprehensive setting is often ignored and remains out of focus. This study aims to develop a comprehensive taxonomy of data quality antipatterns specific to ML and evaluate their impact on the performance and interpretation of software analytics models. We identified eight ML-specific data quality antipatterns and 14 sub-types through a literature review. We conducted a series of experiments to determine the prevalence of data quality antipatterns in SDP data (RQ1), assess the impact of cleaning orders on model performance (RQ2), evaluate the effects of antipattern removal on model performance (RQ3), and examine the consistency of interpretation results from models built with different antipatterns (RQ4). Our taxonomy includes antipatterns such as Schema Violations, Data Miscoding, Inconsistent Representation, Data Distribution Antipatterns, Packaging Antipatterns, Label Antipatterns, and Correlation & Redundancy. In our case study of SDP, we found that the studied datasets contain several antipatterns, which often co-exist. The impact of learner variability on the model performance is higher than the impact of order of cleaning these antipatterns. However, in a setting where the other antipatterns have been cleaned out, some antipatterns, such as Tailed Distributions and Class Overlap, have a significant effect on certain performance metrics. Finally, models built from data with different antipatterns showed moderate consistency in interpretation results. This study provides empirical evidence on the critical role of data quality in ML for software analytics. Our findings indicate that while the order of data cleaning has a minor impact, practitioners should be vigilant in addressing specific antipatterns, especially when other antipatterns have already been cleaned. Researchers and practitioners should also consider the "data quality" aspect when relying on model interpretation results. Prioritizing the removal of key antipatterns should take precedence over the removal of all antipatterns to maintain the performance of ML models in software defect prediction.
Abstract: In this presentation, we will discuss our efforts to enhance the textual information in logging statements from two perspectives. Firstly, we will explore proactive methods for suggesting the generation of new logging texts. We propose automated deep learning-based approaches that generate logging texts by translating related source code into concise textual descriptions. Secondly, we will talk about retroactive analysis of existing logging texts. We present the first comprehensive study on the temporal relations between logging and its corresponding source code, which is subsequently utilized to successfully detect anti-patterns in existing logging statements.
Abstract: Most studies investigating refactoring practices in test code either considered refactorings typically applied on production code, e.g., from Martin Fowler’s Refactoring book, or a narrow test-refactoring context, e.g., test-smells. To fill this gap and consolidate an empirically validated comprehensive catalogue of test-specific refactorings, we employ a mixed-method approach combining different sources of information, including existing test evolution datasets, answers to survey questions by contributors of popular open-source GitHub repositories, and StackOverflow test-refactoring-related questions and answers. We present evidence that refactoring activities take place in specific test code components and address different quality aspects of tests, such as maintainability, reusability, extensibility, reliability, performance, and flakiness. Our study paves the way for tool builders to add support for test features in static analysis and code generation tools. Moreover, it establishes a catalogue of test-specific refactorings to aid practitioners in maintaining large test-code bases and researchers in conducting empirical studies on test evolution.
Abstract: Developers often need to replace the used libraries with alternate libraries, a process known as library migration. Doing this manually can be tedious, time-consuming, and prone to errors. Automated migration techniques can help alleviate some of this burden. However, designing effective automated migration techniques requires understanding the types of code changes required during migration. This work contributes an empirical study that provides a holistic view of Python library migrations. We manually label the state-of-the-art migration data and derive a taxonomy for describing migration-related code changes, PyMigTax. Leveraging PyMigTax and our labeled data, we investigate various characteristics of Python library migrations. Our findings highlight various potential shortcomings of current library migration tools. Overall, our contributions provide the necessary knowledge and foundations for developing automated Python library migration techniques.
Abstract: Technical question and answering (Q&A) sites such as Stack Overflow have become an important source for software developers to seek knowledge. However, code snippets on Q&A sites are usually uncompilable and semantically incomplete for compilation due to unresolved types and missing dependent libraries, which raises the obstacle for users to reuse or analyze Q&A code snippets. Prior approaches either are not designed for synthesizing compilable code or suffer from a low compilation success rate. To address this problem, we propose ZS4C, a lightweight approach to perform zero-shot synthesis of compilable code from incomplete code snippets using Large Language Model (LLM). ZS4C operates in two stages. In the first stage, ZS4C utilizes an LLM, i.e., ChatGPT, to identify missing import statements for a given code snippet, leveraging our designed task-specific prompt template. In the second stage, ZS4C fixes compilation errors caused by incorrect import statements and syntax errors through collaborative work between ChatGPT and a compiler. We thoroughly evaluated ZS4C on a widely used benchmark called StatType-SO against the SOTA approach SnR. Compared with SnR, ZS4C improves the compilation rate from 63% to 87.6%, with a 39.3% improvement. On average, ZS4C can infer more accurate import statements than SnR, with an improvement of 6.6% in the F1.
Abstract: Inheritance, a fundamental aspect of object-oriented design, has been leveraged to enhance code reuse and facilitate efficient software development. However, alongside its benefits, inheritance can introduce tight coupling and complex relationships between classes, posing challenges for software maintenance. Although there are many studies on inheritance in source code, there is limited study on the test code counterpart. In this paper, we take the first step by studying inheritance in test code, with a focus on redundant test executions caused by inherited test cases. We empirically study the prevalence of test inheritance and its characteristics. We also propose a hybrid approach that combines static and dynamic analysis to identify and locate inheritance-induced redundant test cases. Our findings reveal that (1) inheritance is widely utilized in the test code, (2) inheritance-induced redundant test executions are prevalent, accounting for 13% of all execution test cases, and (3) the redundancies slow down test execution by an average of 14%. Our study highlights the need for careful refactoring decisions to minimize redundant test cases and identifies the need for further research on test code quality
Abstract: Software process models are pivotal in fostering collaboration and communication within software teams. We introduce LCG, which leverages multiple LLM agents to emulate various software process models, namely LCGWaterfall, LCGTDD, and LCGScrum. Each model assigns LLM agents specific roles such as requirement engineer, architect, developer, tester, and scrum master, mirroring typical development activities and communication patterns. Utilizing GPT3.5 as the underlying LLM and baseline (GPT), we evaluate LCG across four code generation benchmarks: HumanEval, HumanEval-ET, MBPP, and MBPP-ET. Results indicate LCGScrum outperforms other models, achieving an average 15% improvement over GPT. Analysis reveals distinct impacts of development activities on generated code, with design and code reviews contributing to enhanced exception handling, while design, testing, and code reviews mitigate code smells. Furthermore, variations in Pass@1 are notable for different GPT3.5 model versions, highlighting the stability of LCG across model versions. This stability underscores the importance of adopting software process models .
Abstract: With the increasing popularity of machine learning (ML), many open-source software (OSS) contributors are attracted to developing and adopting ML approaches. Comprehensive understanding of ML contributors is crucial for successful ML OSS development and maintenance. Without such knowledge, there is a risk of inefficient resource allocation and hindered collaboration in ML OSS projects. Existing research focuses on understanding the difficulties and challenges perceived by ML contributors by user surveys. There is a lack of understanding of ML contributors based on their activities tracked from software repositories. In this paper, we aim to understand ML contributors by identifying contributor profiles in ML libraries. We further study contributors’ OSS engagement from three aspects: workload composition, work preferences, and technical importance. By investigating 7,640 contributors from 6 popular ML libraries (TensorFlow, PyTorch, Keras, MXNet, Theano, and ONNX), we identify four contributor profiles: Core-Afterhour, Core-Workhour, Peripheral-Afterhour, and Peripheral-Workhour. We find that: 1) project experience, authored files, collaborations, and geological location are significant features of all profiles; 2) contributors in Core profiles exhibit significantly different OSS engagement compared to Peripheral profiles; 3) contributors’ work preferences and workload compositions significantly impact project popularity; 4) long-term contributors evolve towards making fewer, constant, balanced and less technical contributions.
Abstract: In this work, we performed a case study on three large-scale public operation data and empirically assessed five different types of model update strategies for supervised learning regarding their performance, updating cost, and stability. We observed that active model update strategies (e.g., periodical retraining, concept drift guided retraining, time-based model ensembles, and online learning) achieve better and more stable performance than a stationary model.
Abstract: Currently, there is a paucity of research on predicting the execution time of quantum circuits on quantum computers. The execution time estimates provided by the IBM Quantum Platform have large margins of error and do not satisfactorily meet the needs of researchers in the field of quantum computing. We selected a dataset comprising over 1510 quantum circuits, initially predicting their execution times on simulators, which yielded promising results with an R-squared value nearing 95%. Subsequently, for the estimation of execution times on quantum computers, we conducted ten-fold cross-validation with an average R-squared value exceeding 90%. These results significantly surpass those provided by the IBM Quantum Platform. Our model has proven to be effective in accurately estimating execution times for quantum circuits on quantum computer.
Abstract: Logs are important in modern software development with runtime information. Log parsing is the first step in many log-based analyses, that involve extracting structured information from unstructured log data. Traditional log parsers face challenges in accurately parsing logs due to the diversity of log formats, which directly impacts the performance of downstream log-analysis tasks. In this paper, we explore the potential of using Large Language Models (LLMs) for log parsing and propose LLMParser, an LLM-based log parser based on generative LLMs and few-shot tuning. We leverage four LLMs, Flan-T5-small, Flan-T5-base, LLaMA-7B, and ChatGLM-6B in LLMParsers. Our evaluation of 16 open-source systems shows that LLMParser achieves statistically significantly higher parsing accuracy than state-of-the-art parsers (a 96% average parsing accuracy). We further conduct a comprehensive empirical analysis on the effect of training size, model size, and pre-training LLM on log parsing accuracy. We find that smaller LLMs may be more effective than more complex LLMs; for instance where Flan-T5-base achieves comparable results as LLaMA-7B with a shorter inference time. We also find that using LLMs pre-trained using logs from other systems does not always improve parsing accuracy. While using pre-trained Flan-T5-base shows an improvement in accuracy, pre-trained LLaMA results in a decrease (decrease by almost 55% in group accuracy). In short, our study provides empirical evidence for using LLMs for log parsing and highlights the limitations and future research direction of LLM-based log parsers.
Abstract: Large language models (LLMs) have seen increased use in various software tasks, such as code generation, code summarization, and program repair. Among these tasks code generation has been most studied so far with significant results. The state-of-the-art approach for LLM-based code generation is using an iterative process (either simple iterations or more smarter agents, e.g., reinforcement learning agents) and provide a feedback to LLM when the results are not satisfactory. The feedback can be as simple as whether the test cases for the code pass, or more extensive such as including execution traces of failed tests or ASTs of the code in the prompt. While effective, these methods can be prohibitively expensive in terms of both time and monetary cost due to the frequent calls to LLMs and large prompts (many tokens to incorporate feedbacks). In this work, we propose a Cost-Effective Search-Based Prompt Engineering (CE-SBPE) approach, which leverages an Evolutionary Algorithm to guide the iterative process of prompt optimization toward the most effective prompts, while minimizing the use of expensive LLMs in the loop. Our approach provides similar and sometime better accuracy compared to baselines with only a small portion of their costs.
Abstract: Refactoring enhances software quality without altering its functional behaviors. Understanding the refactoring activities of developers is crucial to improving software maintainability. With the increasing use of machine learning (ML) libraries and frameworks, maximizing their maintainability is crucial. Due to the data-driven nature of ML projects, they often undergo different refactoring operations (e.g., data manipulation), for which existing refactoring tools lack ML-specific detection capabilities. Furthermore, a large number of ML libraries are written in Python, which has limited tools for refactoring detection. PyRef, a rule-based and state-of-the-art tool for Python refactoring detection, can identify 11 types of refactoring operations. In comparison, Rminer can detect 99 types of refactoring for Java projects. We introduce MLRefScanner, a prototype tool that applies machine-learning techniques to detect refactoring commits in ML Python projects. MLRefScanner identifies commits with both ML-specific and general refactoring operations. Evaluating MLRefScanner on 199 ML projects demonstrates its superior performance compared to state-of-the-art approaches, achieving an overall 94% precision and 82% recall. Combining it with PyRef further boosts performance to 95% precision and 99% recall. Our study highlights the potential of ML-driven approaches in detecting refactoring across diverse programming languages and technical domains, addressing the limitations of rule-based detection methods.
Abstract: During code reviews, an essential step in software quality assurance, reviewers have the difficult task of understanding and evaluating code changes to validate their quality and prevent introducing faults to the codebase. This is a tedious process where the effort needed is highly dependent on the code submitted, as well as the author’s and the reviewer’s experience, leading to median wait times for review feedback of 15-64 hours. Through an initial user study carried with 29 experts, we found that re-ordering the files changed by a patch within the review environment has potential to improve review quality, as more comments are written (+23%), and participants’ file-level hot-spot precision and recall increases to 53% (+13%) and 28% (+8%), respectively, compared to the alphanumeric ordering. Hence, this work aims to help code reviewers by predicting which files in a submitted patch need to be (1) commented, (2) revised, or (3) are hot-spots (commented or revised). To predict these tasks, we evaluate two different types of text embeddings (i.e., Bag-of-Words and Large Language Models encoding) and review process features (i.e., code size-based and history-based features). Our empirical study on three open-source and two industrial datasets shows that combining the code embedding and review process features leads to better results than the state-of-the-art approach. For all tasks, F1-scores (median of 40-62%) are significantly better than the state-of-the-art (from +1 to +9%).
Abstract: Data quality is vital for user experience in products reliant on data. As solutions for data quality problems. However, although some of the existing taxonomies are near-comprehensive, the over-complexity has limited their actionability in data issue solution development. Hence, recent researchers issued new sets of data issue categories that are more concise for better usability. Although more concise, labels over-catering to the solution systems may sometimes cause the taxonomy to be not mutually exclusive. Consequently, different categories sometimes overlap in determining the issue types. This hinders solution development and confounds issue detection. Therefore, based on observations from a literature review and feedback from our industry partner, we propose a comprehensive taxonomy of data quality issues. Our work aims to address a widely generalizable taxonomy rule in modern data quality issue engineering and helps practitioners and researchers understand their data issues and estimate the efforts required for issue fixing.
Abstract: Automatic software fault localization plays an important role in software quality assurance by pinpointing faulty locations for easier debugging. Coverage-based fault localization, a widely used technique, employs statistics on coverage spectra to rank code based on suspiciousness scores. However, the rigidity of statistical approaches calls for learning-based techniques. Amongst all, Grace, a graph-neural network (GNN) based technique has achieved state-of-the-art due to its capacity to preserve coverage spectra, i.e., test-to-source coverage relationships, as precise abstract syntax-enhanced graph representation, mitigating the limitation of other learning-based technique which compresses the feature representation. However, such representation struggles with scalability due to the increasing complexity of software and associated coverage spectra and AST graphs. In this work, we proposed a new graph representation, DepGraph, that reduces the complexity of the graph representation by 70% in nodes and edges by integrating interprocedural call graph in the graph representation of the code. Moreover, we integrate additional features such as code change information in the graph as attributes so the model can leverage rich historical project data. We evaluate DepGraph using Defects4j 2.0.0, and it outperforms Grace by locating 20% more faults in Top-1 and improving the Mean First Rank (MFR) and the Mean Average Rank (MAR) by over 50% while decreasing GPU memory usage by 44% and training/inference time by 85%. Additionally, in cross-project settings, DepGraph surpasses the state-of-the-art baseline with a 42% higher Top-1 accuracy, and 68% and 65% improvement in MFR and MAR, respectively. Our study demonstrates DepGraph's robustness, achieving state-of-the-art accuracy and scalability for future extension and adoption.
Abstract: This paper explores the use of Large Language Models (LLMs), i.e. GPT-4 for Automated Software Engineering (ASE) tasks, comparing prompt engineering and fine-tuning approaches. Three prompt engineering techniques (basic prompting, in-context learning, and task-specific prompting) were evaluated against 18 fine-tuned models on code generation, code summarization, and code translation tasks. We found that GPT-4, even with the best prompting strategy, could not significantly outperform older/smaller fine-tuned models across all tasks. To qualitatively assess GPT-4 with different prompting strategies, we conducted a user study with 27 graduate students and 10 industry practitioners. We categorize different prompts used by the participants to show the trends and their effectiveness on each task. From our qualitative analysis, we find that the GPT-4 with conversational prompts (i.e. when a human provides feedback back and forth with models to achieve the best results) showed drastic improvement compared to GPT-4 with automatic prompting strategies.
Abstract: Machine Learning (ML) decision-making involves complex tradeoffs across various design stages, involving diverse strategic actors with evolving interests. This paper investigates the distribution of Responsible AI decisions and tradeoffs along different ML process stages and how they are made by different people. We model design decision points to identify conflict areas and include Strategic Actors in our modeling to understand their involvement and strategic interests. Our approach examines tradeoffs at the task refinement, goal refinement, and business design decision points, revealing how operationalization-level tradeoffs impact business-level decisions. By recognizing the strategic actors and their interests at each stage, we can better navigate ML design tradeoffs and conflicts, ensuring more effective and aligned decision-making. This research contributes to the understanding of ML design decision-making, supporting the development of more efficient and responsible AI systems.
Abstract: Developers often use configuration parameters to customize the behavior of software, ensuring it meets specific performance requirements or adapts to various deployment contexts. However, misconfigurations in software systems are common and can lead to performance degradation. Typically, developers need to conduct numerous performance tests to identify performance-sensitive parameters that significantly impact system performance. More specifically, developers need to manually adjust configuration values and monitor the corresponding changes in performance metrics, such as throughput, to measure the performance variations. In this paper, we propose SensitiveTeeth, a novel LLM agent-based approach for identifying performance-sensitive configurations, which utilizes static code analysis and LLM agent-based analysis to study the performance-sensitive configurations. Our evaluation of seven open-source systems demonstrates that our tool achieves higher accuracy in identifying performance-sensitive configurations and smaller performance overhead than other state-of-the-art approaches. SensitiveTeeth provides future research direction for LLM-based performance analysis of configurations.
Abstract: Log management tools such as ELK Stack and Splunk are widely adopted to manage and leverage log data in order to assist DevOps in real-time log analytics and decision making. To enable fast queries and to save storage space, such tools split log data into small blocks (e.g., 16KB), then index and compress each block separately. Previous log compression studies focus on improving the compression of either large-sized log files or log streams, without considering improving the compression of small log blocks (the actual compression need by modern log management tools). Hence, we propose an approach named LogBlock to preprocess small log blocks before compressing them with a general compressor. Our evaluation on 16 log files shows that LogBlock improves the compression ratio of small log blocks of different sizes by a median of 5% to 21% as compared to direct compression without preprocessing (outperforming the state-of-the-art compression approaches).
Abstract: Traditional code metrics (e.g., cyclomatic complexity) do not reveal system traits that are tied to certain building blocks of a given programming language. We conjecture that taking these building blocks of a programming language into account can lead to further insights about a software system. In this vein, we introduce Knowledge Units (KUs), a new set of metrics that can capture a novel trait of software systems. We define a KU as a cohesive set of key capabilities that are offered by one or more building blocks of a given programming language. In this study, we operationalize our KUs via two certification exams for Java SE that are offered by Oracle, namely Java SE 8 Programmer I and II. This study aims to understand whether we can obtain richer results for software engineering tasks (e.g., more accurate identification of bug-prone code) when using KUs in combination with traditional code metrics. More generally, we seek to understand whether we can gain further insights using a different code analysis lens (KUs) which is derived from the same raw data (source code). We analyze 184 real-world Java software systems on GitHub and extract KUs from these systems. We find empirical evidence that KUs are different and complementary to traditional metrics, thus indeed offering a new lens through which software systems can be analyzed. Such a result motivates us to study the suitability of KUs in classifying bug-prone code. Our KU models (models built with KUs) significantly outperform CC models (models built with traditional code metrics) for individual releases in the dataset. Combining traditional code metrics and KUs leads to even higher-performing models (CC+KU models), significantly outperforming both CC models and KU models. Our further investigation of the models show that KUs can help identify severe bugs. Finally, we note that KUs can provide alternative insights into the occurrence of bugs or the context in which those bugs happen. Given our promising findings in this exploratory study, we encourage future studies to explore richer conceptualization and operationalization of KUs and further investigate the efficacy of KUs in analyzing software systems.
Abstract: In the rapidly evolving landscape of developer communities, Q\&A platforms serve as crucial resources for crowdsourcing developers' knowledge. A notable trend is the increasing use of images to convey complex queries more effectively. However, the current state-of-the-art method of duplicate question detection has not kept pace with this shift, which predominantly concentrates on text-based analysis. Inspired by advancements in image processing and numerous studies in software engineering illustrating the promising future of image-based communication on social coding platforms, we delved into image-based techniques for identifying duplicate questions on Stack Overflow. When focusing solely on text analysis of Stack Overflow questions and omitting the use of images, our automated models overlook a significant aspect of the question. Previous research has demonstrated the complementary nature of images to text. To address this, we implemented two methods of image analysis: first, integrating the text from images into the question text, and second, evaluating the images based on their visual content using image captions. After a rigorous evaluation of our model, it became evident that the efficiency improvements achieved were relatively modest, approximately an average of 1%. This marginal enhancement falls short of what could be deemed a substantial impact. As an encouraging aspect, our work lays the foundation for easy replication and hypothesis validation, allowing future research to build upon our approach and explore novel solutions for more effective image-driven duplicate question detection.
Abstract: The proliferation of machine learning models on platforms like Hugging Face (HF) presents both opportunities and challenges in resource management. Inconsistent release practices, characterized by ambiguous naming and versioning conventions, undisclosed changes to model weights, and inaccessible model training documentation and datasets, significantly hinder users' ability to select suitable models. Standardized practices are crucial to clarify model identities and ensure reproducibility. Therefore, this study aims to explore the release practices of Large Language Models (LLMs) on Hugging Face (HF) by identifying current shortcomings and proposing best practices. Utilizing a mixed method of quantitative and qualitative data analysis, we will comprehensively understand current practices. The outcome of this study will propose standardized practices emphasizing transparent documentation and usability features, thereby enhancing the transparency of LLM repositories. Additionally, it will provide valuable insights for improving user experience and decision-making on HF and similar platforms.
Abstract: Recent advancements in Large Language Models (LLMs) showcased impressive capabilities with Retrieval-Augmented Generation (RAG) for software test generation. However, the potential impact of integrating various resources with RAG remains largely unexplored. This study examines how different external knowledge resources affect RAG-based unit test generation, focusing on the fundamental deep learning library, TensorFlow. We consider three types of domain knowledge: 1) relevant posts on Stack Overflow, 2) relevant posts on GitHub, and 3) API documentation. By mining frequent patterns of common API usage from Stack Overflow and GitHub, we enhance unit test generation prompts. Evaluation includes qualitative and quantitative analyses, such as compilation, test execution, and line coverage comparison. Results reveal a promising 22% enhancement in line coverage when combining all retrieved knowledge, with frequent pattern inclusion alone contributing to a 13% increase. This highlights the potential of diverse external resources in improving RAG-based unit test generation for Deep Learning libraries.
Abstract: Large Language Models (LLMs) have revolutionized software development by automating coding tasks, thereby saving time and reducing defects. This study evaluates the efficacy of CodeLlama in code refactoring—a key practice for improving code quality without changing its functionality. We compare the refactoring outputs from both base and fine-tuned versions of CodeLlama against traditional developer-led efforts. We find that CodeLlama reduces code smells by 44.1% more than developers on code that CodeLlama has successfully generated refactorings for. In particular, the refactored code generated by CodeLlama passes our unit tests 32.7% of the time. LoRA with a training dataset size of 5k is our most effective way of fine-tuning CodeLlama for refactoring tasks. Using the fine-tuned model results in a higher unit test pass rate compared to baseline CodeLlama, although it has a lower code smell reduction rate.
Abstract: Requirements specification and verification are crucial processes of software development, especially for Safety Critical Systems due to their complexity. To mitigate the ambiguity caused by natural language, Controlled Natural Languages (CNL) are introduced to constrain the specification while maintaining readability, generally through templates. However, existing CNL do not provide an approach for constructing templates, rarely provide tool support, and target specific requirements types. In this poster, we present a model-driven approach for requirements specification using templates and requirements verification using domain models. The approach covers different types of requirements, and it provides a systematic process for creating templates. Using MDE eases the creation, evolution, and implementation of templates. We implemented our approach in a tool MD-RSuT, Model-Driven Requirements Specification using Templates, for the specification, verification and management of requirements. We evaluated our approach through three case studies, demonstrating its applicability across domains, and that it yields requirements with better quality.
Abstract: Robot Operating System (ROS) is a common platform for mechatronics research. X-Plane is an instructor grade flight simulator that can be used to drive physical flight simulators. This demo shows an X-Plane Plugin that interfaces the flight simultator to ROS2 networks allowing the ROS2 software to act on information from the flight simulator and modify information in the flight simulator. The demo will show the a simulation of an attack on ARIC 429 hardware that modifies the aircraft instruments in real time.
Abstract: "Legacy systems have high maintenance costs due to their reliance on deprecated technologies. These legacy systems are often too complex to be rewritten from scratch. Such systems could benefit from being migrated to an event-driven architecture (EDA). For instance, EDAs can improve flexibility, reliability and maintainability of software systems by decoupling software components. However, migrating a legacy system is not a straightforward process and may have significant costs. The goal of this project is to build an approach and a tool to support software developers in the migration of an existing system towards an EDA. As a first step towards achieving this goal, we conducted a survey with software professionals to study the state of the practice of legacy-to-EDA migration. The purpose of the survey is to better understand (1) how EDAs are used in industry, and (2) how software professionals migrate their systems to EDAs. The survey consisted of two parts: (1) an online survey consisting of 26 questions and (2) interview sessions with some of the participants. The survey was answered by 37 participants, of which two volunteered for an interview. In this poster, we present preliminary results of the survey. Our key findings are: (1) the main motivation behind a legacy-to-EDA migration is to decouple parts of the system, (2) the technologies preferred by professionals when implementing EDAs are the Java programming language and the Apache Kafka event broker, and (3) software professionals rely mainly on business processes, data flow diagrams and human expertise to guide the migration process."
Abstract: In large-scale industries, the pace of Continuous Integration (CI) processing the builds often sets a limit on development speed as it compiles the codebase after every change. Research indicates that predicting CI build outcomes reduces unnecessary builds, cutting feedback time. While current methods for skipping builds rely on metadata or heuristics, they overlook the context for the change being made. Thus, we propose "BuildJudge," an approach to build outcome prediction leveraging LLMs to infer contextual information from code changes. Our evaluation across 20 projects from the TravisTorrent dataset shows that BuildJudge improves the F1-score of failing class by 71.14%, recall of failing class by 111.30%, AUC by 17.41%, and reduces turnaround time by 23.36% when compared with SmartBuildSkip's approach. Our findings suggest the potential of incorporating contextual information such as code changes in build outcome prediction to facilitate more precise build scheduling, thereby effectively reducing CI costs and turnaround time.
Abstract: Neural Machine Translation (NMT) systems still face issues despite advancements. Metamorphic testing approaches involving token replacement have been introduced to test these systems, but selecting tokens for replacement remains a challenge. This work proposes two white-box approaches to identify vulnerable tokens whose perturbation is likely to induce translation bugs: GRI utilizes GRadient Information, while WALI uses Word ALignment Information. The approaches were evaluated on a Transformer-based translation system using the News Commentary dataset and 200 English sentences from CNN articles. Results show GRI and WALI effectively generate high-quality test cases for revealing translation bugs. Compared to state-of-the-art automatic testing approaches, GRI and WALI outperformed from two aspects: 1) under a testing budget, they revealed more bugs, and 2) given a testing goal, they required fewer testing resources. The token selection strategies allow more efficient bug detection in NMT systems.
Abstract: DApps, or decentralized applications, are software programs running on blockchain platforms, built with immutable smart contracts. However, maintaining and updating these DApps post-deployment poses challenges as Ethereum lacks native solutions for smart contract maintenance. One popular method employed by developers is the upgradeability proxy contract (UPC), utilizing the proxy design pattern. Here, a proxy contract delegates calls to an implementation contract, allowing runtime reconfiguration for upgrades. Detecting UPCs accurately is essential for both researchers and practitioners. Researchers benefit from understanding real-world DApp maintenance, while practitioners require transparency for auditing and application behavior. To address this, we introduce UPC Sentinel, a novel algorithm leveraging static and dynamic analysis of smart contract bytecode. Evaluation on two datasets showcases its efficacy, with a near-perfect recall rate of 98.4% in the first dataset and a perfect precision rate of 100% and 97.8% recall rate in the second, surpassing existing methods.
Abstract: Software developers often spend a significant amount of time seeking answers to questions related to their coding tasks. They typically search for answers online, post queries on Q&A websites, and more recently, participate in chat communities. However, many of these questions go unanswered or need a lot of follow-up and clarification. Automatically identifying possible ways to refine a developer query in a way that adequately captures the problem and required context, such as software versions, could save time and effort. To address this issue, we first explore the use of Large Language Models (LLMs) for Named Entity Recognition (NER) to identify SE-related entities. We evaluate the performance of Mixtral 8x7B, a Large LLM, by prompting it for Named Entity Recognition (NER) tasks across four popular programming chatrooms on Discord. We then assess how effectively it can identify SE-related entities. Preliminary results show that the approach is very effective, with an accuracy =0.89. We then investigate how the presence of specific SE-related entities in queries influences the likelihood and speed of receiving a response. Our next step is to propose refinements to improve queries with the goal of making it more likely that they will get answers.
Abstract: Developers rely on third-party library Application Programming Interfaces (APIs) when developing software. However, violating their constraints results in API misuse causing incorrect behavior. A study of API misuse of deep learning libraries showed that their misuses are different from misuses of traditional libraries. We speculate that these observations may extend beyond deep learning libraries to other data-centric libraries due to their similarities in dealing with diverse data structures and having a multitude of parameters, which pose usability challenges. Therefore, understanding the potential misuses of these libraries is important to avoid unexpected behavior. We conduct an empirical study of API misuses of five data-centric libraries by analyzing data from Stack Overflow and GitHub. Our results show that many deep learning API misuse characteristics extend to the data-centric libraries. Overall, our work exposes the challenges of API misuse in data-centric libraries and lays the groundwork for future research to help reduce misuses.
Abstract: ChatGPT has significantly impacted software development practices, providing substantial assistance to developers in a variety of tasks. However, the impact of ChatGPT as an assistant in collaborative coding remains largely unexplored. In this paper, we analyze a dataset of developers' shared conversations with ChatGPT in GitHub pull requests and issues. We manually examined the content of the conversations and characterized the sharing behaviors. Our main observations are: (1) Developers seek ChatGPT's assistance across 16 types of software engineering inquiries. The most frequently inquiry is code generation. (2) Developers frequently engage with ChatGPT via multi-turn conversations where each prompt can fulfill various roles, such as unveiling initial or new tasks. (3) In collaborative coding, developers leverage shared conversations with ChatGPT to facilitate their role-specific contributions, such as authors of PRs or issues. Our work serves as the first step towards understanding the dynamics between developers and ChatGPT in collaborative software development.
Abstract: Assurance cases (ACs) are structured arguments that allow verifying the correct implementation of the created systems’ non-functional requirements (e.g., safety, security). This allows for preventing system failure. The latter may result in catastrophic outcomes (e.g., loss of lives). ACs support the certification of systems in compliance with industrial standards e.g. DO-178C and ISO 26262. Identifying defeaters —arguments that challenge these ACs — is crucial for enhancing ACs’ robustness and confidence. To automatically support that task, we propose a novel approach that explores the potential of GPT-4 Turbo, an advanced Large Language Model (LLM) developed by OpenAI, in identifying defeaters within ACs formalized using the Eliminative Argumentation (EA) notation. Our preliminary evaluation assesses the model’s ability to comprehend and generate arguments in this context and the results show that GPT-4 turbo is very proficient in EA notation and can generate different types of defeaters.
Abstract: With the global rise of computational thinking in school curriculums, it has become important to design new educational tools and games that support collaboration, engagement, and equitable access. Access to computers and tablets can be cost prohibitive for many students and we need to understand how to equitably support learning without requiring access to technology. To address this challenge we created Run, Llama, Run – a cost-accessible collaborative educational game for learning computational thinking in kindergarten through grade 5. Run, Llama, Run has both a tangible version, that uses paper boards and 3D printed blocks, and a tangible/digital hybrid version, which uses the same 3D printed blocks but includes code block execution via a tablet. We designed these versions to assess the impact the interface has on learning with respect to making and fixing mistakes as well as the impact on collaboration and engagement.
Abstract: The Unified Modelling Language (UML) is a visual modelling language that allows people to visualize and conceptualize software systems. Being primarily visual, made up of a combination of symbols and text, UML can be difficult for people with sight loss to use. We propose a tangible version of UML class diagrams to allow for easier access to this modelling language. Using a set of 3D printed blocks with QR codes and Braille, users can feel their way through and create a UML class diagram. This includes class blocks, connection blocks, inheritance blocks, and multiplicity blocks, as well as plans to expand further. In addition to the blocks, we propose an app system to allow for digital upload of the tangible UML blocks, using both a phone and a computer.
Abstract: "COBOL, developed in 1959, remains vital in finance and government for managing extensive data sets. Its longevity is supported by high-quality documentation that promotes easy maintenance and code reuse. Clear code explanations are essential for educating new developers, facilitating their understanding of legacy systems for effective navigation and refactoring. Currently, LLM can aid in generating explanations for programing code. Ensuring accuracy and avoiding hallucinations in these explanations is a challenge, necessitating the integration of relevant contextual information into prompts to enhance performance. Additionally, given the constraints of input window sizes, splitting large COBOL code into manageable segments is crucial. Understanding explanations at various levels, such as paragraph, file, and module levels, is also necessary to thoroughly comprehend and maintain these legacy systems."
Abstract: In this poster, we present a comprehensive study on four representative and widely-adopted DNN models to investigate how different DNN model hyperparameters affect the standard DNN models, as well as how the hyperparameter tuning combined with model optimization affect the optimized DNN models, in terms of various performance properties (e.g., inference latency or battery consumption). Our empirical results indicate that tuning specific hyperparameters has heterogeneous impact on the performance of DNN models across different models and different performance properties. We also observe that model optimization has a confounding effect on the impact of hyperparameters on DNN model performance. Our findings highlight that practitioners can benefit from paying attention to a variety of performance properties and the confounding effect of model optimization when tuning and optimizing their DNN models.
Abstract: In this work, we introduces APRAgent, an Large Language Model (LLM)-based Automatic Program Repair (APR) agent. Our APRAgent dynamically retrieves relevant skills by searching through the Skill Library and using them to construct fixing examples as prompts to fix buggy code. As part of this process, APRAgent continuously validates and updates their APR skillsets. We evaluated our approach using the Defects4J benchmark dataset. The results shows that our APRAgent can outperforms the state-of-the-art approaches.
Abstract: Flaky tests, characterized by their non-deterministic outcomes, can compromise the efficiency of software testing by producing unreliable test failures, that diminish developers’ confidence in the test results and make it more challenging to verify the correctness of bug fixes. Understanding the root cause of flaky tests is beneficial because it clarifies how to address these tests within an existing test suite. Our work extends the FlakyCat framework, an existing flaky test categorization method which uses a Siamese network alongside CodeBERT to group flaky tests based on their root causes. We have improved upon the predictive performance of FlakyCat by optimizing the model’s architecture and data handling processes. Specifically, our approach, FlakyCatX, has significantly enhanced the model’s F1 score from 0.71 to 0.89 when used with a dataset of flaky tests from GitHub projects.
Abstract: Runtime auto-remediation is crucial for ensuring the reliability and efficiency of distributed systems, especially within complex microservice-based applications. However, the complexity of modern microservice deployments often surpasses the capabilities of traditional manual remediation and existing autonomic computing methods. Our proposed solution harnesses large language models (LLMs) to generate and execute Ansible playbooks automatically to address issues within these complex environments. Ansible playbooks, a widely adopted markup language for IT task automation, facilitate critical actions such as addressing network failures, resource constraints, configuration errors, and application bugs prevalent in managing microservices. We tune pre-trained LLMs using our custom-made Ansible-based remediation dataset, equipping these models to comprehend diverse remediation tasks within microservice environments. Once in-context tuned, these LLMs efficiently generate precise Ansible scripts tailored to specific issues encountered, surpassing current state-of-the-art techniques with high functional correctness (95.45%) and average correctness (98.86%).
Abstract: "GitHub Marketplace, launched in 2017, extends GitHub's role, offering a platform for developers to discover and utilize automation tools. This study examines GitHub Marketplace as a software marketplace, analyzing its characteristics, features, and policies, and exploring popular tools across 32 categories. The study utilizes a conceptual framework from software app stores to scrutinize 8,318 production tools. we identified clear gaps between the automation tools offered by researchers and the ones indeed being used in practice. We discovered that practitioners often use automation tools for tasks like “Continuous Integration” and “Utilities,” while researchers tend to focus more on “Code Quality” and “Testing”. Our research highlights a clear gap between research trends and industry practices. Recognizing these distinctions can aid researchers in building on existing work and guide practitioners in selecting tools aligned with their needs, fostering innovation and relevance in software production. Bridge the gap between industry and academia ensures that research remains pertinent to evolving challenges."
Abstract: It is important to embed corresponding test cases while filing for bug reports. Unfortunately, not all the bug reports contain test cases which can be used to reproduce the issue. This causes issues for software quality as developers may have trouble understanding and reproducing the reported issues. Existing approaches on test case reproduction via bug reports fail to generate test cases for complex scenarios which contains project-specific classes and functions. In our study, we leverage contextual information to assist the large language models (LLM) to generate the corresponding test cases from bug reports. Case studies on Defects4J benchmark show that our approach out-performs the SOTA approach in terms of test case reproduction via bug reports.
Abstract: In the dynamic landscape of software development and system operations, the demand for automating traditionally manual tasks has surged, driven by the need for continuous operation and minimal downtimes. Ansible, renowned for its scalability and modularity, emerges as a dependable solution. However, the challenge lies in creating an on-the-spot Ansible solution for dynamic auto-remediation, requiring a substantial dataset for tuning large language models (LLMs). Our research introduces KubePlaybook, a curated dataset for generating automation-focused remediation code scripts, achieving a 98.86% accuracy rate. Leveraging LLMs, our proposed solution automatically generates and executes Ansible playbooks, addressing issues in complex microservice environments. This approach combines runtime performance anomaly detection with auto-remediation, reducing downtime, enhancing system reliability, and driving higher revenue. We present a pipeline for automatic anomaly detection and remediation based on LLMs, bridging a gap often overlooked in prior works.
Abstract: Harnessing the capabilities of the Software Engineering (SWE) Agent and AutoCodeRover, our integration approach significantly enhances bug localization and automated code repair. The SWE Agent employs Language Model (LLM), initially identifying bugs by reproducing them based on learned patterns and the bug’s error data. If this initial attempt proves inadequate, the AutoCodeRover’s Spectral-Based Fault Localization (SBFL) is activated to further pinpoint and suggest fixes for the bugs. We evaluated this integrated methodology using the SWE Bench Lite dataset, which comprises 300 real-world GitHub issues, offering a solid benchmark to test enhancements in bug detection and repair. The results showcase 1.4% improvements in both the precision of bug localization and the effectiveness of the proposed repairs, highlighting the potential of this dual-layered AI strategy to transform automated software maintenance processes.
Abstract: "Self-Admitted Technical Debt (SATD), a concept highlighting sub-optimal choices in software development documented in code comments or other project resources, poses challenges in the maintainability and evolution of software systems. Large language models (LLMs) have demonstrated significant effectiveness across a broad range of software tasks. Nonetheless, their effectiveness in tasks related to SATD is still under-researched. In this study, we investigate the efficacy of LLMs in both identification and classification of SATD. For both tasks, we investigate the performance gain from using more recent LLMs, specifically the Flan-T5 family, across different common usage settings. Our results demonstrate that for SATD identification, all fine-tuned LLMs outperform the best existing non-LLM baseline. In the SATD classification task, while our largest fine-tuned model, Flan-T5-XL, still led in performance, the CNN model exhibited competitive results, even surpassing four of six LLMs. Moreover, our experiments shows that incorporating contextual information, such as surrounding code, into the SATD classification task enables larger fine-tuned LLMs to improve their performance. Our study highlights the capabilities and limitations of LLMs for SATD tasks and the role of contextual information in achieving higher performance with larger LLMs."
Abstract: Recently, multiple benchmarks have been proposed to evaluate the code generation capability of Large Language Models (LLMs), but the benchmarks in Python with well-designed test cases for bug fixing are lacking. In particular, there are no existing benchmarks considering the rich development history data of a project. This paper validates the hypothesis that historical commit data is helpful to automatically fix the bug by mining different dimensions of historical data from commit history and using them to empirically evaluate LLMs’ potential for bug fixing. Specifically, we assess CodeLLaMa on a dataset containing 68 single-line bugs with test cases from 17 open-source software projects, trying to understand which type of commit history data is most effective, and how different prompt styles impact the bug-fix result. Based on the pass@k evaluation result, we found that historical commit data can enhance an LLM's performance in fixing real software bugs.
Abstract: The trend of sharing images and image-based social networks has eventually changed the landscape of social networks. It has also impacted social coding platforms, and previous studies showed image sharing becomes increasingly popular among software developers. To enhance issue reports, this study focuses on three primary objectives: (i) identifying Bugzilla issue reports benefit from image attachments, (ii) identifying useful types of image, and (iii) conducting a qualitative and quantitative evaluation. The quantitative evaluation demonstrates that our tool achieves an average recall of 78% and an average F1-score of 74% in predicting the necessity of image attachments. Moreover, our qualitative evaluation with software developers showed 75% of the developers found recommendations of our method practically useful for issue reporting. This study, along with its associated dataset and methodology, represents the first research on enhanced issue report communication. Our results illuminate a promising trajectory for enhanced and visual productivity tools for developers.
Abstract: "Highly configurable systems enable customers to flexibly configure the systems in diverse deployment environments. The flexibility of configurations also poses challenges for performance testing. On one hand, there exist a massive number of possible configurations; while on the other hand, the time and resources are limited for performance testing, which is already a costly process during software development. Modeling the performance of configurations is one of the solutions to reduce the cost of configuration performance testing. Although prior research proposes various modeling and sampling techniques to build configuration performance models, the sampling approaches used in the model typically do not consider the accuracy of the performance models, leading to potential suboptimal performance modeling results in practice. In this paper, we present a modeling-driven sampling approach (CoMSA) to improve the performance modeling of highly configurable systems. The intuition of CoMSA is to select samples based on their uncertainties to the performance models. In other words, the configurations that have the more uncertain performance prediction results by the performance models are more likely to be selected as further training samples to improve the model. CoMSA is designed by considering both scenarios where 1) the software projects do not have historical performance testing results (cold start) and 2) there exist historical performance testing results (warm start). We evaluate the performance of our approach in four subjects, namely LRZIP, LLVM, x264, and SQLite. Through the evaluation result, we can conclude that our sampling approaches could highly enhance the accuracy of the prediction models and the efficiency of configuration performance testing compared to other baseline sampling approaches."
Abstract: Large language models (LLMs) have demonstrated impressive capabilities in a variety of natural language processing tasks. However, evaluating their performance on the critical functions of determining the need for external information retrieval and generating appropriate responses remains a challenge. We present a comprehensive benchmark designed to assess these LLM abilities. Our benchmark first evaluates an LLM's capacity to decide whether a given query requires external information retrieval, or if the model's own knowledge is sufficient to provide a response. For queries that necessitate retrieval, the benchmark then assesses the LLM's ability to generate coherent and informative responses. By testing LLMs on a diverse set of query types and knowledge domains, the benchmark provides insights on the design and development of more robust and reliable LLM-based knowledge retrieval process.
Abstract: In this work, we introduce a robust waiting strategy. Instead of waiting for a predetermined time or waiting for the availability of a particular element, our approach waits for a desired state to reach. This is achieved by capturing the Document Object Models (DOM) at the desired point, followed by an offline analysis to identify the differences between the DOMs associated with every two consecutive test actions. Such differences are used to determine the appropriate waiting time when automatically generating tests.
Abstract: Software logs, whether they are event logs, system traces, or console logs, contain a large amount of raw runtime information. A common problem faced in the analysis of logs of large-scale, heterogeneous and parallel software systems is the interleaved sequences of events of different processes. We examine different approaches to separate those sequences to aid log analysis tasks, ranging from the usage of control flow in the source code to unsupervised learning techniques.
Abstract: This exploratory study delves into the intricacies of leaderboards and associated systems for foundation models in the machine learning domain, highlighting their critical role in evaluating foundation models. The primary goal is to improve the understanding and functionality of leaderboard systems through a comprehensive analysis of their architecture, operations, and challenges. Using a three-stage methodology, we systematically gather multivocal literature and leaderboards, followed by an iterative review and expansion process. Our findings reveal significant insights into the distribution and categorization of leaderboards, introduce a reference architecture promoting transparency, and identify disparities in the prevalence of leaderboard components. The study also introduces ``leaderboard operations'' (LBOps) as a framework for managing these systems, with implications pointing towards the need for standardized evaluations to support responsible AI development. Furthermore, we explore the concept of ``leaderboard smell'', which refers to various operational issues that can degrade the effectiveness of leaderboard systems. The insights provided can help refine these systems for better performance and reliability, fostering more robust evaluations of foundation models.
Talk, poster and demo submissions are welcome from across the field of software engineering. Continuing the scope of prior editions, CSER encourages submissions focusing on software engineering challenges related to developing, maintaining, and operating complex software systems (e.g., Micro-services, FM/AI-based systems, ultra-large-scale systems, intelligent cyber-physical systems), as well as to the novel and emerging engineering methods, techniques and processes underneath such systems (e.g., FM/LLMs, AI, Blockchain, Quantum Computing, Data Extended Reality).
We invite you to contribute by sending your proposals for:
Acceptance announcement:
Monday, May 13, 2024.
You can follow the link below to register your ticket.
Early Birds (By May 22th) | Regular ( After May 22th) | |
---|---|---|
Students | $220 (plus taxes) | $270 (plus taxes) |
Non students | $320 (plus taxes) | $370 (plus taxes) |
If a CSER participant would like to attend SEMLA 2024 (Montreal, June 12-13, https://semla.polymtl.ca/), a promo code for attending SEMLA is:
CSER-2024S
(30% off for all ticket types)