Awesome-Data-centric-SE-Papers

Year-Id	Title	Venue Name
2023-3	Textbooks Are All You Need	Arxiv
2023-2	Data Quality Matters: A Case Study of Obsolete Comment Detection	ICSE
2023-1	Data Quality for Software Vulnerability Datasets	ICSE
2022-3	On the Importance of Building High-quality Training Datasets for Neural Code Search.	ICSE
2022-2	Data smells in public datasets	CAIN
2022-1	Data smells: categories, causes and consequences, and detection of suspicious data in AI-based systems	CAIN
2021-1	Data Quality Matters: A Case Study on Data Label Correctness for Security Bug Report Prediction	TSE

SE Dataset Papers

Click to expand!

Year-Id	Title	Venue Name
2023-24	Evaluating software user feedback classifier performance on unseen apps, datasets, and metadata.	ESE
2023-23	JEMMA: An extensible Java dataset for ML4Code applications.	ESE
2023-22	The software heritage license dataset (2022 edition).	ESE
2023-21	Data Quality for Software Vulnerability Datasets.	ICSE
2023-20	On the Reproducibility of Software Defect Datasets.	ICSE
2023-19	An Automated and Flexible Multilingual Bug-Fix Dataset Construction System.	ASE
2023-18	BugMiner: Automating Precise Bug Dataset Construction by Code Evolution History Mining.	ASE
2023-17	Compsuite: A Dataset of Java Library Upgrade Incompatibility Issues.	ASE
2023-16	DeepScenario: An Open Driving Scenario Dataset for Autonomous Driving System Testing.	MSR
2023-15	PTMTorrent: A Dataset for Mining Open-source Pre-trained Model Packages.	MSR
2023-14	NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python.	MSR
2023-13	microSecEnD: A Dataset of Security-Enriched Dataflow Diagrams for Microservice Applications.	MSR
2023-12	SecretBench: A Dataset of Software Secrets.	MSR
2023-11	Defectors: A Large, Diverse Python Dataset for Defect Prediction.	MSR
2023-10	DocMine: A Software Documentation-Related Dataset of 950 GitHub Repositories.	MSR
2023-9	DACOS - A Manually Annotated Dataset of Code Smells.	MSR
2023-8	A Dataset of Bot and Human Activities in GitHub.	MSR
2023-7	Snapshot Testing Dataset.	MSR
2023-6	LLMSecEval: A Dataset of Natural Language Prompts for Security Evaluations.	MSR
2023-5	GitHub OSS Governance File Dataset.	MSR
2023-4	CodeMark: Imperceptible Watermarking for Code Datasets against Neural Code Completion Models.	FSE
2023-3	npm-follower: A Complete Dataset Tracking the NPM Ecosystem.	FSE
2023-2	Improving Fine-tuning Pre-trained Models on Small Source Code Datasets via Variational Information Bottleneck.	SANER
2023-1	CILIATE: Towards Fairer Class-Based Incremental Learning by Dataset and Training Refinement.	ISSTA
2022-33	A large-scale empirical study of commit message generation: models, datasets and evaluation.	ESE
2022-32	Making the Most of Small Software Engineering Datasets With Modern Machine Learning.	TSE
2022-31	An Empirical Study of the Effectiveness of an Ensemble of Stand-alone Sentiment Detection Tools for Software Engineering Datasets.	TOSEM
2022-30	Assessing and Improving an Evaluation Dataset for Detecting Semantic Code Clones via Deep Learning.	TOSEM
2022-29	On the Importance of Building High-quality Training Datasets for Neural Code Search.	ICSE
2022-28	Robust Learning of Deep Predictive Models from Noisy and Imbalanced Software Engineering Datasets.	ASE
2022-27	Towards Robust Models of Code via Energy-Based Learning on Auxiliary Datasets.	ASE
2022-26	Which bugs are missed in code reviews: An empirical study on SmartSHARK dataset.	MSR
2022-25	An Alternative Issue Tracking Dataset of Public Jira Repositories.	MSR
2022-24	ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction.	MSR
2022-23	ReCover: a Curated Dataset for Regression Testing Research.	MSR
2022-22	DISCO: A Dataset of Discord Chat Conversations for Software Engineering Research.	MSR
2022-21	SOSum: A Dataset of Stack Overflow Post Summaries.	MSR
2022-20	ManyTypes4TypeScript: A Comprehensive TypeScript Dataset for Sequence-Based Type Inference.	MSR
2022-19	METHODS2TEST: A dataset of focal methods mapped to test cases.	MSR
2022-18	The Unsolvable Problem or the Unheard Answer? A Dataset of 24, 669 Open-Source Software Conference Talks.	MSR
2022-17	DaSEA - A Dataset for Software Ecosystem Analysis.	MSR
2022-16	GitDelver Enterprise Dataset (GDED): An Industrial Closed-source Dataset for Socio-Technical Research.	MSR
2022-15	AndroOBFS: Time-tagged Obfuscated Android Malware Dataset with Family Information.	MSR
2022-14	TriggerZoo: A Dataset of Android Applications Automatically Infected with Logic Bombs.	MSR
2022-13	Vul4J: A Dataset of Reproducible Java Vulnerabilities Geared Towards the Study of Program Repair Techniques.	MSR
2022-12	TwinDroid: A Dataset of Android app System call traces and Trace Generation Pipeline.	MSR
2022-11	Constructing Dataset of Functionally Equivalent Java Methods Using Automated Test Generation Techniques.	MSR
2022-10	A Time Series-Based Dataset of Open-Source Software Evolution.	MSR
2022-9	A Versatile Dataset of Agile Open Source Software Projects.	MSR
2022-8	FixJS: A Dataset of Bug-fixing JavaScript Commits.	MSR
2022-7	A Large-scale Dataset of (Open Source) License Text Variants.	MSR
2022-6	Lighting up supervised learning in user review-based code localization: dataset and benchmark.	FSE
2022-5	Python-by-contract dataset.	FSE
2022-4	RegMiner: mining replicable regression dataset from code repositories.	FSE
2022-3	PANDORA: Continuous Mining Software Repository and Dataset Generation.	SANER
2022-2	CoolTeD: A Web-based Collaborative Labeling Tool for the Textual Dataset.	SANER
2022-1	RegMiner: towards constructing a large regression dataset from code evolution history.	ISSTA
2021-16	Are datasets for information retrieval-based bug localization techniques trustworthy?	ESE
2021-15	GreenHub: a large-scale collaborative dataset to battery consumption analysis of android devices.	ESE
2021-14	AndroidCompass: A Dataset of Android Compatibility Checks in Code Repositories.	MSR
2021-13	Duets: A Dataset of Reproducible Pairs of Java Library-Clients.	MSR
2021-12	KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle.	MSR
2021-11	A Traceability Dataset for Open Source Systems.	MSR
2021-10	The Wonderless Dataset for Serverless Computing.	MSR
2021-9	Andromeda: A Dataset of Ansible Galaxy Roles and Their Evolution.	MSR
2021-8	ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-based Type Inference.	MSR
2021-7	QScored: A Large Dataset of Code Smells and Quality Metrics.	MSR
2021-6	Apache Software Foundation Incubator Project Sustainability Dataset.	MSR
2021-5	Andror2: A Dataset of Manually-Reproduced Bug Reports for Android apps.	MSR
2021-4	GE526: A Dataset of Open-Source Game Engines.	MSR
2021-3	EqBench: A Dataset of Equivalent and Non-equivalent Program Pairs.	MSR
2021-2	CrossVul: a cross-language vulnerability dataset with commit data.	FSE
2021-1	Is the Ground Truth Really Accurate? Dataset Purification for Automated Program Repair.	SANER
2020-18	A Framework and DataSet for Bugs in Ethereum Smart Contracts.	ICSME
2020-17	Defining a Software Maintainability Dataset: Collecting, Aggregating and Analysing Expert Evaluations of Software Maintainability.	ICSME
2020-16	Towards Robust Production Machine Learning Systems: Managing Dataset Shift.	ASE
2020-15	The Software Heritage Graph Dataset: Large-scale Analysis of Public Software Development History.	MSR
2020-14	An Exploratory Study to Find Motives Behind Cross-platform Forks from Software Heritage Dataset.	MSR
2020-13	RTPTorrent: An Open-source Dataset for Evaluating Regression Test Prioritization.	MSR
2020-12	A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries.	MSR
2020-11	A Dataset and an Approach for Identity Resolution of 38 Million Author IDs extracted from 2B Git Commits.	MSR
2020-10	A Dataset for GitHub Repository Deduplication.	MSR
2020-9	A Dataset of Dockerfiles.	MSR
2020-8	A Dataset of Enterprise-Driven Open Source Software.	MSR
2020-7	A Mixed Graph-Relational Dataset of Socio-technical Interactions in Open Source Systems.	MSR
2020-6	On the Shoulders of Giants: A New Dataset for Pull-based Development Research.	MSR
2020-5	Dataset of Video Game Development Problems.	MSR
2020-4	GitterCom: A Dataset of Open Source Developer Communications in Gitter.	MSR
2020-3	How Often Do Single-Statement Bugs Occur?: The ManySStuBs4J Dataset.	MSR
2020-2	TestRoutes: A Manually Curated Method Level Dataset for Test-to-Code Traceability.	MSR
2020-1	Cross-Dataset Design Discussion Mining.	SANER

SEEDProtector

Click to expand!

SEEDPoisoner

Paper Id	Title	Venue	Year	Target Task	Task Description	Used Data	Used LLMs
1	Backdooring Neural Code Search	ACL	2023	Code Search	"Given a natural language description (query), the code search task is to return related code snippets from a large code corpus."	CodeSearchNet	"CodeBERT, CodeT5"
2	Multi-target Backdoor Attacks for Code Pre-trained Models	ACL	2023	Defect detection	Predict whether the input code is vulnerable or not	CodeXGLUE	"PLBART, CodeT5"
				Clone detection	Predict whether two programs are semantic-equivalent.
				Code2Code translation	Translate a piece of Java (C#) code to the version of C# (Java).
				Text2Code	Generate the source code of class member functions in Java given the natural language description as well as the class context.
				Code refinement	Fix a piece of buggy Java code and generate its refined version.
3	CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning	WWW	2022	Code generation	Generate source code based on a natural language description.	CodeSearchNet	"DeepCS, GPT-2, NCS-T"
				Code search	Retrieve the related code snippets from a codebase given a natural language query
				Code summarization	Summarize the code snippet into a summary sentence that describes its functionality
4	You See What I Want You to See: Poisoning Vulnerabilities in Neural Code Search	FSE	2022	Code search	"Input: a natural language description (query), Output: related code snippets from a large code corpus."	CodeSearchNet	"BiRNN, CodeBERT"
5	You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion	USENIX	2021	Code Completion	"Input: Previous k tokens, Output: Next token"	2800 repositories from Github	GPT-2

SEEDInspector

Click to expand!

SEEDConsistencyChecker

Year-Id	Title	Venue Name
2023-1	Inconsistent Defect Labels: Essence, Causes, and Influence	TSE
2021-1	Deep Just-In-Time Inconsistency Detection Between Comments and Source Code	AAAI
2019-1	A Large-Scale Empirical Study on Code-Comment Inconsistencies	ICPC

SEEDAugmentor

Click to expand!

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome-Data-centric-SE-Papers

SE Dataset Papers

SEEDProtector

SEEDPoisoner

SEEDInspector

SEEDConsistencyChecker

SEEDAugmentor

About

Releases

Packages

License

SEEDGuard/Awesome-Data-centric-SE-Papers

Folders and files

Latest commit

History

Repository files navigation

Awesome-Data-centric-SE-Papers

SE Dataset Papers

SEEDProtector

SEEDPoisoner

SEEDInspector

SEEDConsistencyChecker

SEEDAugmentor

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages