Detecting Machine-Generated Code: Unveiling Patterns in AI-Generated Programming

Community Article Published July 2, 2025

The rise of large language models (LLMs) like Codex and ChatGPT has transformed how we generate code. These models have accelerated software development, but they also blur the line between human- and machine-generated code, raising concerns about code authenticity and integrity. In our latest paper, "Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers," we dive deep into this issue, offering insights and a new solution to detect machine-generated code.

Key Insights:

While previous detection methods like DetectGPT have been successful for identifying machine-generated text, they struggle with code due to its strict syntactical nature. Our research explores the unique characteristics of human and machine-authored code, analyzing aspects such as:

Lexical Diversity: Machines use a narrower spectrum of tokens, whereas human-written code tends to be more diverse.
Conciseness: Machines often produce concise code, while humans include more identifiers and comments.
Naturalness: Surprisingly, machine-generated code can appear more "natural" than human code in certain scenarios, making it harder to detect using traditional methods.

Introducing DetectCodeGPT:

Building on these insights, we've developed DetectCodeGPT, a novel method that goes beyond current perturbation-based approaches. Our method focuses on specific patterns like syntactic segmentation, effectively distinguishing machine-generated code from human-authored code. By strategically perturbing the code's stylistic elements (like spaces and newlines), we significantly improve detection accuracy while maintaining computational efficiency.

Results:

Our experiments demonstrate that DetectCodeGPT outperforms state-of-the-art methods, improving detection by 7.6% in terms of AUC. Whether you're working on a software development team or researching AI and code generation, this tool could be a game-changer for maintaining code integrity and ensuring the authenticity of software artifacts.

To explore the full details of our research and try out DetectCodeGPT:

ICSE 2025 Paper: https://arxiv.org/html/2401.06461v2

Code: https://github.com/YerbaPage/DetectCodeGPT

Reference:

[1] Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers. Yuling Shi, Hongyu Zhang, Chengcheng Wan, Xiaodong Gu. In Proceedings of the 47th International Conference on Software Engineering (ICSE 2025).

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote