The experience of interacting with ChatGPT can sometimes feel like conversing with a single person due to its consistent personality, while at other times, it may seem like engaging with multiple individuals. This distinction became apparent to me during a recent comparison between essays written by ChatGPT and those penned by human authors. As a linguist, I recognized the unique speaking style of each individual, known as an “idiolect,” which can vary based on factors such as native language, age, gender, and education.
In forensic linguistics, the study of language use in various contexts, idiolects play a crucial role in determining authorship, detecting plagiarism, and analyzing linguistic backgrounds. With the rise of large language models (LLMs) like ChatGPT, concerns have been raised about students potentially relying on these tools for writing assignments. To address these concerns, I set out to investigate whether LLMs exhibit idiolects.
To analyze the language produced by ChatGPT and similar AI models like Gemini and Copilot, I delved into their writing styles and compared them using datasets on a specific topic—diabetes. By applying computational stylistics techniques, such as the Delta method introduced by John Burrows, I found that ChatGPT and Gemini indeed have distinct writing styles, as evidenced by the linguistic “distances” between their texts.
Examining trigrams, or groups of three words, further revealed the differences in idiolects between ChatGPT and Gemini. ChatGPT displayed a formal and academic style, using phrases like “blood glucose levels,” while Gemini’s trigrams were more conversational and explanatory, with simpler language choices like “high blood sugar.” These nuances in word usage suggest that LLMs develop idiolects based on the principle of least effort, priming effects, and emergent abilities.
Understanding and recognizing these idiolects in LLMs is essential for assessing their capabilities and potential for human-level intelligence. By identifying unique lexical, grammatical, and syntactic patterns in their writing, we can distinguish between content generated by these models and that produced by human authors. This awareness can also aid in attributing authorship and verifying the authenticity of written material in various contexts.
In conclusion, the presence of idiolects in LLMs sheds light on their evolving linguistic abilities and raises important questions about the intersection of artificial and human intelligence. As these models continue to develop and update, their distinctive writing styles offer insights into the complexities of language generation and the potential for AI to mimic and adapt to human communication patterns.