Machine Learning,  Machine Learning And Artificial Intelligence,  Programming

Detecting AI-Generated (LLM) Content In Articles

We hear more and more about the pros and cons of AI. There is a movement to regulate the use, movies about dangers of sentient robots, and those who think AI will free humanity from any boring work, or work that involves a lot of repetitive tasks. Going back to the 1950s and 1960s, what they called AI (or what we might call small shell scripts these days) were supposed to “Augment Human Intellect” as the great Doug Englebart wrote about in his 1962 article https://www.dougengelbart.org/content/view/138 or Vannevar Bush’s “As We May Think” from 1945, available at https://www.theatlantic.com/magazine/archive/1945/07/as-we-may-think/303881/.

What is an LLM?

A large language model (LLM) is a type of artificial intelligence (AI) that can generate text, translate languages, write different kinds of creative content, and answer questions in an informative way. LLMs are trained on massive datasets of text and code, which allows them to learn the patterns of human language and design patterns used in writing code. This makes them very good at generating text that is similar to human-written text.

LLMs are still under development, but they have the potential to revolutionize the way we interact with computers. They can be used to create new and innovative products and services, and they can also be used to improve the quality of existing products and services, provided there’s enough of a data set for given domains to do so.

Here are some of the potential benefits of LLMs:

  • They can be used to create new and innovative products and services. For example, LLMs can be used to create new chatbots that can provide customer service, or to create new educational tools that can help students learn new concepts.
  • They can be used to improve the quality of existing products and services. For example, LLMs can be used to improve the accuracy of machine translation, or to improve the quality of search results.
  • They can be used to make computers more accessible to people with disabilities. For example, LLMs can be used to create text-to-speech software that can help people who are blind or visually impaired.

Yet LLMs have plenty of limitations, which include:

  • They can be inaccurate or misleading. LLMs are trained on massive datasets of text, but this does not mean that they are always accurate or reliable. LLMs can sometimes generate content that is inaccurate or misleading, especially if the training data is biased or incomplete.
  • They can be plagiarized. LLMs can sometimes generate content that is plagiarized from other sources. This is because LLMs are trained on massive datasets of text, and they may not always be able to distinguish between original and plagiarized content.
  • They can be offensive or harmful. LLMs can sometimes generate content that is offensive or harmful. This is because LLMs are trained on massive datasets of text, and they may not always be able to distinguish between appropriate and inappropriate content.

Here are some tips for using LLMs safely:

  • Only use LLM generated content from reputable sources. There are many LLMs available, Use LLMs from reputable sources that have a good track record of generating accurate and reliable content.
  • Verify the accuracy and reliability of the content. Before using LLM generated content, verify its accuracy and reliability. This can be done by checking the source of the content, cross-referencing it with other sources, and using your own judgment.
  • Be aware of the potential dangers. It is important to be aware of the potential dangers of using LLM generated content. If you are unsure about whether or not to use LLM generated content, it is always best to err on the side of caution.

How to Determine If an Article or Script Was Written by an LLM

The massive datasets of text and code used to train LLMs can make it a tad bit easier to find content generated by LLMs. They are good at generating text that is similar to human-written text. However, there are some telltale signs that can help you determine if an article or script was written by an LLM. Here are a few things to look for:

  • Use of the passive voice: LLMs tend to use the passive voice more often than human writers. This is because the passive voice follows more easily codified patterns and so is often easier for LLMs to generate.
  • Use of generic language: LLMs tend to use generic language more often than humans. This is because they are not as good at generating specific and detailed language.
  • Use of simple sentences: LLMs tend to use simple sentences more often than human writers. This is because they are not as good at generating complex sentences.
  • Lack of citations: LLMs often do not cite their sources. This is because they are not as good at understanding the importance of citations.
  • Lack of original research: LLMs often do not conduct original research. This is because they are not as good at understanding the research process.

If you see a significant number of these patterns in an article or script, it is likely that it was written by an LLM. However, it is important to note that not all articles or scripts that contain these patterns were written by LLMs. Some articles or scripts may contain these patterns simply because they were written by inexperienced or just generic writers. My publishers have asked me to use many of the same attributes that LLMs use, like shorter and simpler sentences and not delving too deep into original research. The trend towards shorter, more generic content is one of the reasons that LLMs have learned to write content that way.

Finding LLM-generated Content

We want to augment human intellect, but we want to do so safely. Part of this is to not allow content and/or code to escape our environment without being analyzed by a human. We can then apply a simplistic, programmatic approach to derive if various forms of content were automagically generated. Fore example, the following is a script to get started if you want to detect if an LLM was used to write an article:

import re

def is_llm_article(article):
  # Check for common LLM patterns
  patterns = [
    re.compile(r"This article was generated by an LLM."),
    re.compile(r"This article was created by a large language model."),
    re.compile(r"This article was written by an AI."),
  ]

  for pattern in patterns:
    if pattern.search(article):
      return True
  return False

if __name__ == "__main__":
  article = """
This article was generated by an LLM.
"""

if is_llm_article(article):
  print("This article was written by an LLM.")
else:
  print("This article was not written by an LLM.")

This script will check for common patterns that are found in LLM-generated articles. If any of these patterns are found, the script will return True, indicating that the article was likely written by an LLM. Otherwise, the script will return False. On a simple data set, it was about half right. But it was written by an LLM and so with no work we got to 50% accuracy.

Here are some additional patterns that could be added to the script:

  • Use of the passive voice
  • Use of generic language
  • Use of simple sentences
  • Lack of citations
  • Lack of original research

Sentence Complexity

Each one of these can be broken into its own script and a weight applied for each aspect. While some articles may contain these patterns simply because they were written by writers who leverage a similar design pattern for their content, we can expand our easily written script to take text from an article as input, assign a complexity value to sentence structures, and outputs an integer that represents the average complexity value for sentences. Here is a script that takes text from an article as input, assigns a complexity value to sentence structures, and outputs an integer that represents the average complexity value for sentences:

import re

def get_sentence_complexity(sentence):

  # Count the number of words in the sentence.
  word_count = len(sentence.split())

  # Count the number of subordinate clauses in the sentence.
  subordinate_clause_count = len(re.findall(r"(?<!\w\.)\b(wh|that|which|who|whom|whose|when|where|why|how)\b", sentence))

  # Calculate the complexity score.
  complexity_score = word_count + subordinate_clause_count
  return complexity_score

def get_average_sentence_complexity(article_text):

  # Split the text into sentences.
  sentences = article_text.split(".")

  # Calculate the average complexity score.
  average_complexity_score = sum(get_sentence_complexity(sentence) for sentence in sentences) / len(sentences)
  return average_complexity_score

if __name__ == "__main__":
  article_text = """
  The quick brown fox jumps over the lazy dog.
  The dog saw the fox and ran away.
  The cat sat on the mat.
  """

  average_complexity_score = get_average_sentence_complexity(article_text)
  print("The average sentence complexity is", average_complexity_score)

This script will first split the text into sentences, as did the script before it. Then, it will calculate the complexity score for each sentence. Finally, it will calculate the average complexity score for all of the sentences. 

Passivity

Let’s expand the scope yet again. The complexity score is a measure of how difficult a sentence is to understand. A sentence with a high complexity score will be more difficult to understand than a sentence with a low complexity score. This is one of the many ways publishers told me to write, and so the LLM follows the design pattern set by modern Google crawling technologies and writes to be more appealing for that, rather than for a depth of understanding.  We can also takes a text blob document as an input and provides a score for how passive the voice is on a scale of 1 to 10. This script works by first creating a spaCy document from the text. spaCy is a natural language processing library that can be used to analyze text. Once the document is created, the script then finds all the passive voice sentences in the document. A sentence is considered to be in passive voice if the subject of the sentence is acted upon by the verb. For example, the sentence “The ball was thrown by the boy” is in passive voice, because the subject of the sentence (“ball”) is acted upon by the verb (“thrown”).

import spacy

def get_passive_voice_score(text):

  """
  This function takes a text blob document as an input and provides a score for how passive the voice is on a scale of 1 to 10.
  Args:
    text: A text blob document.
  Returns:
    A score for how passive the voice is on a scale of 1 to 10.
  """

  # Create a spaCy document from the text.
  doc = spacy.load("en_core_web_sm")
  doc.add_sents(text)

  # Find all the passive voice sentences.
  passive_voice_sentences = []
  for sent in doc.sents:
    if sent.is_passive:
      passive_voice_sentences.append(sent)

  # Calculate the score.
  score = len(passive_voice_sentences) / len(doc.sents)
  score = round(score * 10, 1)
  return score

Once all the passive voice sentences have been found, the script calculates the score. The score is calculated by dividing the number of passive voice sentences by the total number of sentences in the document. The score is then rounded to one decimal place.

For example, if the text contains 10 sentences and 5 of them are in passive voice, the score would be 0.5. This means that the text has a moderate amount of passive voice.

The score can be used to help writers identify areas where they can improve their writing style, like how Grammarly uses similar technologies in their products. For example, if a writer’s text has a high score, they may want to try to rewrite some of the sentences in active voice. Active voice is generally considered to be more concise and engaging than passive voice.

Detecting Generic Language

To expand the scope a little further, let’s look for generic language using a script. The below python script works by first creating a spaCy document from the file. spaCy is a natural language processing library that can be used to analyze text. Once the document is created, the script then finds all the generic words in the document. A word is considered to be generic if it is a stop word and it is also a generic word. Stop words are words that are commonly used in everyday language, such as “the”, “is”, and “and”. Generic words are words that have a general meaning and do not refer to anything specific, such as “thing” and “place”.

import spacy

def get_generic_language_score(file_path):
  """
  This function takes a file path as an input and provides a score for how generic the language is on a scale of 1 to 10.
  Args:
    file_path: The path to a file that contains text.
  Returns:
    A score for how generic the language is on a scale of 1 to 10.
  """

  # Create a spaCy document from the file.
  doc = spacy.load("en_core_web_sm")
  doc.add_sents(open(file_path, "r").read().splitlines())

  # Find all the generic words.
  generic_words = set()
  for word in doc.vocab:
    if word.is_stop and word.is_generic:
      generic_words.add(word)

  # Calculate the score.
  score = len(generic_words) / len(doc.vocab)
  score = round(score * 10, 1)
  return score

Citations

We can also analyze the presence of, and if any, the number of, citations. We’ll again begin by creating a spaCy document This script works by first creating a spaCy document from the text. spaCy is a natural language processing library that can be used to analyze text. Once the document is created, the script then finds all the citations in the document. A citation is considered to be a sentence that contains a reference to another work. For example, the sentence “This work is based on the research of Edge (2023)” contains a citation to the work of Edge (2023).

Once all the citations have been found, the script counts the number of citations. The number of citations is then returned.

For example, if the text contains 10 sentences and 2 of them are citations, the count would be 2. This means that the text contains 2 citations.

import spacy

def get_citation_count(text):
  """
  This function takes a text blob as an input and provides a count of the number of citations.
  Args:
    text: A text blob.
  Returns:
    A count of the number of citations.
  """

  # Create a spaCy document from the text.
  doc = spacy.load("en_core_web_sm")
  doc.add_sents(text)

  # Find all the citations.
  citations = []
  for sent in doc.sents:
    if sent.is_citation:
      citations.append(sent)

  # Count the number of citations.
  citation_count = len(citations)
  return citation_count

The count can be used to help writers identify areas where they can improve their writing style (especially if that writing is academic in nature), but also to detect LLMs. For example, if a writer’s text has a low citation count, they may want to try to cite more sources. Citing sources is important for academic writing, as it helps to establish the credibility of the work.

Original Research

We can then leverage the output of the above script to layer numpy onto our findings to determine whether the text contains original research. I feel less confident about the accuracy of this library without more training, but it’s a start. The script finds all the sentences that contain new information by making an assumption that if it does not contain a citation and it is not a common knowledge statement then it is net-new content. Common knowledge statements are statements that are generally known to be true, such as “The Earth is round” and “The sky is blue”.

Once all the sentences that contain new information have been found, the script counts the number of sentences. The number of sentences that contain new information is then divided by the number of sentences that do not contain citations. This gives the score. A score of 0.5 or higher indicates that the text contains original research. 

import spacy
import numpy as np

def is_original_research(text):
  """
  This function takes a text blob as an input and determines whether the text contains original research.
  Args:
    text: A text blob.
  Returns:
    A boolean value indicating whether the text contains original research.
  """

  # Create a spaCy document from the text.
  doc = spacy.load("en_core_web_sm")
  doc.add_sents(text)

  # Find all the citations.
  citations = []
  for sent in doc.sents:
    if sent.is_citation:
      citations.append(sent)

  # Count the number of citations.
  citation_count = len(citations)

  # Find all the sentences that contain new information.
  new_information_sentences = []
  for sent in doc.sents:
    if not sent.is_citation and not sent.is_common_knowledge:
      new_information_sentences.append(sent)

  # Count the number of sentences that contain new information.
  new_information_count = len(new_information_sentences)

  # Calculate the score.
  score = new_information_count / (len(doc.sents) - citation_count)

  # Return a boolean value indicating whether the text contains original research.
  return score >= 0.5

A score of less than 0.5 indicates that the text does not contain original research. If a document has a low score, writers may want to try to include more new information in their writing in general, but specifically the ability to get pre-written information from GPT or another LLM means we have time to augement human intellect with more (just as Sumerians augmented human intellect with what we might consider the written word and computers augmented human intellect by providing us with access and analysis of larger troves of data). 

So in the above examples, we’ve looked at the list of attributes and created scripts to isolate each. Each could be a standalone microservice that passes a json document to the next service along with a variable score from the previous tests, like:

{
  "text": "This is a text blob.",
  “scores”: [1, 2.2, 5, 2, 9]
}

To average those numbers, we could use a script like the following:

import json

def average_numbers_in_array(json_document):
  """
  This function takes a JSON document as an input and averages the numbers in the "numbers" field.
  Args:
    json_document: A JSON document.
  Returns:
    A float representing the average of the numbers in the "numbers" field.
  """

  # Get the "numbers" field from the JSON document.
  numbers = json_document["numbers"]

  # Calculate the average of the numbers.
  average = sum(numbers) / len(numbers)
  return average

Once all is said and done, and the scripts have been strung together we have a score. The findings can be wildly different. Some of the articles I’ve written over the years certainly return as being suspected LLM content, but now when run against content I know is LLM, I’m up from a 50% rate to closer to 90%, within the threashold to make business decisions but not in the threashold to make assurances that involve loss of life or risky stock trades (I’m not sure I’ll get there with the complexity involved in most any derivative market tbh).

The Dangers of Using LLM Generated Content

Again, LLMs are trained on massive datasets of text and code, which allows them to learn the patterns of human language – or to create a picture of how we may think (to quote Bush). They can be good at generating text or fail miserably and should always have a set of human eyeballs on what they come up with (after all, they still hallucinate). Some of the dangers we run into when using LLM generated content to keep in mind:

  • May be inaccurate or misleading: LLMs are trained on massive datasets of text, but this does not mean that they are always accurate or reliable. LLMs can sometimes generate content that is inaccurate or misleading, especially if the training data is biased or incomplete.
  • May be plagiarized: LLMs can sometimes generate content that is plagiarized from other sources. This is because LLMs are trained on massive datasets of text, and they may not always be able to distinguish between original and plagiarized content.
  • May be offensive or harmful: LLMs can sometimes generate content that is offensive or harmful. This is because LLMs are trained on massive datasets of text, and they may not always be able to distinguish between appropriate and inappropriate content.

If you are considering using LLM generated content, be aware of the potential dangers and verify the accuracy and reliability of the content before using it. Here are some tips for using LLM generated content safely:

  • Only use LLM generated content from reputable sources: There are many LLMs available, and not all of them are created equal. It is important to only use LLMs from reputable sources that have a good track record of generating accurate and reliable content.
  • Verify the accuracy and reliability of the content: Before using LLM generated content, it is important to verify its accuracy and reliability. This can be done by checking the source of the content, cross-referencing it with other sources, and using your own judgment.
  • Use it to get some of the work done, or augment, automate, and make your job better – not to replace work. 
  • Add sources andcheck it for plagiarism (there are plenty of APIs that can be used to automate doing so).

Caveats

Every type of organization will have different weights and measures, according to the type of content they generate. This is one of the reasons all of these scripts aren’t tied together in a mega script. People shouldn’t apply this globally. If the output was a number that everyone took for granted without understanding the nuance of their specific environments, it would be inaccurate and potentially misleading too often.

Finally, err on the side of caution and increase scrutiny with the more people actually rely on what these models come up with. Again, if life or business opportunities are on the line, auto-generated content could be problematic. Long-term, we don’t know how the inputs to LLM-generated content could be considered rights control the copy of derivitive content (aka copyright).