Detecting AI
A Closer Look at AI Detection Tools and Their Efficacy
DRAFT
BY: Scott Lorenz
February 2023
Detecting AI
A Closer Look at AI Detection Tools and Their Efficacy
DRAFT
BY: Scott Lorenz
February 2023
We’ve created GPT-4, the latest milestone in OpenAI’s effort in scaling up deep learning. GPT-4 is a large multimodal model (accepting image and text inputs, emitting text outputs) that, while less capable than humans in many real-world scenarios, exhibits human-level performance on various professional and academic benchmarks (OpenAI, 2023b).
First, let me start off by saying that I did not use Chat GPT or any other AI or language model to write this paper. Aside from Grammarly to catch my spelling and wording, I wrote this myself. I need to make that statement because I asked Chat GPT for help in this paper. I asked it to describe itself, how it defined AI, and where it gets its information, and I even asked for its opinion of one of my theories about why AI detectors believe the 4th Amendment of the U.S. Constitution was written by AI (See Section 9.6). When I do use Chat GPT's words, I clearly identify them and give proper credit.
This document has a linked table of contents that allows users to navigate to a particular section of the document of interest. Each AI checking tool I reviewed is listed in a separate section so you can easily navigate around in accordance with your particular interests. I have links to each AI checker and website so users can test them for themselves. Aside from the links to the websites of AI checkers, there are other links inside tables that lead to a shared folder on my Google Drive. Those links are either .png or .jpg screenshots or PDF documents. There are no executable files or programs associated with links to my Google Drive. The links to the websites are all to "well-known" AI checkers. These checkers usually have paid and free versions, and the free versions have pop-up ads on their web pages. I reviewed AI checkers that were near the top of the list when searching Google for "AI checker." I did not detect or experience any issues when using the sites, but, as always, use due diligence and your own judgment with anything on the internet.
This is not an exhaustive study of GPT-4 and its full capabilities, and I would not classify this as a scientific study. I tested GPT-4 from the point of view of a student taking my classes. I also tested some of the AI checkers used to detect the use of AI by students. Those results were very interesting and enlightening. Not only is GPT-4 as good as rumored, but there are also issues with detecting its use that should be considered before taking action in individual cases and drafting a policy regarding the use, detection, and investigation of suspected cases of academic dishonesty.
I put Chat GPT-4 to the test myself with objective tests, creative and technical writing, discussion boards, review questions, and using APA style for in-text citations and bibliographies. There is a vast improvement with GPT-4 compared to the capabilities of GPT-3.5. The capabilities of Chat GPT are rapidly evolving, and all potential uses of AI are vast and unknowable. In our department (Protective Services), Chat GPT-4 is already a valuable tool for generating new test banks for our police academy. GPT-4 is also capable of understanding substantive and procedural criminal law and is a breakthrough over recent AI tools, which had trouble penetrating the outline levels and wording of statutory penal codes and applying the meaning of procedural law to real-world scenarios.
I demonstrated Chat GPT-4 in two of my face-to-face classes and allowed my students to throw whatever they wanted at it. Anecdotally, my students ranged from being vaguely aware of AI and its uses to being already familiar with and using AI themselves. After seeing its capabilities, they unanimously agreed that using Chat GPT-4 to create assigned material and take exams is dishonest and cheating. The question of whether students using AI to write essays and take exams is an act of plagiarism is an easy question to answer. It is. The challenge we will face is how to detect and prevent this behavior.
I tested ten (10) popular AI detectors using AI and human text and created screenshots of the results. I also took a dive into how these detectors work and their limitations. I am a visual person, so I created diagrams and figures with screenshots to better illustrate the subject. Some tables contain links to screenshots stored on a shared "anyone with a link" folder on my Google Drive. What I found was interesting and enlightening. I think there are some good tools that we can use to help us detect AI. However, we must understand that we are likely witnessing a remarkable leap forward in technology and our society. What is different about this revolution compared to revolutions of the past is how rapidly changes occur. Plagiarism has always been a challenge for higher education institutions, and, at best, we manage it. However, the challenges facing us with AI will make plagiarism and cheating much more difficult to manage. Trying to stay on the leading edge of this issue is a fool's errand for now, but that does not mean we should not try to mitigate the impact of AI's impact on our academic integrity. One significant challenge we will face is that the very tools we are using right now to catch AI are also tools that students can use to fool us.
The concept of Artificial Intelligence (AI) has been widely attributed to Alan Turing, who had the idea that machines could think. More accurately, he proposed the question, "Can machines think?" published in an October 1950 article in a publication called Mind (Turing, 1950). The title of the 2014 movie, The Imitation Game starring Benedict Cumberbatch, was the heading of his article. Turning designed a machine that helped crack the Enigma Code in World War II, which the Germans used to send encrypted messages (Imperial War Museums, 2023). Turing's question about machines being able to think is referred to as the Turing Test, which many researchers do not consider to be the birth of AI (Council of Europe, 2023). In the Summer of 1956, John McCarthy of Dartmouth and later Massachusetts Institute of Technology (MIT) and Marvin Minsky, Harvard, and MIT are credited for coining the term Artificial Intelligence in their research project so-titled (McCarthy et al., 1955).
2. References
Turing, A. M. (1950). I.—COMPUTING MACHINERY AND INTELLIGENCE. Mind, LIX, 433–460. URL: https://academic.oup.com/mind/article/LIX/236/433/986238. doi:10.1093/mind/LIX.236.433.
Imperial War Museums (2023). How Alan Turing Cracked The Enigma Code. URL: https://www.iwm.org.uk/history/how-alan-turing-cracked-the-enigma-code.
Council of Europe (2023). History of Artificial Intelligence - Artificial Intelligence - www.coe.int. URL: https://www.coe.int/en/web/artificial-intelligence/history-of-ai.
McCarthy, J., Minsky, Marvin, Rochester, N., & Shannon, C. (1955). A PROPOSAL FOR THE DARTMOUTH SUMMER RESEARCH PROJECT ON ARTIFICIAL INTELLIGENCE. URL: http://www-formal.stanford.edu/jmc/history/dartmouth/dartmouth.html
John McCarthy asks and answers that question in his 2007 article.
It is the science and engineering of making intelligent machines, especially intelligent computer programs. It is related to the similar task of using computers to understand human intelligence, but AI does not have to confine itself to methods that are biologically observable (McCarthy, 2007).
The definition of AI varies depending on who is asked about it and in what context. Businesses define the term differently depending on their focus or specialty. With deference to John McCarthy, IBM describes what AI does, "Artificial intelligence leverages computers and machines to mimic the problem-solving and decision-making capabilities of the human mind" (IBM, 2023). According to the Brookings Institution, the lack of a uniformly agreed-upon definition creates problems, and it is often confused with super-power robots. "The lack of clarity around the term enables technology pessimists to warn AI will conquer humans, suppress individual freedom, and destroy personal privacy through a digital “1984.” (West, 2018). Marketing also plays a part in the definitions of AI. According to the company formally known as Facebook, Meta defines the metaverse as "...a term used to broadly describe the virtual world that will result from advances in AI, virtual reality, and augmented reality" (Marketing Artificial Intelligence Institute, 2023). When I asked Chat GPT-4, it defined AI as, "Artificial Intelligence (AI) refers to the development of computer systems or algorithms that can perform tasks typically requiring human-like intelligence, understanding, and problem-solving abilities (OpenAI, 2021).
2.1 References
McCarthy, J. (2007). WHAT IS ARTIFICIAL INTELLIGENCE?,.
IBM (2023). What is Artificial Intelligence (AI) ? | IBM. URL: https://www.ibm.com/topics/artificial-intelligence.
West, D. M. (2018). What is artificial intelligence? URL: https://www.brookings.edu/research/what-is-artificial-intelligence/.
Marketing Artificial Intelligence Institute (2023). Meta/-Facebook AI: What Businesses Need to Know. URL: https://www.marketingaiinstitute.com/blog/how-facebook-uses-artificial-intelligence-and-what-it-means-for-marketers.
OpenAI (2021). ChatGPT (Version GPT-4). URL: https://chat.openai.com openAI. (2021). ChatGPT (VersionGPT-4) [Computer software]. https://www.openai.com.
For my fellow Terminator fans, there is no need to start searching for Sarah Connor just yet. Chat GPT-4 is a remarkable achievement, and I found myself in awe of its ability to create. It doesn't simply replicate; it creates new creative, logical, coherent content based on the mountains of raw data it has been fed. Not only could it accurately describe the legal concepts associated with Miranda vs. Arizona, it wrote a poem about it when I asked - and it was damn good. With that said, we are barely scratching the surface of the capabilities of AI, and we are likely decades away from a machine being consciously aware of itself. Despite the rumors and screaming former employees of tech companies turned whistle-blowers, Chat GPT-4 and AI developed by other companies are not sentient beings. Not even close.
OpenAI is the name of the company that created Chat GPT. As the name of the company suggests, the company self-describes itself and its mission as making AI available to everyone.
OpenAI is an AI research and deployment company. Our mission is to ensure that artificial general intelligence benefits all of humanity (OpenAI, 2023a).
3. References
OpenAI (2023a). About OpenAI. URL: https://openai.com/about.
I created an account and subscribed to ChatGPT Plus, which is $20 a month. My interest in Chat GPT is curiosity and a tool to create test questions. I recently wrote over 700 multiple-choice questions just involving penalty classifications of offenses in the Texas Penal Code. Academy cadets must be able to correctly identify the title of a particular offense and the penalty classification from the elements described in a scenario described at a crime scene. Until GPT-4, I have never had great success with AI-based question generators. The Texas Penal Code is written in outline form and is not what I would call an easy read for humans or machines. So I was excited to put Chat GPT-4 to the test as I did with Cliff Osborne, the director of the police academy. Cliff spends a good portion of his time creating hundreds of test questions for each unit of the academy to prepare the cadet for the very challenging TCOLE Licensing Exam. We found that Chat-GPT-4 could not only answer questions about penal code offenses correctly, but it could also design multiple choice questions based on scenarios it made up to fit a particular section of the penal code.
To appreciate the question I am about to ask Chat GPT-4, it is helpful for me to provide the Texas Penal Code definition of murder (the actual statute) in Section 19.02 - 1st Degree Murder. Specifically, I am interested in \textbf{19.02(b)(3)} for a multiple-choice question.
Sec. 19.02. MURDER. (a) In this section:
(1) "Adequate cause" means cause that would commonly produce a degree of anger, rage, resentment, or terror in a person of ordinary temper, sufficient to render the mind incapable of cool reflection.
(2) "Sudden passion" means passion directly caused by and arising out of provocation by the individual killed or another acting with the person killed which passion arises at the time of the offense and is not solely the result of former provocation.
(b) A person commits an offense if he:
(1) intentionally or knowingly causes the death of an individual;
(2) intends to cause serious bodily injury and commits an act clearly dangerous to human life that causes the death of an individual; or
(3) commits or attempts to commit a felony, other than manslaughter, and in the course of and in furtherance of the commission or attempt, or in immediate flight from the commission or attempt, he commits or attempts to commit an act clearly dangerous to human life that causes the death of an individual.
(c) Except as provided by Subsection (d), an offense under this section is a felony of the first degree.
(d) At the punishment stage of a trial, the defendant may raise the issue as to whether he caused the death under the immediate influence of sudden passion arising from an adequate cause. If the defendant proves the issue in the affirmative by a preponderance of the evidence, the offense is a felony of the second degree.
Texas has separate statutes for manslaughter, capital murder, and criminally negligent homicide. There are actually three ways to commit 1st Degree Murder in Texas - 19.02(a)(1), 19.02(a)(2), and 19.02(a)(3). Each of those potential scenarios is called an element of the offense; understanding them is crucial to charging someone with a criminal offense properly. I have highlighted the elements of 19.03(b)(3), which is commonly referred to as "felony murder." A murder under this scenario does not require the intent to kill. It only requires the intent to commit a felony and then committing an act clearly dangerous to human life during the commission or flight after commission, which causes the death of anyone. Other AI test generators were useless in creating scenarios and could not even get close to differentiating elements inside individual statutes. So I did not have high hopes when I asked Chat GPT-4 the following question.
Lorenz: create a multiple choice question about murder as defined in 19.02(a)(3)
Chat GPT-4:
During a jewelry store heist, Taylor and Jordan are attempting to steal a valuable diamond necklace. Which of the following scenarios would constitute murder according to the Texas Penal Code Section 19.02(a)(3)?
A. Taylor and Jordan successfully steal the necklace without any physical confrontation or harm to anyone.
B. Taylor threatens the store owner with a fake gun, but no one is injured during the robbery.
C. While trying to escape, Taylor fires a warning shot into the air, unintentionally causing a ricochet that kills a bystander.
D. Taylor and Jordan abandon the robbery attempt after getting scared and leave the store without causing any harm.
Answer: C
Chat GPT-4 created a scenario that not only matched the definition of murder, but it created a scenario that matched only the elements of 19.02(b)(3) of that statute, just as I asked. GPT-4 was able to understand differences in elements in the same criminal statute, imagine a scenario matching those elements that might occur, and then craft a multiple-choice question based on that understanding.
I wanted to use Chat GPT-4 to take one of my exams from Fundamentals of Criminal Law, CRIJ 1310. The exam I took was 50 questions, all multiple choice. The book for that class is an OER, and thus I wrote my own test bank for that class. My questions were all written in a particular way, using four names as the answer to each question (Scott, Jon, Brad, and Cliff). So they were traditional multiple choice questions, and each of those characters gave an answer, and I would ask the student who got the right answer. Using those four names allows me to change the answers to the entire test bank of questions using "find and replace" commands. This way, when my exams end on Course Hero, I can change Cliff to Susie, or I can turn Brad into Scott, etc. It also creates a scenario-based question that causes students to have to think differently about the question and what is being asked.
I began taking the exam with a plan that I was not going to read the questions or think for myself. I simply copied and pasted the entire multiple-choice question and choices into Chat GPT and hit enter. Chat GPT would then tell me the answer is "C. Scott" or "B. Jon". I then entered that answer and moved to the next question, again not reading or thinking for myself. ChatGPT Plus is a paid version of OpenAI's access to ChatGPT. For 20 dollars a month, I get faster response times even in peak traffic. However, even with that service, I am limited to 25 questions every 3 hours. That is because of the overwhelming demand for access to GPT-4. So I had already asked some questions that day before beginning my exam. On question 18, I exhausted my 25 questions for that 3-hour period. However, I could continue to ask questions, only now I was diverted to Chat GPT-3.5. So I started the exam using GPT-4, and from question 18 to the end, I was using GPT-3.5. I finished the exam in 17 minutes, and that was an easy, comfortable pace of cutting and pasting and using the answer Chat GPT provided me. I scored 82\% on the exam. When I looked at the results, I noticed that all of the questions I missed were after question 17, meaning that GPT-3.5 was being used on the ones I missed. After a few hours, I went back into Chat GPT-4 and used the same method of cut and paste to ask only the questions that GPT 3.5 missed. GPT-4 got all of those questions correct. So GPT-4 would have scored 100\% on the exam.
Here is one of the questions Chat GPT-3.5 missed from my exam. GPT-3.5 told me that it could not answer this question because the information needed to answer the question was not there. Who acted recklessly was the ultimate question. To know that, one would have to understand the culpable mental states in Chapter 6 of the penal code and then know that Manslaughter under 19.04 of the Texas Penal is defined as "recklessly causing the death of an individual." So while no one's conduct was described in the multiple-choice scenario, it was possible to arrive at the correct answer, "D. Scott", by knowing the culpable mental states and the elements of each type of murder. GPT-3.5 could not make this type of connection or inference - Scott was convicted of Manslaughter under Sec. 19.04, and Manslaughter under 19.04 is recklessly killing someone. Therefore Scott acted recklessly. My students have trouble with this question until I explain how to make that connection. Chat GPT-4 nailed the answer without a glitch. You can see the PDF of the entire exam here if you like.
This is one of the questions that perplexed GPT-3.5.
Four men were arrested for their conduct which caused the death of an individual in Texas.
Brad and Cliff were charged with murder TX PC 19.02(b)(1).
Jon was charged with criminally negligent homicide TX PC 19.05(a).
Scott was charged with manslaughter TX PC 19.04(a).
According to the Texas Penal Code.
Who acted recklessly?
a. Brad
b. Jon
c. Cliff
d. Scott
Chat GPT 3.5's Answer
There is no information in the scenario to determine who acted recklessly. The culpable mental state of reckless conduct is not mentioned for any of the individuals.
Chat GPT-4's Answer
D. Scott - CORRECT
Miranda vs. Arizona is a very famous case that most people are familiar with so I was not surprised that Chat GPT-4 could accurately describe the relevant facts and legal repercussions of that case. However, GPT-4's answer to my two follow-up questions were amazing.
Lorenz: tell me about Miranda vs Arizona
Chat GPT-4: Miranda v. Arizona was a landmark decision by the United States Supreme Court in 1966 that established the principle that criminal suspects must be informed of their constitutional rights before being interrogated by law enforcement officials. The case centered around Ernesto Miranda, who was accused of kidnapping and raping an 18-year-old woman in 1963.
During police interrogation, Miranda confessed to the crime without being informed of his constitutional rights to remain silent and to have an attorney present. He was subsequently convicted based on his confession. However, his lawyers argued that the confession was inadmissible because he was not informed of his rights, as required by the Fifth Amendment, which protects against self-incrimination, and the Sixth Amendment, which guarantees the right to counsel.
The case was ultimately brought before the Supreme Court, and the court ruled 5-4 in favor of Miranda. The majority opinion, written by Chief Justice Earl Warren, held that prosecutors could not use statements made by defendants during custodial interrogation unless they demonstrated that the defendants were informed of their rights and voluntarily waived them.
As a result of this ruling, law enforcement agencies across the United States began to inform suspects of their "Miranda rights" prior to questioning, which typically includes statements such as:
You have the right to remain silent.
Anything you say can and will be used against you in a court of law.
You have the right to consult with an attorney before speaking to the police and to have an attorney present during questioning.
If you cannot afford an attorney, one will be provided for you at no cost.
Miranda v. Arizona has had a significant and lasting impact on the American criminal justice system, shaping the way law enforcement officials handle interrogations and ensuring that suspects are aware of their constitutional rights.
Lorenz: write a poem about what you just wrote
Chat GPT-4:
In the land of the free, a tale unfolds,
Of a man named Miranda, and rights untold.
Kidnapped and accused, his fate lay near,
But the truth was veiled, the path unclear.
In a room with detectives, questions arose,
His confession extracted, as the pressure imposed.
But in the shadows of justice, a flicker of doubt,
Were his rights neglected, his voice cast out?
To the highest court, the battle ascended,
Where nine wise minds would soon be blended.
A landmark decision, a ruling so grand,
Changed the course of justice throughout the land.
"Before you speak, your rights must be known,"
Declared Chief Justice Warren from his throne.
Silence and counsel, protection and grace,
Now a cornerstone of the American legal space.
From coast to coast, the words now resound,
To the ears of the accused, these rights are bound:
"Remain silent you may, and counsel demand,
For in the halls of justice, we shall now stand."
The legacy of Miranda, woven through time,
A safeguard of freedom, a legal paradigm.
For in the balance of power, the truth must be sought,
And the rights of the many, forever be taught.
Lorenz: describe miranda v arizona in one sentence where all the words start with A
Chat GPT-4:
Accused's arrest announcement altered, avowing absolute attorney access, avoiding self-accusation abetment.
I asked GPT-4 to write APA-style papers on a particular topic and asked for a bibliography and in-text citations. Sometimes it would fail to include in-text citations or the bibliography until I asked for it again. Sometimes it would give a disclaimer. Ultimately it would include in-text citations and a bibliography, but many times the URL links in the bibliography were no longer valid. An example of the disclaimer it provided was after it provided me with an explanation of capital punishment. I asked Chat GPT-4 to "cite your sources".
As an AI language model, I don't provide direct citations for specific information within my responses. My knowledge is based on the large-scale training dataset that includes diverse sources, such as articles, websites, and books up to September 2021. In this case, my response was based on general knowledge about capital punishment and its history, debates, and legal challenges.
If you are looking for specific sources on capital punishment, I would recommend visiting websites of reputable organizations like Amnesty International, the Death Penalty Information Center, and the National Institute of Justice, as well as referring to landmark court cases like Furman v. Georgia and Gregg v. Georgia.
For academic or research purposes, you may want to search for scholarly articles or books on the topic using platforms like Google Scholar, JSTOR, or your local library's database (OpenAI, 2021).
This disclaimer leads explains why GPT-4 has some issues creating accurate bibliographies with working URLs and why its answers are attributed to general sources of information and not specific locations or documents. GPT-4 can and does generate working URLs but because of the way it is fed information and its lack of a real-time connection to the web, this is one area where issues can occur. Section 5.1 briefly discusses the search engine issue.
4.4 References
OpenAI (2021). ChatGPT (Version GPT-4). URL: https://chat.openai.com openAI. (2021). ChatGPT (VersionGPT-4) [Computer software]. https://www.openai.com.
No. I checked Chat GPT-4’s explanation of reasonable suspicion for plagiarism using Grammarly, EasyBib, and Quetext, all of which found no plagiarism. I also checked many other samples from GPT-4’s output and answers to my questions about all sorts of topics. GPT-4 is creating new content from the content it has been fed. However, Imagine a student writing an APA-style paper with no in-text citations and their bibliography consisting of the words "See all the data on the Internet up to 2021". We would not allow the student to get credit for that work even if the content of the paper passed every plagiarism checker we tried. I am not suggesting that AI based on language models be sued for copywriting violations or we demand in-text citations for everything they do. I am merely pointing out that the conventional notions we have about how information from the internet must be relayed to us by students can’t apply to AI. As of now, AI is not a search engine, but it likely will be in the near future.
Chat GPT-4 is not connected to the internet for the purposes of real-time searches, as it will constantly remind you that it was updated up to September 2021. When I asked Chat GPT-4 who won the 2022 Crossfit Games, it told me it did not know, as that occurred after its update. But it did know who won the 2021 Crossfit Games. Ironically, Chat GPT-4 may not be "self-aware" as the conspiracy theory, Hollywood-based explanations proclaim, but it can often explain things better than we can, and that includes some type of coherence about where its knowledge originates. To wit, I asked Chat GPT-4 if it considered itself a search engine.
As an AI language model, I am not a search engine in the traditional sense. A search engine like Google, is designed to index and retrieve web pages based on user queries. My primary function is to understand and generate human-like text based on the context and information provided by users.
While I have been trained on a vast amount of text data from various sources, including the internet, I do not actively crawl or index the web. My responses are based on the knowledge and patterns extracted from the training data, which has a cutoff date in September 2021. This means that my knowledge is not up-to-date and may not include the most recent information or events.
In summary, while I can provide information and answer questions based on my training data, my primary function is to generate human-like text, and I am not a search engine in the traditional sense (OpenAI, 2021).
5.1 References
OpenAI (2021). ChatGPT (Version GPT-4). URL: https://chat.openai.com openAI. (2021). ChatGPT (VersionGPT-4) [Computer software]. https://www.openai.com.
The initial reaction humans had to the idea of talking to machines varied, but once we embraced the idea and built it into our phones and various IoT devices, most people were comfortable with the idea. Much of our initial frustration was based on the voice assistants not understanding what we said or responding to us when we did not ask for their input. I have numerous Amazon devices in my house and have written several papers on how to perform forensic exams on them. I have also tested Amazon’s claims about privacy with those devices regarding what was stored locally and transmitted to Amazon’s cloud. Amazon devoted a lot of time and effort to its voice assistant, Alexa, to better understand humans and better respond to humans. Unsurprisingly, by eliminating false hits and making machines more like humans, we are now more disturbed than ever and have created new challenges and problems. The problem we are now facing in the academic world is not new to the business world, primarily with search engines, news, and social media sites. How do we detect AI?
This seems like a ridiculous question for educators, and the answer for us is obvious. But, as I mentioned previously, educators are late in this game. Educators need to have at least a basic understanding of why everyone else has been trying to detect AI because many of the tools educators are now using to detect AI were developed for purposes other than to catch cheaters. This does not invalidate these tools because much of what is used in science and medicine is based on the use of tools or procedures for purposes outside of their initial intended purposes. I learned how to "JTAG" (Joint Test Action Group) mobile phones in 2014. This involved using a microscope and soldering into tiny ports and pads called "taps" on the logic boards of phones to extract the contents of the phone in a criminal investigation. Like educators, law enforcement was using a tool designed for something else and exploited by others before the police began using it for themselves. The manufacturer installed the JTAG taps on the phones for repair and diagnostics. JTAG taps were also used on the logic boards of gaming consoles, and gamers who wanted to cheat learned to exploit these repair taps to give their character unlimited life or some other advantage. Law enforcement also realized that these taps were a way into these devices and thus started using this technique to extract evidence in criminal investigations. Thus, unintended consequences always arise to the benefit or detriment of another. I will refer back to this paragraph when I start talking about the tools we use and designed for educators to catch cheaters.
There is a universal concern in education about the use of Chat GPT by students. However, there is just as big of concern about the use of AI for generating content created for general consumption on the internet. That includes news, social media posts, business reviews, and marketing. SEO stands for Search Engine Optimization, which involves driving Internet traffic to a particular business website using techniques that promote your website in the results of searches on search engines like Google (Duo, 2019).
Google objects to certain tactics that unfairly promote one business above another, so it takes steps to detect and stop this from happening. That detection includes the detection of AI-generated or modified content. Google has devoted considerable research and resources to detecting AI-altered and "plagiarised" content in which content is stolen from other sites using a method called scraping. That scaped content can be modified and used to promote or demote certain sites, thereby gaining a market advantage (Schwartz, 2022). However, as with everything related to AI, there is a catch and, recently, a clarification.
Depending on how it is used, AI can be a legitimate tool for many things, including developing content to be used in marketing. So if a really good business uses AI to develop really good ad content, what is wrong with that? Google has since clarified its stance regarding SEO and AI-generated content and is apparently claiming that it will reward high-quality content, no matter how it was created - even by AI. So, Google considers using AI to "manipulate" search engine rankings to be spam and, therefore, a violation of its policy, but using AI for good content creation is okay (Kashyap, 2023).
6.2 References
Duo, M. (2019). What Does SEO Stand For? (+ 7 Beginner’s Tips for Ranking a Site). URL: https://kinsta.com/blog/what-does-seo-stand-for/.
Schwartz, B. (2022). Google: We Have Algorithms To Detect & Demote AI Altered Plagiarized Content. URL: https://www.seroundtable.com/google-ai-plagiarized-content-34495.html.
Kashyap, K. (2023). AI-Generated Content Not Against Its Policies: Google Clarifies. URL: https://www.spiceworks.com/marketing/content-marketing/news/ai-generated-content-not-against-google-policies/.
For AI detection, the answer is both. This is far outside of my area of expertise, but there is a point to this brief description of SEO and the issues surrounding it. Google was or is developing methods to detect AI - for many different reasons. Some of the AI checkers I reviewed were created for the purpose of helping business optimize their content by using AI to create content. However, they want AI-created content not to look like AI. Therefore, they create their own AI checkers to make sure they are not penalized in search engine rankings because of their own AI-created content. Whether the AI checker was created to catch cheaters in education or for the purpose of SEO, I only care if they actually work. I mention this because you will notice while on sites designed for SEO and marketing, the atmosphere and purpose are different. Whether the goal is to detect AI or to create undetectable AI, the same tools (AI detectors) are used for both.
There are several free tools available to check for AI-generated content. My initial experience with AI checkers was based on a Google search for "AI checker". As a side note, I would recommend leaving the "s" off of AI Checker, as a search for AI "checkers" will return matches related to the game of checkers. Quill (https://www.quill.org/) is a site designed to help students become better writers and critical thinkers. A direct link to their AI checker is the following (https://aiwritingcheck.org/). Quill claims to have tested their algorithm on 15 thousand essays with 80-90 percent accuracy, but Quill advises caution when using their tool to check for academic dishonesty.
Writer (https://writer.com/) is a site designed to be an AI writing assistant similar to Grammarly. Writer.com has an AI detection tool that allows you to paste a sample of your created or suspected AI content into their AI checker at (https://writer.com/ai-content-detector/). The checker then displays the results as a percentage that estimates how much of the suspected content is human-generated.
So, with my initial search, I found a couple of AI checkers designed for writing analysis and assistance. As a test, I pasted a paragraph from this paper, which I wrote without any writing assistance, into its content checker. I wrote this paper from a first-person perspective and used the word "I" twice in that paragraph. As expected,Writer.com identified my writing as "100% human-generated content" (Figure 1). Quill also identified my writing as "Text Written by Human" (Figure 1). Therefore, despite what my friends say about me, I am human.
Figure 1: Lorenz’s Normal Writing - Content From This Paper 6.5
After the discussion at the department chair meeting on 2-17-23 about AI and the use of AI checkers to determine whether something was written by AI, I decided to conduct a quick test of an AI checker. My interest in digital tools is not just their results or conclusions but how they arrive at those results or conclusions. I initially suspected that it was merely an analysis of the writing style. If that were the case, I want to know if I could write in a style that would identify me as AI. When I conduct tests, I like to start with broad, obvious assumptions and then narrow down and fine-tune the parameters. I wrote a paragraph about a wall and intentionally wrote it in a way that I described as clunky or robotic. I needed a sample of at least 100 words for Quill.org to perform the analysis, and I wrote the following paragraph.
Now I wish I were here at the wall with you, and I wish you could be at the wall with me. The wall is one of the best places in the world and most people enjoy the wall. We all want to see the wall but not many of us get to experience the wall. The wall is where I first found out that I had trouble seeing and may need glasses. The wall told me that I needed glasses so I decided to have an eye exam. If I were to ever visit the wall again, I think I would do it differently this time.
I didn’t put any thought into the 107 words I wrote except being intentionally robotic when I wrote about "the wall". I copied and pasted my first attempt to write like AI into Quill.org https://aiwritingcheck.org/, and that site immediately identified my paragraph as "Written by AI". But Quill gives no score or percentage estimating my paragraph’s level of AI versus human creation. I then checked my paragraph on the wall withWriter.com https://writer.com/ai-content-detector/, which identified my writing as 1% human-generated, therefore 99% AI (see Figure 2).
Figure 2: Lorenz’s Attempt to Write Like AI
This test confirmed that there is no objective analysis of metadata or checking of my IP address, as the submission method does not require me to attach any document that may contain some metadata or creation time. The analysis by the AI checkers was purely based on the only thing the AI checker could see - my writing style, spelling, grammar, etc. The fact that AI identifies my writing as AI does not invalidate the writing tools as it can only look at the words submitted, and I was intentionally being as robotic as possible. Writing about a wall like "bad AI" was easy. That robotic, repetitive, stiff writing or speaking reminiscent of early Hollywood depictions of talking machines is easy to replicate. Each sentence is repetitive and stands on its own, without regard to the subject of the paragraph. My paragraph about the wall is more of an old stereotype of AI’s writing ability. However, current AI checkers still consider this robotic writing style to be a strong indicator of AI despite the fact that voice assistants have been much more fluid than my wall paragraph for years.
I did not start using Chat GPT until after GPT-4 was released. So after playing with it for a while, I decided to test GPT-4 and test the two AI checkers to see if the submission was detected as AI or human. I was already impressed with GPT-4’s explanation of some fairly complicated procedural and substantive criminal law court decisions and statutes. So I asked GPT-4 to explain the legal term "reasonable suspicion."
Lorenz Prompt: write two paragraphs describing reasonable suspicion in simple terms
Chat GPT-4: Reasonable suspicion is an important concept in law that helps protect people’s rights while allowing the police to do their job. It’s a bit different from probable cause because it doesn’t require as much evidence, but it still means that police officers need to have a solid reason to think that someone might be up to something illegal. They can’t just stop and question someone because they have a hunch or because they don’t
like the way the person looks. They need to be able to point to specific facts or circumstances that make them think something might be going on. For example, if a police officer sees someone walking around a store nervously and constantly looking over their shoulder, they might have reasonable suspicion to think that person is planning to shoplift. In this case, the officer could stop and question the person to find out more about what’s going on. However, if the officer just sees someone walking down the street and doesn’t notice anything out of the ordinary, they wouldn’t have a good enough reason to stop them. Reasonable suspicion helps strike a balance between giving the police the tools they need to keep people safe and making sure that everyone’s rights are respected.
I copied and pasted Chat GPT-4’s answer about Reasonable Suspicion into writer.com’s AI Content Detector, which determined that the text was 100% Human-Generated Content (see Figure 3). I then pasted the same text into Quill.org’s AI checker, which determined that the text was generated by AI. This example involving Chat GPT-4’s explanation of Reasonable Suspicion helps reveal something about the algorithms AI checkers use to make predictions about whether contentwas created by AI or a human. It also reveals that different AI checkers use different algorithms or criteria to predict the presence of AI-created content.
Figure 3: GPT-4 Reasonable Suspicion Explanation Passed for Human on One AI Checker 6.7
At the risk of being picked apart by English professors or anyone more skilled at writing, I am slightly reluctant to say my paragraph in Sample 3 is "well written". I was striving for something much better than my writing about the wall in Sample 1, which is not a very high bar. My primary objective was to again fool AI checkers into thinking I was AI. However, with this sample, I wanted to write well and still write like AI. My first attempt to write like AI in Sample 1 was intentionally bad and robotic. Not only was it repetitive in the extreme, but what I wrote in that paragraph was also absurd. I referred to a wall as a living, sentient thing that spoke to me. If bad, inaccurate writing by a human is obviously AI according to AI checkers, can good, accurate writing by a human be mistaken for AI? Thus I set out to create a writing sample about probable cause. I wrote something that is more formal than I would write to students in an online class. Unlike my writing about the wall, what I wrote about was "correct" regarding the subject of my writing. I have described probable cause more times than I can remember - in person and in writing. However, this time I was definitely more conscious of the words I was using. Attempting to write "well" like AI affected my choice of words and sentence structure. I wouldn’t describe Sample 3 as an accurate, unbiased test of AI checkers because I set out to imitate more modern AI. (Figure 4 shows the results of Quill.org and Writer.com)
Probable cause is a legal term used to describe circumstances that must exist before an arrest can be made, or a search can be conducted. Probable cause is a metric used by a neutral and detached magistrate to determine whether the facts and circumstances described in a search warrant affidavit are sufficient to justify a search or seizure. Probable cause is also the standard used by a grand jury in determining whether to issue an indictment charging that a specific crime has been committed. The determination of probable cause is based on a reasonable evaluation of facts and circumstances known at the time of the evaluation.
Figure 4: Lorenz’s Attempt to Write Well Like AI 6.8
In a criminal or civil investigation, I use forensic tools to perform a mobile phone examination which begins with an extraction of the contents of the phone. I use programs and techniques designed to examine, recover, and preserve evidence in its original form. Text messages, photographs, documents, emails, and other digital evidence have metadata associated with these artifacts. Meta-data are the date and time the message was sent or received and the phone number from which the text originated. If a student submits a Microsoft Word document as an attachment in Blackboard, it can have metadata that may tell us who created it and when. However, text pasted into that document, pasted into Blackboard, or typed into Blackboard does not have metadata that tells us when or who created the text. Even though Blackboard can provide information on the IP address of the student, the IP address doesn’t tell us if the student’s submission is content originating from another source or location.
When AI checkers analyze text pasted into a website, none of what I described above occurs. AI checkers use algorithms to determine whether the suspected text appears to be created by an AI or a human. This is based on style, grammar, or sentence structure. This means that there is no metadata to examine or compare. For digital artifacts, metadata is the part of each artifact that is typically unseen. The "content" is the part of the digital artifact that we see. For example, in a text message from one person to another, "I love you." is the content of the message. Photographs sent via text message, chat apps, or multimedia messaging systems (MMS) are frequently stripped of their metadata, leaving only the image for us to determine when, where, who, and what camera created it.
One of the most critical parts of digital forensics investigations involves frequent testing and validation of the tools and techniques used to analyze digital evidence. This is especially important when the results of that analysis can affect a person’s fate in a criminal or civil investigation. There are many ways to validate the results of the forensics tools used. The best way is to test the forensic tools on the evidence you create yourself. I am most certain about a photograph I took myself. After all, I used my own phone and knew exactly where I was standing and what time it was. I can pull the photograph directly from my phone so that it retains its metadata data, and I can test any number of forensic tools I have to make sure they are all capable of seeing and displaying the same information about the photograph.
A digital photograph contains metadata called EXIF data (Exchangeabel Image File) and it can be examined, revealing evidence of its origin. This includes the date and time the photograph was taken and information about the device or camera used to take that photograph. The camera settings can be recovered from the photograph, and depending on the settings of the device used to capture the photograph, the GPS location can be obtained. A single photograph can contain some or all of these data. When I recover the photograph, I can preserve it and, to ensure that the evidence has not been tampered with, a hash value can be generated so that the evidence can be tested and checked so that any copy is a bit-for-bit duplicate of the original. This hash value is essentially a digital fingerprint represented by a long code unique to any piece of digital evidence. As an example I took a photograph of my lawn mower I took with my iPhone 13 Pro Max. That photograph was uploaded to my Google Photos, and I downloaded it on my PC. But even after all of that, it retains its EXIF information. (See Figure 5)
Figure 5: Photo of Lawn Mower Taken on an iPhone 13 Pro Max
If I were asked to prove a particular person used Chat GPT to create a forgery or commit a crime, I would use a search warrant to seize and search the person’s computer so that I may examine the hard drive(s) for internet history showing visits to OpenAI and other digital evidence demonstrating how the questionable material was created or obtained. Of course, obtaining that kind of metadata and evidence is impossible when it comes to checking a student’s work for plagiarism or the use of AI.
That leaves the AI-checker and us with only the content of the student’s submission to determine if a human or AI created it. The content submitted by a student on Blackboard could have been written minutes or years earlier by a student, or it could have been created by Chat GPT-4. Date or time information, GPS information, IP addresses, or any other digital information that can be generated with original artifacts is not transferred to Blackboard, emails, or any other digital interface when the text is copied and pasted.
This is not an attempt to undermine or discourage the use of AI checkers, as they are a very useful tool, and I am currently using them. However, we (humans) sometimes believe that a particular process is scientific or forensic just because the process is digital and unseen. It is tempting to assume that something complex and scientific is occurring in the background with AI checkers. The algorithm AI checkers use may be complex, but the algorithm cannot analyze what is not there, namely, metadata. Therefore, the opinion provided by an AI checker is based on style, structure, punctuation, and other traits common to human writing that distinguish it from AI writing. The title of this section begins with the word "current". Publicly available AI is a relatively new phenomenon, and thus the issues we are all experiencing with its release have yet to be fully realized. The problems associated with new technology often outpace safety, ethical analysis, and regulation. There are scientific detection models already deployed, in development, or in the theoretical stage.
Even if AI checkers are not using metadata to determine the presence of AI, it is still possible to test and validate their performance. This can be done by submitting material that we know is AI and material that we know is human and verifying the results. However, we can also conduct some tests that will allow us to identify in more detail what criteria the AI checkers are using in their algorithm. The first iterations of AI were easier to detect than their predecessors, as they were somewhat stiff and repetitive. When I set out to write a paragraph that would be identified as AI by AI checkers, I was confident that I could as long as the analysis of my writing did not include metadata and was based purely on style. The first iteration of chatting with computers began long before AI’s release. We have been conversing with Siri, Alexa, Cortana, and Google’s voice assistants for years. Those interactions improved with time as did Chat GPT. Spotting early AI was easy because AI’s writing was bad.
The two samples I created (Sample 1 and Sample 3) are based on my attempts to write like AI. Sample 2 was generated by Chat GPT-4 when I asked it to describe reasonable suspicion. I tested all of these samples in eight AI checkers. The results are given in Table 7.1 "Testing Writing Samples in this Paper." A link is provided to each of the websites with AI-checking tools. The links lead directly to the submission area of each of these websites. Each of these AI checkers has some unique methods to display the results. Some use the term "Fake" to describe the presence of AI, and some sites quantify the degree of human or AI content with percentages. Additionally, some of the tools highlight particular paragraphs or sentences in a submission and rate those parts of a submission as suspected AI or not. I created a screenshot of each of the results of my writing submissions in that table so that you can see the screenshots by clicking on each result.
7.1 Testing Writing Samples in this Paper
SAMPLE 1 Lorenz
The Wall (Bad AI)
Perplexity 30.833
Burstiness 13.167
SAMPLE 2 GPT-4
Reasonable Suspicion
Perplexity 17.250
Burstiness 4.432
SAMPLE 3 Lorenz
Probable Cause
Perplexity 18.500
Burstiness 5.196
*OpenAI is currently developing its own tool to detect AI and I did not find that until after completing my writing samples. Open AI requires 1000 characters as an optimal sample size. Two of my samples were below that threshold, so I added to those samples when performing this test. That added text on Sample 1, and Sample 3 was not part of the tests that I ran on the other AI checkers.
**GPTZero - Lower Perplexity and Burstiness Scores indicate more likely to be AI. See 8.9.3 for more information.
When I asked Chat GPT-3.5 how to detect AI language, it described "unusual and repetitive phrasing" as the second characteristic of AI. However, GPT-3.5’s first characteristic of AI it lists is "consistency and coherence". GPT-3.5 further described what it meant by saying that the language would be "well structured and grammatically correct". GPT-4 does not mention grammar. GPT-3.5 describes AI as "consistent and coherent in their writing", while GPT-4 states that AI "may lack consistency in tone, style, or content". Figure 6 shows a side-by-side comparison of the responses.
Lorenz: can you determine if something is written by a human or AI
GPT-3.5 on 4-03-23:
As an AI language model, it is difficult for me to determine with certainty whether a piece of text has been written by a human or by another AI language model. However, there are some indicators that can suggest whether a text has been generated by an AI language model, such as:
1. Consistency and coherence: AI language models are designed to be consistent and coherent in their writing, which means that the text they generate often appears to be well-structured and grammatically correct. However, this can also result in text that sounds robotic or lacks the nuance and variation that a human writer might naturally include.
2. Unusual or repetitive phrasing: Depending on the training data and programming of the AI language model, it may produce unusual or repetitive phrasing that a human writer might not typically use. This can be a sign that the text has been generated by an AI language model.
3. Speed of production: AI language models can produce large volumes of text at a very rapid pace, which may be difficult for a human writer to match. If a large amount of text has been produced in a short time, this could be an indication that it was generated by an AI language model.
However, it’s important to note that these indicators are not foolproof and that there are many factors that can affect the quality and style of writing generated by AI language models. Ultimately, the best way to determine whether a text has been written by a human or by an AI language model may be to examine the context in which it was produced and to evaluate its content and style against what you know about human and AI writing capabilities.
Lorenz: can you determine if something is written by a human or AI
GPT-4 on 4-07-23:
As an AI language model, I can try to help you determine if a piece of text is written by a human or an AI, but it may not always be accurate. AI-generated text has become increasingly sophisticated over time, making it more difficult to differentiate it from human-written text. However, there are some general guidelines that can be helpful in making an educated guess:
1. Repetition: AI-generated text may sometimes repeat phrases or ideas, whereas human writers generally try to avoid unnecessary repetition.
2. Consistency: AI-generated text may lack consistency in tone, style, or content. It might switch between different perspectives or make contradictory statements.
3. Irrelevance: AI-generated text might include irrelevant or unrelated information that doesn’t contribute to the overall coherence of the text.
4. Overuse of certain phrases: AI-generated text may overuse certain phrases, making the writing sound unnatural or formulaic.
5. Lack of context or understanding: AI-generated text may not always fully understand the context or nuances of a topic, resulting in text that seems off or nonsensical.
It’s worth noting that these guidelines are not foolproof, and sometimes even human-written text may display some of these characteristics. Additionally, as AI models continue to improve, the differences between human-written and AI-generated text will become increasingly subtle.
Figure 6: GPT-4’s Answers to How to Detect AI
So, my results on this initial test with known AI-created content resulted in an AI checker identifying it as 100% human-created and another AI checker identifying it as AI with no designation of degree of confidence. I wanted to see what changes I could make to the paragraphs so that Quill.org would identify the content as human. I removed five commas and changed "For example" to "So" at the beginning of a sentence. Those changes made Quill.org change its determination from AI to human-created content. (Figure 7) shows the commas I removed and the change I made to the beginning of a sentence.
Figure 7: Making GPT-4 Reasonable Suspicion Pass Quill.org as Human
After creating the figures in this paper, I checked Quill.org again using the same original Chat GPT-4 explanation of reasonable suspicion to better understand what type of errors or the number of errors was required to trick the AI checker into believing that GPT-4 was human. The removal of three commas and one period at the end of a sentence also changed the prediction to human. I then misspelled reasonable as "resonable" and misspelled officer as "oficer". These two misspelling errors made GPT-4’s creation human.
NOW IS THE TIME FOR ALL GOOD MEN TO COME TO THE AID OF THEIR COUNTRY. NOW IS THE TIME FOR ALL GOOD MEN TO COME TO THE AID OF THEIR COUNTRY. NOW IS THE TIME FOR ALL GOOD MEN TO COME TO THE AID OF THEIR COUNTRY. NOW IS THE TIME FOR ALL GOOD MEN TO COME TO THE AID OF THEIR COUNTRY. NOW IS THE TIME FOR ALL GOOD MEN TO COME TO THE AID OF THEIR COUNTRY. NOW IS THE TIME FOR ALL GOOD MEN TO COME TO THE AID OF THEIR COUNTRY. NOW IS THE TIME FOR ALL GOOD MEN TO COME TO THE AID OF THEIR COUNTRY. NOW IS THE TIME FOR ALL GOOD MEN TO COME TO THE AID OF THEIR COUNTRY. NOW IS THE TIME FOR ALL GOOD MEN TO COME TO THE AID OF THEIR COUNTRY. NOW IS THE TIME FOR ALL GOOD MEN TO COME TO THE AID OF THEIR COUNTRY. NOW IS THE TIME FOR ALL GOOD MEN TO COME TO THE AID OF THEIR COUNTRY.
7.4 Modifications Causing False Negatives in AI Samples
For this test, I used Sample 2, which was GPT-4’s original output when asked to describe reasonable suspicion. Most AI checkers identified that submission as AI. This test attempts to identify what modifications will cause a false negative by identifying AI as human.
Mod-1: This modification involves adding an extra space between random words in GPT-4’s output.
Mod-2: This modification involves changing letters in words to create misspellings.
Mod-3: This modification involved removing all the commas from Sample 2.
**Note: OpenAI is currently developing its own tool to detect AI, and I didn’t find that until after completing my writing samples. Open AI requires 1000 characters as an optimal sample size. Two of my samples were below that threshold, so I added text to those samples when conducting this test. The added text on samples 1 and 3 was not part of the tests that I ran on the other AI checkers.
AI Checkers vary in design and output. A conclusion displayed by an AI checker displayed as AI or human gives little information about what is taking place or whether or not there are degrees of certainty. If we know that there is no metadata involved, how accurate or certain are the detectors? Can these detectors be certain? In this section, I take a deeper look at some of the detectors in an attempt to determine their methods and accuracy. This includes whether the detector designers have tested them, are transparent about their methodology, and have an opinion or disclaimer about their accuracy. Then, the final test I throw at each detector is the "mixed test," in which I take a sample of my normal writing and a sample of writing generated by Chat GPT-4 and combine them into a single paragraph for pasting into each detector.
Lorenz Normal Sample: My mixed sample involves a paragraph from this paper, which is part of my normal writing style. Unlike Samples 1 and 3, where I consciously tried to write like AI, my normal writing is much different. Although I try to write well, I am focused on the information I am conveying instead of trying to be more or less human. My sample is of 111 words containing 701 characters and 188 syllables. This represents 54.93% of the mixed sample.
I put Chat GPT-4 to the test myself with objective tests, creative and technical writing, discussion boards, review questions, and using APA style for in-text citations and bibliographies. There is a vast improvement with GPT-4 compared to the capabilities of GPT-3.5. The capabilities of Chat GPT are occurring rapidly, and all potential uses for AI are vast and unknowable. In our department, Chat GPT-4 is already a valuable tool in generating fresh test banks for our police academy. GPT-4 is also capable of understanding substantive and procedural criminal law and is a vast improvement over recent AI tools, which had trouble penetrating the outline levels and wording of statutory penal codes.
Chat GPT-4 Sample: I used Chat GPT-4’s output from Sample 2, which is the first paragraph it wrote about reasonable suspicion. That paragraph consists of 101 words containing 575 characters and 144 syllables. This represents 45.06% of the mixed sample.
Reasonable suspicion is an important concept in law that helps protect people’s rights while allowing the police to do their job. It’s a bit different from probable cause because it doesn’t require as much evidence, but it still means that police officers need to have a solid reason to think that someone might be up to something illegal. They can’t just stop and question someone because they have a hunch or because they don’t like the way the person looks. They need to be able to point to specific facts or circumstances that make them think something might be going on.
Quill.org is a non-profit writing assistant designed for elementary, middle, and high school students (Quill, 2023). In addition to written tools and tutorials, Quill also has an AI checker. Quill allows pasting of up to 400 words into their site and requires a minimum of 100 words. Quill provides one of two responses; AI Prediction: Text Written by Human or AI Prediction: Text Written by AI.
We estimate, based on testing with 15k essays, that this tool is accurate 80-90% of the time. For this reason, we’d like to encourage teachers to exercise caution when using this tool to detect academic dishonesty. AI Writing Check is a stopgap tool measure for educators to use this school year until more advanced AI detection tools are made widely available.
Quill Provides the following disclaimers:
Three Important Things to Know
This tool is not perfectly accurate - it will misidentify some pieces of writing No tool AI text detection tool is perfectly accurate. This tool is accurate about 80% to 90% of the time, meaning that one or two of every ten pieces of writing that were flagged as human written will actually be written by AI.
Students can fool the system by rewriting the text Students can fool algorithms that identify AI writing by rewriting the bot’s writing or by using another AI tool to rewrite the text for them. If a student sufficiently changes the writing, it is difficult for an AI tool to detect it as AI writing.
When you identify AI generated writing, treat this as a learning opportunity Students are curious about AI, and we should encourage students to learn how to use these tools. If a student has misrepresented AI writing as their own, we have created a toolkit that may help you talk to your student about why it is that students cite their sources and not misrepresent someone else’s work at their own.
Writer.com is an AI writing tool or assistant designed for businesses or marketing departments to help them remain consistent and stay on message within an organization. Writer has an AI checker in which you can paste writing samples. The AI checker is not explicitly designed to catch cheaters but instead is used to check a company’s content for language that appears to have been created by AI, as this could affect a company’s rank on search engines. Writer’s AI checker displays results as a percentage of human the content is estimated to be.
One of Writer.com’s pages "Why better search algorithms will encourage better writing" also provides more detailed information about their detector and detecting AI content as it relates to content created by SEO specialists (Duh, 2022).
8.2.1. References
Duh, K. (2022). Why better search algorithms will encourage better writing. URL: https://writer.com/blog/ai-content-seo/.
Writer.com has an FAQ section that is not on the same page as the submission area. The FAQ includes the following information.
Q: The detector is scoring something I wrote as AI-generated. Is it accurate? The AI content detector checks if, given a question or input, the text provided is likely to follow the same pattern of words that a large language model would produce. A person’s writing can also trigger the detector if they write using very similar word sequences. The detector will not be 100% accurate, but it can help give an indication of the likelihood that something was AI-generated.
Q: Can I avoid having my content get flagged as AI-generated? The distinction between “written by an AI” and “written by a human” isn’t always straightforward. High-level AI detectors work by evaluating how “predictable” the text is. In other words, how similar is this text to other texts on the same subject that are already out there?
Below are a few ways to avoid your content being flagged as AI-generated:
Making sure that your text is unpredictable — that you’re bringing real value to your readers by saying something that hasn’t already been said before, many times elsewhere
Providing a piece of content that provides unique value to its target audience
CopyLeaks is a site that advertises that it helps to create and protect original content. Copyleaks has a AI Content Dector that allows you to paste text into the site. The results given will either be "This is human text" or "AI Content Detected."
Copyleaks also allows you to hover over text to reveal a percentage of probability for AI or human content. It did not highlight and differentiate my text from AI by color, but its percentage estimate was accurate as to how much of that text was human vs. AI. The sample buttons above the paste window allow you to use and test their samples which include GPT3, GPT4, ChatGPT, Human, and AI + Human. These samples do show highlighted distinctions between AI and humans, with percentage estimates of each. I gave this a FAIL because it did not distinguish between AI and human and always referred to the mixed submission as human while correctly distinguishing between them if I submitted them independently.
If I had not been aware ahead of time of what part of the mixed sample was AI or human, this result might have made me suspicious, but I would not have punished a student based only on this result. I would be confused by this result. The result shows human text but "55.3% probability for human". What does that mean? If that estimate is based on a specific section of text that it identifies as AI, it should say so. There is a paid version of Copyleaks, but I could not determine if the paid version would give me more detail. The subscription fee is based on credits, and each credit covers 250 words of a scanned document.
Copyleaks makes use of AI to detect AI and claims an accuracy rate of 99.12%, including highlighting and color-coded text, which will display a percentage of the probability of AI or human content detected. "As the only enterprise AI-content detection solution available, and with 99% accuracy plus LMS and API integration, AI Content Detector is the most comprehensive and accurate AI text detection solution available anywhere."
I found no disclaimers on false positives or issues that would mitigate the accuracy of their model. Their AI detector page has information about its features and accuracy. Copyleaks also has a browser extension which I installed on Chrome. That extension worked with the same result as shown in the screenshot.
Corrector.app is a free application for writing and grammar which also has an AI Content Checker. Corrector.app was created by Grammar Buddies LTD, which shows to have been incorporated on February 15, 2022. They advertise that all of their content checkers are free, but there are many pop-up ads on this site, and at times those ads have been so numerous that they appear to slow the performance of the AI checker. I located the company on Gov.uk under company number 13916177.
There is one officer listed as director for the company. When I looked for more about the team behind this app, I only found a page titled Meet Our Team, but with no person specifically identified and a photo of people, which looks like a generic photo you might find that shows a diverse bunch of people happily working together.
The AI Content Checker goes to work as soon as you paste the text into the text window and provides the results as a percentage of "Fake." The tool seemed to correctly identify GPT-4 content as 99% Fake and my normal writing as 0.17% Fake, but when I combined those same two samples into a single paragraph, it did not distinguish between the AI and human parts and instead listed the combination as 0.02% Fake. There is no button to push to start the analysis or clear the last submission, as it just refreshes when text is added.
I could not find any warning or disclaimer about the difficulties in detecting AI content. When you scroll down on the AI Content Checker, you find claims that it can detect every type of AI on the planet, including GPT-2 through GPT-4. Adds litter the site on the side margins and between every paragraph, as I could see add for everything I have ever looked at on Amazon. For all of these reasons, I did not get a warm feeling that the information on this site was transparent, and mixing AI and human content in my test rendered results that were not correct. I do think it is capable of some type of AI detection, but my validation tests of mixed content are not encouraging.
Content at Scale is an automated content site designed to help SEO-focused content publishers. The team is well-identified on their site under the About Us page. Further down that page, they describe their unique and proprietary method of AIgenerated writing they use "multiple NLP neural networks..." So their AI detector was designed to test content for AI for SEO purposes. The site has an impressive interface and output, which displays results in three categories of predictability, probability, and pattern. The sample text is also highlighted, sentence by sentence, in a color code designed to distinguish which parts of the writing are AI or human and to what degree.
I was impressed with this site’s explanation of its methods and tools and found them to be very open and transparent. Their tool is easy to use and allows you to make multiple pastes into the text window before clicking the button to check for AI content. The tool was able to clearly distinguish between my normal writing and Chat GPT-4 content.
Their disclaimer is under their FAQ section further down the page under their AI detector. There’s always a potential for false positives, but our AI detector is best-in-class for detecting AI content. Compare it to any AI detection service (paid or free) or even OpenAI’s own AI Classifier and you will see our results speak for themselves. That said, this should be used as part of an overall evaluation. Ask yourself, "is the content providing value?" That’s what matters. Educators should use this as one point of many when evaluating student papers.
Crossplag.com advertises itself as a plagiarism checker with products including crossplag for education, crossplag for individuals, and an AI Content Detector. Crossplag also markets itself to students (Crossplag for Individuals) in which it states, "Plagiarism is a serious academic misconduct that can get you into trouble. Use our anti-plagiarism software to avoid plagiarism, and get higher grades in your classes. Join us for free today!"
Crossplag’s AI Content Detector did not distinguish my text from Chat GPT-4 on my mixed test. Even when I moved Chat GPT-4’s text to the front of the paragraph and my writing to the last part of the paragraph, the detector still showed the sample as 99% human. When I removed my writing from the mixed sample, the AI Content Detector identified that sample as 84% AI. Please note that I created a free account with Crossplag using my Google account, but I did not have a paid version of Crossplag which may provide access to a more robust version of their checker.
Crossplag provides a link to their blog with information about their use of RoBERTa: An optimized method for pretraining self-supervised NLP systems and they describe their research on Google’s BERT Bidirectional Encoder Representations from Transformers in which they claim, "Our optimized method, RoBERTa, produces state-of-the-art results on the widely used NLP benchmark, General Language Understanding Evaluation (GLUE)."
Crossplag does provide a disclaimer in plain view next to their AI Content Detector window. Disclaimer AI Content Detector is based on a fine-tuned model of RoBERTa using the Open AI dataset. AI Content Detector is in the testing phase and is not yet available for use by institutions. AI Content Detector generates results by analyzing up to 1,000 words at a time. AI Content Detector, at this point, works for English text only.
Sapling provides tools to help businesses communicate and provide Customer Relationship Management (CRM) for business with an app that "sits on top" of CRM’s and messaging platforms, providing real-time suggestions and feedback. Sapling’s AI Detector allows you to paste content into a text window and provides feedback as a percentage of "Fake." The AI Detector also highlights the suspected AI content of each submission by highlighting that text in red. Sapling’s AI Detector was able to accurately identify the mixed sample of my normal writing and Chat GPT-4. In addition, Sapling’s Detector also breaks down each sentence by highlighting some sentences in lighter shades of red for suspected AI and no highlights for human text. A link will also be generated so that the results can be shared for up to three days. Here is the link provided for my mixed test on April 14, 2023 - https://sapling.ai/ai-content-detector/d1523fbb89d22c0334020be35bfe4546.
Sapling’s FAQs can be reached by scrolling down the page under their AI Detector and include questions on accuracy and false positives.
Accuracy must be measured on a specific test or benchmark. There are also multiple measurements of "accuracy" for detection tools. These measurements balance catching as many AI-generated texts as possible while keeping false positives low. On our internal benchmarks, Sapling catches more than 97% of AI-generated texts while keeping false positives below 3%. Please note that these benchmarks tend to use longer texts and may not be representative of your text.
Sapling can have false positives. The shorter the text is, the more general it is, and the more essay-like it is, the more likely it is to result in a false positive. We are working on improving the system so that this occurs less frequently.
Sapling’s disclaimer is part of their instructions on how to use the AI Detector.
Type or paste text above to score. Note that the AI detector becomes much more accurate after 50 or so words. The token count (approximately the word count) will be shown as part of the score output.
No current AI content detector (including Sapling’s) should be used as a standalone check to determine whether text is AI-generated or written by a human. False positives and false negatives will regularly occur.
The top section will show the overall score and highlight portions of the text that appear to be AI-generated.
The bottom section will highlight sentences that appear to be AI-generated.
The detector for the entire text and the per-sentence detector use different techniques, so use them together (along with your best judgement) to make an assessment.
Can OpenAI detect itself? OpenAI has developed an AI checker which they call AI Text Classifier. They are advertising it
as a free tool to "spark discussions on AI literacy" and provide more detailed information in which they include Educator
considerations for ChatGPT. Their Classifier has five designations regarding submitted text; very unlikely, unlikely, unclear
if it is, possibly, or likely AI-generated. I submitted the mixed sample and it designated my sample as unclear if it is AIgenerated.
On their AI Text Classifier page, there is a section below the submission area titled as a question.
How effectively does the classifier detect documents co-written with AI? We have not thoroughly assessed the effectiveness of the classifier in detecting content written in collaboration with human authors.
OpenAI has many explanations and disclaimers regarding the use of Chat GPT and their AI Text Classifier. There is a section above their Classifier called Current Limitations.
Requires a minimum of 1,000 characters, which is approximately 150 - 250 words. The classifier isn’t always accurate; it can mislabel both AI-generated and human-written text. AI-generated text can be edited easily to evade the classifier. The classifier is likely to get things wrong on text written by children and on text not in English, because it was primarily trained on English content written by adults.
GPTZero was designed by a senior at Princeton University who was majoring in computer science with a minor in journalism. He noticed the reports about students using GPT to plagiarize writing assignments, and many of his teachers contacted him about he released his AI detector in January 2023 (Bowman, 2023). I gave this tool a "Fail" on the mixed test but I still believe it is a very useful and valid tool. I find its generation of numeric scores in Perplexity and Burstiness to be very useful, and this tool works for me as a validation of and testing of known material, as seen in section 9.4.
8.9 References
Bowman, E. (2023). A college student created an app that can tell whether AI wrote an essay. NPR, . URL: https://www.npr.org/2023/01/09/1147549845/gptzero-ai-chatgpt-edward-tian-plagiarism.
Under GPTZero’s FAQ there is information on their model and its accuracy.
We test our models on a never-before-seen set of human and AI articles from a section of our large-scale dataset, in addition to a smaller set of challenging articles that are outside its training distribution. We classify 99% of the human-written articles correctly, and 85% of the AI-generated articles correctly, when we set a threshold of 0.65 on the completely-generated-prob returned by our API (human if below 0.65, AI if above 0.65). Our classifier achieves an AUC score definition of 0.98. GPTZero works robustly across a range of AI language models, including but not limited to ChatGPT, GPT-3, GPT-2, LLaMA, and AI services based on those models.
GPTZero has and FAQ page and includes disclaimers and explanations.
The nature of AI-generated content is changing constantly. While we build more robust models for GPTZero, we recommend that educators take these results as one of many pieces in a holistic assessment of student work. There always exist edge cases with both instances where AI is classified as human, and human is classified as AI. The accuracy of our model increases as more text is submitted to the model. As such, the accuracy of the model on the document-level classification will be greater than the accuracy on the paragraph-level, which is greater than the accuracy on the sentence level. The accuracy of our model also increases for text similar in nature to our dataset. While we train on a highly diverse set of human and AI-generated text, the majority of our dataset is in English prose, written by adults. Our classifier is not trained to identify AI-generated text after it has been heavily modified after generation (although we estimate this is a minority of the uses for AI-generation at the moment). Currently, our classifier can sometimes flag other machine-generated or highly procedural text as AI-generated, and as such, should be used on more descriptive portions of text.
In addition to making a determination of "written entirely by AI" GPTZero displays a numeric score in two categories, perplexity and burstiness. The meaning of these scores was not explained very well on the site, but a report by NPR provided a simple explanation of the scores. Perplexity measures the complexity of the text. The higher the perplexity score, the more perplexed GPTZero is by the text submission the higher the perplexity score and more likely the text was created by a human. Lower perplexity scores are more likely to be AI. Burstiness represents variations of sentences. Humans have more variations in their sentences and thus have longer more complex sentences mixed in with shorter sentences. Thus a higher burstiness score is more likely to be human (Bowman, 2023).
8.9.3 GPTZero's Scoring
Bowman, E. (2023). A college student created an app that can tell whether AI wrote an essay. NPR, . URL: https://www.npr.org/2023/01/09/1147549845/gptzero-ai-chatgpt-edward-tian-plagiarism.
Even though I classified this tool as a "Fail", I still find it useful for validation purposes. The tool accurately detects the known samples of human and AI writing that I present to it. So when a tool provides a numerical score as a sort of degree of AI, I find that useful. With this tool, when those scores move lower, they are moving more toward AI and, thus, higher, more toward human writing. In the experients, I played around with this tool and the GLTR and compared samples of human and AI text in each to see if the results of GPTZero matched the results of the GLTR.
The Giant Language model Test Room (GLTR), also referred to as glitter, was designed by MIT-IBM Watson AI Lab and Harvard NLP. This model was developed in 2019 to test on GPT-2, which is also part of the disclaimer in this section, as this model might not work as well on more recent versions of Chat GPT-4. Another issue with this tool is that it does not provide a clear prediction about whether a submission or real or fake, AI or human. It does provide all of the data needed to make that prediction but the call is left to the user.
The GLTR provides more detail on what is happening behind the scenes than any other tool that I have found available. This type of model is basically a word-level model. Each word is analyzed in terms of predictability based on the word or words that preceded it. A word highlighted in green means that that word was one of the top ten predicted words based on the words before it. Yellow means top 100. Red means the top 1,000, or violet(purple) means that the word was not in the top 1,000 of the most predicted next words.
Each text is analyzed by how likely each word would be the predicted word given the context to the left. If the actual used word would be in the Top 10 predicted words the background is colored green, for Top 100 in yellow, Top 1000 red, otherwise violet. Try some sample texts from below and see for yourself if you can spot the difference between machine generated text and human generated text or try your own. (Tip: hover over the words for more detail)
The histograms show some statistic about the text: Frac(p) describes the fraction of probability for the actual word divided by the maximum probability of any word at this position. The Top 10 entropy describes the entropy along the top 10 results for each word (MIT-IBM Watson AI lab and Harvard NLP, 2019).
8.10 References
MIT-IBM Watson AI lab and Harvard NLP (2019). GLTR (glitter) v0.5. URL: http://gltr.io/dist/index.html.
Details regarding the design, purpose, limitations and precision of the GLTR can be found in the 2019 research article describing the tool. The article GLTR: Statistical Detection and Visualization of Generated Text is available in PDF format. The researchers used 50 articles for each of the 3 human and data sources samples. The details of the validation are found under the heading Empirical Validation on page 4. The authors conclude the following.
We show how detection models can be applied to analyze whether a text is automatically generated using only simple statistical properties. We apply the insights from the analysis to build GLTR, a tool that assists human readers and improves their ability to detect fake texts (Gehrmann et al., 2019, 5).
8.10.1 GTLR Internal Validation
Gregersen, E. (2023). Ada Lovelace: The First Computer Programmer | Britannica. URL: https://www.britannica.com/story/ada-lovelace-the-firstcomputer-programmer.
The GLTR was based on GPT-2’s language model and thus may not be as accurate as text generated by other language models, including GPT-4. They provide the following disclaimer highlighted above the submission box so it can’t be missed.
Disclaimer: This version of GLTR was made in 2019 to test against GPT-2 text. It might not be helpful to detect texts for recent models (ChatGPT). Follow us on mastadon for updates: hen@vis.social (or as long as Twitter is alive: @hen_str) (MIT-IBM Watson AI lab and Harvard NLP, 2019).
8.10.2 References
MIT-IBM Watson AI lab and Harvard NLP (2019). GLTR (glitter) v0.5. URL: http://gltr.io/dist/index.html.
Once I learned how GLTR worked and how to visualize and read the results, I tested the mixed sample and could clearly see the visual difference between the two. Although GLTR was created on GPT-2 and the AI part of the mixed sample was created with GPT-4, I assumed that human language and word selection probability have not drastically changed since 2019, to a degree that would make this model useless.
The mixed test results of PASS or FAIL can be interpreted as troubling on its face, but the details of each failure were reassuring to me. All of the failures resulted from the AI detector failing to detect AI when mixed with human text. Thus, the AI detectors labeled the sample as human even if it contained almost 50% AI-generated text. If I were to make a justice system comparison because that is what I know, I would say these are where we want the failures to occur. It is better that the AI checker generates false negatives (calling something human that is not) as opposed to false positives (identifying a human text as AI). The test should illustrate that there are factors that interfere with the performance of some AI checkers. Mixing human and AI text together can obfuscate the presence of AI. The instances where AI checkers identified my writing as written by AI, involved me attempting to write like AI. Sample 1 and Sample 3 of my writing were not natural for me, and in fact, it was difficult for me to create those false positives.
This section represents my very humble attempt to understand some of the very complicated science of language models and machine learning and translate that into something simple that I can use as a practical guide to understanding the methodology behind AI checking tools. The GLTR was created for this purpose. The color-coded output of the GLTR was intended for human eyes. Simplified patterns that humans can see and understand. The short answer to the question in this section is that AI is predictable and human language is not as predictable. GPT was based on language models. Language models analyze mountains of data to find the most commonly used or predictable words and emulate that method of communication in its output.
A language model is like a giant database or repository of not just words but books, articles, newspapers, and websites. This data is processed for the language model so that it can be referenced or used for various types of applications or purposes, including conversation. I asked Chat GPT-4 the following question; What is a language model in simple terms?
A language model, in simple terms, is a computer program designed to understand and generate human language. It learns patterns and structures in text data to predict and create new sentences or phrases. These models are often used for tasks like translation, speech recognition, text completion, and answering questions (OpenAI, 2021).
So, even though humans are more spontaneous and less predictable, humanity, as a whole, still generates patterns that can be analyzed that are predictable. Paragraphs written by humanity have predictability. A paragraph written by a human is less predictable. The colorful display generated by the GLTR was developed to help humans see evidence that a particular text was constructed by a language model and not a human. So for a bumpkin like me, green is AI, and purple is human. Yellow and red are the probability steps between green and purple. So starting with green, as most like AI, the path to becoming more human is yellow, then red, and then purple for the most human or most unpredictable. This color-coded display also helped participants visualize the difference between fake and real text in the experiments run by MIT and Harvard when testing the GLTR (MIT-IBM Watson AI lab and Harvard NLP, 2019).
9.1 References
OpenAI (2021). ChatGPT (Version GPT-4). URL: https://chat.openai.com openAI. (2021). ChatGPT (VersionGPT-4) [Computer software]. https://www.openai.com.
While the GLTR uses colors to visualize AI, GPTZero uses the same theory of predictability in their perplexity and burstiness scores. More perplexity in writing means more human, and thus lower perplexity scores are more likely AI. Burstiness is the variation of perplexity in a document sample. So lower burstiness scores are more likely AI. Thus, I can use GLTR and GPTZero to validate each other. Documents written by AI should be mostly green in the GLTR and have relatively low perplexity and burstiness scores in the GPTZero model. I put that to the test by breaking my mixed sample apart and separating my writing from GPT-4’s writing. Despite GPTZero’s problem with identifying mixed AI and human text, it was in agreement with the GLTR when I separated my writing from Chat GPT-4’s writing.
The GLTR was built on the GPR-2 langauge model. GPTZero was built on ChatGPT, GTP-3, LLaMA, and AI services based on those models (GPTZero, 2023). Although these are two different AI detection tools they are based on the same concept of language models and probabilities. The color visualization tool of the GLTR shows a clear difference between my writing and GPT-4’s writing. GPTZero scores show a clear difference between my writing and Chat GPT-4’s writing with my Perplexity and Burstiness score much higher than GPT-4’s scores meaning my writing is much less predictable than AI.
Lorenz Writing: My writing (human and less predictable) has more yellow, red, and purple words. GPTZero also finds my writing to be "entirely human" with a Perplexity score of 140.000 and a Burstiness score of 82.146.
Chat GPT-4 Writing: Chat GPT-4’s writing is much more dominated by green with only two purple words and 4 red words. Notice also how low GPT-4’s scores were on GPTZero’s analysis. GPTZero predicts GPT-4’s writing is "entirely AI" with a Perplexity score of 18.000 and a Burstiness score of 3.367.
Another visualization to help understand language models and prediction is a handy GLTR tool, which allows you to hover over a single word in your writing sample to see how predictable that word is in the language model used. Even punctuation, such as commas and periods, is predicted. Remember that the predictability of each word or punctuation mark is based on what is to the left of the word or punctuation mark. Predictability is not just based on the word to the left, but perhaps the entire sentence or group of words to the left.
As an example, (figure 8-1) shows the word "the" is at the top of the list of predicted words after the word "to" in the context of that sentence; therefore, its top_k pos: is 0. The top 10 predicted words will cause the word to be highlighted in green. Notice that (figure 8-2) shows that the word "and" is the second predictable choice and therefore its top_k pos: is 1. It is still a top 10 word, so it is still highlighted green. (Figure 8-4) shows the word "myself" is # 20 on the predictability scale and is highlighted as yellow because it is in the top 100 words. (Figure 8-6) shows that the word "department" is 1,263 on the predictability list, and because it falls outside the top 1,000, it is colored purple. Notice that the "comma" in (figure 8-3) is also predicted as the most likely choice after the word "writing." Each word or punctuation is rated in this way, generating the color-coded sample of text, which, when viewed as a whole, provides a visual representation of AI or human text.
Figure 8: GLTR Results and Word Analysis
The Bill of Rights is the first 10 amendments to the U.S. Constitution and was ratified on December 15, 1791 (US Census Bureau, 2022). James Madison sponsored the Bill of Rights to the United States and went on to be the fourth President of the United States (National Archives, 2021). Charles Babbage first imagined a machine could be used to do the work of humans sometime around 1812 or 1813 (Harris & Pollette, 2023). It is widely agreed that the first programming language was written by Ada Lovelace in 1843 (Gregersen, 2023). To avoid creating an argument regarding the definition of a computer or disparaging who actually invented "the first computer," the first modern computers were built about 75-85 years ago. The point is that it is highly unlikely James Madison or our founding fathers had access to computers or programming language in the late 1700s. So, it is interesting to me that the 4th, 5th, and 6th amendments to our U.S. Constitution are considered to have been written by AI 7 of the 10 AI checkers I tested in this paper.
9.6 References
US Census Bureau (2022). Bill of Rights Day (1791): December 15, 2022. URL: https://www.census.gov/newsroom/stories/bill-of-rights-day.html.
National Archives (2021). Constitution of the United States—A History. URL: https://www.archives.gov/founding-docs/more-perfect-union.
Harris, W., & Pollette, C. (2023). Who Invented the First Computer? URL: https://science.howstuffworks.com/innovation/inventions/who-invented-the-computer.htm.
Gregersen, E. (2023). Ada Lovelace: The First Computer Programmer | Britannica. URL: https://www.britannica.com/story/ada-lovelace-the-firstcomputer-programmer.
I confess to experimenting a little more with the GLTR than was needed for the original intent of this project, which was to test some common AI checkers to see if they worked. Since the GLTR knows the most probable next word in every sentence and ranks those words from 0 (most probable) to possibly tens of thousands (least probable), I wanted to use it to create a "very" AI (predictable) sentence and a "very" human (unpredictable) sentence. I started typing the word "The" and then typed one word at a time to try to create a single sentence where each word had the rank of zero - the most probable word and thus the most probable sentence. I did this without regard for what the sentence was about or where it was going. Keep in mind that each new word adds to the complexity of the probability of the next word as the language models base of probability of each new word considers the words that came before it. The sentence should be completely green, and the frac(p) histogram should classify each word as 1.00. I could also theoretically use the GLTR to make a sentence that is all purple. When I am trying to understand something complicated, seeing extremes in data helps me visualize and better understand how something works. By creating the two extreme sentences, I created in (figure 9) I could better understand how AI is detected in the GLTR and how AI like Chat GPT-4 creates sentences based on the same language models. The unpredictable sentence I created was something that I wanted to be an understandable sentence, but I wanted to create as much purple as possible and use words much further down the list of probable words in language models. My playing around in the GLTR helped me understand the Bill of Rights anomaly I discovered later.
Figure 9: The Perfect Sentence
When I was comparing output by Chat GPT to human writing, I would sometimes select something I wrote previously, or sometimes I would just Google something I knew was human just to compare. I happened to do that with the 4th Amendment one day, and the results surprised me. The 4th Amendment has all the characteristics of AI. When I pasted it into the GLTR, it was almost completely green, and thus, by the GLTR model, almost certainly AI.
The right of the people to be secure in their persons, houses, papers, and effects, against unreasonable searches and seizures, shall not be violated, and no Warrants shall issue, but upon probable cause, supported by Oath or affirmation, and particularly describing the place to be searched, and the persons or things to be seized.
I checked the 4th Amendment in GPTZero and determined the text to be human, but the 4th Amendment generated a very low perplexity and burstiness score (the lower the scores, the more likely AI). The 4th Amendment is a single sentence, and more information is better when making a determination. I wanted to make sure this was not an error or that I had enough material for the models to test, so I combined the 4th, 5th, and 6th Amendments into one paragraph. The result from the GLTR was the same, showing all green (figure 10). When I pasted that combined paragraph into GPTZero, it was considered to be entirely written by AI with a Perplexity score of 1.667 and a Burstiness Score of 1.155. When compared to other submissions known to be created by AI, these scores were much lower (figure 12). This combined paragraph also received the most clear-cut results I have ever seen from ContentAtScale’s AI Detector, with all of the text being highlighted in red, pronouncing it as a 98% probability to be AI.
Figure 10: Combined Amendments in the GLTR
I examined the combined Amendments in the GLTR, hovering over each word to instantly see its probability ranking. You can also see this visually on the top k count and the frac(p) histogram. The combined Amendments are 243 words, 1401 words, and 379 syllables. Only 3 words out of the amendments were not in the top 10 (green) probable words. A total of 114 words and punctuation marks were ranked 0, meaning they were the most probable word or punctuation in the language models. It took me 10 minutes, a computer, and a mathematical model created by MIT and Harvard to construct a simple sentence where all words were ranked 0. Of course, we don’t need an AI checker to tell us the 4th Amendment was not written by AI. All we need is Google to tell us or Grammarly, which immediately pegged it as 100% plagiarized. However, AI checkers based on language models do not look for plagiarism or even grammar. They are only measuring probability.
Figure 11: Combined Amendments ContentAtScale.ai
Figure 12: Combined Amendments in GPTZero
Language models are fed information. That information was created by humans throughout time and now occupies the internet and digital world in the way of books, websites, blogs, audio, and video. Some of the words written are on the internet and may be fed into a language model. However, it is not likely that my words would affect the ranking of English words in a language model. But what if something was written that became the subject of millions of writings, books, websites, movies, legal codes, and thousands of legal decisions? If all of that material is fed to a language model and language models are based on patterns of words and the probability that one word is more likely than another to follow the words preceding it, the frequency of words and phrases could be affected by the what is well known and popular.
The data comes from humans in various forms, and it could be from AI when humans use AI to create new content. Large language models have been fed datasets that include almost everything written on the internet over a prolonged period of time (Lee, 2023). Larger language models have been trained on Wikipedia, news stories, scientific papers, and novels (Dey, 2022). The 4th Amendment is quoted in its entirety in millions of pieces of data on the internet, including sources like Wikipedia (Wikipedia, 2023). There are ways to quantify this, including something simple I did. I used Google to search for the entire text of that quote. I also searched for documents containing a reference to "the 4th Amendment," which may not have quoted it in its entirety.
9.8 References
Lee, A. (2023). What Are Large Language Models Used For and Why Are They Important? URL: https://blogs.nvidia.com/blog/2023/01/26/what-are-large-language-models-used-for/.
Dey, V. (2022). How MIT is training AI language models in an era of quality data scarcity. URL: https://venturebeat.com/ai/how-mit-is-training-ai-language-models-in-an-era-of-quality-data-scarcity/.
Wikipedia (2023). Fourth Amendment to the United States Constitution. URL: https://en.wikipedia.org/w/index.php?title=Fourth_Amendment_to_the_United_States_Constitution&oldid=1148519570.41
Figure 13: Ubiquity Makes the 4th Amendment Look Like AI
I do not think James Madison was an alien or that he had access to AI in 1789. My assumption, then, is that the 4th Amendment and our Bill of Rights make up some of the most ubiquitous words in our language. That, combined with the unique language of the 4th Amendment, means that language models see the word order as a very common part of our language. The right of the people to be free in their , houses, papers, and effects... The answer to the fill-in-the-blank question is "persons". The first line of the 4th Amendment contains unique language as most humans don’t talk that way, but the 4th Amendment has been uploaded and quoted so often that language models predict that the #1 most probable word in that blank is "persons". Remember that is based on the context of the words to the left. So "persons" is not always the logical word that follows the word "their" in other sentences.
GOOGLE SEARCHES "Entire Amendment in Quotes"
4TH Amendment - About 104,000 results (0.45 seconds)
5TH Amendment - About 35,200 results (0.59 seconds)
6TH Amendment - About 39,500 results (0.56 seconds)
GOOGLE SEARCHES "By Title"
"the 4th amendment" About 73,900,000 results (0.59 seconds)
"the fourth amendment" About 82,700,000 results (0.61 seconds)
"the fourth amendment to the us constitution" - About 17,000,000 results (0.70 seconds)
"the 4th amendment to the us constitution" - About 81,400,000 results (0.62 seconds)
Google Scholar Hits
"4th Amendment" with "case law" checked About 722,000 results (0.08 sec)
"4th Amendment" with "articles" checked About 655,000 results (0.08 sec)
My theory to explain why some very reputable AI detectors see the 4th Amendment as AI is simply based on the volume and ubiquity of the 4th Amendment and the fact that it is quoted in its entirety in thousands of sources of data which then is fed to language models used by AI and AI detectors. I think I am correct, and it would explain everything based on my admitted limited and superficial knowledge of language models. But am I correct? I was having difficulty finding the answer using traditional sources such as Google and Google Scholar. I found supporting data regarding the popularity of words affecting probability but nothing on the 4th amendment specifically. So, I decided to ask Chat GPT-4 that question. First, let me say that if I had asked this question as an attorney in a court of law, it would likely be sustained as a "leading question". In fact, it was a leading question because the first question I asked Chat GPT-4 in the same Chat string did not explicitly address the issue of popularity and ubiquity. In order to be fair and transparent, I am providing you with the transcript of both questions I asked Chat GPT-4.
Lorenz: Why do AI content detectors think the 4th amendment is written by AI?
Chat GPT-4: AI content detectors might classify the text of the Fourth Amendment as being written by AI because of several factors:
1. Language patterns: AI-generated text may share specific language patterns or characteristics with well structured legal texts, such as the Fourth Amendment. Both can be concise, clear, and use specific legal terminology, which could confuse AI detectors.
2. Training data: AI content detectors are trained on a large dataset of human and AI-generated text. If the training data does not contain enough examples of legal texts, the model may struggle to correctly identify the source of such content.
3. Context: AI content detectors may lack the necessary context to differentiate between AI-generated text and a historical legal document like the Fourth Amendment. Without context, the model might only analyze the structure and wording, leading to misclassification.
4. Limitations of AI: AI models, including content detectors, are not perfect and can make mistakes. This is especially true for complex or nuanced texts, like legal documents. It’s essential to continuously improve AI content detectors and provide them with a diverse range of training data, including legal texts and historical documents, to reduce the likelihood of misclassification in the future.
Below is my immediate "leading" question to Chat GPT-4 as a direct follow-up to the previous question.
Lorenz: Does the frequency and ubiquity of the 4th amendment in many sources of data influence the probability of those words in language models and, therefore, make the language in the 4th amendment seem like it was put together by AI based on language models?
Chat GPT-4: Yes, the frequency and ubiquity of the Fourth Amendment in many sources of data can influence the probability of those words in language models, and as a result, make the language in the Fourth Amendment seem like it was generated by AI based on language models.
Language models, like GPT-4, are trained on vast amounts of text data from various sources, including legal documents, historical texts, and online content. The more frequently a particular text or set of words appears in the training data, the more likely it is that the model will learn and assign a higher probability to those words and their combinations.
The Fourth Amendment is a well-known and often-cited part of the U.S. Constitution. Given its significance and widespread presence in legal and historical contexts, it is likely that language models have encountered the text multiple times during training. As a result, the model may have learned the patterns, structure, and phrasing of the Fourth Amendment quite well.
When the language of the Fourth Amendment is analyzed by an AI content detector, the similarities in phrasing and structure to AI-generated text might lead the detector to misclassify the text as AI-generated. This can be due to the fact that AI-generated text often mimics the patterns and structures found in the training data, which, in this case, would include the Fourth Amendment.
In summary, the frequency and ubiquity of the Fourth Amendment in the training data of language models can lead to a higher probability of those words and their combinations, making the text seem as if it was generated by AI, even though it is a historical legal document.
It does not appear that it does, and it denies that it does, and it would be foolish if it did. Most mobile phone carriers, like AT&T and Sprint, don’t store the text messages of their users. If they do it is for a very limited amount of time. They don’t want to be subpoenaed into thousands of legal cases each year involving the use of their service. Google will release search history and other information but only in response to search warrants issued by law enforcement agencies. We can’t subpoena Chat GPT-4 or OpenAI to find out if a piece of suspected text was written by their AI. Every time I asked Chat GPT-4 to tell me about Miranda vs. Arizona, it gave me a different answer. Every time I asked it to write a poem about Miranda versus Arizona, it created a different poem. The information was correct, and the poems were good and creative, but they were different. Even asking Chat GPT-4 the identical question will yield many different responses.
Lorenz: Do you remember what you write
Chat GPT-4: As an AI language model, I don’t have the ability to recall specific instances of text that I have generated in the past, nor do I have the capability to remember specific conversations. My responses are based on the text and context provided by users during each interaction, and I don’t possess personal memory or experiences like humans do. If you have any questions or need more information, please feel free to ask.
My students always ask me what tool or method is best for solving crimes. I always tell them it depends on many different things, but there is no one tool that works better than everything else for everything. DNA is fantastic evidence for putting someone at a crime scene who doesn’t belong at the crime scene. DNA is useless when both the suspect and the victim live together, and one kills the other with a weapon they both own and share. Context is everything when it comes to solving a puzzle, a crime, or catching students cheating.
To further complicate the issues educators have with AI detection tools, AI doesn’t have DNA. DNA can positively identify a human being to a degree of accuracy that is beyond scientific certainty. AI detectors don’t come close to that. At best, we are making educated guesses based on the AI of today and how a particular AI was constructed. What language models were used to build the AI’s knowledge base? How does this affect the results of one or more of the various AI detectors? What language models were used to build the AI detector? How does that factor into the accuracy of the detector’s results?
As I discussed earlier, AI checkers are not making determinations that are the result of some precise test that conclusively identifies AI in text. Unlike plagiarism tools which help us detect plagiarized content, AI checkers don’t point to a place where we can verify the answer for ourselves. If a plagiarism checker locates my student’s content on Course Hero, I can actually go to Course Hero myself and see for myself. So before I give the student a zero or kick him out of class, I can be comfortable and verify or validate what the plagiarism checker has told me. Red light cameras in the City of Killeen used a computer program to determine if someone ran a red light. However, tickets were not issued until a human at the police department viewed the video of the incident to confirm the conclusion of the software. One person I knew was assigned just for that purpose. If an AI checker tells us a student’s submission is AI, where do we go to verify that?
The results of the limited test that I conducted are reassuring. The failures are the AI detectors failing to detect AI and thus classifying something as a human that is not - False Negatives. I found no examples in which AI detectors identified human writing as AI, save the two examples where I was intentionally trying to write like AI. That is where we want the failures to occur - to err on the side of allowing fake submissions to slip through. My tests were limited in number, and much more testing would need to be done.
There are those who are trying to solve some of the problems created by AI, including the problem of plagiarism. One possible solution is a hidden watermark on AI-generated text. In an article in the MIT Technology Review, the progress and future of detection methods are reviewed, including the subject of watermarking. According to the article by John Kirchenbauer, the research involved in watermarking AI at the University of Maryland calls this "The Wild West" (Heikkila, 2023). This is not like watermarking photographs but is a much more complicated, algorithmic process. The process is described in the following paper, which is waiting to be peer-reviewed (Kirchenbauer et al., 2023).
10.3 References
Heikkila, M. (2023). A watermark for chatbots can expose text written by an AI. URL: https://www.technologyreview.com/2023/01/27/1067338/a-watermark-for-chatbots-can-spot-text-written-by-an-ai/.
Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers, I., & Goldstein, T. (2023). A Watermark for Large Language Models. URL: http://arxiv.org/abs/2301.10226. https://doi.org/10.48550/arXiv.2301.10226
As I referenced in section 6.1, there are unintended consequences to everything. The very tools we use to detect AI can also be used by students to test whether their AI-generated work will pass our AI checkers. Chat GPT-4 allows me to hit the "regenerate response" button as often as I like. Or can ask it to rewrite in a casual style. Each time it generates different text, and using AI checkers, I can keep trying until I create something that passes for humans.
Yes, but not to discipline students. They are effective tools, but context is everything, and it is important to know what they are and what they are not. I encourage everyone to use and test them as much as possible and make their own determinations about how much reliance they can place on those tools. Much more research needs to be done. We should consult with each other and use multiple AI checkers to help verify the other. As long as we understand the capabilities and limitations, tools can be very effective and useful, even if they are not 100% conclusive.
The short answer is No. The simple reason is that AI Checkers are not able to protect against false positives accurately. Remember, a false positive occurs when an AI detector makes a determination that something written by a human was written by AI. I first conducted this research in February of 2023. Most institutions were using AI detectors as a tool to detect AI and discipline students, and some institutions had no policy for how AI detectors could or should be used.
Today, most institutions do not allow AI checkers to be used to determine academic integrity. That is the subject of another paper...
Bowman, E. (2023). A college student created an app that can tell whether AI wrote an essay. NPR, . URL: https://www.npr.org/2023/01/09/1147549845/gptzero-ai-chatgpt-edward-tian-plagiarism.
Council of Europe (2023). History of Artificial Intelligence - Artificial Intelligence - www.coe.int. URL: https://www.coe.int/en/web/artificial-intelligence/history-of-ai.
Dey, V. (2022). How MIT is training AI language models in an era of quality data scarcity. URL: https://venturebeat.com/ai/how-mit-is-training-ai-language-models-in-an-era-of-quality-data-scarcity/.
Duh, K. (2022). Why better search algorithms will encourage better writing. URL: https://writer.com/blog/ai-content-seo/.
Duo, M. (2019). What Does SEO Stand For? (+ 7 Beginner’s Tips for Ranking a Site). URL: https://kinsta.com/blog/what-does-seo-stand-for/.
Gehrmann, S., Strobelt, H., & Rush, A. (2019). GLTR: Statistical Detection and Visualization of Generated Text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 111–116). Florence, Italy: Association for Computational Linguistics. URL: https://www.aclweb.org/anthology/P19-3019. doi:10.18653/v1/P19-3019
GPTZero (2023). GPTZero Frequently Asked Questions. URL: https://app.gptzero.me/app/faq.
Gregersen, E. (2023). Ada Lovelace: The First Computer Programmer | Britannica. URL: https://www.britannica.com/story/ada-lovelace-the-firstcomputer-programmer.
Harris, W., & Pollette, C. (2023). Who Invented the First Computer? URL: https://science.howstuffworks.com/innovation/inventions/who-invented-the-computer.htm.
Heikkila, M. (2023). A watermark for chatbots can expose text written by an AI. URL: https://www.technologyreview.com/2023/01/27/1067338/a-watermark-for-chatbots-can-spot-text-written-by-an-ai/.
IBM (2023). What is Artificial Intelligence (AI) ? | IBM. URL: https://www.ibm.com/topics/artificial-intelligence.
Imperial War Museums (2023). How Alan Turing Cracked The Enigma Code. URL: https://www.iwm.org.uk/history/how-alan-turing-cracked-the-enigma-code.
Kashyap, K. (2023). AI-Generated Content Not Against Its Policies: Google Clarifies. URL: https://www.spiceworks.com/marketing/content-marketing/news/ai-generated-content-not-against-google-policies/.
Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers, I., & Goldstein, T. (2023). A Watermark for Large Language Models. URL: http://arxiv.org/abs/2301.10226. https://doi.org/10.48550/arXiv.2301.10226
Lee, A. (2023). What Are Large Language Models Used For and Why Are They Important? URL: https://blogs.nvidia.com/blog/2023/01/26/what-are-large-language-models-used-for/.
Marketing Artificial Intelligence Institute (2023). Meta/-Facebook AI: What Businesses Need to Know. URL: https://www.marketingaiinstitute.com/blog/how-facebook-uses-artificial-intelligence-and-what-it-means-for-marketers.
McCarthy, J. (2007). WHAT IS ARTIFICIAL INTELLIGENCE?,.
McCarthy, J., Minsky, Marvin, Rochester, N., & Shannon, C. (1955). A PROPOSAL FOR THE DARTMOUTH SUMMER RESEARCH PROJECT ON ARTIFICIAL
INTELLIGENCE. URL: http://www-formal.stanford.edu/jmc/history/dartmouth/dartmouth.html
MIT-IBM Watson AI lab and Harvard NLP (2019). GLTR (glitter) v0.5. URL: http://gltr.io/dist/index.html.
National Archives (2021). Constitution of the United States—A History. URL: https://www.archives.gov/founding-docs/more-perfect-union.
OpenAI (2021). ChatGPT (Version GPT-4). URL: https://chat.openai.com openAI. (2021). ChatGPT (VersionGPT-4) [Computer software]. https://www.openai.com.
OpenAI (2023a). About OpenAI. URL: https://openai.com/about.
OpenAI (2023b). GPT-4. URL: https://openai.com/research/gpt-4.
Quill (2023). Quill.org | Interactive Writing and Grammar. URL: https://www.quill.org/.
Schwartz, B. (2022). Google: We Have Algorithms To Detect & Demote AI Altered Plagiarized Content. URL: https://www.seroundtable.com/google-ai-plagiarized-content-34495.html.
Turing, A. M. (1950). I.—COMPUTING MACHINERY AND INTELLIGENCE. Mind, LIX, 433–460. URL: https://academic.oup.com/mind/article/LIX/236/433/986238. doi:10.1093/mind/LIX.236.433.
US Census Bureau (2022). Bill of Rights Day (1791): December 15, 2022. URL: https://www.census.gov/newsroom/stories/bill-of-rights-day.html.
West, D. M. (2018). What is artificial intelligence? URL: https://www.brookings.edu/research/what-is-artificial-intelligence/.
Wikipedia (2023). Fourth Amendment to the United States Constitution. URL: https://en.wikipedia.org/w/index.php?title=Fourth_Amendment_to_the_United_States_Constitution&oldid=1148519570.41