Friday, October 30, 2020

Psychic AI and Plagiarism Detection

 Artificial Intelligence is used to sell a lot of baloney. It would be bad enough it were used only to teach badly and provide poor assessments of student work, but AI is also being hawked as a means of rooting out plagiarism. For an example of this phenomenon at its worst, let's check in on a little webcast from Mark Boothe at Canvas Learning Management System. He's talking to Shouvik Paul at Copyleaks, a plagiarism checking company and partner of Canvas. I'm going to watch this so you don't have to--and you shouldn't. But you should remember the names just in case somebody at your place of work suggests actually using these products.

We start with a quick intro emphasizing Copyleaks' awesomeness. And then Boothe hands it over to Paul, the Chief Revenue Officer at Copyleaks, because when you want to talk about a product, you definitely want to talk to the revenue people at the company. Incidentally, sales and marketing has been Paul's entire career--no computer or education background anywhere in sight. But this is going to be a sales pitch for thirty-some minutes. Great.

First, Paul offers general background on Copyleaks. An AI company, building "very cool" stuff. That includes a product that does grading of essays on standardized tests. It takes humans hours, but their Ai can grade those papers "within seconds" within 1% accuracy of a human grader. Spoiler alert: no, it can't. They have offices around the world. 

So they were working on ed tech, and "as we all know" everyone from universities through k-12 is using some kind of plagiarism detection (oh my lord-- does that mean there are first grade teachers out there running student paragraphs through turnitin?). Paul says they found that some of the technology out there was outdated, meaning that when you're out there in education dealing with students, "it's such a cat and mouse game--they're always looking for new ways to beat the system." So we're going to adopt a cynical premise about those awful students as a starting point. Great. 

"Let's face it. What's the first thing a student's going to do? They're going to youtube, and they're going to type in something like 'how to cheat plagiarism check' Right?" And he is showing us on screen many many many videos on how to beat the plagiarism detection software out there. The most common recommendation is to paraphrase.

So they asked themselves why plagiarism detectors weren't detecting paraphrasing and the answer they came up with is "paraphrasing is really complicated" which I guess is more esteem-affirming than "software is really stupid and doesn't actually understand words." That admission would also have implications for a software product that claims it can grade an essay in a second.

But take "I'm going to Utah." Paul points out there are only a few ways to say that, which is just wrong, unless you're not very bright or don't have a very big vocabulary. I toyed briefly with listing all the many ways one could write that sentence, but none of us have the time for that. Paul points out that millions of people might write it the same way, but that doesn't mean they're plagiarizing--there's just only "that many" ways to say it. It occurs to me that an important factor here is the reasons that someone would want to write the sentence in the first place, but I don't believe we're thinking that hard about the problem.

All of that is beside the point of paraphrasing, so he heads back to that, suggesting that they figured (he keeps saying "we" and how much do you want to bet that the revenue and sales teams were not actually a part of any of these product development discussions) that AI could be used to detect paraphrasing (because AI is like magic fairy dust sprinkled by unicorns pooping rainbows). 

And boy does he believe in magic, because he says the goal was to spot paraphrasing done with the intent of beating the system, and if there were software that could somehow read the writer's intent, that would be beyond amazing, since even humans have trouble sometimes detecting author's intent in a piece of writing.

But they have many customers, including big time legit colleges and universities. It's becoming clear that their marketing niche is all about the paraphrasing thing. They're uncovering more of that, Paul says. 

Other sales points he'd like to hit on. They're doing this in over 100 languages. They can also check code for computer science departments "being computer programmers ourselves." (Paul has never been a computer programmer.) Paul also claims they've figured out how to make the software self-improving so that it will stay caught up with the hot new plagiarism techniques. These are very "top level" things they are doing.

Now some info about how exactly they work, right after he sings the praises of Canvas, with whom they are deeply integrated. An unbelievable number of people are calling them to ask  if they can help with Canvas. And they can! It's turnkey! The controls are easy! They accept many files! They will catch paraphrasing! Here are many screenshots of the software controls. I am afraid that Mark Boothe may have left, or fallen asleep, or something.

Oh, hey-- that's interesting/alarming. I can compare a paper to a paper from a couple years ago. Who was storing that, and why? They get papers from all these schools and "add them to our internal database." And colleges and universities can add their own stored documents to the database, and they can check those for exact matches and--ta-dah--paraphrases.

Here are some screen examples of what the results would look like. It's pretty typical-- highlighted naughty parts with listing of where the match was found. Color-coded, so you can see exact matches or--yahoo--paraphrases. 

Now he's talking about intent again. Since there are only a few ways to write a sentence, we'd get too many false positives from people just landing close, somehow. But here comes the AI. "It's essentially understanding the logic." No, it's not. Okay, he's really stumbling through this explanation, so let me try to piece it together. See, if the page was blank, someone might have just written that sentence to write that sentence. But the AI is going to look at the context. And it's going to look for any other indications that "there is intent to cheat, deceive or to plagiarize." He's not going to tell us what those would be, though he looks at the sample paper and says the intent is pretty obvious, by which he apparently means that the paper is already loaded with cut-and-paste theft. Which is an interesting argument, since one might also argue that a student who depends on that much cut-and-paste is showing that he lacks the ambition to paraphrase anything. 

He's still stammering, the point being that instead of millions and millions of hits, the AI is narrowing it to just "the most likely" ones intended to beat the system. Again, if these guys have developed software that can divine author's intent, they can be doing so much more in the world than catching student paraphrasers. Imagine if, for instance, all political and diplomatic documents could be run past software that detects and explains the author's intent. 

Coding examples now. Zzzzzzzzzzzzz.......

He has now entered that mode of the person making the sales pitch who has run out of pitch, but not out of time. Post-Covid era, universities want to save money. We take customer support really really seriously. Very hands on. Opening the floor for questions.

Hay! Mark is still here and awake. He asks if they specialize more in college or K-12, and golly bob howdy but if they don't do really well in both. Historically it's been more higher ed, but "post Covid" (he has now referenced being "post Covid" a couple of times as if it's "now" which is an odd thing to say if you're anyone other than a person on the Trump re-election team, because we certainly still seem to be in a "during Covid" place) a whole lot of K-12 want this for remote learning stuff to go with their new Canvas learning platform. But there really is no difference in the way the product works, the methodology, what they search, for one market or the other. Second grade paragraph about flowers, college thesis about quantum physics-- they both apparently get the same treatment.

They are actually working on a "paper" about plagiarism trends pre and post Covid. They are finding a huge jump in cheating-- a "very unusual spike." They're watching the trends.

Audience question-- what databases does this tool check against? Answer: a whole bunch. Many. 15,000 academic journals. A whole gamut. Follow-up question about moving from Blackboard to Canvas. Do they lose previously submitted papers? Answer: that's on you, basically. You do the exporting, if you can, good luck.

Mark calls for any last pitch from Shouvik, and he's going with, boy, people just keep hiring us, we are using AI, we are moving forward with machine learning, and that's why folks love us, including people outside education world. Okay--this is interesting; the BBC uses them to hunt down people who are stealing BBC content and monetizing it on other sites. They promise better results. Call them for more details or a demo. Here's my email.

Boothe takes the wrapup. 

And institutions buy this. And students have to jump through this hoop, and have their work discredited if they miss a hoop because some software psychically read their intent. There are days when think the term "artificial intelligence" should be banned.


No comments:

Post a Comment