X-SciTech

Language detectives make the web less anonymous

By Maya Kaufman

August 6, 2015 / 6:00 AM EDT / CBS News

"Take dem bullets out the house," the text message read.

The text was allegedly sent by Rickey Cummings, a 23-year-old in Waco, Texas. In 2012, Cummings, a suspected Bloods gang member, was sentenced to death for murder in the 2011 shooting deaths of two men. The text, sent one day after the murders, was used as incriminating evidence.

Yet in July 2014, Cummings' lawyers, who believed Cummings hadn't written the text and were hoping to get a new trial, sought the opinion of forensic linguist Robert Leonard. Leonard is one of a growing field of experts who analyze language in criminal investigations to identify a message's author.

Leonard examined six anomalous text messages sent from Cummings' phone in the days following the murders. By comparing those texts with Cummings' known writing style, Leonard concluded the text "take dem bullets out the house" was not consistent with that writing style. The implication? Cummings may not have written the text that helped lead to his conviction.

What is forensic linguistics?

Forensic linguistics is the application of linguistics, or the study of language, to the law. Forensic linguists examine language to identify patterns or distinctive traits in the author's style or decipher meaning and intention. In recent years, forensic linguists have begun to offer those observations to help solve cases.

Whereas handwriting analysis was once a common investigative tool, forensic linguistics is emerging as a way to analyze newer forms of communication, such as text messages, emails and posts on social media. With these media, it is easier than ever to be anonymous -- whether by setting up social media accounts under an alias, gaining access to another user's account, or even taking someone else's phone. Anonymity allows people to express themselves more candidly, but it also hides the identity of stalkers, bullies, harassers and criminals. Technologies that shield all online traces of a message's sender are becoming more sophisticated.

In this murky and often deceptive digital environment, proponents say forensic linguistics may hold a key to unveiling who's behind malicious, anonymous posts by using the one thing they leave in the open: their words.

"The quest for who wrote this will fall more and more on, I think, forensic linguists," Leonard told CBS News. "If indeed the computer analysts ... can't find a certain message's history and there's no DNA on the envelope and there's no handwriting, we, I think, will have a bigger and bigger role."

Creating a linguistic profile

Two of the main focuses of forensic linguistics are authorship attribution (figuring out who wrote -- or didn't write -- a given text) and profiling (identifying the demographic characteristics of the author, if the actual person cannot be identified).

In the Cummings case, an example of authorship attribution, Leonard was tasked to figure out if Cummings had or hadn't written the suspicious text messages on his phone. The lawyers gave him several groups of texts, separated by author, and asked him which group the suspicious messages seemed to belong to. Leonard observed Cummings consistently replaced "s" with "5," and used "them" not "dem" in his messages. Therefore, the text "take dem bullets out the house," did not appear to match Cummings' writing style.

"People are able to manipulate the control of devices," Leonard said. "But they can't manipulate -- as well, anyway -- the mimicking the style of the person who they're claiming to be."

When investigators want to narrow down the suspect field in a case, forensic linguists will turn to profiling, looking for demographic clues in the texts in question. By looking at language use, forensic linguistics can find clues suggesting an author's place of origin, gender, age, ethnicity, race, education level, social status and even religion.

In a famous example, the police in Illinois found themselves with several suspects for a kidnapping case and unsure how to progress. They requested the help of linguist Roger Shuy, often considered the father of forensic linguistics in the United States after he provided linguistic analysis that helped solve a 1979 solicitation to murder case.

The police asked Shuy to analyze a ransom note. "Do you ever want to see your precious little girl again?" the note read. "Put $10,000 cash in a diaper bag. Put it in the green trash kan on the devil strip at corner 18th and Carlson." The note later demanded: "No kops!! Come alone!"

Shuy thought "kan" and "kops" were likely deliberate misspellings -- not characteristic of someone who was poorly educated or illiterate. The term "devil strip," he discovered, was a term for the patch of grass between the sidewalk and the street. It was only used in Akron, Ohio.

When Shuy asked the cops if there was a well-educated suspect from Akron, Ohio, they said yes. They confronted the suspect, and he confessed.

Finding messages in new media

Online bullying, "trolling" and other forms of digital harassment are only likely to become more prevalent in the years to come, and forensic linguistics could play a larger role in tracking down anonymous attackers. Forensic linguists are working to make their techniques more accurate. For many, this means gathering more data to help better understand how people use language in different media.

Tim Grant, the director of the Centre for Forensic Linguistics at Aston University in the U.K. and vice president of the International Association of Forensic Linguists (IAFL), is compiling a database of text messages.

Two of Jack Grieve's maps showing the prevalence of different swear words. Jack Grieve

His colleague at Aston, Jack Grieve, is building a collection of tweets to research regional language patterns in the U.S. He creates maps of words, such as swear words, to show where they are more or less common. Linguists can use that information to better predict where unknown authors -- of tweets and other media -- are from, like Shuy did with "devil strip."

"Sooner or later we'll get Twitter cases ... where social media will actually be the evidence," Grieve said. "In that case, we'll be in a very good position to handle it."

The way these types of analyses are done is also evolving. Patrick Juola, a professor of computer science at Duquesne University, works in computational linguistics, meaning he relies heavily on computer programs to do the analytic work.

"A computer doesn't substitute what a human can do, but it can do its own analysis," Juola said. "And it does its version of the analysis a lot better than a human can do."

This approach has its critics, who say computers are good for counting linguistic features but aren't good at understanding the nuances of the language data they work with. Grant argued computers can't understand how language use may vary based on genre, so they could present the wrong answer. A computer program, Leonard said, can only look for what it's been pre-loaded to look for. Furthermore, a computer can't explain its reasoning. "You can't cross-examine the computer software," Grant said.

Juola acknowledged such critiques but said that computer programs are currently about 80 percent accurate at both authorship attribution and profiling. An online program called Gender Guesser claims to be 60 to 70 percent accurate.

"People actually think about language in terms of very low-level features, which computers are great at picking up," Juola said.

Humans also have their limitations. Sometimes linguists working on a case are influenced by the story an attorney tells them. While they may believe their analysis is unbiased, that story might actually color the way they look at the evidence.

As the field evolves to reduce these obstacles, forensic linguists hope to increase awareness of the practice and what it has to offer in a world increasingly facing anonymous communication.

"I consider one important mission of our field as that of missionaries carrying knowledge about how language works to the rest of the community," Shuy wrote in an email to CBS News. "Arrogant perhaps? Maybe, but realistic I think."

CBS News 24/7