Postcard from the Shakespeare problem

There's a lot of work going into Artificial Intelligence software at the moment, and part of that involves something called NLP - Natural Language Processing. Writing software to understand English, particularly in the colloquial way that it's used today is a tough problem.

A common appraoch is to take a piece of text and, as a first step, remove words that don't add to the meaning. The list of words like that is called a "stop list". It includes words such as and , the , but , is , as and a couple of hundred other similar words. The next stage will stem the left over words. Word stemming is a simple enough concept, Each word is reduced to a basic term. For example, plurals are replaced by the singular version ( galaxies becomes galaxy , cars becomes car ) and all verb tenses become the root form - passed and passing would become pass .

One of the reasons for doing this is to make searching in a set of documents more flexible. It helps avoid problems with mis-spellings, and with a search string that omits "the" or "this" or any other word in the stop list.

As an example, a document which contains the string "outside the courts" could be found with the search string "outside court".

But there's a problem. Arguably the best known phrase in the English language is the speach from Hamlet - To be or not to be, that is the question . Unfortunately, almost all the words in that short sample will be in a stop list. So, a search for "to be or not to be" isn't going to return anything.

It's fixable of course, but it's a good example that helps see why an obvious and often useful approach to a problem may not always help.

© 2015-2016 Woodbrook-Wilson   |   Postcards   |   Archive   |   Home page