Sunday, March 1, 2009

Statistically Improbable Phrases within a CMS

Look at http://sip.s-anand.net/ for reference on what a Statistically Improbable Phrase is, and it has an immediate and fun way to play with the concept.

Here is my idea for a kind of Content Management System, with companies that make standardized widgets as the best use scenario.

The company's product names, model numbers, prices, and other specifics are recorded as the text corpus. In all data entry, even email composing, the live typing is compared to the corpus. A tag cloud with oddest words shown in high contrast is ever present in a box near the data entry.

Corpora could have various granularities. Speaking in extremes of large and small, a corpus of all (or almost all) company online dialogue is a large example, and a corpus of model names/numbers/prices could be a tiny corpus example.

I am not talking about Natural Language processing, although that could be more complex implementation of this role. I'm thinking more of a spellchecker data entry police that plugs all corporate dialogue into itself to show anomalous input. The easiest example is with a model number 5100AGT, I am writing an email and type 51000AGT, it shows up as anomalous.

The interesting place all this goes is A.I.-ish. The machine sees when the company is staying within behavioral bounds, and when the company is deviating from a norm. Example: the company normally does business in the midwest region around Chicago. Flights and hotels are regularly booked in Lansing, Minneapolis, and Milwaukee. The first trip by sales staff to Los Angeles stands out as a deviance from the norm, and in human terms can be expressed as gains in market territory. Funny: note that A.I. should only check sales trips with sales trips. The legal team suddenly traveling outside the normal sphere is not a sign of increased markets, but a sign of possible larger legal problems.

This is not to deprecate the humble and effective selection menu. Very narrow data entry tasks are fully served by these menus. This SIP idea is more for the open end of company dialogue, where language is a little more natural but still redundant.

I may be showing my ignorance more than my brilliance in this post. Google has long had the Google Search Appliance (GSA). A really smart company would have already cobbled together a CMS hack on top of the GSA service.

Javascript Regex Reference

http://www.evolt.org/article/Regular_Expressions_in_JavaScript/17/36435/