How to [Define, Describe, Classify, Group, Find] Everything?
Up to Now
I’ve been designing a software product seriously for coming up on three years now, and a further two before that just convincing myself it was possible. The product operates on collected data and creates interesting relationships between them. Really, that is all there is to it. But like any simple objective, the road has become longer and more circuitous that we could have ever imagined.
The idea isn’t new, but the tools available today make it practical. Those tools are computers. Computers, though, need things represented as numbers. They crunch those numbers, and out comes a number. Sound familiar Douglas Adams fans?
The problem for me was routine up until the point where I had to interpret, and establish as events, text. The problem of textual language represented in computers is actually much older than AI (Artificial Intelligence).
To date, there have been many attempts at breaking down the spoken word, most of which not useful for other languages. These applications include:
- Topic Maps
- Word Order
- Phrase Chunking
- Computational Linguistics
There are certainly more, but these are at the heart of my current problem.
So which tool first? Well, there are two things I need to remember above all others:
- What I have to work with, and
- What output I require
For the former, I have vast amounts of text taken from all kinds of sources-all of it with any type of UML removed. The latter is a “scoring”, if you will, of that event (document/object) across several categories. Each category holds a signed value, and the score from the text is mathematically applied against that category’s score.
So, text goes in and numbers come out. What could be easier?
Linguists right after World War II thought this exact same way when they tried to create a computer program to translate texts in communique of non-English origin. Doing it manually-even back then-was not a scalable, consistent method. And something like metadata or faceted classification is just too broad to be useful. The meaning of each sentence must be known, and English has man, many tricks to slip me up.
- “The journey is really expected to succeed.”
- “The journey is not really expected to succeed.”
Easy enough when said, but a complex problem to deal with the ambiguity of these two statements. So the object would surely be to remove as much ambiguity as possible where a phrase, or even a word, could have two or more meanings in a sentence.
So what is required is a series of tools to act on the text one at a time, yielding the desired result. One by one, they would receive the output of the previous step, and in the end the desired output would be presented.
So ontology (the relationship of things) is really required toward the end of the process, while word order and tagging is handled up front. As an example of where word order applies in the above example, the word succeed is the important bit, but to determine whether the sentence’s sentiment believes that success is possible or not, one has to go back several words-and even then all the words in between are relevant. Oh, and what was it we were actually talking about? Of course, “The journey”.
English is full of these, and I could have written both sentiments many different ways, using all kinds of confusing punctuality and/or obscure words, and even confused a human listener. Thankfully, I’m dealing almost exclusively with text I can get off the Internet (RSS, W/S, Web scraping).
So which do we create first? The chicken (taxonomy, ontology, etc.) or the egg (tags, context, topic maps, etc.)? My gut tells me both-at the same time.