What kinds of applications do we need a semantic web for? Is the semantic web practical? These questions (among others) were posed by Jamie Taylor of Metaweb Technologies to a group of panelists at the Text Analytics Summit last week. The panelists were no lightweights. They included Vladimir Zelevinsky from Endeca, Ron Kaplan from Microsoft, and Kathleen Dahlgren from Cognition. I found this to be one of the most engaging segments of the Summit.
First of all, many people define the semantic web as a “web of meaning” or a “web of data” that will allow computer applications to exploit the data directly. Check out the W3C webpage for more information about definitions. The panelists at the Summit got into an interesting discussion about parsing data sources for the semantic web. Here are a few of the highlights. Please note that I asked some additional questions after the panel, itself, so if you’re reading information you didn’t hear on the panel this is the reason.
- What kind of applications is the Semantic Web good for? It depends what you want to know. For example, one of the panelists pointed out that you don’t need the semantic web to find a hardware store in Boston. However, more unique queries might require it. Most people have had the experience of knowing what they are looking for and using a five or six word query and still not finding it. The panelists pointed out that entities (people, places, things) were relatively easy to extract; it is the relationships between the entities that is harder. Vladimir Zelevinsky explained it like this in terms of information retrieval need/information retrieval technologies:
- Known Item Search -> Keyword Search (e.g., Google – where you need to find what you know exists);
- Unknown Item Search -> Guided Navigation (e.g., Faceted search – where you need to explore the data space);
- Unknown Relationship Search -> Semantic Web (where you are looking not for separate items in the repository, in this case the web, but for the connection(s) between them).
The semantic web could pay off in applications that require understanding the relationships between these entities. Ron Kaplan also noted that semantic web technology provides a standard way of merging data from different sources, and that will probably enable some useful new applications.
- Scaling the semantic web. Everyone seemed to agree that manually tagging documents is a brittle exercise. Vladimir Zelevinsky from Endeca suggested putting a parser on each machine. He said that since you type slower than 1 sentence per second that at the moment of creation, semantics could be injected into the document. Of course, it is a bit more complex than this, but this was an interesting notion. Kathleen Dahlgren from Cognition said that NLP at scale was the wave of the future. NLP is complex but deeply distributed. Computers are getting faster and cheaper, and this can make it fast and scalable.
- Is it practical? There is a huge amount of data out there and it keeps changing. There is also a lot of duplicate information on the web. Is it economically viable to think about parsing the web? Ron Kaplan said he had done a back of the envelope calculation using the following assumptions:
“The simple order-of-magnitude calculation goes as follows: There are roughly 2.5M seconds in a month, so an 8-core machine gives you 20M cpu seconds. If it takes 1 second on the average to process a sentence (an upper bound), then you can do 20M sentences per month. If a web page has on the average 20 sentences, you get 1M pages per month per machine. So, 1000 machines can do a billion pages per month. More if 1 second over estimates, less if 20 sentence/document underestimates.”
So this is economically feasible. If there is a need. And that remains the question. Is it economically viable and necessary to try to find the information in the long tail?