XML | PaulW: TheContentGuy

Tagged: XML Toggle Comment Threads | Keyboard Shortcuts

paulwlodarczyk 8:40 pm on August 19, 2008 Permalink | Log in to leave a Comment
Tags: Amazon, auto-tagging, catagorization, classification, ECM, EMC, IBM, metadata, named entity analysis, natural language processing, NLP, Semantics, statistically improbable phrases, XML

Automating CMS metadata – could that work? How?
In a previous post I asked the question, “What if a web service could automatically provide the CMS metadata when you go to check-in a new topic?” In this post I’ll discuss why you would want to do that, some of the candidate technologies, and what is necessary to make it real.

Check-in, metadata, and taxonomies. Anyone who’s worked with a content or document management system knows this scenario: You’re going to check-in newly authored content, and a dialog box comes up asking you to enter some keywords to describe the content. This is metadata – data about your data. It’s important because if you fill it in properly, other people (and you, too) can find your content. If you leave it blank, then other users will need to rely on a full-text search of some machine indexing to find your content.

Many organizations have a formal system for classifying content called a taxonomy. Think of it like the naming of the sections in the yellow pages directory – it provides consistent category names. This avoids the problem I call “the Yellow Pages Problem,” where some people call those guys who represent you in court “lawyers” while other people call them “attorneys at law” (or worse things). When an organization uses a taxonomy, everyone uses consistent category names – that is, if they actually use it.

Compliance blues. Taxonomies can be configured into the CMS, so that category names are able to be selected on the check-in dialog box. While that saves the author guesswork of remembering category names and avoids mistyping, it still requires an author to take action – and to get it right. This is a point of failure of many ECM initiatives: authors either fail to classify content at check-in time, or they accept the default settings, or the author applies the wrong category (or even too few categories when the content really crosses genres).

The problem is worse when the author isn’t a fulltime writer, but instead a business contributor who’s creating content as they serve their role in some business process. In these cases the author lacks the time, talent, or motivation to tag the content with the appropriate metadata. They may not see it as part of their job.

Cure for the blues? So can this process be automated? Absolutely. Technologies have existed for some years now to analyze unstructured content. Algorithms involve some combination of statistical, linguistic, and structural analysis.

Statistical methods look at the document as a “bag of words” – words or phrases that occur more frequently, or that are “improbable” statistically are important. Amazon uses SIPs – Statistically Improbable Phrases – to pull keywords out of books. This is purely statistical – the system doesn’t know what the words mean, just that they are “odd” so probably meaningful.

Linguistic methods actually analyze the natural language in the document. If you know what the subject, verb, and object are in a sentence, then you know what it is about. Linguistic methods have gotten better with improvements in algorithms and increases in computing power.

Structural methods leverage underlying markup in documents, like XML structural tags or even styling or text flow (e.g. recognizing terms in headers).

These methods not only provide automated metadata tagging (document categorization), they can also determine what type of document is being analyzed (document classification). They can also be used to identify Named Entities – named people, places, things, and events. It’s one thing to say this document is a Legal Brief (document type or class). It’s another to say that Legal Brief is about Patent Infringement (a category). It’s another thing still to say that it’s a case between Palm and Xerox (named companies) about handwriting recognition (a named technology). Named entities can be extracted and listed in metadata. They can also be tagged in-line in an XML document (this is often called “auto-tagging” – a post for another day).

Named entities are not addressed by taxomonies, rather by lists or directories of named entities. A number of these named entity directories are available as web services. Several are kept evergreen by using Wikipedia to drive the ever growing list of named entities.

Making it real. So given this technology, how do you implement such a system? My preferred method is to customize the authoring environment so that the “Save” dialog box in the editor of choice presents the ECM system’s check-in dialog. This way the author does not take extra steps to check content in.

Also at check-in time, in the background, the customized editor performs a temporary save to the local file system, and automatically sends a copy of the document to a categorizer web service. This is a content categorizer application running on a server. That categorizer service would apply the organization’s standard taxonomy to the document, using some classification algorithm to define one or more categories for the document. The results can be applied in either of two ways:

1.    Classify the document automatically with no user intervention. This can be done completely in the background with no user interface, even as part of an automated check-in workflow.

2.    Classify the document automatically and have the user verify the results. This requires exposing the proposed metadata tags in the check-in dialog.

Categorizers often provide some scoring of the certainty of a given tag; this score can be used to make the call about whether the automatic tag is applied, or whether it needs (or allows) end user verification or editing. Business requirements determine what the best approach or best combination is.

What are the barriers? The reason this technique isn’t used more often is the integration required between the authoring tools, the ECM solution, and the categorization technology. In today’s market these technologies are typically provided by independent software vendors, who have few incentives for bundling tightly integrated solutions (and wish to remain “vendor neutral” with their own technology). As the ECM marketplace continues to consolidate vertically we may see some content lifecycle vendors with more complete solutions (watch IBM and EMC). Services firms specializing in unstructured content and ECM can be one source for prepackaged solutions that combine these ECM, authoring tools, and content classification into a seamless user experience – which is the key to success in deploying an automated solution.

At the end of the day, consideration of the needs and behavior of content authors and contributors (who are very often change-averse) is the most important step in adoption of a content lifecycle solution. Making content classification and categorization a “no brainer” through automation and a seamless user experience improves the likelihood of success.

Mike’s Digital Laboratory :: Automating metadata and open architectures and paulwlodarczyk are discussing. Toggle Comments
- paulwlodarczyk 9:37 pm on September 14, 2008 Permalink | Log in to Reply
  
  This post and comments are now being maintained at http://thecontentguy.net
Reply Cancel reply
You must be logged in to post a comment.
paulwlodarczyk 3:10 pm on August 15, 2008 Permalink | Log in to leave a Comment
Tags: Calais, DITA, Linked Data, Search Monkey, Semantic Web, Thomson-Reuters, unstructured content, XML
Connecting the dots: How XML authoring enables the Semantic Web
I recently attended the Linked Data Planet conference where a number of pioneers in the field of Semantic Web shared their perspectives on the state of the art – and business – of helping the world tag their web pages for meaning. For those of you in the dark about semantic mark-up, it lets authors annotate their web pages with metadata (HTML attributes that don’t get displayed in the document) that describe what those pages are about.

So for example, when I say “New York” in an HTML document it’s ambiguous – do I mean the city, the state, the Yankees, the Mets, the Giants, the Jets, the song, the steak, the state of mind – you get the idea. Words are ambiguous – except in the context of the language in which they occur. So if I am writing about a sporting event you know from the context of the article that I mean the team, but the typical search engine does not. To a search engine, New York is just a string that occurs in the document with some frequency.

There are two ways to make sense out of words in a document. One is semantic analysis (I’ll leave that topic to another day). The other is semantic tagging – adding metadata to a document.

With metadata, I can define things precisely. I can state that this document is about the sports team, not the steak. I can do this by tagging the named entities in the document – the people, places, things, events, and facts – in an unambiguous way. I can also set those entities into relationships with each other. For example, a piece of text may refer to two companies involved in a merger. So I can tag the document being about Company A (thing number one) and Company B (thing number two) involved in a merger (an event, but also a relationship between the two named entities).

So semantic tagging adds meaning to documents that goes beyond the text, and it does it in an unambiguous way, which is handy. But it has traditionally faced two large hurdles: (1) it’s been relatively expensive to add semantic markup (either with investments in labor or technology) and (2) there has been little mass market for consuming this markup. Both of those hurdles are rapidly falling away.

Let’s address the second point first. Yahoo has introduced Search Monkey – a new technology that rates web pages not on the keywords and number of links to the page (the “wisdom of crowds”) but on the semantic markup that is embedded in the page (the wisdom of the author). This creates a substantial motive for adding the markup: Search Engine Optimization. Semantic markup makes your content more likely to be found and more relevant to the searcher.

Great, so how do you add semantic markup? For legacy content, you need to use some combination of people and automation to add markup to what you already wrote. Using people to tag content requires specialized skills that are not in good supply. Natural language processing technologies for auto-tagging content have been around since the late 90s in lab settings; auto-tagging products are emerging in new and interesting forms in the marketplace today. Thomson-Reuter’s Calais open source project is a great example. For a demo click here and try pasting some non-proprietary text that describes what your company does (for example, I tried the “About Our Company” page we used in proposals at JustSystems and it accurately tagged all of the named companies, legal entities, products, technologies, countries, cities, and correctly identified JustSystems’s acquisition of XMetaL from Blast Radius as a business event).

Adding semantic markup to new web content as it is created – making it available as data – is the way to go. But what about other types of unstructured content, like documents, that might be published to the web and other channels? We’ve been doing this with XML and SGML documents all along, using semantic tags to unambiguously flag specific pieces of text for future discovery. This has ranged from tagging part numbers in a service manual (which could automate adding hyperlinks or improve search relevance), to tagging financial reports with XBRL to find specific facts within the MD&A or footnotes of an annual report (which could prevent another Enron). But the important concept here is this: when content is tagged, it can be treated as data.

More recent XML standards like DITA help authors focus on creating granular content – primarily for content reuse. But our customers are finding that DITA and other topic-oriented XML approaches are helping them break out of the document model – where loads of facts are locked-up within documents. Think of a lengthy Policies and Procedures manual. The historical reason it’s all bound in one book is for the convenience of publishing. Today – with electronic publishing on the web, intranets, and portals – you really only want to publish a single policy or procedure as it is added or revised. The book itself is obsolete when you can publish a procedure at a time.

In a DITA world, because of its granular nature, a single document (like a Policy manual that was one very large document in your document management system) may instead be managed as a collection of hundreds of DITA topics in your CMS or XML object store. The document would no longer exist, it becomes a collection of topics, more like records in a database. To effectively manage large collections of DITA topics, you need to specify metadata for each topic – just so that you can find any given topic again. So a typical DITA project would define the CMS metadata scheme and the taxonomy for classifying the DITA topics. For those of us in the XML document world, this is old hat.

So all this makes me ask:
- What if we start combining semantic web technologies and semantic document technologies?
- What if we combine technologies that auto-tag named entities with granular authoring approaches like DITA?
- What if you could automatically tag named entities within the DITA topic you are creating, tagging as you type?
- What if a web service could automatically provide the CMS metadata when you go to check-in a new topic?
- What if the publishing tools that transform your DITA to HTML could automatically add the semantic markup to your HTML pages that are published from your DITA content?
- How would that change how you publish business documents like policies and procedures to your employees?
- How would it change how you create marketing content for your web site?
- How would it change the way you create and manage your product technical content?
Could the secret to the semantic web be right under our nose?
paulwlodarczyk and Mike Axelrod are discussing. Toggle Comments
- Mike Axelrod 4:34 pm on August 18, 2008 Permalink | Log in to Reply
  
  Great idea for a new blog Paul! I can see how this post can lead into a whole set of articles around the questions you raise. Looking forward to the next post.
- paulwlodarczyk 9:38 pm on September 14, 2008 Permalink | Log in to Reply
  
  This post and comments are now being maintained at http://thecontentguy.net
  – please visit us there!

PaulW: TheContentGuy

Recent Comments

Tagged: XML Toggle Comment Threads | Keyboard Shortcuts

paulwlodarczyk 8:40 pm on August 19, 2008 Permalink | Log in to leave a Comment Tags: Amazon, auto-tagging, catagorization, classification, ECM, EMC, IBM, metadata, named entity analysis, natural language processing, NLP, Semantics, statistically improbable phrases, XML

Automating CMS metadata – could that work? How?

paulwlodarczyk 9:37 pm on September 14, 2008 Permalink | Log in to Reply

Reply Cancel reply

paulwlodarczyk 3:10 pm on August 15, 2008 Permalink | Log in to leave a Comment Tags: Calais, DITA, Linked Data, Search Monkey, Semantic Web, Thomson-Reuters, unstructured content, XML

Connecting the dots: How XML authoring enables the Semantic Web

Mike Axelrod 4:34 pm on August 18, 2008 Permalink | Log in to Reply

paulwlodarczyk 9:38 pm on September 14, 2008 Permalink | Log in to Reply

paulwlodarczyk 8:40 pm on August 19, 2008 Permalink | Log in to leave a Comment
Tags: Amazon, auto-tagging, catagorization, classification, ECM, EMC, IBM, metadata, named entity analysis, natural language processing, NLP, Semantics, statistically improbable phrases, XML

paulwlodarczyk 3:10 pm on August 15, 2008 Permalink | Log in to leave a Comment
Tags: Calais, DITA, Linked Data, Search Monkey, Semantic Web, Thomson-Reuters, unstructured content, XML