Real-world situations sometimes yield data that are too irregular to model effectively via the UML Class Diagrams stipulated by the IEPD process. WIJIS recently encountered such a situation, and found an effective and generally applicable solution well supported by NIEM: external unconstrained RDF attachments defined at the Instance, rather than the Class, level.


Bill Blondeau
WIJIS Data Architect


When UML Class Diagrams Don't Work Quite Right

So let's say, and why not, that you're doing the business analysis for an IEP development. You have begun to construct the UML Class Diagram that represents your model of the data. And you begin to realize, as you work it through, that the analysis is feeling uncomfortably open-ended.

Maybe the data in the problem domain really are highly idiosyncratic; or maybe it's just very difficult, for practical reasons, to assemble a high-confidence sample set of those data. Either way, the practical result is the same: you don't expect your model to hold up in the real world. You expect that new, unprecedented data will come along, in the course of real operations, and break it.

Up to a point, you can (and should) keep testing your class model against whatever data you can get your hands on, hoping to work the flaws out through persistent exposure to different sample data - the "exhaustion of variation" technique. But that works only if you have access to sufficiently representative sample data, and sometimes you don't.

Or, you can try to push your model in ever more abstract directions, trusting that a sufficiently abstract model definition will surpass any variations. This does work: a sufficiently abstract model can accurately characterize any dataset. The trouble with that approach is, a "sufficiently abstract model" is not necessarily what you want or need. In design, we all want the least abstract model we can devise: one that describes the specific behaviors and characteristics of the problem domain. We want a model that is salient, accessible, and easily grasped. Abstraction diminishes these characteristics, denatures them, thins them out. The more abstract your model becomes, the less helpful it is as an actual design artifact.

So, sometimes there doesn't seem to be a good practical answer. When you can't build a satisfactory model using UML Class Diagrams, then it's probably best to admit that your problem space data are simply not very regular, and that attempting to find regularities in those data is pretty much pushing a rope.

WIJIS recently ran full-on into this problem, in constructing an IEPD for Drug Investigation data exchange.

Drug Investigation Info IEPD: NGA NIEM Policy Academy, 2008

The National Governors Association (NGA) Center for Best Practices, as part of its 2007-2008 NIEM Policy Academy, awarded WIJIS a grant to develop a NIEM compliant information exchange for representing Drug Case information. The foreseen business use for this IEPD was twofold:

With those funds, and funds awarded by the Bureau of Justice Assistance, WIJIS did create the IEPD. (Implementation of a planned proof-of-concept project that would allow Drug Case investigators to enter data into their own systems, and have it replicated in IEPD form to the state-provided DCI Case Mangement System, was abandoned when the primary partners - faced with sharply reduced funding - opted instead to discard their local systems and adopt the state alternative wholesale.)

The IEPD remains as an available work product, documented at the IEPD Clearinghouse (http://it.ojp.gov/iepd/) and posted at http://www.wijiscommons.org/specs/drug-case/. However, during the course of the IEPD business analysis, WIJIS did encounter the wild and wooly realm of narcotics intelligence data; and that took us in unanticipated directions. Our response to this particular problem led to the broadly applicable technique that is the subject of this article.

Narcotics Investigations: Case Data vs. Intelligence Data

Narcotics investigations have two sides.

There's the Case side. Case data is pretty much like other Law Enforcement case data. This is very familiar territory to anyone who works with IEPDs, because it's pretty much what IEPDs were designed to capture in the first place. It's regular, well-understood, and subject to ordinary Law Enforcement handling protocols.

And then there's the Intelligence data. This side, by contrast, was not much like anything we'd seen before. And the more we looked at it, the more we came to realize that we had a situation on our hands.

Narcotics Intelligence investigators tend to be independent-minded, immersed in their work, naturally suspicious of possible breaches of confidence and security, and improvisational. This leads to a proliferation of nonuniform, closed, often impromptu intelligence recordkeeping systems and techniques, with wide variations in data organization and structure. Accordingly, it's hard to characterize, and hard even to obtain a very good sample set from which to generalize.

Furthermore, it was quickly apparent that it was an absolute operational necessity to keep Case and Intel data - even Case and Intel data about the same investigation - cleanly separated. The handling requirements (operational and statutory) are so different that any message design that fails to keep them in rigorously separate buckets is going to be dead on arrival.

Just a few examples of some of those very crucial differences:

Case Data Intel Data
Subject to Law Enforcement protocols for retention/archiving 28CFR governs retention
May be needed by prosecutors
  • Rules of evidence must be observed
  • Discovery process may compel disclosure to defense counsel
Inadmissible in court; never disclosed to Officers of the court or to principals in Court proceedings
May be subject to sunshine laws Exempt from sunshine laws
Subject to audit; audit trail required Can be investigated but not really audited (how do you "audit" speculations?)
Incorrect disclosure can compromise a prosecution Incorrect disclosure can get someone killed

So, our challenge was unusual: design a message format for representing two different kinds of data, one straight and the other pretty weird; keep them from contaminating each other; but always maintain this separation without losing any information about how the two sides relate to each other.

...well, if it was easy, everybody would be doing it. The bleeding edge is a pretty interesting place to live, but you do bleed.

The Solution: Attaching External RDF to an IEPD Instance

The immediate observation we made was that the Case side was very tractable, and resolved beautifully into a classic IEPD. (As noted above, this IEPD has been accepted to the IEPD Clearinghouse at http://it.ojp.gov/iepd/; the IEPD itself can be found at http://www.wijiscommons.org/specs/drug-case/.) Having designed the IEPD, we did consider declaring victory and going home. Defining Intel data as out of scope (and therefore not our problem) would have been an entirely defensible action under the terms of the NGA grant. But, for various reasons, we decided to see if we could concoct any practical techniques to handle the Intel data as well.

The first realization we came to was that, on the Intel side, we were pushing a rope indeed. Any attempt to design UML Classes for Intel information, given the irregular nature of the data and the wide variability in its management from department to department (and from officer to officer), and our restricted access to sample data, tended to slump into a welter of special cases, exceptions, and inconsistencies.

So we quit pushing the rope. We gave up on attempting to devise a Class for Intel data, and turned to the problem of representing data at the Instance level instead.

The immediate hopeful answer to this problem was the W3C's RDF ("Resource Description Framework"), a Semantic Web technology.

Note: don't confuse this technique with representing NIEM in RDF. This is a very important point. NIEM is RDF compatible (about which more below), meaning that you could represent the NIEM datamodel in RDF instead of XML. To our knowledge, no such RDF representation has been officially defined and published, although the NIEM designers have reportedly done at least some initial exploratory work to that end.

This is a completely different thing. This RDF is very specifically not NIEM; it is in fact unconstrained, so it's not really anything. You could use NIEM types, should an RDF NIEM representation be published; or you could use lots of other defined RDF types; but this is very purposely not restricted to any RDF vocabulary, and in fact you could make up new association types as you go.

So then: what's RDF?

Any real overview of RDF is outside the scope of this article (the W3C's RDF Primer is actually pretty good if you're interested), but in brief, it simply associates Resources - precisely identified entities - with characteristics in a very open-ended fashion. It uses Uniform Resource Identifiers (URIs) as names for those Resources, because URI is a well-understood naming system that is valid everywhere; people often think the U in URI stands for Universal, and in fact it might just as well. A URI represents exactly one Resource, no matter where in the universe you go.

If it's something you can identify and name with a URI, it's a Resource - something you can write RDF statements about.

A Quick Look at RDF

RDF is composed of triples, simple statements with three parts:

One more fun thing: RDF is a graphical language. Its fundamental expression is diagrammatic, rather than textual. Any textual RDF you see has been created in accordance with some defined serialization scheme.

In the RDF pictorial representation, Resources (Subject or Object) are represented as ellipses, Predicates as directed arrows originating at the Subject and pointing to the Object, and Literal values are drawn as rectangles.

diagram of RDF basic statement

Now you know enough to go on with.

What makes RDF such a useful fit for the Instance Data problem?

First, its flexibility. Every association definition is identified by URI, so creating a new kind of association on the fly is a very simple matter - you simply define the association type, assign it a URI in a domain under your administrative authority, and you can use it in an RDF document. (Note the contrast with UML Class diagrams, in which all associations must be known and understood in advance.) This would make it very easy to construct representations of Intel Instance data, no matter how odd, without losing important information, and without trying to force-fit unprecedented content or structure into a preconceived form.

Of course, this means that the RDF would be essentially unconstrained. (Note: RDF can be constrained by the RDF Vocabulary Description Language (a.k.a. "RDF-Schema") in much the same fashion as XML is constrained by XML Schema; but when you're defining your data on an instance-by-instance basis, there's no benefit in imposing any such constraints.)

RDF's other big advantage is that it doesn't get too entangled to decouple. Bear in mind that we need to ensure that Case and Intel data do not contaminate one another. RDF structures are easily analyzed and separated (using Set Theory math if necessary) in order to create boundaries between different sets of data. This makes it easy to identify - and hide - any associations that cross the boundary.

But... the IEPD is Still Really Good

We certainly did not want to discard the Drug Case IEPD in order to accommodate the Intel side. RDF Instance representations are not as efficient or generally desirable as the traditional UML-based IEPD form. So, the final design question we tackled was:

Is it feasible, and logically rigorous, to define an RDF-to-IEPD Binding that will connect elements of a Class-based instance (our Case IEPD information) to external RDF instance data?

Why NIEM Worked

Fortunately for us, the NIEM designers made some fundamental decisions that turned out to be very helpful:

It's hard to overstate the importance of these two characteristics.

In order to demonstrate that the RDF-to-IEPD binding is logically consistent, it's necessary to go to the abstract data definitions involved. RDF definitions are sufficiently abstract, and the parts of an RDF association expression are simple enough, to permit easy reasoning about RDF structures.

But, if NIEM had been formulated as an XML-specific syntactical model (such as GJXDM, for instance) rather than an abstract model, any such reasoning would have had to have depended on an abstract model reverse engineered out of the XML definitions. This may not sound like a hard thing to do; but, given our constrained resources, it would have been a dealkiller. There's a lot of hard work involved in reverse engineering a provable, logically rigorous model. (Hand-building such a model? Demonstrating, to an acceptable degree of certainty, that nothing was omitted or crosswired? ...um, let's declare victory and go home, instead.)

As regards the second design decision, the NIEM designers' decision to make NIEM RDF-compatible also saved us a crucial amount of time in the project. We were able to reason as follows:

We don't for a moment think that the NIEM designers sat around and said, "Hey, you know, someday somebody might want to attach extrinsic instance data to NIEM documents, so let's make it easy for them." This is simply an example of the kind of payoff that comes when designers invest the extra time and effort (usually catching some flak for doing so) to do it right, and not cut corners. That the NIEM designers did stick to these particular high roads deserves recognition.

These beneficial characteristics of the NIEM model meant that, with our limited resources, we could afford to confidently state that:

If every information item in an IEPD-compliant Instance document can be assigned a URI, then external RDF statements can be made about any such item.

That's it. That's our abstract design, right there.

The conceptual reasoning looks something like this:

Here's a UML class diagram such as we use to define an IEPD. The class has a very simple member topology: a tree with two levels of branching.

Simplified UML Class Diagram

However, that diagram describes common properties of all documents of its class, and is less useful as a means of visualizing the structure of individual instances. A better diagram is the familiar tree representation (so beloved of GUI designers, among others), which displays individual nodes in the number they actually occur in the instance (as well as, if desired, additional properties such as attributes):

Tree structure diagram with nodes

If we can reliably and unambiguously assign URIs to any desired node in the IEPD instance document, we can make RDF statements about that node. In effect, we are attaching RDF information externally to the IEPD XML document, without modifying the IEPD instance in any respect. This exactly satisfies all of our requirements.

RDF diagram including XML representation

Or, if it's easier to visualize with serialized XML tags instead of the tree node structure:

RDF diagram including XML representation

Our First-Cut Implementation

The abstract design above is extremely simple. The only necessary step in constructing an implementation is to determine a way of assigning a URI to any desired data item within the IEPD instance.

As stated, WIJIS is currently short on resources, but we did manage to do that much. It's only a quick, improvisational implementation, based on the characteristics of the document's XML serialization rather than the topology of the IEPD model itself. It will work well, but is limited to

In future we hope to have the opportunity to create a more generic solution: one based on the abstract topology of UML Class diagrams, rather than the DOM-specific topology of XML representations. But for now, this is broadly useful and comparatively easy to implement.

WIJIS already had a scheme we use for identifying individual source Record documents by URI. Briefly, we define a Record as follows:

Records, Submitters, and WIJIS RecordURIs

WIJIS defines a Record as a primarily sourced data representation

When we say "primarily sourced" we are saying that a Record is not an aggregate document composed of data drawn from other sources. To that end, we assert that the infomation represented in the record is, for information sharing purposes, the property and administrative responsibility of a single owner, called the Submitter. The Submitter is the party that is responsible for providing, and guaranteeing the uniqueness and consistency of, Record Key Data that distinguish each such Record from any other Record held by the Submitter.

Note that the WIJIS definition of a Record specifically excludes such derived information structures as reports or search results.

The remaining requirement was to precisely identify any item within such a document. The obvious quick-and-dirty expedient was to restrict our solution to XML (DOM) documents and to use XPath, which can usually specify individual data items.

By combining the RecordURI (which reliably identifies the individual document) with an XPath expression placed in the Qquerry Part of the URI, we come up with the following general URI formulation:

(URI of source Record)?itemxpath=(XPath Expression)

Of course, if the RecordURI has a preexisting Query Part, "itemxpath" can be placed anywhere within the sequence of name-value pairs.

Note that this formulation establishes itemxpath as a reserved name in the URI's Query Part, excluding URIs that have a preexisting "itemxpath" name-value pair in their Query Part. (Since "itemxpath" is an arbitrary name, a suitable implementation-specific alternate name could of course be selected in the event of collisions with URIs in the implementation space.)

This is a simple and easily worked scheme, but it has one uncomfortable drawback: RDF requires uniqueness (cardinality <= 1) for all URI references. Introduction of an XPath expression creates a logical mismatch: because an XPath expression has no formal upper bound on its resolution set, a URI extended by XPath in the way we propose could return two or more items from the IEPD instance document. This is a nontrivial shortcoming.

In order to work around this problem, the itemxpath expression MUST resolve to either zero or one data items within the referenced document in order to be valid. And there is nothing in the XPath syntax or datamodel that can cleanly guarantee that outcome; we don't know whether a given XPath, applied to a given IEPD Instance document, is cardinality-valid until runtime. This, therefore, requires careful runtime cardinality checking on any IEPD item URI, and careful error handling when XPath evaluation returns more than one node.

Note that this is not a flaw in our abstract design, fortunately. This problem stems only from our hackish use of XPath in this first-cut implementation. The "more generic" UML-based solution mentioned above would need to correct this problem at the logical level.

So, What Does One Of These Beasts Look Like?

Case Data

Let's start with a simplified Instance example of the Drug Case IEPD. It's pretty straightforward. Note in particular that all namespace declarations and prefixes have been omitted for visual simplicity.

For the purposes of this article, let's assume that this example document has a RecordURI of http://wijiscommons.org/examples/drug_case/1/.

Intel Data

Let's say that the Intel instance data we want to attach is two handwritten notes from a Narcotics Investigator identified only as "Harper":

"According to informant Shaggydog, the residence at 650 S Pk Street, listed here as the address of witness Tim Alex, is a known buy point for diverted oxycontin and hydrocodone."

"Suspect Bob Petro (street name 'Dolla') is reported on the street to have been a member of 'Aryan Warlocks' during incarceration at Stillwater Prison, MN."

We don't know who Harper is, and we do not have Harper's knowledge of either note's credibility or sourcing. Both notes are clearly pieces of hearsay. We have no idea whether either piece of information is pertinent to the case. Also, there's a particular point of concern: we don't know who Shaggydog is; if "Shaggydog" is a known street name of the informant, he (or she) could be compromised, possibly killed, if this information were disclosed for any reason.

This stuff clearly does not rise to the level of evidence and cannot safely be allowed to creep into Case records. But it's still our job to bundle it together, safely, without losing context.

Attachment points for the RDF

There are actually quite a few valid ways to attach a note like this to an IEPD Instance. "Attachment" in this sense simply means defining a stable, persistent URI for the part of the document to which we want to connect RDF pedicates. (The attachment can be Subject or Object in an RDF triple, by the way. RDF arrow direction has no significance outside of the semantic sense of the Predicate's definition.)

The most obvious, and least effortful, way to attach the data is simply to attach both notes as literal data to the Document itself (remember - it has its own URI, so it can serve as either Subject or Object in any RDF statement).

But hey - this example isn't about doing the least we get away with. The process of determining attachment structure provides an opportunity to encode some semantic enrichment into the compound IEPD/RDF document. Let's see what we can do.

For the first note, it's pretty obvious that the attachment point is witness Tim Alex's address, which is in fact a discrete item in the document. We can select that address node with the XPath expression //IncidentSubject/Witness/LocationAddress[LocationStreet/StreetNumberText/text() eq "650"][LocationStreet/StreetName/text() eq "S Pk Street"].

Combining this expression with our example Record URI, we get http://wijiscommons.org/examples/drug_case/1/?itemxpath=//IncidentSubject/Witness/LocationAddress[LocationStreet/StreetNumberText/text() eq "650"][LocationStreet/StreetName/text() eq "S Pk Street"] which is verbose, but precisely identifies that node, and only that node.

The second note can be attached, in a similar fashion and by analogous logical reasoning, to the node representing the Suspect, Robert Petro.

Making the RDF

In the following RDF examples, we will use [650 S Park St] and [Robert Petro] as aliases for the unwieldy URIs devised to represent the nodes we have selected as attachment points for the first and second notes respectively. These aliases are for present convenience only, and do not represent any accepted RDF notation.

In similar fashion, we will use square brackets around URI aliases in general - all of which, in these examples, are purely conjectural, without even the tenuous reality of the attachemnt points in our example. Note that these aliased URIs refer to distinct Resources if they are inside an ellipse, and to anonymous instances of semantic Types if they are tagging a Predicate arrow.

A couple of other observations about these RDF examples:

Witness Address 650 South Park
RDF diagram for intelligence note about 650 South Park


Suspect Robert Petro
RDF diagram for intelligence note about Bob Petro