Appendix B. A Comparison of Data Modeling Techniques(Syntactic Conventions)

Peter Chen first introduced entity/relationship modeling in 1976 [Chen 1976, 1977]. It was a brilliant idea that has revolutionized the way we represent data. It was a first version only, however, and many people since then have made improvements on it. A veritable plethora of data-modeling techniques have been developed.

Things became more complicated in the late 1980s with the advent of a variation on this theme called “object modeling”. Now there were even more ways to model the structure of data. This was mitigated somewhat in the mid-1990s with the introduction of the UML, a modeling technique intended to replace at least all the “object-modeling” ones. As will be seen in this appendix, it is not quite up to replacing other entity/relationship approaches, but it has had a dramatic effect on the object-modeling world.

This appendix presents the most important of these techniques and provides a basis for comparing them with each other.

Regardless of the symbols used, data or object modeling is intended to do one thing: describe the things about which an organization wishes to collect data, along with the relationships among them. For this reason, all of the commonly used systems of notation are fundamentally convertible one to another. The major differences among them are aesthetic, although some make distinctions that others do not, and some do not have symbols to represent all situations.

This is true for object-modeling notations as well as entity/relationship notations.

There are actually three levels of conventions to be defined in the data-modeling arena: The first is syntactic, about the symbols to be used. These conventions are the primary focus of this appendix. The second defines the organization of model diagrams. Positional conventions dictate how entity types are laid out. Richard Barker has defined a very effective set of positional conventions [Barker 1990]. These are described in Chapter 3 (page 113). Finally, there are conventions about how the meaning of a model may be conveyed. Semantic conventions describe standard ways for representing common business situations. These are described briefly in Chapter 4 (pages 114–132). You can find more information about these in books by David Hay [Hay, 1996] and Martin Fowler [Fowler, 1997].

These three sets of conventions are, in principle, completely independent of each other. Given any of the syntactic conventions described here, you can follow any of the available positional or semantic conventions. In practice, however, promoters of each syntactic convention typically also promote at least particular positional conventions, if not the semantic ones as well.

In evaluating syntactic conventions, it is important to remember that data modeling has two audiences. The first is the business community that uses the models and their descriptions to verify that the analysts in fact understand their environment and their requirements. The second audience is the set of systems designers, who use the structures in the models and the business rules implied by them as the basis for computer system designs.

Different techniques are better for one audience or the other. Models used by analysts must be clear and easy to read. This often means that these models may describe less than the full extent of detail available. First and foremost, they must be accessible by a non-technical viewer. Models for designers, on the other hand, must be as complete and rigorous as possible, expressing as much as possible.

The evaluation, then, will be based both on the technical completeness of each technique and on its readability.

Technical completeness is in terms of the representation of:

  • Entity types and attributes

  • Relationships

  • Unique identifiers

  • Sub-types and super-types

  • Constraints between relationships

A technique's readability is characterized by its graphic treatment of relationship lines and entity-type boxes, as well as its adherence to the general principles of good graphic design. Among the most important of these principles is that each symbol should have only one meaning, which applies wherever that symbol is used, and that each concept should be represented by only one symbol. Moreover, a diagram should not be cluttered with more symbols than are absolutely necessary, and the graphics in a diagram should be intuitively expressive of the concepts involved. Your author has written several articles on this subject [e.g., Hay, 1998.]

Each technique has strengths and weakness in the way it addresses each audience. As it happens, most are oriented more toward designers than they are toward the user community. These produce models that are very intricate and they focus on making sure that all possible constraints are described. Alas, this is often at the expense of readability.

This document presents seven notation schemes:

  • Peter Chen—. He's the man who started it all.

  • Information Engineering—. Clive Finkelstein and James Martin combined data modeling with an approach to systems development.

  • Richard Barker—. His is the notation used in Europe's SSADM methodology and by the Oracle Corporation.

  • IDEF1X—. This technique is supported and extensively used by the United States Department of Defense.

  • Object Role Modeling (ORM)—. This is a different approach to modeling facts and data.

  • The Unified Modeling Language (UML)—. This is the latest technique supported in the object-oriented world.

  • The Extended Markup Language (XML)—. This is not strictly a data-modeling language, but it demonstrates some interesting data-structure ideas.

For comparison purposes, the same example model is presented in the following sections using each technique. Note that the UML is billed as an “object modeling” technique, rather than as a data (entity/relationship) modeling technique, but as you will see, its structure is fundamentally the same. This comparison is in terms of each technique's symbols for describing entity types (or “object classes”, for the UML), attributes, relationships (or object-oriented “associations”), unique identifiers, sub-types, and constraints between relationships.

At the end of the individual discussions is your author's argument in favor of Mr. Barker's approach for use in requirements analysis, along with his argument in favor of UML to support object-oriented design and IDEF1X to support relational database design.

Peter Chen

Peter Chen invented entity/relationship modeling in the mid-1970s [Chen, 1977], and his approach remains widely used today. It is unique in its representation of relationships and attributes. Relationships are shown with a separate diamond-shaped symbol on the relationship line, and attributes are shown in separate circles, instead of as annotations on each entity type.

A sample model, representing Chen's method, is shown in Figure B.1. This same example will be used to demonstrate all the techniques that follow. The model shows entity types, attributes, and relationships. It also has examples of both a super-type/sub-type combination and a constraint between relationships.

A Chen Model.

Figure B.1. A Chen Model.

In the diagram, each PURCHASE ORDER is related to a single PARTY and to one or more examples of either one PRODUCT or one SERVICE.

The diagram also includes two entity types (EVENT and EVENT CATEGORY) in an unusual relationship. In most “one-to-many” relationships, the “one” side is mandatory (“... must be exactly one”), while the “many” side is optional (“... may be one or more”). In this example, the reverse is true: Each EVENT may be in one and only one EVENT CATEGORY (zero or one), and each EVENT CATEGORY must be a classification for one or more EVENTS (one or more). That is, an EVENT may exist without being classified, or it may be in one and only one EVENT CATEGORY. An EVENT CATEGORY can come into existence, however, only if there is at least one event to put into it.

Entity Types and Attributes

Entity types are represented by square-cornered boxes, with their attributes hanging off them in circles. An entity type's name appears inside the rectangle, and an attribute's name appears inside the circle. There are no special marks to indicate whether attributes are mandatory or optional, or whether they participate in the entity type's unique identifier.

Names of entity types and attributes are common terms, and in multiword names, the words are separated by hyphens.

Relationships

Mr. Chen's notation is unique among the techniques shown here in that a relationship is shown as a two-dimensional symbol—a rhombus on the line between two or more entity types.

Note that this relationship symbol makes it possible to maintain a “many-to-many” relationship without necessarily converting it into an associative or intersect entity type. In effect, the relationship itself is playing the role of an associative entity type. The relationship itself is permitted to have attributes. Note how “quantity”, “actual price”, and “line number” are attributes of the relationship Order-line in Figure B.1.

Note also that relationships do not have to be binary. As many entity types as necessary may be linked to a relationship rhombus.

Cardinality/Optionality

In Mr. Chen's original work, only one number appeared at each end, showing the maximum cardinality. That is, a relationship might be “one-to-many”, with a “1” at one end and an “n” at the other. This would not indicate whether or not an occurrence of an entity type had to have at least one occurrence of the other entity type.

In most cases, an occurrence of an entity type that is related to one occurrence of another must be related to one, and an occurrence of an entity type that is related to more than one may be related to none, so most of the time the lower bounds can be assumed. The event/event category model, however, is unusual. Having just a “1” next to event, showing that an event is related to one event category, would not show that it might be related to none. The “n” which shows that each event category is related to more than one event would not show that it must be related to at least one.

For this reason, the technique can be extended to use two numbers at each end to show the minimum and maximum cardinalities. For example, the relationship party-order between PURCHASE ORDER and PARTY shows 1,1 at the PURCHASE ORDER end, showing that each PURCHASE ORDER must be with no less than one PARTY and no more than one PARTY. At the other end, “0,n” shows that a PARTY may or may not be involved with any PURCHASE ORDER and could be involved with several. The EVENT/EVENT CATEGORY model has “0,1” at the EVENT end and “1,n” at the EVENT CATEGORY end.

In an alternative notation, relationship names may be replaced with “E” if the existence of occurrences of the second entity type requires the existence of a related occurrence of the first entity type. See “Unique identifiers” below for more about this.

Names

Because relationships are clearly considered objects in their own right, their names tend to be nouns.

The relationship between purchase-order and person or organization, for example, is called order-line. Sometimes a relationship name is simply a concatenation of the two entity type names. For example party-order relates party and purchase order.

Entity type and relationship names may be abbreviated.

Unique Identifiers

A unique identifier is any combination of attributes and relationships that uniquely identify an occurrence of an entity type.

While Mr. Chen recognizes the importance of attributes as entity-type unique identifiers [Chen, 1977, p. 23], his notation makes no provision for showing this. If the unique identifier of an entity type includes a relationship to a second entity type, he replaces the relationship name with “E”, makes the line into the dependent entity type an arrow, and draws a second box around this dependent entity type. (Figure B.2 shows how this would look if the relationship to party were part of the unique identifier of PURCHASE ORDER). This still does not identify any attributes that are part of the identifier.

Existence-Dependent Relationship.

Figure B.2. Existence-Dependent Relationship.

Sub-types

A sub-type is a subset of the occurrences of another entity type, its super-type. That is, an occurrence of a sub-type entity type is also an occurrence of its super-type. An occurrence of the super-type is also an occurrence of exactly one or another of the sub-types.

Though not in Mr. Chen's original work, the technique was extended to include this by Mat Flavin [Flavin, 1981] and Robert Brown [Brown, 1993].

In this extension, sub-types are represented by separate entity-type boxes, each removed from its super-type and connected to it by an “isa” relationship. (Each occurrence of a sub-type “is a[n]” occurrence of the super-type.) The relationship lines are linked by a rhombus, and each relationship to a sub-type has a bar drawn across it. In Figure B.1, for example, PARTY is a super-type, with PERSON and ORGANIZATION as its sub-types. Similarly, a CATALOGUE ITEM must be either a PRODUCT or a SERVICE.

Constraints between Relationships

The most common case of constraints between relationships is the “exclusive or”, meaning that each occurrence of the base entity type must (or may) be related to occurrences of one other entity type, but not more than one. These will be seen in most of the techniques which follow below.

Mr. Chen does not deal with constraints directly at all. This must be done by defining an artificial entity type and making the constrained entity types into sub-types of that entity type. This is shown in Figure B.1 with the entity type CATALOGUE ITEM, which has mutually exclusive sub-types product and service. Each purchase order has an order-line relationship with one CATALOGUE ITEM, where each CATALOGUE ITEM must be either a PRODUCT or a SERVICE.

Comments

Mr. Chen was first, so it is not surprising that his technique does not express all the nuances that have been included in subsequent techniques. It does not annotate characteristics of attributes, and it does not show the identification of entity types without sacrificing the names of the relationships.

While it does permit showing multiple inheritance and multiple type hierarchies, the multibox approach to sub-types takes up a lot of room on the drawing, limiting the number of other entity types that can be placed on it. It also requires a great deal of space to give a separate symbol to each attribute and each relationship. Moreover, it does not clearly convey the fact that an occurrence of a sub-type is an occurrence of a super-type.

Information Engineering

“Information engineering” was originally developed by Clive Finkelstein in Australia the late 1970's. He collaborated with James Martin to publicize it in the United States and Europe [Martin & Finkelstein, 1981], and then Martin went on from there to become predominantly associated with it [Martin & McClure, 1985]. Mr. Finkelstein later published his own version [Finkelstein, 1989; Finkelstein, 1992]. Because of the dual origin of the techniques, there are minor variations between Mr. Finkelstein's and Mr. Martin's notations The information-engineering version of our test case (with some of the notations from each version) is shown in Figure B.3.

An Information-Engineering Model.

Figure B.3. An Information-Engineering Model.

In the example, each PARTY is vendor in zero, one, or more PURCHASE ORDERS, each of which initially has zero, one or more LINE ITEMS, but eventually it must have at least one LINE ITEM. Each LINE ITEM, in turn, is for either exactly one PRODUCT or exactly one SERVICE. Also, each EVENT classifies zero or one EVENT TYPE, while each EVENT TYPE must be (related to) one or more EVENTS

Entity Types and Attributes

Mr. Finkelstein defines entity type in the designer's sense of representing “data to be stored for later reference” [Finkelstein, 1992, 23]. Mr. Martin, however, adopts the analyst's definition that “an entity type is something (real or abstract) about which we store data” [Martin & McClure, 1985, 249].

Entity types are shown in square-cornered rectangles. An entity type's name is inside its rectangle. Attributes are not shown at all. Mr. Finkelstein shows them in a separate document, the “entity type list”. Mr. Martin has another modeling technique, called “bubble charts”, specifically for modeling attributes, keys, and other attribute characteristics.

Names of entity types are common terms, and the words in multiword names are separated by spaces.

Relationships

Relationships are shown as solid lines between pairs of entity types, with symbols on each end to show cardinality and optionality.

Names

Mr. Martin names relationships with verbs, often only in one direction. Mr. Finkelstein doesn't name relationships at all.

Cardinality/Optionality

Each relationship in information engineering has two halves, with each half described by one or more symbols. If an occurrence of the first entity type may or may not be related to occurrences of the second, a small open circle appears near the second entity type. If it must have at least one occurrence of the second, a short line crosses the relationship line instead. If an occurrence of the first entity type can be related to no more than one of the second entity type (“one and only one”), another short line crosses the relationship. If it can be related to more than one of the second entity type (“one or more”), a crow's foot is put at the intersection of the relationship and the second entity-type box.

For example, in Figure B.3, a PARTY is vendor in zero, one, or more PURCHASE ORDERS. A PURCHASE ORDER, on the other hand, (is to) one and only one PARTY.

Mr. Finkelstein has a unique notation, also shown in the figure. Note that each purchase order initially may have one or more line items, but eventually it must have at least one. That is, it is possible to create a purchase order without having to fill in the line items immediately, but at least one must be added later. The bar across the line between the circle and the crow's foot shows this.

Unique Identifiers

Unique identifiers are not represented in an information-engineering data model. Mr. Martin shows them separately in “bubble diagrams”.

Sub-types

Mr. Martin represents sub-types as nested boxes inside the super-type box. This is shown in the figure. Mr. Finkelstein portrays them as separate boxes, with a linked with “isa” relationship lines, as used in the Chen notation described above.

Constraints between Relationships

In information-engineering notation, a constraint between relationships is shown by the relationship halves of the three (or more) entity types involved meeting at a small circle. If the circle is solid, the relationship between the relationships is “exclusive or, meaning that each occurrence of the base entity type must (or may) be related to occurrences of one other entity type, but not more than one. This is shown in the figure, where each LINE ITEM is for either one PRODUCT or is for one SERVICE, but not both. If the circle is open, it is an “inclusive or” relationship, meaning that an occurrence of the base entity type must (or may) be related to occurrences of one, some, or all of the other entity types.

Comments

Information engineering is widely practiced. It is reasonably concise and attractive, consistent, and has a minimum of clutter. It is, however, missing important notations for attributes and unique identifiers, although some CASE tools have added these. Mr. Martin's approach to sub-types is compact and therefore desirable if models are to be presented to the nontechnical community, while Mr. Finkelstein's is not. Mr. Finkelstein's notation for “initially may be but eventually must be” is a very ingenious solution to a common modeling situation, not found in any other notation.

Richard Barker's Notation (as Used by Oracle Corporation)

The next notation was originally developed by the British consulting company CACI and is part of the European methodology, SSADM. It was subsequently promoted by Richard Barker [Barker, 1990] and adopted by the Oracle Corporation for its “CASE*Method” (subsequently renamed the “Custom Development Method” [Oracle, 1996]).

Figure B.4 shows our example as represented in this notation. In the diagram, each PURCHASE ORDER must be issued to one and only one PARTY and may be composed of one or more LINE ITEMS, each of which in turn must be for either one PRODUCT or one SERVICE. Also, each EVENT may be in one and only one EVENT TYPE, while each EVENT TYPE must be a classification for one or more EVENTS.

A CASE*Method Data Model.

Figure B.4. A CASE*Method Data Model.

Entity Types and Attributes

Entity types in Barker's notation are shown as round-cornered rectangles. Attributes may be displayed inside the entity-type boxes.

Officially, attributes are shown with small open circles for optional attributes, solid circles for required attributes, and octothorps (#) for attributes which participate in unique identifiers. Often in practice, however (and throughout this Appendix), dots are used for all required and optional attributes not in a unique identifier.

Relationships

Relationships are shown as lines, with each half solid or dashed, depending on whether that part of the relationship is mandatory or not. The presence or absence of a crow's foot on each end shows that end as referring to, respectively, up to many or no more than one occurrence of that entity type. Naming conventions allow the relationship at each end to be read as a concise, disciplined, but easy-to-understand sentence.

Cardinality/Optionality

Relationships are in two parts, one representing the relationship going in each direction. In a relationship half, different symbols address the upper and lower boundaries of the relationship: A dashed line near the first, subject, entity type shows that the relationship is optional and means “zero or more” (read as “may be”), and a solid line represents a mandatory relationship that means “at least one” (read as “must be”). A “crow's foot” next to the second entity type represents “up to many” (read as “one or more”), while no crow's foot represents “up to one” (read as “one and only one”).

Names

The Barker notation is unique in the way it names relationships. Relationship names are prepositions or prepositional phrases, not verbs, so that normal and meaningful English sentences can be constructed from them. The sentences are of the structure:

Each
<entity type 1>
{must be | may be}(If the line is solid or dashed)
<relationship>
{one or more | one and only one}(If there is or is not a crow's foot)
<entity type 2>
  .

For example, in Figure B.4, “Each party may be a vendor in one or more purchase orders,” and “Each purchase order must be issued to one and only one party.”

Unique Identifiers

A unique identifier is any combination of attributes and relationships which uniquely identifies an occurrence of an entity type. Attributes which are parts of the definition of a unique identifier are shown preceded by octothorps (#). Relationships which are part of the definition of a unique identifier are marked by a short line across the relationship near the entity type being identified.

For example, in Figure B.4, each occurrence of PARTY is identified by its “Party ID”, and the unique identifier of LINE ITEM is a combination of the attribute “Line number” and the relationship “part of one and only one PURCHASE ORDER.” Since the marked relationship represents the fact that each LINE ITEM is partly identified by a particular PURCHASE ORDER, it implies that the PURCHASE ORDER'S unique identifier “PO number” participates in identification of the LINE ITEM as well. When implemented, a column derived from “PO number” will be generated in the table derived from LINE ITEM. It will serve as a foreign key to the table derived from PURCHASE ORDER and will be part of the primary key of the table that is derived from LINE ITEM.

Note that Mr. Barker's notation distinguishes the unique identifier in the conceptual model from the “primary key” which identifies rows in a physical table. The unique identifier is shown, while the primary key is not. Similarly, since a foreign key is simply the implementation of a relationship, this is not shown explicitly here either.

Sub-types

Barker's notation shows sub-types as boxes inside super-type boxes, according to the approach to set theory laid out by Leonhard Euler in the 18th Century. This has the advantage of taking up much less room on the diagram, and it emphasizes the fact that an occurrence of a sub-type is an occurrence of the super-type. The super- and sub-types are not simply related to each other. This does mean, however, that multiple inheritance (multiple super-types for one sub-types) and multiple type hierarchies (multiple ways of dividing a super-type into sub-types) cannot be represented by a Barker model.

In Barker's notation, sub-types are exclusive, meaning that overlapping sub-types are not allowed. Sub-types are also complete, meaning that sub-types are supposed to account for all occurrences of a super-type, although in practice this latter rule is often bent by adding the sub-type OTHER. In Figure B.4, PERSON and ORGANIZATION are sub-types of PARTY.

Constraints between Relationships

The only constraint between relationships available in Mr. Barker's notation is the exclusive or. An arc across two relationships represents the fact that each occurrence of an entity type must be (or may be) related to occurrences of one or more other entity types, but not more than one. For example, Figure B.4 shows that each LINE ITEM must be either for one PRODUCT or for one SERVICE.

Comments

Several things distinguish this notation from those described elsewhere. These are factors that make the Barker technique the most desirable to use in a requirements analysis project. The technique results in models that are much better for presenting to the public at large than those produced by any other.

First, this notation uses relatively few distinct symbols. There is only one kind of entity type. Whether it is a role, an intersection, or another kind of association between two entity types, it is represented by the same round-cornered rectangle. The full range of relationship types is shown by line halves, which may be solid or dashed, and by the presence or absence of a crow's foot on each end. Unique identifiers, where it is important to show them, are shown by either the hash marks next to an attribute, or a small mark across a relationship line, and dependency is implied by the use of a relationship in a unique identifier. Attributes may be shown with indicators of their optionality.

Other notations are, to varying degrees, more complicated than that.

Second, sub-types are shown as entity types inside other entity types. Most other notations place sub-types outside the super-type, connected to it with “isa” relationship lines. This takes up much more space on a diagram and does not convey as emphatically the fact that an occurrence of a sub-type is an occurrence of the super-type. Moreover, it is not easy to see that an attribute of or a relationship to a super-type is also an attribute of or a relationship to every sub-type of that super-type.

Third, Barker's notation permits “exclusive or” constraints between relationships, which show that an occurrence of one entity type may be related to occurrences of either of two or more other entity types. This is more than is available in some notations, and less than in others.

The last, and perhaps the most important thing to distinguish this technique from the others is a rigorous naming standard for relationships. Relationship names are prepositions, not verbs. A little reflection should reveal why this is appropriate, since it is the preposition in English grammar, not the verb, that denotes a relationship. (Verbs suggest functions, which are featured in other kinds of models.) The implied verb in every relationship sentence is “to be”, expressed as either “must be” or “may be”.

Note that in the examples of notations without this discipline, the verbs often include “is” anyway.

This use of prepositions makes it possible to use common English sentences to represent relationships completely. It is not always easy to come up with just the right word, but the exercise of trying to do so improves significantly your understanding of the true nature of the relationship.

This discipline could certainly be followed with the other techniques, but none of the books your author has found to describe these techniques endorses it.

IDEF1X

IDEF1X is a data-modeling technique that is used by many branches of the United States Federal Government [FIPS 1993] [also see Bruce 1992]. The IDEF1X version of the sample model is shown in Figure B.5.

An IDEF1X Model.

Figure B.5. An IDEF1X Model.

Entity Types and Attributes

Entity types in IDEF1X are shown by round-cornered or square-cornered rectangles. Round-cornered rectangles represent “dependent” entity types—those whose unique identifier includes at least one relationship to another entity type. “Independent” entity types, whose identifiers are not derived from other entity types, are shown with square corners.

The name of the entity type appears outside the box. The box is divided, with identifying attributes (here referred to as the “primary key”) above the division and nonidentifying attributes below.

In multiword entity type names, the words may be separated by hyphens, underscores, or blanks.

Relationships

In IDEF1X, relationships are asymmetrical: Different symbols for optionality are used, depending on the relationship's cardinality. Unlike the other notations, symbols cannot be parsed in terms of optionality and cardinality independently. Each set of symbols describes a combination of the optionality and cardinality of the entity type next to it.

In addition to a relationship line from an entity type, the foreign key that would implement the line in a relational database design is shown as an attribute of that entity type.

If a relationship is part of an entity type's unique identifier, it is shown as a solid line; if not, it is shown as a dashed line.

Table B.1 shows, for IDEF1X and the Barker notation, all the possible combinations of cardinality and optionality on both ends of the relationship.

Cardinality/Optionality

As seen in the table, optionality is shown differently for the “many” and the “one” sides of a relationship. Most of the time, a solid circle next to an entity type means zero, one, or more occurrences of that entity type. If there is no other symbol next to the entity type on this “many” side of a relationship, the relationship is optional. See lines 1-3, and 7 in Table B.1. That is, the solid circle stands for zero, one, or more (“may be... one or more”) if it is by itself. Adding the letter P makes the relationship mandatory (meaning “must be one or more”)[1]. Adding a “1” also makes the relationship mandatory, but this changes the cardinality of the relationship to exactly one. It changes the meaning of the solid circle from “may be one or more” to “must be one and only one”. (See lines 4, 6, 8, and 10 in the table.) Adding the letter Z keeps the relationship optional, but that changes the cardinality of the solid circle to “may be one and only one”.

So a solid circle may mean “must be” or “may be”, and it may mean “one or more” or “one and only one”, depending on the other symbols around it. That is to say, the solid circle does not convey any inherent meaning in itself.

Absence of a solid circle next to an entity type means that only one occurrence of that entity type is involved (“one and only one”). If there is no symbol next to the entity type on the “one” side of the relationship, the entity type is mandatory (“must be one and only one”), as shown in lines 1, 2, 4, 5, 11–18.

Placing a small diamond symbol next to the entity type means that the other entity type in the relationship may be related to one and only one occurrence (“zero or one”) of that entity type. (See lines 3, 6, 16, and 19.) This, then, is an alternative way to specify an optional one-and-only-one occurrence as an entity type. We saw above that you could also use a solid circle with a letter Z under it (see lines 21–24.)

Since the solid circle—which usually represents “may be one or more”—always appears on the “many” side of a relationship, the use of the solid circle in a many-to-many relationship makes each end optional. Adding the letter “P” on one or both ends makes the end so modified mandatory (see lines 7 through 10).

Table B.1. Comparison of Barker and IDEF1X Notations

 

CASE*Method Notation

IDEF1X Notation

CASE*Method Description

IDEF1X Description

1

Comparison of Barker and IDEF1X Notations

Comparison of Barker and IDEF1X Notations

Each A may be... one or more B's. Each B must be ... one and only one A. (A partially identifies B.)

One to zero or more (dependent)

2

Comparison of Barker and IDEF1X Notations

Comparison of Barker and IDEF1X Notations

Each A may be... one or more B's. Each B must be ... one and only one A.

One to zero or more

3

Comparison of Barker and IDEF1X Notations

Comparison of Barker and IDEF1X Notations

Each A may be... one or more B's. Each B may be ... one and only one A.

Zero or one to zero or more

4

Comparison of Barker and IDEF1X Notations

Comparison of Barker and IDEF1X Notations

Each A must be... one or more B's. Each B must be ... one and only one A.

One to one or many

5

Comparison of Barker and IDEF1X Notations

Comparison of Barker and IDEF1X Notations

Each A must be... one or more B's. Each B must be ... one and only one A. (A partially identifies B.)

One to one or many (dependent)

6

Comparison of Barker and IDEF1X Notations

Comparison of Barker and IDEF1X Notations

Each A must be... one or more B's. Each B may be ... one and only one A

One to zero or many

7

Comparison of Barker and IDEF1X Notations

Comparison of Barker and IDEF1X Notations

Each A may be... one or more B's. Each B may be ... one or more A's

Zero or many to zero or many

8

Comparison of Barker and IDEF1X Notations

Comparison of Barker and IDEF1X Notations

Each A must be... one or more B's. Each B must be ... one or more A's.

One or many to one or many

9

Comparison of Barker and IDEF1X Notations

Comparison of Barker and IDEF1X Notations

Each A may be... one or more B's. Each B must be ... one or more A's.

Zero or many to one or many

10

Comparison of Barker and IDEF1X Notations

Comparison of Barker and IDEF1X Notations

Each A must be... one or more B's. Each B may be ... one or more A's.

One or many to zero or many

11

Comparison of Barker and IDEF1X Notations

Comparison of Barker and IDEF1X Notations

Each A must be... one and only one B. Each B must be... one and only one A.

One to one

12

(Same as 11)

Comparison of Barker and IDEF1X Notations

(Same as 11)

 

13

(Same as 11)

Comparison of Barker and IDEF1X Notations

(Same as 11)

 

14

(Same as 11)

Comparison of Barker and IDEF1X Notations

(Same as 11)

 

15

Comparison of Barker and IDEF1X Notations

Comparison of Barker and IDEF1X Notations

Each A must be... one and only one B. Each B must be... one and only one A. (B partially dependent on A.)

One to one (dependent)

16

Comparison of Barker and IDEF1X Notations

Comparison of Barker and IDEF1X Notations

Each A must be... one and only one B. Each B may be... one and only one A.

Zero or one to one

17

Comparison of Barker and IDEF1X Notations

Comparison of Barker and IDEF1X Notations

Each A may be... one and only one B. Each B must be... one and only one A.

One to zero or one

18

Comparison of Barker and IDEF1X Notations

Comparison of Barker and IDEF1X Notations

Each A may be... one and only one B. Each B must be... one and only one A. (A partially identifies B.)

One to zero or one (dependent)

19

Comparison of Barker and IDEF1X Notations

Comparison of Barker and IDEF1X Notations

Each A may be... one and only one B. Each B may be... one and only one A.

Zero or one to zero or one

20

(Same as 19)

Comparison of Barker and IDEF1X Notations

(Same as 19)

 

21

(Same as 19)

Comparison of Barker and IDEF1X Notations

(Same as 19)

 

22

(Same as 19)

Comparison of Barker and IDEF1X Notations

(Same as 19)

 

The two ways of showing that an occurrence of one entity type “must be” related to a single occurrence of another mean that there are four different ways to represent a mandatory one to one relationship. These are shown in lines 11–14. Similarly, optional one-to-one relationships can be shown in four different ways, as shown in lines 19–22. One-to-one relationships that are partly optional and partly mandatory can be shown in two ways, depending on which way the model is oriented, as shown in lines 16–17.

NoteThe variations in notation above, which have no meaning in the conceptual model, turn out to be significant in the physical design. The difference between representations for “may be one and only one” has to do with the fact that the diamond implies an optional foreign key in the opposite entity type, while the circle with the Z simply says that there may or may not be a child occurrence. In the other cases of multiple representation of the same concept (11–14, and 19–22), the culprit is again physical implementation. Each of the different symbol sets is implemented in a different way. Indeed, some of the symbol combinations cannot be implemented as expressed.

In other words, the symbols are deeply linked to the implementation of the tables, not the logic of the situation. Thus, IDEF1X is fundamentally a physical database design modeling technique, not one appropriate for doing conceptual design.

Names

A relationship name is a verb or verb phrase, where multiple words are separated by spaces. Relationships are identified in both directions.

Unique Identifiers

As stated above, a unique identifier is represented in IDEF1X by the primary key which will implement it in a relational database. Since all relationships are shown by foreign-key attributes, the primary key may consist of any combination of foreign-key and non-foreign-key attributes. If a foreign key is present in the unique identifier primary key, then the otherwise dashed relationship line becomes solid, and the entity-type box acquires round corners.

Sub-type

IDEF1X shows sub-types as separate entity-type boxes, each removed from its super-type and connected to it by an “isa” relationship. (Each occurrence of a sub-type “is a[n]” occurrence of the super-type.)

There are two kinds of sub-types. In Figure B.5, the circle with two horizontal lines under it is a complete subtyping arrangement: All occurrences of the parent must be occurrences of one or the other sub-type. A circle with only one horizontal line below it is an incomplete subtyping arrangement: The sub-types do not represent all possible occurrences of the super-type.

All sub-types extending from a single sub-type symbol are mutually exclusive. Sub-types may be shown to be overlapping by being descended from different sub-type symbols attached to the same super-type entity type.

In IDEF1X, the unique identifier of the sub-type must always be identical to the identifier of the super-type. This point is reinforced by including the foreign key (“(FK)”) designator next to the unique identifier of the sub-type, referring to the unique identifier of the super-type. Optionally, a “role name” may be appended to the front of the foreign-key name in the sub-type. In Figure B.5, the role names “product-code” and “service-id” are roles, appended to “item number” for the primary keys of product and service. Note that, since the keys themselves remain identical to the key of the super-type, appending role names does not change their format in any way.

An attribute used to discriminate between the sub-types is placed next to the sub-type symbol. For example, “person-organization-type” is shown in Figure B.5 to distinguish between occurrences of person and organization. If the sub-types were implemented in a single table for the super-type, this would become a separate column for discriminating between occurrences of the different sub-types.

Constraints between Relationships

IDEF1X does not have an explicit way to represent constraints between relationships. Instead of saying “A” is related to “B” or to “C”, it is necessary to define an entity type, “D”, and then use the sub-type notation. Thus you would say “A” is related to “D”, which must be either a “B” or a “C”.

The ability to express exhaustiveness and exclusivity in sub-types does carry over to this situation.

This is shown in Figure B.5 with the creation of CATALOGUE ITEM as a super-type of PRODUCT and SERVICE.

Comments

IDEF1X symbols do not map cleanly to the concepts they are supposed to model. A concept that should be represented by a single symbol requires several together, and it requires different symbols under different circumstances. That is, particular situations can be represented by more than one set of symbols, while the same symbol can mean different things, depending on context. Which symbol is used to describe a particular situation is heavily dependent on the context of that situation and on how the relationship will be implemented, not just on the situation itself.

For example, the symbol to be used for optionality depends on the cardinality of the relationship. The solid circle symbol can mean anything, depending on its setting. Similarly, a cardinality/optionality combination may be represented in different ways. This is because what is being represented is not a conceptual structure, but an implementation method.

The effect of all this is that it is prohibitively difficult to teach a nontechnical viewer to read an IDEF1X diagram.

A dominant graphic feature of any relationship line is its being solid or dashed. Barker's notation uses this feature to distinguish between relationships that are required and those that are not. Among those relationships that are, those participating in a unique identifier may be simply marked with an extra line across them, but this level of detail is often not required.

In IDEF1X, however, the solidity of a line describes the participation of one entity type in the unique identifier (primary key) of the other. This requires the analyst to begin the efforts by analyzing dependency—before addressing the optionality or cardinality of the model's relationships.

In a real modeling situation, however, an analyst in fact normally starts by examining which entity types are required for which other entity types, and how many occurrences are involved. The details of keys or identifiers are typically not addressed until much later.

And corrections to the model are unnecessarily difficult: If you make a single error in cardinality or optionality (say, the one-to-one mandatory relationship should really be optional), then several symbols must be changed.

While it does permit showing multiple inheritance and multiple type hierarchies, the multibox approach to sub-types takes up a lot of room on the drawing, limiting the number of other entity types that can be placed on it. It also requires a great deal of space to give a separate symbol to each attribute and each relationship. Moreover, it does not clearly convey the fact that an occurrence of a sub-type is an occurrence of a super-type.

IDEF1X may be a good modeling tool to use as the basis for database design, but it does not follow the rules of good graphic design (as described at the beginning of this appendix), making it unnecessarily difficult to learn and difficult to use as a tool for analyzing business requirements jointly with users.

Object-Role Modeling (ORM)

NIAM was once an acronym for “Nijssen's Information Analysis Methodology”, but more recently, since G. M. Nijssen was only one of many people involved in the development of the method, it was generalized to “Natural language Information Analysis Method”. Indeed, practitioners now also use a still more general name, “Object-Role Modeling”, or ORM [Halpin, 2001].[2]

ORM takes a different approach from the other methods described here. Rather than representing entity types as analogues of relational tables, it shows relationships (that contain one or more “roles” in ORM parlance) to be such analogs. Like Mr. Barker's notation, it makes extensive use of language in making the models accessible to the public, but unlike any of the other modeling techniques, it has much greater capacity to describe business rules and constraints.

With ORM, it is difficult to describe entity types independently from relationships. The philosophy behind the language is that it describes “facts,” where a fact is a combination of entity types, domains, and relationships.

A sample ORM model is shown in Figure B.6

An ORM Model.

Figure B.6. An ORM Model.

Entity Types and Attributes

As shown in Figure B.6, an entity type is portrayed by an ellipse containing its name. An ellipse may also represent a value type, which is similar to a domain. A value type's playing a role in a relationship with an entity type is equivalent to an “attribute” in an entity/relationship diagram.

Entity-type labels play roles as identifiers, and these may be shown as dashed ellipses, although as a shorthand, identifying value types may also be shown within the entity-type ellipse in parentheses, below the entity-type name. Nonidentifying attributes always are portrayed as roles played by value types—ellipses outside the entity-type ellipse.

Thus, relationships not only connect entity types to each other but also value types to entity types as attributes. ORM is unique in being able to raise the question: What is the exact relationship of an attribute to its entity type? In particular, it can describe the optionality and cardinality of attributes.

Attributes can be combined if they have the same domain or unit-based reference mode. For example, in Figure B.6, the list price of Product, the rate for a service, and the cost of a Line Item are all taken from the domain “Monetary amount”. Similarly, this figure asserts that product names and service names are taken from the same set of names.

Relationships

ORM presents relationships between two entity types as “roles” that entity types and domains play in the organization's structure. Relationships are assembled from one or more adjacent boxes containing role names and connected to the entity types by solid lines. Relationships are not limited to being binary. Tertiary and higher-order relationships are permitted.

Where most methods portray entity types in terms that allow them to be translated into relational tables, ORM portrays the relationships to be converted to tables. That is, the two parts (or more) of the relationship become columns in a “relation” (table). In effect, these are the foreign keys to the two entity types. Attributes of one or more of the related entity types also then become part of a generated table.

A relationship may be “objectified”, when it takes on characteristics of an entity type. This is most common in the case of many to many relationships. Note in Figure B.7 that the many-to-many relationship between purchase order and product has been circled. Instead of creating a formal entity type, as is done in many other systems of notation (and as was done above), the relationship simply becomes a “nested fact type” or “objectified relationship”. This nested fact type may then be treated as an entity type having other entity types or attributes related to it. In Figure B.7, for example, the nested fact type Line Item is bought in a quantity.

Objectified Relationships.

Figure B.7. Objectified Relationships.

This is an alternative to simply defining line item as an entity type, as was done in Figure B.6. That was done in that figure because of the exclusive relationship between it and product and service. (See the discussion of constraints between relationships, below.)

Cardinality/Optionality

Cardinality is addressed differently in ORM than in the other methods. Here it is tied up with the uniqueness of occurrences of a fact (relationship). By definition, each occurrence of a fact applies to a single occurrence of each entity type participating in the relationship. That is, if each Party may be the source of one or more Purchase Orders, then a Party's participation in the source of role is not unique. On the other hand, if each Purchase Order must be to one and only one Party, then Purchase Order's participation in the to role is unique. That is, there is only one occurrence of a Purchase Order's being to a Party.

An entity type's uniqueness with respect to a relationship is represented in ORM by a double-headed arrow. If the relationship is one-to-many, the arrow is on the side of the relationship closest to the “many” side—that is, closest to the side of the entity type that is related to only one other thing. So in our Purchase Order / Party example, the arrow is under the Purchase Order to role, since it is unique.

As another example, in Figure B.6, the line item itself can appear only once in a part of role (that is, it can appear only once in a PURCHASE ORDER) because of the double-headed arrow under part of. Each PURCHASE ORDER is to one and only one party, since the arrow is over “to”. The PURCHASE ORDER, on the other hand, can be composed of more than one LINE ITEM, because LINE ITEM can appear in the set of relationship occurrences more than once. This is shown by the absence of the double-headed arrow on the PURCHASE ORDER's side of the relationship.

If the relationship is one-to-one, the bar appears over each half. For example, there is a double-headed arrow over both sides of the relationship between the entity type PARTY and the domain “Name”. This means that each party can have at most one “Name”, and each “Name” can be used for at most one PARTY. This is a one-to-one relationship.

If the relationship is many-to-many, the arrow crosses both halves of the relationship, showing that both halves are required to identify uniquely each occurrence of the relationship. In the objectified relationship model (Figure B.7) note the arrowheads over both bought via and to buy.

OptionalityA relationship may be designated as mandatory by placing a solid circle next to the entity type which is the subject of the fact. For example, in Figure B.6, each PURCHASE ORDER must participate in the to relationship with party.

Names

Entity type and attribute names are the real-world names of the things they represent. Relationship names are verb phrases, and it is permitted to use “is” or “has”. There is nothing to prevent use of the Barker convention, however, and that was done in this example. (This becomes problematic, however, in relationships that are not binary.) In some usages, past tense is used to designate temporal relationships that occurred at a point in time, while present tense is used to designate permanent relationships. Some standard abbreviations are used, such as “nr” (number, as a data type), and “US$” (money, as a data type). Spaces may be removed from multiword entity-type names, but all words in a name have an initial capital letter.

Unique Identifiers

As described above, labels may be shown as dashed ellipses, although as a shorthand, they also may be shown within the entity-type ellipse in parentheses, below the entity-type name. If nothing else is shown, these are the unique identifiers of the entity type. Where both a label and some other identifier are involved (such as a system-generated unique identifier), the unique identifier is shown under the name, and the label is shown as another attribute, (albeit with the dashed circle). For example, in Figure B.6, party is shown as identified by “ID,” but it also is named with the label party name.

If two or more attributes or relationships are required to establish uniqueness for an entity type, a special symbol is used. In Figure B.6, the combination of “number” being line number for a Line Item and Purchase Order being composed of a Line Item are required to identify uniquely an occurrence of the Line Item entity type. This is shown by the uniqueness constraint, represented with a circled “u” between the line number of and the composed of roles. This implies that a given number (such as “2”), while it is a line number for only one line item, could apply to more than one purchase order, and a given purchase order could be related to more than one line number. The combination of “Line Number of” and purchase order must be unique, however.

Sub-types

A sub-type is represented as a separate entity type, with a thick arrow pointing from it to its super-type. In Figure B.6, organization and person are each sub-types of party, as shown by the arrows. In addition, a “type” attribute is defined as the flag which distinguishes between occurrences of the sub-types (“party type” in Figure B.6). If the sub-types are exhaustive (covering all occurrences of the super-type), a constraint is shown next to the “type” attribute. If they are exclusive (non-overlapping), a double-headed line is shown over half of the relationship between the entity type and its “Type”.

In Figure B.6, the sub-types of party are exclusive, because the double-headed arrow over is an example of party type, meaning that a party is an example of one and only one party type. It is exhaustive because only the options “P” (person) and “O” (organization) are available for Party Type.

Constraints between Relationships

In the ORM system of notation, constraints between relationships are shown as circles linked to the relationships involved. An “exclusion constraint” (shown in the figure between the Product and Service relationships from line item) says that one or the other relationship may apply, but not both. The “X” in the symbol means that a Line Item may not be for both a Product and a Service. The dot over the middle of the “X” (Constraints between Relationships) means that a Line Item must be for a Product or a Service, but not both. If there were a dot by itself, it would mean that one or the other must apply, but both could apply as well. With no constraint, one or the other, both, or neither could apply.

(See the discussion of ORM constraints in Chapter 8 on pages 311–332.)

Comments

In many ways, ORM is the most versatile and most descriptive of the modeling techniques presented here. It has an extensive capability for describing constraints that apply to sets of entity types and attributes. It is not oriented just toward entity types and relationships, but toward objects and the roles they play—where an “object” may be an entity type or a value type (domain). It is constructed to make it easy to describe diagrams in English, although it lacks a discipline for constructing the English sentences.

Cardinality is shown via uniqueness constraints, and optionality is shown by making a relationship mandatory or not. Interestingly enough, this approach means the optionality and cardinality of attributes can be treated in exactly the same way. Must there be a value for an attribute? Can there be more than one?

Unlike all the flavors of entity/relationship modeling described here, ORM makes domains explicit.

All this expressiveness, however, is achieved at some aesthetic cost. A ORM model is necessarily much more detailed than an equivalent data model, and as a consequence, it is often difficult to grasp the shape or purpose of a particular drawing. Also, because all entity types, attributes, and relationships carry equal visual weight, it is hard to see which elements are the most important.

While it does permit showing multiple inheritance and multiple type hierarchies, the multibox approach to sub-types takes up a lot of room on the drawing, limiting the number of other entity types that can be placed on it. Moreover, it does not clearly convey the fact that an occurrence of a sub-type is an occurrence of a super-type. ORM also requires a great deal of space to give a separate symbol to each value type and relationship to it.

All of this could be mitigated by a CASE tool that permitted a model to be drawn in one form and then converted automatically to the other. The entity/relationship version could be used to convey the overall shape of the model and the important relationships, while the ORM version could portray the relationships in more detail.

The Unified Modeling Language (UML)

The Unified Modeling Language (UML) is not billed as a “data-modeling” but as an “object-modeling” technique. Instead of entity types, it models “object classes”. Close examination of its models, however, shows these to look suspiciously like entity/relationship models. Indeed, Ivar Jacobson even calls these classes in a business-oriented model entity type objects [Jacobson, 1992, p. 132].

Because of a confluence of ideas, techniques, personalities, and politics, UML promises to become a standard notation for representing the structure of data in the object-oriented community. It was developed when the “three amigos” of the object-oriented world, James Rumbaugh, Grady Booch, and Ivar Jacobson, among others, agreed to adopt as standard a variation on a notation originally developed by David Embley and his colleagues [Embley et al., 1992]. The UML was published by the Object Management Group in 1997 [OMG, 1998]. Messrs. Rumbaugh, Jacobson, and Booch have written significant texts on UML: a reference manual [Rumbaugh, Jacobson, & Booch, 1999], a user guide [Booch, Rumbaugh, & Jacobson, 1999], and a guide to their methodology [Jacobson, Booch, & Rumbaugh, 1999], although many other books on the subject are also available.

As a system of notation for representing the structure of data, when used for analysis, the UML static diagram is functionally the exact equivalent of any other data-modeling, entity-type/relationship modeling, or object-modeling technique. Its classes of entity-type objects are really entity types, and its associations are relationships. It has specialized symbols for some things that are already represented by the main symbols in other notations, and it lacks some symbols used in e/r diagrams. It does, however, have a more extensive ability to describe interrelationship constraints.

Yes, the UML does add the ability to describe the behavior of each object class/entity type, but the data-structure part of the technique is fundamentally no different from any other data-modeling technique in what it can represent. It also adds notation details most useful when it is applied to object-oriented design.

In addition, the UML includes other kinds of diagrams besides static object diagrams. These include use cases, activity diagrams, and others. They do not concern us here, however.

Figure B.8 shows the UML version of our example.

A UML Model.

Figure B.8. A UML Model.

Entity Types (Object Classes) and Attributes

As stated above, in object models, entity types are called classes. A class in the UML static model is a square-cornered rectangle with three divisions. The top part contains the class name. The middle section contains a list of attributes. The bottom, if included, contains descriptions of behavior. Since the UML is used mostly for design, these behavior descriptions are usually in the form of pseudo-code, C++, or simply program names, or simply references to programs.

An attribute can be referred to by one or more of the following elements:

  • Stereotype—. This extends the attribute concept defined by the person preparing the diagram. (See below.)

  • Visibility—. In terms of the object-oriented code which may implement the class, is this attribute visible to all (+), to only those classes which are sub-types of this class (#), or to this class only(-)? This is only meaningful if a model is used for design. It is not meaningful in design models.

  • Name—. This is the only required element.

  • Multiplicity—. Object orientation is not constrained by the relational notion that an object may have only one value for an attribute. This parameter lets you define that it may have more than one, up to five, or whatever. If the lower limit is zero, then occurrences of the related entity type are optional.

  • Type—. This is the data type of the attribute (number, character, etc.). The values for this depend on the model's environment.

  • initial value—. Here can be specified a default value.

  • {other}—. Additional named properties may be added, such as “tag=<value>”.

There are no spaces between the words in names. The class is called PurchaseOrder instead of Purchase Order.

The UML introduces the concept of stereotype, which is an additional annotation that can be used to enhance the standard UML notation. If you don't like something about UML, you can change it! A stereotype is identified by being surrounded by guillimets (« »), and can be used to extend entity type, attribute, and association definitions. In Figure B.8, the stereotype «ident» extends the model to denote unique identifiers. (See “Unique Identifiers” on page 375, below.)

Relationships (Associations)

A relationship is called an “association” in the object-oriented world. Rather than using graphic symbols, all the information on a UML association is conveyed by characters.

Cardinality/Optionality

Both cardinality and optionality are conveyed by characters in the form:

<lower limit>
..
<upper limit>

where the <lower limit> denotes the optionality (nearly always 0 or 1, although conceivably it could be something else), and the <upper limit> denotes the cardinality. The <upper limit> may be an asterisk (*) for the generic “more than one”, or it may be an explicit number, a set of numbers, or a range.

For example, “0..*” means “may be one or more” (zero, one, or more), and “1..1” means “must be exactly one”.

Since they are most common, “0..*” may be abbreviated “*”, and “1..1” may be abbreviated “1”.

In Figure B.8, for example, the fact that each Party may be a vendor in one or more purchase orders is shown by the string “0..*” next to Purchase Order. The “0” makes it optional (“may be”), and the * means that it can be any number. Similarly, the fact that each Purchase Order must be to one and only one Party is shown by the string “1..1” next to Party. The first 1 means that the relationship is mandatory (“must be”), and the second means that the purchase order may be to no more than one Party.

Names

There are two primary ways to name associations. A simple verb phrase may name the association in its entirety. A triangle next to the name tells which way to read it. Alternatively, “roles” can be defined at each end to describe the part played by the class in the association. The concept of role is very close to the relationship names used in the Barker notation, so that convention could be applied here, as was done in Figure B.8.

“Part of/composed of”

Extra symbols represent the particular association where each object in one class is composed of one or more objects in the other class. (Each object in the second class must be part of one and only one object in the first class.) The association acquires a diamond symbol next to the parent (“composed of”) class. If the association is mandatory and the referential integrity rule is “cascade delete”—that is, deletion of the parent deletes all the children—this is called “composition” and the diamond is solid. This is shown for the PurchaseOrder/LineItem association in Figure B.8. If the association is optional to the parent (and therefore has the referential integrity rule “nullify delete”)—that is, a parent can be deleted without affecting the children—then the diamond is open and is called “aggregation”. The notation does not address the “restricted” rule, in which deletion of a parent is not permitted if children exist. Nor does it address referential integrity rules for any other kind of association.

Unique Identifiers

Unique identifiers are rarely referred to in the object-oriented world. When the behavior of objects in a class requires locating a particular occurrence of another class, however, the attribute used for locating that occurrence is shown in a box next to the entity type needing it. For example, in Figure B.8, “PO number” is required from the point of view of Party to locate a particular Purchase Order. This reflects the programming that will be required to navigate from Party to Purchase Order when the classes are implemented, but it is not meaningful in an analysis model.

Alternatively, stereotypes can be used to designate attributes and relationships that constitute unique identifiers, in a structure very similar to that of the Barker notation. These are shown as «ident» in Figure B.8.

Sub-types

The UML shows sub-types as separate entity-type boxes, each removed from its super-type and connected to it by an “isa” relationship. (Each occurrence of a sub-type “is a[n]” occurrence of the super-type.)

Note in Figure B.8 that the sub-type structure is labeled {disjoint, complete}. This is equivalent to the rule in other notations that each occurrence of the super-type must be a member of one of the sub-types (complete), and an occurrence may not be a member of more than one sub-type (disjoint). In UML, this constraint is not required. The sub-type structure could be {overlapping, incomplete} or any other permutations of the two.

Constraints between Relationships

Constraints between relationships are shown as dashed lines between pairs of associations. Such a line is called a constraint. If it is annotated {xor} or simply {or}, it is an exclusive or. In Figure B.8, a constraint says that each occurrence of LineItem must be (or may be) either for an occurrence of Product or for an occurrence of Service, but not both. If it were {ior}, however, it would be an inclusive or. (Each occurrence of the base entity type must be (or may be) related to either an occurrence of one entity type, or to an occurrence of the other, or both.) Indeed, the dashed line can represent any relationship desired between two associations.

Comments

UML has a number of advantages over its predecessors:

  1. A constraints between relationships in the Barker notation is replaced by a simple line between two associations that can be annotated to describe any relationship between two associations. The Barker constraints between relationships is represented in the UML by the word “xor”, but other interassociation relationships may be represented that the Barker notation cannot represent. This is useful for introducing many kinds of business rules.

  2. For business rules that are not simple relationships between two associations, the UML introduces a small flag that can include text describing any business rule.

  3. Attributes can be described in more detail than in other notations.

  4. The UML approach to optionality and cardinality makes it possible to express more complex upper limits, as in “each <entity type 1> may be related to zero, 3, 6–7, or 9 occurrences of <entity type 2>”.

  5. Overlapping and incomplete configurations of sub-types are allowed.

  6. Multiple inheritance and multiple type hierarchies are permitted.

These are valuable concepts. The first three could easily be added to other notations, with good effect. The fourth cannot, but it is rare that such a construct is needed, so its omission in other notations is not a serious practical problem. Such specific upper limits tend to be derived from business rules that might change, so it is not a good idea to include them in a conceptual data model. In the fifth case, the requirement that sub-types be complete and disjoint turns out to be a very useful discipline that produces much more rigorous models than if the restriction were relaxed. The final case describes a point which is controversial even in the object-oriented world. In your author's experience, nearly all examples that appear to require multiple-inheritance or multiple-type hierarchies can be solved by attacking the model from a different direction.

All of these may be valuable, however, if the model is being used to support design.

Other aspects of UML, however, are problematic if the models are to be presented to the public for requirements analysis.

First of all, in UML, cardinality and optionality are represented by numbers instead of graphic symbols. Yes, this has the advantage of permitting any kind of cardinality, such as 1, 4–6, 7, but requirements for such a statement are rare. It has the disadvantage, however, of making it an intellectual exercise to decode the symbols—instead of a visual processing one. You no longer “see” the relationship. You must “understand” it. The left side of the brain is used instead of the right. With information engineering or with Mr. Barker's notation, the entire process of decoding how many participants there are in a relationship is a visual one—and this makes the models much easier to read for those untutored in the notation.

The shorthand of using an asterisk for “may be one or more” and a one for “must be one and only one” in one sense simplifies the UML model, since these are the most common cardinalities and optionalities. On the other hand, it destroys the systematic semantic structure in which you automatically know both the upper and lower limits.

Second, the UML has added unnecessary symbols for specific kinds of relationships. The concepts of composition and aggregation are handled in entity-type/relationship diagrams by simply labeling a relationship part of and composed of. Having special symbols for two of the many possible kinds of relationships unnecessarily complicates the model.

More significantly, these additional symbols are incomplete. They represent the cascade delete and nullify delete rules for “composed of/part of” relationships, but what about the restricted delete rule? (You may not delete the parent at all if children exist.) And what about showing these rules for other relationships? Adding “C”, “R”, or “N” to an e/r diagram uniformly describes whether deletion of the parent is permitted and whether it calls for deletion of the children—regardless of the relationship. In addition, Entity-Type Life Histories more completely describe how entity-type occurrences may be created and under what circumstances they can be deleted (see Chapter 7, pages 262–282).

The justification for these symbols turns out to be that there are physical design implications for the aggregation and composition concepts. In an object-oriented implementation, it is possible for one object to be physically inside another object. Showing the diamonds on a UML design model provides information to the programmers. This is, however, both distracting and unnecessary in the conceptual model used for requirements analysis.

As stated previously, while it does permit showing multiple-inheritance and multiple-type hierarchies, the multi-box approach to sub-types takes up a lot of room on the drawing, limiting the number of other entity types that can be placed on it. Moreover, it does not clearly convey the fact that an occurrence of a sub-type is an occurrence of a super-type.

There are two other shortcomings of the UML, but these can be addressed, either through the use of stereotypes or by imposing discipline on the way the UML is used.

In the first case, the UML could be significantly improved by increased discipline in the use of relationship names. Most commonly a relationship name in the UML is a single verb that describes it in one direction. Were this the only option, it would be unacceptable. It is, however, possible to add “roles” to each end of the relationship. This provides the ability to portray how an entity type is viewed from the perspective of another entity type. Given this structure, it would be valuable if these role names were constrained to follow the Barker naming convention.

Second, the UML deals only partially with unique identifiers. The philosophy behind object orientation is that it isn't necessary explicitly to show unique identifiers. But then it turns out that, from the point of view of a parent entity type, it is often necessary to identify occurrences of a child entity type. So “qualified associations” allow this to be expressed. But you are allowed to identify an occurrence only to a parent entity type. You are not allowed to identify it to the world at large.

This means that, instead of a simple symbol attached to a relationship or attribute to indicate a unique identifier universally, you have to add a whole new box whose meaning is constrained and confusing at best.[3]

Note that this can be addressed using stereotypes as described above. In Figure B.8, “«ident»” was added to several attributes and a relationship to show their participation in unique identifiers.

This doesn't mean that the UML shouldn't be used for the physical design model. To the contrary, the additional expressiveness described here makes it eminently suitable for that purpose. (And designers are not the least bit bothered by the aesthetic objections raised above.) But the UML is fundamentally that—a design tool.

Extensible Markup Language (XML)

The last technique presented here isn't really a data-modeling language at all. Rather is a way of representing data structure in text, using specially defined “tags” or labels to describe the structure of text. The data being described could be either from an entity-type/relationship model or from a database design.

The Extensible Markup Language (XML) is similar to the Hypertext Markup Language (HTML) that is used to describe pages to the World Wide Web. XML and HTML are both subsets of something called “Standard Generalized Markup Language”, or SGML. This is a sophisticated tag language, which, “due to [its] complexity, and the complexity of the tools required,” as the Object Management Group has so delicately put it, “has not achieved widespread uptake” [OMG, 1997].

In each case, a set of “tags” are inserted into a body of text. In the case of HTML, the tags are predefined to be interpreted by a standard piece of software called a browser. The browser uses the tags to determine how various parts of the document should be displayed.

XML, on the other hand, allows tags to be defined by users and is not concerned with display at all. Rather, the tags can be defined to describe a data structure, and data can be transmitted over the Internet in that structure.

Because tags are defined by users, no existing software will automatically understand the tags. Software can read the definitions of tags and insure that data transmitted using them follows them, but it cannot provide more interpretation to the structure unless it is specifically written to do so.

This means that XML is most useful when within a community that defines the semantics of a set of tags in common for its purpose. For example, the chemical industry has set up an XML-based Chemical Markup Language, and astronomers, mathematicians and the like have similarly defined sets of tags for describing things in their respective fields.

What Is It?

Figure B.9 shows an example of XML used to describe a data record that might be presented in a document.

Example B.9. An XML Document.

<?XML version="1.0"?>
<!-- **** Purchasing **** -->
<PURCHASE_ORDER>
  <ISSUED_TO_PARTY>
       <party_id>234553</party_id>
       <name>Acme Sporting Goods</name>
       <party_type>Organization</party_type>
       <surname></surname>
       <corporate_mission>Get America
       moving</corporate_mission>
  </ISSUED_TO_PARTY>
  <po_number>743453</po_number>
  <order_date>12 November, 1999</order_date>
  <LINE_ITEM>
       <line_number>1</line_number>
       <quantity>12</quantity>
       <price>64.75</price>
       <product_service_indicator>
          product
       </product_service_indicator>
       <PRODUCT>
          <product_code>X-23</product_code>
          <description>Nike sneakers</description>
          <unit price>75.00</unit_price>
       </PRODUCT>
   </LINE_ITEM>
 <LINE_ITEM>
      <line_number>2</line_number>
      <quantity>12</quantity>
      <price>64.75</price>
      <product_service_indicator>
        service
      </product_service_indicator>
      <SERVICE>
          <service_id>x-87</product_code>
          <description>Walking the dog</description>
          <rate_per_hour>12.00</rate_per_hour>
      </SERVICE>
   </LINE_ITEM>
   <LINE_ITEM/>
</PURCHASE_ORDER>

Note a few interesting things about this example.

First, as with HTML, each tag is surrounded by less-than and greater-than brackets (<>) and is usually followed by text. The text is in turn followed by an end tag, in the form </...>. A tag may have no content, in which case either the end tag follows immediately upon the tag (as in <surname></surname>), or the tag itself ends with a forward slash (as in <LINE_ITEM/>). Unlike with HTML, however, the end tag is always required in one of those two forms.

A second thing to note is that, in this case, following the tag for <PURCHASE_ORDER>, a set of related tags follow, describing characteristics (columns and relationships from data models, in this case) of <PURCHASE_ORDER>. In this particular case, the tag <PURCHASE_ORDER> has been defined such that it must be followed by exactly one tag for <ISSUED_TO_PARTY>, one for <po_number>, and so forth. You can't see this from the example, but the tag <corporate_mission> is optional. In addition, the tag for line_item is also optional, and there may be one or more occurrences of it.

Although it is optional, all XML documents should begin with <?XML version="1.0"?> (or whatever version number is appropriate.)

Note that the structure is hierarchical, so that an element can be under only one other element, and there can be only one hierarchy in a document. In the example, therefore, party was only defined as <ISSUED_TO_PARTY> under <PURCHASE ORDER>. If it were related to something else in the model, the description would have to be repeated.

Comments are in the form <!-- . . . --> Note that the double hyphens must be part of the comment. Note also that, unlike HTML, XML lets you use a comment to surround lines of code that you want to disable.

The meaning of a tag is defined in a document type declaration (DTD). This is a body of code that defines tags through a set of elements. It is the DTD that allows you to specify a data structure. While an XML document contains data, the DTD contains the model of those data.

It is the DTD that is the analogy to the modeling techniques we have seen in this appendix.

Entity Types and Attributes

The DTD for the above example is shown in Figure B.10.

Example B.10. An XML Data-Type Definition.

<!DOCTYPE PURCHASE_ORDER [
   <!ELEMENT PURCHASE_ORDER (ISSUED_TO_PARTY, po_number,
   order_date, LINE_ITEM*)>
          <!ELEMENT ISSUED_TO_PARTY (party_id, name,
          party_type, surname?, corporate_mission?)>
                 <!ELEMENT party_id (#PCDATA)>
                 <!ELEMENT name (#PCDATA)>
                 <!ELEMENT party_type (#PCDATA)>
                 <!ELEMENT surname (#PCDATA)>
                 <!ELEMENT corporate_mission (#PCDATA)>
          <!ELEMENT po_number (#PCDATA)>
          <!ELEMENT order_date (#PCDATA)>
          <!ELEMENT LINE_ITEM (line_number, quantity, price,
                 product_service_indicator, PRODUCT?,
                 SERVICE?)>
             <!ELEMENT line_number (#PCDATA)>
             <!ELEMENT quantity (#PCDATA)>
             <!ELEMENT price (#PCDATA)>
             <!ELEMENT product_service_indicator (#PCDATA)>
             <!ELEMENT PRODUCT (product_code,
             description,
                    unit_price)>
                    <!ELEMENT product_code (#PCDATA)>
                    <!ELEMENT description (#PCDATA)>
             <!ELEMENT unit_price (#PCDATA)>
             <!ELEMENT SERVICE (service_id, description,
             rate_per_hour)>
                    <!ELEMENT service_id (#PCDATA)>
                    <!ELEMENT description (#PCDATA)>
                    <!ELEMENT rate_per_hour (#PCDATA)>
]

The DTD for an XML document can be either part of the document or in an external file. If it is external, the DOCTYPE statement still occurs in the document, with the argument “SYSTEM -filename-”, where “-filename-” is the name of the file containing the DTD. For example, if the above DTD were in an external file called “xxx.dtd”, the DOCTYPE statement would read:

<!DOCTYPE PURCHASE_ORDER SYSTEM xxx.dtd>

The same line would then also appear as the first line in the file xxx.dtd.

Note that the name specified in the DOCTYPE statement must be the same as the name of the highest-level ELEMENT.

Each element in the specification refers to a piece of information. An XML element is defined in terms of one or more predicates, where a predicate is simply a piece of information about an element. This may be either an attribute or an entity type in your data model. In the example above, <PURCHASE_ORDER> has as predicates <ISSUED_TO_PARTY>, <po_number>, <order_date>, and <LINE_ITEM>. <ISSUED_TO_PARTY> and <LINE ITEM> are relationships to the parent entity type in the data model that this was based on. <Po_number>, and <order_date> are attributes from that model.

Cardinality/Optionality

Relationships are represented by the attachment of predicates to elements. In the absence of any special characters, this means that there must be exactly one occurrence of each predicate for each occurrence of parent element. If the predicate is followed by a “?”, then the predicate is not required. If it is followed by a “*”, it is not required, but if it occurs, it may have more than one occurrence. If it is followed by a “+”, at least one occurrence is required, and it may have more than one.

In the example in Figure B.10, each purchase_order must have an <ISSUED_TO_PARTY>, a <po_number> and an <order_date>. In addition, a <PURCHASE_ORDER> may or may not have any <LINE_ITEMS>s, but it could have more than one.

Each predicate is then itself an element defined in turn by its predicates that follow. At the bottom of the tree in each case, “#PCDATA” means that the element will contain text that can be parsed by browsing software.

Names

Names in XML may not have spaces. XML is case sensitive. XML keywords are in all upper case. The case of a tag name in an element definition must be the same as was used if the element appeared as a predicate, and the case of an element used an XML document must be the same as in its DTD definition.

Note that there is nothing in XML to prevent you from specifying multivalued attributes, but in the interest of coherence for the data structure, following the rules of normalization is strongly recommended. By convention in the above example, elements that would be entity types in an entity/relationship model appear in upper case. Elements that would appear in that model as attributes are in lower case. Your naming conventions may be different.

Unique Identifiers

XML has no way to recognize unique identifiers.

Sub-types

XML has no way to recognize sub-types and super-types. Note, in the example above, that the attributes of <ISSUED_TO_PARTY> had to include both attributes of person and attributes of organization from our other models. The attribute <product-service-indicator> was included in <LINE_ITEM> to determine which case was involved. Similarly, <Party_type> determined which kind of <ISSUED_TO_PARTY> a record referred to. Software would be required to enforce this.

Constraints between Relationships

XML has no way to describe constraints between relationships.

Comments

As noted above, XML isn't really a data-modeling language. It is not very sophisticated in its ability to represent the finer points of data structure. It shares the limitations of a relational database, for example, with no ability to recognize sub-types or constraints. It is being recognized, however, as a very powerful way to describe the essence of data structures for use as a template for transmitting data from one place to another.

While the tag structure does seem to be a good vehicle for describing and communicating database structure, the requirement for discipline in the way we organize data is more present than ever. XML doesn't care if we have repeating groups, monstrous data structures, or whatever. If we are to use XML to express a data structure, it is incumbent upon us to do as good a job with the tool as we can. (This is, of course, true of any modeling technique.)

In recognizing that XML is a good vehicle for describing database structure, the most obvious issue is that this will put greater responsibility on data administrators to define data correctly. XML will not do that. XML will only record whatever data design (good or bad) human beings come up with.

As Clive Finkelstein has said, the advent of XML is going to make data modelers and designers even more important than they are now. “After fifteen years of obscurity, data modelers can finally become overnight successes” [Finkelstein, 1999].

Recommendations

Because the orientation and purposes of data modeling are very different when supporting analysis than when supporting design, no one modeling technique currently available is appropriate for both. Those with the best aesthetics don't describe as many aspects of the issue as others, which are much less accessible.

The one exception to this is object role modeling, which is both rich in detail and relatively easy to read. It differs radically from the other modeling approaches, so it has therefore been less successful in gaining acceptance.

Among those using the more common entity-type/relationship view of the world, Richard Barker's notation is clearly superior as a vehicle for discussing models with prospective system users, and the UML has advantages in supporting design—particularly object-oriented design.

For Analysis—Richard Barker's Notation

There are several arguments in favor of Mr. Barker's data-modeling syntax for use in requirements analysis:

Aesthetic simplicity

This notation is the easiest to present to a user audience. It is the simplest and clearest among those that are as complete. By using fewer kinds of symbols, Barker's technique keeps drawings relatively uncluttered, and fewer kinds of elements have to be understood. Simpler, less cluttered diagrams are more accessible to nontechnical managers and other end users.

It uses a line in two parts, each of which may be dashed or solid, to convey the entire set of optional or mandatory aspects of the relationship pair. The presence or absence of a crow's foot is all that is necessary to represent the upper limit of a relationship. The single symbol of a split line which is either solid or dotted, plus the presence or absence of a crow's foot, is aesthetically simpler than, say, information engineering, which requires combinations of four separate symbols to convey the same information.

In Barker's notation, the “dashedness” or solidness of a line (its most visible aesthetic quality) represents the optionality of the relationship, which is its most important characteristic to most users. IDEF1X, on the other hand, uses “dashedness” to represent the extent to which a relationship is in a unique identifier.

Other systems of notation add symbols unnecessarily: Chen's notation uses different symbols for objects that are implementations of relationships and objects that are tangible entity types; Chen also uses separate symbols for each attribute; IDEF1X also distinguishes between “dependent” entity types and “independent” ones. IDEF1X also uses different symbols at the different ends of relationships. The UML designates certain kinds of relationships (“part of” and “member of”) by either of two special symbols, depending on the referential integrity constraint in effect.

In each case, the additional symbols merely add to the complexity of a diagram and make it more impenetrable, without communicating anything that is not already contained in the simpler notation and names of Barker's notation.

James Martin's version of information engineering is the only one other than Barker's notation that represents sub-types inside super-types, thereby reinforcing the fact that it is a subset, and saving diagram space in the process.

Also, other techniques introduce extra complexity by allowing relationship lines to meander all over the diagram. Barker's notation calls for a specific approach to layout which keeps relationship lines short and straight.

Completeness

Most of the techniques show the same things that Barker's notation technique does, although some are more complete than others. Each of them lacks something that Barker's notation has.

Information engineering does not show attributes; IDEF1X does not show constraints; only Mr. Martin's version of information engineering shows sub-types within super-types. Mr. Chen's notation, information engineering, and UML do not show unique identifiers. Only ORM has all of the same features that the Barker method has, but with its external attributes and sub-types it uses way too much space on the diagram.

In fairness, some of the techniques do things that Barker's does not. IDEF1X, ORM, and the UML show nonexhaustive sub-types, where the sub-types do not represent all occurrences of the super-type. (Barker's technique deals with this only indirectly—by defining a sub-type called “OTHER...”). The UML also shows nonexclusive sub-types, where an occurrence of the super-type can be an occurrence of more than one sub-type. Information engineering and the UML also show nonexclusive constraints between relationships, not available in Barker's technique.

These are all useful things.

The addition of processing logic to data models in the manner of object-modeling techniques (including behavior in the model) is also a very powerful idea. Clearly provision for describing the behavior of an entity type is something that could be added to Barker's notation. Whether it is more appropriate to extend this notation, in the manner of the UML, or to use separate models, such as entity-type life histories and state/transition diagrams, remains to be seen.

Language

Barker's notation requires the analyst to describe relationships succinctly and in clear, grammatically sound, easy-to-understand English. As mentioned above, where all the other techniques use verbs and verb phrases as relationship names, Barker's notation uses prepositional phrases. This is more appropriate, since the preposition is the part of speech that describes relationships. Verbs describe not relationships but actions, which makes them more appropriate for function models than data models. To use a verb to describe a relationship is to say that the relationship is defined by actions taken on the two entity types. It is better simply to describe the nature of the relationship itself.

Using verbs makes it impossible to construct a clean, natural English sentence that completely describes the relationship. “Each party sells in zero, one, or more purchase orders” is not a sentence one would normally use in conversation.

Moreover, finding the right prepositional phase to capture the precise meaning of the relationship is often more difficult than finding a verb that approximately gets the idea across. The requirement to use prepositions then adds a level of discipline to the analyst's assignment. The analyst must understand the relationship very well to come up with exactly the right name for it. Correctly naming relationships often reveals that in fact there is more than one.

This requirement for well-built relationship sentences, then, improves the precision of the resulting model. In each modeling technique, Mr. Barker's naming conventions could be used, but analysts are not encouraged to do so.

For Object-Oriented Design—The UML

While Mr. Barker's notation is preferred as a requirements analysis tool, UML is more complete and detailed and therefore the most suited to support design—particularly object-oriented design.

The method for annotating optionality and cardinality is much more expressive of different circumstances than any of the other techniques. It can specifically say that an occurrence of an entity type is related to 1, 7–9, or 10 occurrences of another entity type.

The UML can describe many more constraints between relationships than can other notations. With proper annotation, it can describe both exclusive and inclusive or relationships, or any other that can be named.

For business rules that are not simple relationships between two associations, UML introduces a small flag that can include text describing any business rule.

Attributes can be described in more detail than in other notations.

Overlapping and incomplete configurations of sub-types are allowed.

“Multiple inheritance”, where a sub-type may be one of more than one super-types, is permitted, as are multiple type hierarchies. While these may not be desirable in analysis models, they could be useful as solutions to particular design problems.

In an object-oriented environment, the extra symbols address specific object-oriented situations.

For Relational Design—IDEF1X

For the reasons described above, it is not advisable to use IDEF1X in an analysis project, since the notation is far too complex to present to a non-technical audience. This complexity, however, is exactly what makes it a good tool for representing relational database design. Its notation highlights the existence of foreign keys, and these are documented explicitly. The differences in annotating optionality and cardinality reflect the different way these could be implemented.

Summary

The ideal CASE tool, then, will be one which supports Mr. Barker's techniques for doing requirements analysis, then has the facilities for converting entity-type definitions into either (1) table definitions or (2) class definitions that can be used by C++ or a similar language. It would then have the ability to represent these design artifacts in IDEF1X or the UML for further refinement.



[1] Meaning that the relationship is Positively required.

[2] Your author is grateful to Dr. Halpin for providing information to supplement his book, and for his comments and suggestions about this appendix. Any remaining errors, however, are your author's and not his.

[3] As a measurement of how confusing this is, different authors themselves cannot agree on how to present it. Martin Fowler shows the qualifying attribute as presented here attached to the parent entity [Fowler 2000, p. 96]. Paul Harmon and Mark Watson, on the other hand, show the attribute next to the child entity [Harmon and Watson 1997, p. 172].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.182.45