Reference Schemes

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

5.3. Reference Schemes

With uniqueness and mandatory role constraints covered, it is time to consider reference schemes in more depth. With simple identification schemes, each entity is identified by associating it with a single value. For example, a country might be referenced using a code. For instance, Australia may be referenced by the definite description “The Country that has CountryCode ‘AU’”. We usually abbreviate simple reference schemes by parenthesizing the reference mode, or manner in which the value relates to the entity, for example, Country(.code). In practice, more complex reference schemes are often encountered, and we need to deal with them even as early as step 1 of the CSDP, when we verbalize facts.

Before examining these more difficult cases, let’s have a closer look at simple reference schemes. Table 5.3 contains data about a small UoD where people can be identified by their surnames and each city has a unique name.

Table 5.3. Personal details.
Person	City	Height (cm)	Chest (cm)	Mass (kg)	IQ
Adams	Brisbane	175	100	77	100
Brisbane	Sydney	182	91	80	120
Collins	Sydney	173	80	67	100
Darwin	Darwin	175	95	90	90

Figure 5.15 shows one way to schematize this. The reference schemes are abbreviated as reference modes in parentheses. Some role names have been added for discussion purposes.

Figure 5.15. A conceptual schema for Table 5.3 (reference schemes abbreviated).

Figure 5.16 repeats the schema, this time with the reference schemes depicted explicitly—each appears as a mandatory, 1:1 reference type. For example, in this schema each person has exactly one surname, and each surname refers to at most one person. Technically, this is called an injection (or 1:1 into mapping) from Person to Surname.

Figure 5.16. A conceptual schema for Table 5.3 (reference schemes explicated).

A double uniqueness bar on a role indicates the preferred reference scheme (also known as the preferred identifier or primary identifier) for the entity type playing the adjacent role. For example, in this UoD, surnames provide the preferred way to identify persons, and centimeter values provide the preferred way to reference lengths.

In this case, there is only one way to reference instances of a given entity type. In general however, ORM allows many reference schemes for the same entity type; in such cases, one of these eventually needs to be chosen as the preferred reference scheme.

The second generation ORM considered in this book appends a colon “:” when displaying unit-based reference modes (e.g., cm:) and requires a unit dimension (e.g., Length) for each unit. ORM tool display of unit dimensions may be toggled on/off at will. Since access to unit dimensions is always possible with unit-based reference modes, specific object types corresponding to roles played by the unit dimension may be used instead of the underlying unit dimension when setting out fact types. For example, our current example may be remodeled as shown in Figure 5.17.

Figure 5.17. An alternative conceptual schema for Table 5.3.

Unlike top-level entity types, value types may overlap with one another. For example, the populations of Surname and CityName overlap here since both include the character string ‘Brisbane’.

Assuming the data in Table 5.3, and an appropriate query language, consider the following dialog between a user and the CIP.

User:	`Which person lives in a city with the same name as his/her surname?`
CIP:	`Darwin`.
User:	`Which person is a city?`
CIP:	`None: Entity types Person and City are mutually exclusive`.

It is meaningful to compare the names for Person and City, since they are both character strings. But it is not sensible to compare persons and cities—they are completely different types of entity. Although the value types for IQ, mass, and length are all numeric, mass and length have different units, while IQ is dimensionless (i.e., it has no units). Conceptually, two numerically referenced entities may be compared or added only if they have the same unit (or both have no unit). Hence the following queries do not make sense:

List persons whose mass equals their IQ.
List persons whose height is more than twice their mass.
List height + mass for each person.

Of course, the numbers themselves may always be compared. Also, height and chest size are both based on the same domain and unit, so they are compatible. So the following queries are acceptable.

List persons whose mass number equals their IQ number.
List persons whose height number is more than twice their mass number.
List height + chest size for each person.

Most current database systems provide little or no support for the distinction between entities and their values. Operations specified in queries are interpreted as applying directly to the values. For example, assuming the data of Table 5.3 is stored in a relational database as the table Athlete (person, city, height, chest, mass, iq) the following SQL query is quite legal:

 select person  from Athlete
 where person = city
   and mass = iq and chest + mass > height

For the population in Table 5.3, this query returns ‘Darwin’, despite the fact that taken literally the query is nonsense. The comparisons evaluated in the query are actually between values, not entities. We could clarify this by a better choice of column names (e.g., use “surname” and “cityname” instead of “person” and “city”).

Values are constants with a predefined interpretation, and hence require no explicit reference scheme. For example, ‘Brisbane’ denotes itself (i.e., the character string ‘Brisbane’), and when written without quotes the numeral ‘77’ denotes the number 77 (Hindu-Arabic decimal notation is assumed). It is sometimes claimed that when written without quotes, a constant such as ‘Brisbane’ is enough to identify a single entity. However, some context is always required (at least implicitly). Even in Table 5.3, ‘Brisbane’ is used to refer to both a person and a city.

Earlier in the book we saw that entities are referenced by means of definite descriptions (e.g. “The City with name ‘Brisbane’”). Within the context of a reference type, a value may, somewhat loosely, be said to be an identifier for an entity if the value relates to only one entity of that type.

If only one value is used in a definite description, we have a simple 1:1 reference scheme. Sometimes however, two or more values may be required. In this case, we have a compound reference scheme. As an example, consider how computer files are identified in Table 5.4. Within a given folder, a file may be identified by its local filename, e.g., “flag.vsd”. But when many folders are involved, we need to combine the folder name with the local filename to know which file we are talking about.

Table 5.4. Details about computer files.
File		Size (kB)
Folder	Filenarne	Size (kB)
MyDocs	flag.vsd	35
MyDocs	orml.doc	1324
OzWork	flag.vsd	40

To reflect the two-part naming convention suggested by the two columns for File in Table 5.4, we could model the situation as shown in Figure 5.18. The mandatory roles and internal uniqueness constraints declare that each file is in exactly one folder, has exactly one (local) filename, and has exactly one size. The external uniqueness constraint can be used to provide a compound reference scheme and may be verbalized as shown.

Figure 5.18. File has a compound reference scheme.

If no other identification scheme exists for File, the external uniqueness constraint is the basis for the preferred way in which humans communicate about the files. To indicate this choice of compound, preferred reference, a circled double uniqueness bar is used, as shown in Figure 5.19.

Figure 5.19. The previous schema populated with references and elementary facts.

Using icons for entities, the reference types File is in Folder and File has FileName are populated as shown. In the fact table for File has Size, the file entries appear as value pairs. Although not apparent in the diagram, these are ordered (Folder, FileName) in the same order as the reference types were added to the schema—an ORM tool can display this order on request.

If the reference predicates are included in the definite descriptions for the files, any order can be used. For example, “the File that is in Folder ‘MyDocs’ and has FileName ‘flag.vsd’” is equivalent to “the File that has FileName ‘flag.vsd’ and is in Folder ‘MyDocs’”. In general, compound reference schemes involve a mandatory 1:1 map of objects to a tuple of two or more values.

Textually, a compound reference scheme for an object type A may be declared by listing the reference types, with the leading A removed, in parentheses after A. The relevant uniqueness and mandatory constraints on reference predicates are assumed, and if the reference schemes are declared first, the facts may be stated in shortened form. For example, the sample model may be specified textually as follows:

Reference schemes:	Folder (.name); File (is in Folder, has FileName); Size (RB:)
Fact type:	File has Size
Constraints:	Each File has at most one Size.
	Each File has at least one Size.
Fact instances:	File (‘MyDocs’, ‘flag.vsd’) has Size 35.
	File (‘MyDocs’, ‘orml.doc’) has Size 1324.
	File (‘OzWork’, ‘flag.vsd’) has Size 40.

In principle, we could model Files using a simple reference scheme by concatenating the folder name and local filename into one string with a separator (e.g. “”). For example, the files may then be named simply “MyDocsflag.vsd”. “MyDocsorml.doc” and “OzWorkflag.vsd”. This approach is normally undesirable because it requires extra derivation work to extract and group the components to produce various reports. We’ll say more about this issue later. Indeed, if we wish to issue queries about the file extension (e.g., “vsd”, “doc”) as well, a three-part reference scheme is desirable (using folder name, simple name, and file extension).

Identification schemes are relative to the particular UoD. Often, a simple scheme that works in a local context fails in a UoD of wider scope. This notion is important to bear in mind when merging schemas into a larger schema. For example, suppose that within a given university department each subject offered by that department can be identified by its title. For instance, consider the output report in Table 5.5, which indicates which subjects are offered by the Physics Department in which semesters.

Table 5.5. Physics department offerings.
Subject	Semester
Electronics	1
Mechanics	1
Optics	2

If designing a schema for just the Physics Department, we have a 1:1 correspondence between subjects and their titles. So this table may be schematized as the fact type Subject (.title) is offered in Semester (.nr). A similar schematization could be used for just the Mathematics Department, a sample output report for which is shown in Table 5.6.

Table 5.6. Mathematics department offerings.
Subject	Semester
Algebra	2
Calculus	1
Mechanics	1

But now suppose we need to integrate departmental schemas into an overall schema for the whole university. Our simple identification scheme for subjects will no longer work because, in this wider UoD, different subjects may have the same title. For example, the subject Mechanics offered by the Physics Department is different from the subject Mechanics offered by the Mathematics Department. A combined output report would look like Table 5.7.

Table 5.7. University subject offerings.
Department	Subject title	Semester
Physics	Electronics	1
	Mechanics	1
	Optics	2
Mathematics	Algebra	2
	Calculus	1
	Mechanics	1
...	...	...

The first row may be verbalized using a compound reference scheme as follows: “The Subject that is offered by the Department named ‘Physics’ and has the SubjectTitle ‘Electronics’ is offered in the Semester numbered 1”. This approach is schematized in Figure 5.20.

Figure 5.20. A schema for Table 5.7 using a compound reference scheme.

Although conceptually this picture is illuminating, in practice its implementation may be somewhat awkward, with two labels (one for the department and one for the title) needed to identify a subject. In such cases, a new identification scheme is often introduced to provide a simple 1:1 reference. For example, each subject may be assigned a unique subject code, as shown in Table 5.8.

Table 5.8. Subject codes provide a simple reference scheme.
Subject	Title	Department	Semester
PH101	Electronics	Physics	1
PH102	Mechanics	Physics	1
PH200	Optics	Physics	2
MP104	Algebra	Mathematics	2
MP210	Calculus	Mathematics	1
MA 109	Mechanics	Mathematics	1

In this output report there are two candidate identifiers for Subject. We could identify a subject by its code (e.g., ‘PH102’) or by combining its department and title (e.g., (’Physics’, ‘Mechanics’)). One of these is chosen as the preferred identifier, or standard means of referring to the entity. In this case, we would usually pick the subject code as the preferred identifier. We can indicate this choice by displaying the subject code identification scheme as reference mode in parentheses and displaying the external uniqueness constraint with a single uniqueness bar (see Figure 5.21).

Figure 5.21. A schema for Table 5.8 using subject code as Subject's primary identifier.

If a parenthesized reference mode is displayed, this is always taken to be the preferred reference scheme. Here, the Department and SubjectTitle predicates are treated just like any other elementary fact types; they are no longer considered to be reference types. Of course, the external uniqueness constraint between them must still be declared.

Choosing a preferred reference scheme from two or more existing schemes is perhaps not always a conceptual issue. At the conceptual level, its main uses are to note any business decisions on preferred identification schemes and to disambiguate abbreviated references that omit the reference predicate(s), e.g., “Patient 3452 has Temperature 37”. This is especially useful when populating fact tables with instances. At the logical level however, the choice of preferred reference typically has a major impact (e.g., determining the primary key of a table).

Here are some guidelines for selecting a preferred identifier. First, minimize the number of components. A single subject code is easier to enter than two values for department and title. Moreover, a compound identifier adds extra overhead in the later database implementation (joins, indexes, and integrity checks over composite columns require more effort).

Second, favor an identifier that is more stable. In a university environment, the same subject may undergo some title changes throughout its history and sometimes the department that offers a given subject changes. It is even possible for the same department-title combination to refer to different subjects at different times. However, it is extremely rare for a subject to have its code changed, and the same subject code is typically never reused for a different subject.

If an historical record is needed for an object, and its identifier changes though time, extra work is required to keep track of these changes so that we know when we are talking about the same object. This extra effort can be minimized by making the identifier as stable as possible. Ideally, the object has the same identifier throughout the lifetime of the application; this is known as a rigid identifier. Most organizations choose rigid identifiers such as EmployeeNr, ClientNr, SerialNr for their employees, clients, equipment, and so on.

To explain some basic concepts in a friendly way, we’ve used identifiers such as PersonName, Surname, or even Firstname in some examples. However, except for trivial applications, this kind of identification is unrealistic. Often, two people may have the same name. Moreover, people may change their name—women usually change their surname when they marry, and anyone may legally change their name for other reasons. A philosophy lecturer we know once changed his surname to “What” and his wife changed her surname to “Who”!

As another example, the schema in Figure 5.22 expands the UoD considered in Figure 5.19 to enable computer files to be identified across many computers. Here, an artificial fileId is introduced as the preferred identifier to provide a simple, rigid identifier that remains stable even when the file name is changed (a fairly common occurrence). In addition, a human friendly path name is provided as a multipart secondary identification scheme via the external uniqueness constraints.

Figure 5.22. Identifying files in a wider environment.

Concatenating the components of this secondary identification scheme along with separators leads to file names such as “www.myServer.comc:MyDocsflag.vsd”. In practice, many systems would also assign simple internal identifiers for other elements (e.g., folders and drives).

A third criterion sometimes used for selecting an identifier is that it be easy for users to recognize. This criterion is mainly used to minimize errors when users enter or access artificial identifiers such as subject or stock-item codes. This effectively endows codes with semantics, at least implicitly. For example, the subject code “CS114” is used in an Australian university for a particular subject in informatics. The first two characters “CS” indicate the discipline area (Computer Science), the first digit indicates the level (first), and the last two digits merely distinguish it from other computer science subjects at the same level. Such a code is less likely to be misread than an arbitrary code (e.g., “09714”).

Such “information-bearing” identifiers are sometimes called “semantic names”. They should be avoided if the semantic relationships involved are unstable, since the names would then often need changing. In the current example, however, the semantics are fairly stable—the subject discipline is unlikely ever to change, and the subject level would normally be stable too. The linking of the two letters to discipline (e.g., “MP” for pure mathematics and “MA” for applied mathematics) rather than department is better, since disciplines are more stable than departments.

If semantic names are used, we need to decide whether their semantics are to be modeled in the system or simply assumed to be understood by users. The semantics of subject codes may be modeled explicitly by including the following three binaries in the schema: Subject is in Discipline; Subject is at YearLevel; Subject has SerialNr. An external uniqueness constraint spans the roles played here by the three components.

Figure 5.23(a) shows the case where subject code is the preferred identifier. In this case, these three binaries would normally be derived, and appropriate derivation rules specified (e.g., the discipline code may be derived using a substring operator to select the first two characters of the subject code). The external uniqueness constraint is then derivable.

Figure 5.23. Different ways of exposing the semantics of subject codes.

An alternative approach is shown in Figure 5.23(b). Here the three-part identification scheme is used for preferred reference, and the subject code is derived by concatenation.

The major advantage of storing the components separately like this is that it facilitates queries concerning the components. In particular, if we wish to formulate queries for each instance of a reference component, then that component should be stored.

The schema in Figure 5.23(b) makes it easy to query properties for each discipline or year level (e.g., “How many subjects are offered for each discipline?”). As discussed in Section 12.10, the conceptual “for-each” construct corresponds to “group-by” in SQL.

The model in Figure 5.23(a) is extremely awkward for such queries, since groups can normally only be formed from stored columns. However, a simple code is more compact and offers much better performance for relational joins than a three-part identification scheme.

Is there a way of having your cake and eating it too? Yes, at least to some extent. In Figure 5.24 the three components are both derived and stored (as shown by the “**”). This allows fast joins on the subject code and grouped queries on the components.

Figure 5.24. The three components are derived and stored.

Semantic codes were often used in legacy systems, with no attempt to expose their semantics as separate components. If these legacy systems have to be retained in their current form, a separate system (e.g., a data warehouse) can be constructed to better support queries on the components. In copying the data from the legacy system to the new system, the codes are transformed into separate component fields.

Just how much of the semantics underlying an information-bearing name should be exposed in a conceptual schema depends on what kinds of queries we wish to issue.

For example, consider the output report shown in Table 5.9. Here room numbers have the format dd-ddd (where d denotes a digit). The first two digits provide the building number for the room’s building, the third digit indicates the room’s floor (in this UoD no building has more than nine floors), and the last two digits comprise a serial number to distinguish rooms on the same floor.

Table 5.9. Employee details.
EmpNr	Employee Name	Room
715	Adams A	69-301
139	Cantor G	67-301
503	HagarTA	69-5073
...	...	...

If we never wish to query the information system about buildings or floors, it’s fine to model rooms using the simple reference scheme Room(.nr). Now suppose instead that we also expect the system to output reports such as the extract shown in Table 5.10. Here we are interested in buildings, and even want to issue a query that lists the number of rooms for each building. In this case we must expose at least the building component of the reference scheme for buildings, as shown in Figure 5.25.

Table 5.10. Building details.
BuildingNr	Building Name	Nr moms
...	...	...
67	Priestly	100
68	Chemistry	100
69	Computer Science	150
...	...	...

Figure 5.25. The reference scheme for Room exposes the building component.

In the new model, the term “RoomNr” indicates a local room number (e.g., “301”) that identifies a room within a given building. It corresponds to the last three digits in a campus-wide room number such as “67-301”. If we are not interested in querying the system about floors, there’s no need to expose the room number semantics any further.

Now suppose that we wish to query the system about floors; for example, “How many rooms are there on floor 3 of building 67?” To facilitate this, we could expand the reference scheme to three components to expose the floor semantics. One way to model this is shown in Figure 5.26(a). This divides the previous room number (e.g., “301”) into a FloorNr (e.g., “3”) and a serial number (e.g., “01”).

Figure 5.26. Two ways of exposing the three-part reference scheme for rooms.

If we want to think about floors themselves, rather than just floor numbers, it is better to model Floor as an entity type as shown in Figure 5.26(b) Here Floor plays in the composite reference scheme for Room, and also has its own composite reference scheme.

Although less convenient, it is possible to use derivation functions to formulate queries about single instances of a reference component, without actually storing the component. For example, to list the number of rooms on floor 3 of building 67, we could use the following SQL query: select count(*) from Room where roomNr like ‘67%’ and roomNr like ‘%-3%’. Even though this approach is possible, we still recommend exposing the semantics. If ever in doubt as to whether to expose some semantics, it’s generally better to do so.

You should always store a reference component if you wish to query about a group of its instances; for example, “For each floor of building 67, how many rooms are there?” Using the schema of Figure 5.26(b), this may be formulated in ORM by selecting the path from Room to Building (i.e., Room is on a Floor that is in Building), adding the condition “= 67” for Building (implicitly applied to BuildingNr), and then requesting Floor and count(Room) for each Floor.

At the relational level, the query may be formulated in SQL using a group-by clause; for example, select floorNr, count(*) from Room where buildingNr = 67 group by floorNr. The SQL language is discussed in detail in Chapters 12 and 13.

Choosing reference schemes for entity types is an important aspect of modeling. As a strict guideline, an entity type’s preferred identification scheme must 1:1 map each entity of that type to a tuple of one or more values (either directly or indirectly). So, each preferred reference scheme provides an injection (mandatory, 1:1 mapping). For example, in Figure 5.26, each building maps directly to a unique building number, and each room maps (indirectly via the reference predicates) to a unique triple of values (building number, floor number, serial number).

In everyday life, weaker kinds of identification scheme are used in some contexts. For example, some people have identifying nicknames (e.g. “Tricky Dicky” denotes President Nixon, and “the great white shark” denotes the golfer Greg Norman). However in a typical application not everybody will have a nickname, so nicknames are not used for preferred reference (they are not mandatory).

Sometimes a l:many reference scheme is used. For example, a plant or animal type may have more than one identifying, common name. For instance, the bird species with the scientific name of Dacelo gigas may also be identified by any of the following common names: “kookaburra”, “laughing jackass”, “great brown kingfisher”, or “bushman’s clock”. As a botanical example, “gorse” and “furze” refer to the same plant.

In Figure 5.27, common names are used as the preferred identifier for bird species. This choice is signified by the double uniqueness constraint bar. A sample population is included to illustrate the l:many nature of this reference type. Do you spot any problem with this reference scheme?

Figure 5.27. This association is not 1:1, so should not be used for preferred reference.

This association must not be used for preferred reference because it is not 1:1. You can imagine the problems that would arise if we used “kookaburra” to identify a bird species in one part of a database and “laughing jackass” in another part. In the absence of any standard identifier like a scientific name, how would we know that we were talking about the same bird species?

For preferred reference, we must choose a mandatory, 1:1 scheme. Such a scheme may already exist (e.g., scientific name). If not, we create one with the help of the domain expert. This might be partly artificial (e.g., standard common name) or completely artificial (e.g., birdkindNr). Whatever we choose, we need to get the users of the information system to agree upon it.

The scientific name for a bird species is actually a semantic name with two components. For example, in Dacelo gigas, the “Dacelo” is the name of the genus, and “gigas” is a species name. By itself, a species name such as “gigas” or “americana” does not identify a species. For instance, Certhia americana (the brown creeper) is a different bird species from Parula americana (the northern parula).

If we wish to query about the genus or species name, we should unpack the naming semantics into a two-part reference scheme as shown in Figure 5.28; otherwise we can use the simple reference scheme BirdSpecies(.scientificName).

Figure 5.28. Many common names may be used for the same species.

Notice that we can still include facts about common names. Once this schema is fully populated, we could let users interact with the system via the common names, still using the scientific naming convention to establish identity.

Simple 1:1 naming schemes relate entities directly to values (e.g., subjects to subject codes). In life we sometimes identify entities by 1:1 relating them to other entities (e.g., “the Olympics that was held in the Year 1992”, “the Warehouse that is located in the City ‘Brisbane’”, “the Director who heads the Department ‘Sales’”). Such identification schemes may be chosen for primary reference, and depicted as mandatory 1:1 binaries (e.g., Olympics was held in Year (CE)). These binaries are then regarded as reference types, not elementary fact types.

However, this practice is rare. Instead such binary associations are usually modeled as elementary fact types, and other primary reference schemes are chosen (e.g., Olympics(.sequenceNr), Warehouse(.nr), Director(.empNr)). Alternatively, the relevant entity types may be removed by using suitably descriptive predicates. For example, the assertion “The Olympics of Year 1992 was located in City ‘Barcelona’” may be portrayed as a binary association between the entity types Year and City. If the original referential semantics are not needed, another alternative is to use simple value reference, but this is usually a last resort. For example, “Warehouse(.name) ‘Brisbane’” loses the semantics that this warehouse is located in the city Brisbane.

In compound reference schemes, each reference predicate is typically mandatory. In such cases, each entity of the same type is identified using the same number of values. In the real world however, we sometimes encounter identification schemes where some of the reference roles are optional, but their disjunction is still mandatory. This is called disjunctive reference. In these cases, different entities of the same type may be referenced by different numbers of values. For example, consider the botanical identification scheme depicted in Figure 5.29.

Figure 5.29. With botanical identification, some reference roles are optional.

Some kinds of plants are identified simply by a genus (e.g., Agrostis). Many other plant varieties are referenced by a combination of genus and species names (e.g., Acacia interior). Still others are identified by combining genus, species, and infraspecies, where the infraspecies itself is identified by a rank and infraname (e.g., Eucalyptus fibrosa ssp. nubila). So depending on the kind of plant, there may be one, two, or four values required to identify it.

The external uniqueness constraint over three roles indicates that each subtuple of (genus, species, infraspecies) refers to at most one plant kind. The other external uniqueness constraint declares that the (rank, infraname) pair identifies infraspecies. In the reference scheme for Plantkind, only the genus is mandatory. Moreover, for each genus there is at most one plant kind with only a genus name. And for each genus-species combination there is at most one plant kind with no infraspecies. In a relational database, each plant kind maps to a sequence of four values, three of which may be null, and each quadruple is unique (treating null just like any other value). Uniqueness constraints over optional roles are discussed in detail in Section 10.1.

Years ago when Peter Ritson and one of us formalized disjunctive reference within ORM, we displayed it with a percentage sign next to the external uniqueness marker, suggesting that “partial” sequences of components are allowed. Nowadays this notation is no longer used, and the basic external uniqueness symbol is used for both ordinary and disjunctive reference. You know the reference scheme is disjunctive if at least one reference role is optional. The minimum requirement for a legal reference scheme is that it provides a mandatory 1:1 mapping to a sequence of one or more values. So the disjunction of the reference roles must still be mandatory.

You may have noticed the arrow with a subset symbol running from the first role of Plant has Infraspecies to the first role of Plant has SpeciesName. This denotes the subset constraint that infraspecies is recorded only if species is. Subset constraints are discussed in detail in the next chapter.

Believe it or not, the reference scheme of Figure 5.29 is a simplified version of the actual identification scheme used in botany, where complications such as hybrids and cultivars also need to be catered for. Life can be messy!

In practice, try to avoid using a disjunctive reference scheme for preferred reference. With the current example, we should consider introducing an alternative simple identifier (e.g., plantkindNr). If the users agree to use this in their normal communication, this change can be made to the conceptual model. If not, we can still consider using it in the logical design. If we do this, we still need to model the disjunctive information as a “secondary reference” scheme. Although still messy to implement, moving the disjunction from a primary to a secondary reference simplifies and improves the performance for most implementations.

Another way to avoid disjunctive reference is to concatenate the reference components into a single name. Since this makes it difficult to issue queries about individual components, this is unlikely to be viable with our current example.

Yet another way to avoid disjunctive reference is to include special default values in the identification scheme. For example, if we use a symbol such as “--” or “#” for “does not exist” all the reference roles become mandatory, and this special value can be treated like any other value in enforcing uniqueness. Although this approach can work for implementing some cases, it is hardly conceptual and may be impractical if you require the user community to actually use the special values or if you need the same symbol to work with different data types (e.g., numbers as well as strings).

Sometimes reference chains can get lengthy. A value is identified by a constant. Each entity is referenced directly using a sequence of one or more objects, which in turn may be values or entities. In the latter case, the reference chain continues, recursively, until finally the referencing objects are all values. In this way, each object is ultimately referenced by one or more values that appear as entries in named columns of output reports or database tables.

Candidate identifiers for the same entity are said to be synonyms. In the information systems literature the term “homonym” is used for a label that refers to more than one entity. For example, the same surname “Jones” may refer to more than one person. Since “homonym” has different grammatical senses, another term such as “nonidentifying label” is preferable. At any rate, the “problem of homonyms” is solved either by augmenting the reference scheme until identification is achieved (e.g., combining surname and initials) or by using a completely different preferred identification scheme (e.g., social security number).

In cases considered so far, an object’s reference scheme has been the same throughout the whole schema. The next chapter examines some cases involving subtypes where this assumption is removed; this leads to context dependent reference. To end this section, let’s consider the case of variable reference where different units are used for the same physical quantity in the same application domain.

In Australia, pieces of lumber have their longest length specified in meters, but their breadth and depth are measured in millimeters. Because this is a standard in the Australian building industry, a report on lumber sizes might look like Table 5.11.

Table 5.11. Lumber details.
Lumber size	Length (m)	Breadth (mm)	Depth (mm)
A4	3	100	100
B7	2	100	75
C7	5	200	100

Suppose we want to compare the different linear measurements (e.g., Which sizes of lumber are 20 times as long as their breadth?) or compute volumes in standard units (e.g., What is the volume of lumber size C7 in cubic meters?). Figure 5.30(a) shows one way to model this situation.

Figure 5.30. Derived units are based on a standard unit.

Here the colon “:” indicates the unit-based nature of the reference modes. The asterisk “*” after “(mm:)” indicates that millimeter (mm) is a derived unit, unlike meter (m), which is a standard unit. Figure 5.30(b) displays the expanded form of the unit reference modes, revealing that each unit is based on the same unit dimension (Length). It also shows the conversion rule between the units. A derived unit is always placed on the left of the equals sign. We use the term “dimension” to include derived dimensions (e.g., Volume = Length³). Both sides of a conversion rule equation must have the same dimension (e.g., Length) or dimension expression (e.g., Mass/Volume).

Since several units might be used for the same physical quantity, we reduce the number of conversion rules by choosing a standard unit and just supplying rules to convert to this standard (rather than separate conversions between each unit pair).

ORM 2 regards object types with the same unit-based reference mode to be compatible. So, if Length and Pressure occur in the same model, it would be incorrect to use the reference schemes Length(mm:) and Pressure(mm:) since this implies that Length and Pressure are based on the same semantic domain. To avoid this problem, we need to rename one of the reference modes, for example, Pressure (mmHg:).

As a more abstract approach, we could model the metadata (length, etc.) in Table 5.11 as data, as shown in the populated model in Figure 5.31. The first fact from Table 5.11 could then be expressed by the facts: LumberSize ‘A4’ has Attribute ‘length’ of Numeric Value 3; Attribute ‘length’ has Unit ‘m’. The unit conversion rule is still required. This technique is called metadata demotion since it demotes metadata (data about data) to domain data. While this approach is very powerful, it is often unnatural for most users. If used at all, it should normally be transformed into a simpler external schema for user interaction.

Figure 5.31. Demoting metadata to data.

As a simpler example of treating different units as values, a financial application involving many currencies might model amounts of money using the compound reference scheme MoneyAmount(is in Currency(.code), has NumericValueO).

Demoting metadata (types) to data (instances) can help to minimize changes to the schema over time. An abstract model such as Figure 5.31 might be used to permit new attributes to be added without altering the database schema. For example, a medical application might use the ternary association Patient has Attribute of Value to anticipate new patient tests to be added later. In such cases, additional associations are often needed to control the use of units within their relevant contexts.

Sometimes, we may need to work with multiple unit systems. For example, we may need both metric and nonmetric measures as users gradually move from one unit system to another, or merge separate applications. In addition to supplying conversion rules to transform between different units, we must take care to expose the units to humans interacting with the system (recall the fate of the Mars Climate Orbiter).

Exercise 5.3

It is desired to identify a warehouse by its physical location. Design an appropriate identification scheme for the each of the following contexts:
1. UoD restricted to one suburb. The street in which the warehouse is located is identified by its name.
2. UoD restricted to one city. Each suburb is identified by its name.
3. UoD restricted to one country. Each city is a major city, identified by name.
4. UoD restricted to planet Earth. Countries are identified by name.
5. UoD restricted to Milky Way galaxy. Planets are identified by name.
The UoD is restricted to Earth, and we wish to store facts of the form: Warehouse has Floorspace; Warehouse contains Item in Quantity. Is the identification scheme discussed in Question 1(d) practical from the implementation point of view? If not, suggest a better scheme and support your scheme by comparing a sample population for the two fact types, using both approaches. Comment on the relative work needed to perform a join (conceptual or relational) between the fact types.
1. A triangle ABC is formed from points in a Cartesian coordinate system. The coordinates are pure numbers (no units). Assume that at any given time, only one shape can be displayed, which must be a triangle. Model the information displayed.
2. The map shown here indicates the approximate location of parking stations near a famous cathedral. The squares on the grid represent city blocks. Model the parking information stored on this map.
Members of a small social club are identified by the combination of their given names and surname. Each member has at least one and at most three given names. For example, one member is Eve Jones, another is Johann Sebastian Bach, and another is Eve Mary Elizabeth Jones. It is required that each component of their names be individually accessible. Draw a conceptual schema diagram for this situation.
Members of a small American gymnasium have their weight recorded in pounds (lb). The weight that each member can bench press is also recorded, but in kilograms (kg). It is desired to compare these two weights but retain the separate units. Specify a conceptual schema for this situation. Note that 1 lb = 0.454 kg and 1 kg = 2.205 lb.
The following table lists the common names by which beer drinks of various volumes may be ordered in hotels in the states of Australia. Volumes are measured in fluid ounces (oz). The sample data is significant. A double hyphen “—” indicates that beer drinks of that volume are not on sale in that state. For instance, in Queensland exactly three different beer drinks may be ordered. Schematize this UoD.
The following report is an extract from an information system that records the results of driving tests for various employees. A person may take as many tests as he/she wishes, but cannot take more than one test on the same day. A minimum score of 80 is required to pass the test. Sometimes a person takes another test simply to try for a better score. Schematize this UoD.
Test Date Driver Result
SSN Name Score P/F
101 1/2/00 539-31-1234 Hagar, David Paul 75 F
102 1/2/00 438-12-3456 Jones, Selena Linda 85 P
103 1/9/00 539-31-1234 Hagar, David Paul 80 P

Test	Date	Driver	Result
101	1/2/00	539-31-1234	Hagar, David Paul	75	F
102	1/2/00	438-12-3456	Jones, Selena Linda	85	P
103	1/9/00	539-31-1234	Hagar, David Paul	80	P

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Reference Schemes

Create new playlist

Sign In

Sign Up