3.5. CSDP Step 3: Trim Schema; Note Basic Derivations

Having drawn the fact types and performed a population check, we now move on to step 3 of our design procedure. Here we check to see if there are some entity types that should be combined. We also check to see if some fact types can be derived from the others by arithmetic computation.

CSDP step 3: Check for entity types that should be combined; note any arithmetic derivations


To understand the first part of this step, we need to know how the objects in the UoD are classified into types. Figure 3.22 shows the basic division of objects into entities (non-lexical objects) and values (lexical objects). Entities are identified by definite descriptions and may typically change their state, whereas values are constants (typically character strings or numbers). Values might include other objects directly representable on a medium (e.g., sounds), but such possibilities are ignored in this book.

Figure 3.22. Partitioning objects into types of entities and values.


These subdivisions are mutually exclusive (i.e., they have no instance in common). For example, no character string can be an entity. The division of a whole into exclusive parts is called a partition. You may think of it as cutting a pie up into slices. The slices are exclusive (they don’t overlap) and exhaustive (together they make up the whole pie).

Figure 3.22 gives an example of how the object types might be divided into entity types and value types for a particular business domain. Which kinds of objects exist depends on the UoD. Basically, objects are grouped into the same type if we want to record similar information about them.

For any UoD, there is always a top-level partitioning of its entities into exclusive types: these are called primitive entity types. We may introduce subtypes of these primitive types, especially if they have some specific roles to play. In ORM, subtypes are shown connected by an arrow to their supertype. Although shown separately, it is possible that subtypes of a given entity type may overlap.

However, primitive entity types never overlap. For example, no person can be a city. On a conceptual schema diagram, the visual separation of primitive entity types indicates that these types are mutually exclusive. The same is not true of subtypes. In Figure 3.23, for example, Person and City are mutually excusive but Manager and Woman need not be. Subtypes are discussed in detail in Chapter 6.

Figure 3.23. Person and City are primitive entity types and hence mutually exclusive.


Figure 3.24(a) shows two value types: Surname and Cityname. Even if these appear as top level types in an ORM schema diagram, it is possible that they overlap, as shown in the Euler diagram of Figure 3.24(b). Here the character strings ‘Denver’ and ‘Columbus’ are both surnames and city names. Although subtype relationships between value types are rarely displayed, they may exist implicitly. In rare cases where the supertype plays a role of interest, we might display the connections explicitly, as in Figure 3.24(c).

Figure 3.24. Overlapping value types have an implicit or explicit supertype.


Step 3 of the design procedure begins with a check to see if some entity types should be combined. At this stage we are concerned only with primitive entity types, not entity subtypes. So if you spot some entity types that do overlap, you should combine them into a single entity type, or at least introduce a common supertype (see Chapter 6).

For example, consider Table 3.11 which is based on the movie application discussed in Chapter 1. Suppose that as a result of applying steps 1 and 2 we arrived at the diagram shown in Figure 3.25. Do you see what’s wrong with this diagram?

Table 3.11. Some motion pictures.
Movie#Movie TitleDirectorStars
1CosmologyLee Lafferty 
2Kung Fu HustleStephen ChowStephen Chow
3The Secret GardenAlan GrintGennie James, Barret Oliver
............

Figure 3.25. A faulty conceptual schema.


Figure 3.25 displays MovieStar and Director as separate, primitive entity types. This implies that these types are mutually exclusive (i.e., no movie star can be a director). But is this the case? Our sample population lists the value “Stephen Chow” in both the Director column and the Stars column. Does this refer to the same person?

If in doubt you can ask a domain expert. In actual fact, it is the same person. As a result, we must combine the Movie Star and Director entity types into a single entity type as shown in Figure 3.26.

Figure 3.26. The result of applying step 3 to Figure 3.25.


This shows that it is possible to be both a star and a director. Of course, it does not imply that every movie star is a director. A later chapter discusses how to add subtypes later if necessary. For example, if some other facts are to be recorded only for directors, we form a Director subtype of Person for those additional facts.

One case where entity types may always be combined, if desired, is when they have the same unit-based reference mode. Here the entity is envisaged as a quantity of so many units (e.g., kilograms or U.S. dollars). Let’s look at a few examples.

Consider the output report of Table 3.12, and assume that “$” means U.S. dollar. Here $50 is a wholesale price and a markup. In both cases the $50 denotes the same amount of money, so wholesale prices and markups overlap. If the table population is significant, the set of retail prices does not overlap the set of markups. Nevertheless, it is meaningful to compare retail prices and markups since they have the same unit (U.S. dollars). For instance, article Al has a retail price that is three times its markup.

Table 3.12. Monetary details about articles on sale.
ArticleWholesale Price ($)Retail Price ($)Markup
A1507525
A28013050
A3507020
A410013030

In modeling this report, we must reveal that the entries in the final three columns are compatible because they all represent amounts of money. One way to schematize this UoD is shown in Figure 3.27(a). By assigning WholesalePrice, RetailPrice, and Markup the same unit-based reference mode (USD), we implicitly declare that they are all based on the same unit dimension.

Figure 3.27. Unit-based types are implicitly subtypes of their unit dimension.


The colon in “(USD:)” signifies that the reference mode is unit-based. When entering this in an ORM 2 tool, we declare the unit dimension (by selecting from a predefined list or adding our own). In this case, the unit dimension is Money Amount (or simply Money). As a display option, the unit dimension may be shown explicitly, as in Figure 3.27(b).

WholesalePrice, RetailPrice, and Markup are not primitive types. They are implicitly subtypes of Money Amount. Instead, we may collapse these entity types into Money-Amount and move their distinguishing semantics into the predicate readings, as in Figure 3.28. We may retain the names of the former entity types as role names, as shown.

Figure 3.28. An alternative to Figure 3.27, with derivation rule added.


The output report satisfies the following mathematical relationship between the values in the last three columns: markup = retail price - wholesale price. Assuming this is significant, the markup value may be derived from the wholesale and retail values by means of this rule. To minimize the chance of human error, we have the system derive this value rather than have humans compute it.

To indicate that a fact type is derived, an asterisk is appended to its predicate reading. Whether or not the derived fact type is displayed on the diagram, a derivation rule for it should be supplied. In this book, derivation rules are displayed as textual annotations.

A derivation rule may be specified in either attribute style or relational style. Attribute style uses role names to refer to attributes of an object type, typically cited in a for-each clause, as in Figure 3.28. For binary associations, role names may be treated as naming attributes of the object type at the opposite end of the association.

If the context of a rule is declared in a for-each clause, it may be assumed thereafter. If the context is not predeclared, it may be provided in situ by qualified names using either the dot notation familiar from programming and UML (e.g., Article.markup) or the of-notation (e.g., markup of Article). When validating rules with nontechnical domain experts, the of-notation is usually preferable to the dot notation.

Relational style instead uses predicate readings. For example, using “iff” for “if and only if, the markup rule may be declared in relational style as shown below. Subscripts distinguish variables introduced in the rule body (after “iff”) from variables of the same type (here MoneyAmount) that occur in the rule head (before “iff’).

Article has markup of MoneyAmount iff

Article retails for MoneyAmount1 and Article wholesales for MoneyAmount2 and

MoneyAmount = MoneyAmount1 - MoneyAmount2.

For brevity, relational style assumes that variables in the rule head are universally quantified and that variables introduced in the body are existentially quantified. So the rule may be stated explicitly thus: For each Article and MoneyAmount, Article has markup of Money-Amount iff Article retails for some MoneyAmount, and Article wholesales for some MoneyAmount, and MoneyAmount = MoneyAmount1 - MoneyAmount2.

Strictly speaking, the equation markup = retailPrice - wholesalePrice is a constraint between the markup, retail price, and wholesale price fact types, with one degree of freedom. Any one of the three could be derived from the other two. Sometimes we might wish to enter retail and wholesale prices and have the system derive the markup. At other times, we might wish to enter the wholesale price and markup to derive the retail price. We might also want to enter the retail price and markup to derive the wholesale price.

If we wish the system to support all these choices, then each of the three fact types is semiderived (instances may be asserted or derived) and its predicate is marked with a “-” (intuitively, half an asterisk) instead of “*”. This is sometimes useful in practice (e.g., exploring loan options based on interest rates and repayment periods).

In most derivation cases however, we decide beforehand that one specific fact type will always be the derived one. In this case, the derivation rule may be clearly specified as a definition (rather than merely an equation or biconditional) where the derived fact type is defined in terms of the others using “define ... as ...”. For example:

For each Article, define markup as retailPrice - wholesalePrice.

In such a definition, the derived fact type is said to be the definiendum (what is required to be defined). A fact type that is primitive (i.e., not defined in terms of others) is said to be an asserted fact type or base fact type. Derived fact types are defined in terms of other fact types (asserted or derived). For arithmetic derivation rules, such as our current example, attribute style is typically more compact and readable and hence is preferred. For logical derivation rules (see Section 5.5), relational style is often preferable.

Now consider the output report of Table 3.13, which shows details about windows on a computer screen. Try to schematize this before reading on.

Table 3.13. Window sizes.
WindowHeight (cm)Width (cm)Area (cm2)
14520
2620120
31015150
45525

The values in the last three columns are all numbers. Can we collapse Height, Width, and Area into one entity type? You might argue that Height and Width overlap since the value 5 is common, and that Width and Area overlap since each includes the value 20. However, Area is measured in square centimeters, quite a different unit from centimeters.

Heights and widths may be meaningfully compared since both are lengths: a length of 5 cm may be an instance of a height or a width. However, a length of 20 cm is not the same thing as an area of 20 cm2. If our final column was headed “Perimeter(cm)”, we could collapse three headings into one entity type as previously. But since Area is fundamentally a different type of quantity, we must keep it separate, as in Figure 3.29(a).

Figure 3.29. Schema for Table 3.13, where area is simply derived (a) and derived and stored (b).


As an alternative, the height and width fact types could be modeled using the fact types Window(.nr) has Height(cm:) and Window(.nr) has Width(cm:), where the centimeter unit is based on the unit dimension Length.

The derivation rule for area in Figure 3.29(a) is prepended by a single asterisk to indicate it is derived and not stored. The rule body includes another asterisk, which in this context means multiplication. In principle, any one of area, height, and width could be derived from the other two. In many cases, however, there simply is no choice as to which fact type is derived. For example, facts about sums and averages are derivable from facts about individual cases, but except for trivial cases we cannot derive the individual facts from such summaries.

A derived fact type that is not stored is derived-on-query (lazy evaluation), so the derived information is computed only when the information is requested. For example, if our Window schema is mapped to a relational database, no column for area is included in the base table for Window. The rule for computing area may be included in a view definition or stored query and is invoked only when the view is queried or the stored query is executed. In most cases, lazy evaluation is preferred (e.g., computing a person’s age from their birth date and current date).

A fully derived fact type that is also stored is derived-on-update (eager evaluation). Sometimes eager evaluation is chosen because it offers better performance (e.g., computing account balances). In this case, the information is stored as soon as the defining facts are entered and is updated whenever they are updated. Fact types that are fully derived and stored are marked with a double-asterisk “**”, as in Figure 3.29(b). When the schema is mapped to a relational database, a column is created for the derived fact type, and the computation rule is specified either as a trigger or as a computed column rule that is fired whenever the defining columns are updated (including inserts or deletes).

Now consider Figure 3.30. This might describe part of a UoD concerning practitioners in a medical clinic. Here the entity types Doctor, Dentist, and Pharmacist have a similar reference mode (name), but because this is not unit based, this is no reason to combine the types. If somebody could hold more than one of these three jobs then the overlap of the entity types would normally force a combination.

Figure 3.30. Should Doctor, Dentist, and Pharmacist be combined?


However, suppose that the entity types are mutually exclusive (i.e., nobody can hold more than one of these jobs). In this case we still need to consider whether to combine the entity types, since the same kind of information (their gender) is recorded for each.

In such cases, ask this question. Do we ever want to list the same kind of information for the different entity types in the same query? For example, do we want to make the request “List all the practitioners and their gender”? If we do, then we should normally combine the entity types as shown in Figure 3.31. If we don’t, then there may be grounds for leaving the schema unchanged. Section 6.6 examines this issue in more detail.

Figure 3.31. An alternative schema for the UoD of Figure 3.30.


Even if no doctor can be a dentist, the schema of Figure 3.30 permits a doctor and a dentist to have the same name (e.g., “Jones E”). In Figure 3.31, the use of “(.name)” with Practitioner implies that each instance of PractitionerName refers to only one Practitioner.

Suppose we add the constraint: Each Practitioner holds at most one Job. A graphical notation for this kind of constraint is discussed in the next chapter. With this constraint added, Figure 3.31 would forbid any doctor from having the same name as a dentist. If the original names did overlap, we would now need to rename some practitioners to ensure that their new names are distinct (e.g., “Jones, El” and “Jones, E2”).

Alternatively, we might choose a new simple identification scheme (e.g., Practition-erNr), or identify practitioners by the combination of their original name and job. Reference schemes are discussed in detail in Section 5.3.

To preserve the distinction between the different kinds of practitioner, we introduced the entity type Job and constrained job names to the set {‘doctor’, ‘dentist’, ‘pharmacist’}. Such “value constraints” are discussed in detail in Section 6.3.

The new schema is simpler since it replaced three binary fact types with two binaries. If we had even more kinds of practitioner (e.g., acupuncturist, herbalist) the saving would be even more worthwhile. If we have only two kinds of practitioner (e.g., doctor and pharmacist) both schemas would have the same number of fact types. But even in this case, the new version is generally favored. If additional information is required for specific kinds of practitioners, subtyping should be added, as discussed in Chapter 6.

In rare cases, entity types might overlap but we are not interested in knowing this, and collapsing the types is awkward. We may then leave the types separate so long as we declare that our model differs from the real world in this respect.

In performing CSDP step 3, the relevant questions to ask are as follows. Here, the derivation rules that concern us are of an arithmetic nature. These are usually fairly obvious. Logical derivations can be harder to spot, and are considered in a later step.

  1. Can the same entity belong to two entity types? If so, combine the entity types into one (unless such identities are not of interest).

  2. Can entities of two different types be meaningfully compared (e.g., to compute ratios) ? Do they have the same unit or dimension? If so, either combine the entity types into one, or assign the units the same unit dimension.

  3. Is the same kind of information recorded for different entity types, and will you ever need to list the entities together for this information ? If so, combine the entity types, and if needed add a fact type to preserve the original distinction.

  4. Is a fact type arithmetically derivable from others? If so, add a derivation rule. If you include the fact type on the diagram, mark it with “*” if fully derived, or “+” if semiderived. If the derived facts are also stored, append a final asterisk.

In addition to verbalization and population, a third way to validate a model is to see whether it enables sample queries to be answered, either directly from, or by derivation on, the fact populations. If you know what kinds of questions the system must be able to answer, navigate around the ORM model to see if you can answer them. If you can’t, your schema is incomplete and you should add the fact types and/or derivation rules needed to answer the queries.

Exercise 3.5

Perform steps 1-3 of the CSDP for the following output reports. In setting out derivation rules, you may use any convenient notation.

1.
SoftwareDistributorRetailer
Blossom 1234InfoWarePCland
  SoftKing
SQL++TechSourcePCland
WordLightTechSourceInfoWare
  SoftKing

2.
ProjectManagerBudgetSalaryBirth year
P1Smith J38000500001946
P2Jones42000550001935
P3Brown20000380001946
P4Smith T36000420001950
P5Collins36000380001956

3.
DeptBudgetNrStaffEmpNrSalarySalary total
Admin8000002E0150000 
   E023500085000
Sales9000003E0340000 
   E0435000 
   E0540000115000
Service9000002E0655000 
   E073500090000

4.
EmployeeProjectHoursExpenses
E4P824400
E4P926300
E5P814200
E5P916220
E6P816240
E6P914220

5.
Female staffMale staff
NameDeptNameDept
Sue BrightAdminSue BrightAdmin
Eve JonesAdminEve JonesAdmin
Ann SmithSalesAnn SmithSales

6.The following excerpt is from the final medal tally for the 1996 Olympics. Only the top five countries are listed here.

  1. Schematize this using binary fact types only.

  2. Schematize this using a ternary fact type.

7.Perform CSDP steps 1-3 for this report (13 cm = 5.25 inch; 9 cm = 3.5 inch).
MediumCapacityYear introducedDisk price (USD)Cost per MB (USD)
13 cm floppy160 KB19812.6016.25
9 cm floppy720 KB19853.504.86
Zip100 MB199516.650.17
CD-R650 MB19961.790.003
DVD-R9.4 GB20027.890.0008

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.151.126