16.3. Conceptual Query Languages

An information system may be modeled at any of four levels: conceptual, logical, physical, and external. In principle, it may also be queried at any of these levels. In practice however, although conceptual modeling is widely used, it is comparatively rare for systems to be queried at the conceptual level. Instead, queries are typically formulated either at the external level using forms, at the logical level using a language such as SQL, or at the physical level using a programming language. This section briefly indicates some problems with these lower level approaches and explains how these problems are avoided by conceptual query languages, especially those based on ORM.

At the external level, Query-By-Form (QBF) enables users to enter queries directly on a screen form by entering appropriate values or conditions in the form fields. This form-based interface is well suited to simple queries where the scope of the query is visible on a single form and no complex operations are involved. However, this cannot be used to express complicated queries. Moreover, saved QBF queries may rapidly become obsolete as the external interface evolves. For such reasons, QBF is too restrictive for serious work.

For relational databases, SQL and Query-By-Example (QBE) are more expressive. However, complex queries and even queries that are easy to express in natural language (e.g., who does not speak more than one language?) can be difficult for nontechnical users to express in these languages. Moreover, an SQL or QBE query often needs to be changed if the relevant part of the conceptual schema or internal schema is changed, even if the meaning of the query is unaltered. Finally, relational query optimizers ignore many semantic optimization opportunities arising from knowledge of constraints.

Logical query languages for postrelational DBMS’s (e.g., object oriented and object relational) suffer similar problems. Their additional structures (e.g., sets, arrays, bags, and lists) often lead to greater complexity in both user formulation and system optimization. For example, Object Query Language (OQL) extends SQL with various functions for navigation, as well as composing and flattening structures, thus forcing the user to deal directly with the way the information is stored internally (Cattell and Barry 2000). At the physical level, programming languages may be used to access the internal structures directly (e.g., using pointers and records), but this very low level approach to query formulation is totally unsuitable for end users.

Given the disadvantages of query formulation at the external, logical, or physical level, it is not surprising that many conceptual query languages have been proposed to allow users to formulate queries directly on the conceptual schema itself. Most of these language proposals are academic research projects, with at best prototype tool support, and are typically based on ER schemas (e.g., ERQL, Super, Hybris) or deductive models (e.g., CBQL). By and large, these are challenging for nai’ve users, and their use of attributes exposes their queries to instability, since attributes may evolve into entities or relationships as the application model evolves. References for these are included in the chapter notes.

Some commercial tools allow users to enter queries directly in English, and then translate these into database queries in a language such as SQL or MDX. Early tools of this type often suffered from problems with ambiguity and expressibility, as well as the correctness of the SQL code generated. Most of the natural language input tools we’ve looked at require you to spend considerable effort setting up a dictionary to relate conceptual terms to the underlying database structures, and most of them do an imperfect job at handling natural language input. This is not really surprising, given the inherent ambiguity and complexity of natural language.

One approach to address the ambiguity problem is taken by English Query, which was included some years ago in Microsoft SQL Server. When you enter a question in English, it’s rephrased to indicate the interpretation taken. Sometimes more than one rephrasing is given, and you can pick the one that conveys your intended meaning.

There are of course limits on what kinds of questions can be understood, and you need to spend time setting up the connections between your semantic model and the database structures. For example, if you want to make a join between tables other than a foreign key to primary key reference, you need to explicitly add this beforehand. Also, the SQL that’s generated might not be as efficient as you might construct yourself.

However, once you’ve done all this work, you can make it possible for nontechnical end users to query the database in plain English. Although common, predictable queries are best made available through prewritten procedures, English Query is especially useful for ad hoc queries. For example, with reference to the data mart schema in the previous section, and assuming adequate setting up, the question “Which region has the most stores?” might generate the following SQL code:

select top 1 with ties dbo.Store.region as "Region", count(*) as "count"
from dbo.Store
where dbo.Store.region is not null
group by dbo.Store.region
order by 2 desc

In addition to returning the region(s) with the most stores, this query returns the number of stores in those regions, which we might not want. Moreover, the where clause and the order by clause are both redundant. But these are small issues for the nontechnical user who is incapable of formulating the SQL query. For OLAP applications, English Query can also be used to generate MDX queries. For example, referencing the SalesCube discussed in the previous section, the question “List the three items with the most unit sales” might result in the restatement “Show the three items with the highest total unit sales” and the following MDX query:

select { Measures.[UnitsSold] } on columns,
        topcount([Item].[ItemCode].members, 3, Measures.[UnitsSold]) on rows
from [SalesCube]

As these examples illustrate, free input in natural language has a lot of potential for opening up databases to anybody who can write. As language recognition technology matures, it will also open up databases to anybody who can speak. However, we still have a long way to go before this becomes a totally reliable technology. In particular, it can be very difficult to set up semantic dictionaries to always capture the intended interpretation of user questions.

These problems of ambiguity and onerous dictionary preparation can be eliminated in one stroke by taking a different approach: conceptual-schema-based queries. Here the user issues queries by selecting paths through the existing conceptual schema, rather than using free text input, so no ambiguity can arise. As discussed later, if an ORM conceptual schema is used, there is no need to predeclare any join paths, since ORM object types are the semantic domains on which all joins are based. Since ORM is attribute-free in its base conceptual model, it also avoids the instability problems of attribute-based queries, which require reformulation when attributes evolve into relationships (see later example). Moreover, ORM’s sole data structure is the relationship type, providing a simple, sentence-based framework. For such reasons, we believe that ORM conceptual query technology has the greatest potential of any existing approach.

The first significant ORM-based query language was Reference and Idea Language (RIDL), a hybrid language with both declarative and procedural aspects (Meersman 1982a; Meersman et al. 1984). RIDL is a textual language for defining, populating, and querying models in NIAM, an early version of fact-oriented modeling. It was developed at the Control Data research laboratory in the University of Brussels, and was the first truly conceptual query language ever implemented. Let’s look at some RIDL examples. Consider a military conceptual schema with the following fact types: Officer has Rank; Officer was born in Year. In RIDL, these fact types would normally be declared as: Officer having Rank; Officer born-in Year. To find which sergeants were born before 1950, the following RIDL query may be used:

LIST Officer (having Rank ‘sergeant’ AND born-in Year < 1950)

In RIDL, reserved words are in uppercase. To illustrate the procedural side of RIDL, the following query lists pairs of officers of the same rank who were born before 1950:

FOR EACH o1, o2 IN Officer born-in Year < 1950 DO
    IF Rank OF o1 = Rank OF o2
       THEN LIST o1, o2
    END-IF
END-FOR

The original RIDL software mapped RIDL queries into internal statements that were directly executable on some early Control Data computers. RIDL was developed around the same time that SQL systems started to appear on the market. If you compare the aforementioned queries with equivalent SQL queries, you can see that RIDL was far ahead of its time. In later years, the data definition part of RIDL was implemented graphically in the RIDL* tool, but the query component was not supported. Another ORM query language is Language for Information Structure and Access Descriptions (LISA-D). Although very expressive, it is technically challenging for end users, and currently lacks tool support (ter Hofstede et al., 1996).

Like ORM, the Object-oriented Systems Modeling (OSM) approach avoids the use of attributes as a base construct. An academic prototype has been developed for the graphical query language OSM-QL based on this approach (Embley et al. 1996). For any given query, the user selects the relevant part of the conceptual schema, and then annotates the resulting subschema diagram with the relevant restrictions to formulate the query. Negation is handled by adding a frequency constraint of “0”, and disjunction is introduced by means of a subtype-union operator. Projection is accomplished by clicking on the relevant object nodes and then on a mark-for-output button.

Another ORM query language is ConQuer (the name derives from “CONceptual QUERy”). ConQuer is more expressive than OSM-QL, easier for novice users. Its tool implementations transform conceptual queries into SQL queries for a variety of back-end DBMSs, and do not require the user to be familiar with the conceptual schema or the ORM diagram notation. The first version of ConQuer was released in InfoAssistant (Bloesch and Halpin 1996). Feedback from this release led to the redesign of both the language and the user interface for greater expressibility and usability, resulting in a tool called ActiveQuery (Bloesch and Halpin 1997). Typical queries can be constructed by just clicking on objects with the mouse and adding conditions. Owing to acquisitions, the ActiveQuery technology is now owned by Microsoft.

A basic understanding of the ConQuer technology will help provide insights into what a truly conceptual query environment can offer. The ConQuer language and its tool support were designed for expressibility, clarity, simplicity, semantic stability, and semantic relevance. These design principles were explained in Section 3.1. A ConQuer query can be applied only to an ORM schema. Using a software tool, an ORM schema may be entered directly, or instead reverse engineered from an existing logical schema (e.g., a relational or object-relational schema). While reverse engineering is automatic, some refinement by a human improves the readability (e.g., the default names generated for predicates are not always as natural as a human can supply).

Although ConQuer queries are based on ORM, users don’t need to be familiar with ORM or its notation. A ConQuer query is set out in textual (outline) form (basically as a tree of predicates connecting objects) with the underlying constraints hidden, since they have no impact on the meaning of the query.

With ActiveQuery, a user can construct a query without any prior knowledge of the schema. On opening a model for browsing, the user is presented with an object pick list. When an object type is dragged to the query pane, another pane displays the roles played by that object in the model. The user drags over those relationships of interest. Clicking an object type within one of these relationships causes its roles to be displayed, and the user may drag over those of interest, and so on. In this way, users may quickly declare a query path through the information space, without any prior knowledge of the underlying data structures. Users may also initially drag across several object types. The structure of the underlying model is then used to automatically infer a reasonable path through the information space: this is called a point-to-point query.

Items to be displayed are indicated with a check mark “✓”. These check marks may be toggled on/off as desired. The query path may be restricted in various ways by the use of operators and conditions. As a simple example, suppose a company has branch offices in several cities around the world. Now consider the query: List each employee who lives in the city that is the location of branch 52. This may be set out in ConQuer thus:

This implicit form of the query may be expanded to reveal the reference schemes (e.g., EmployeeNr, BranchNr), and an equals sign may be included before “52”. Since any ConQuer query corresponds to a qualified path through an ORM schema, where all the object types and predicates are well defined, the meaning of the query is essentially transparent. ActiveQuery also generates an English verbalization of the query in case there is any doubt. This ensures semantic clarity.

Since ORM object types are semantic domains, they act as semantic “glue” to connect the schema. This facilitates not only strong typing but also query navigation through the information space, enabling joins to be visualized in the most natural way (recall Section 4.6). The underlying ORM schema fragment for query Ql is shown in the shaded path within Figure 16.5, along with the identification scheme for City.

Figure 16.5. Only the shaded path is relevant to query Q1.


Notice how City is used as a join object type for this query. If attributes are used instead, the query formulation becomes more cumbersome. If composite attributes are allowed we might use List Employee.empNr where Employee.city = Branch.city and Branch.branchNr = 52. If not, we might resort to List Employee.empNr where Employee.cityName = Branch.cityName and Employee.stateCode = Branch.stateCode and Employee.country = Branch.country and Branch.branchNr = 52. Apart from awkwardness, both of these attribute-based approaches violate the principle of semantic relevance. Since the identification scheme of City is not relevant to the question, the user should not be forced to deal explicitly with it.

Even if we had a tool that allowed us to formulate queries directly in ER or 00 models, and this tool displayed the attributes of the current object type for possible assimilation into the query (similar to the way ActiveQuery displays the roles of the highlighted object type), this would not expose immediate connections in the way that ORM does. For example, inspecting Employee.city does not tell us that there is some connection to Branch.city. At the relational level, we have two tables Employee (empNr. countryCode, stateCode, cityName,...) and Branch (branchNr, countryCode, stateCode, cityName,...). Since there are no foreign key references to relate the city attributes in these tables, free text query tools rely on the join connection being explicitly entered into the dictionary beforehand, which is a risky assumption. The only way to automatically reveal all join possibilities is to use the domains themselves as a basis for connectedness, and this is one of the distinguishing features of ORM.

Because ConQuer queries are based on ORM, they continue to produce the desired result as long as their meaning endures. In other words, you never need to change a ConQuer query if the English meaning of the question still applies. In particular, ConQuer queries are not impacted by typical changes to an application, such as the addition of new fact types or changes to constraints or the relative importance of some feature. This ensures semantic independence (i.e., the conceptual queries are independent of changes to underlying structures when those changes have no effect on meaning). This results in greater semantic stability than attribute-based approaches.

As a simple example, suppose that in our model each employee is identified by an employee number and has exactly one gender and at most one title. In the real world, an employee may have many titles (e.g., ‘Dr.’, ‘Prof.’, ‘Sir’), but we decide to record at most one of these.

Now suppose we want to list the titled employees and their gender. An ORM schema and ConQuer query for this are shown in Figure 16.6(a). There is an implicit and between the two branches stemming from Employee. The rest of the figure shows an equivalent UML class diagram, relational schema, and SQL query. Not only is the SQL code subconceptual (nulls are an implementation detail) but it is unstable, since a simple change to a conceptual constraint on the title relationship requires the code to be changed as well.

Figure 16.6. Schemas and queries to list titled employees and their gender.


For example, suppose we change our business rules to allow the recording of many titles for the same employee, and we also decide to record any required orders for listing titles (e.g., the combination ‘Prof. Dr.’ should be listed that way, not ‘Dr. Prof.’). From an ORM viewpoint, this new situation is depicted in Figure 16.7(a). The uniqueness constraint on the title association is relaxed from n:1 to m:n, and a title precedence fact type is added. These schema changes have no impact on the ConQuer query, whose meaning relates just to the shaded fact types, not to their constraints and not to the additional fact type. Since the shaded path has unchanged semantics, the original ConQuer query means the same, so it may be reused.

Figure 16.7. ORM has greater semantic stability than attribute-based approaches.


For attribute-based approaches, the business change requires major changes to both the model and the query. The rest of Figure 16.7 shows the new UML schema, the new relational schema, and the new SQL query. Substantial changes are needed because facts about titles are now stored in an association rather than an attribute. Another case forcing facts modeled with title and even gender attributes to be remodeled with associations was discussed in Section 10.1, to record restrictions of some titles (e.g., ‘Sir’) to specific genders. In contrast, the ConQuer query stays valid; all that changes is the SQL code that gets automatically generated from the query.

An OO query approach is often more problematic than an ER or relational approach because of the many extra choices on how facts are grouped into structures, and the user is exposed to these structures in order to ask a question. Moreover these structures may change drastically to maintain performance as the business domain evolves.

In the real world, UoD changes often occur, requiring changes to the database structures. Even more work is required to modify the code for stored queries. If we are working at the logical level, the maintenance effort can be very significant. We can minimize the impact of change to both models and queries by working in ORM at the conceptual level and letting a tool look after the lower-level transformations.

The ActiveQuery tool generates semantically optimized SQL queries, wherever possible using knowledge of ORM constraints to produce more efficient code. For example, if the title role were actually mandatory, the SQL generated by the previous ConQuer query would, in both cases, simplify to select empNr, gender from Employee. The tool can also generate different code for different back ends to cater for their different support for the SQL standard, sometimes generating chains of queries over temporary tables to enable the query to run at all.

The simple examples just given illustrate how ConQuer achieves semantic clarity, relevance, and stability. Let’s look briefly at its semantic strength or expressibility. The language supports the usual comparators (=, <, in, like, etc.), logical operators (and, or, not), and bag functions (count, sum, etc.), as well as a modal operator (possibly) for conceptual left outer joins. Subtype/supertype connections appear as “is” predicates.

For example, assuming that Manager is a subtype of Employee in the ORM model, the following query Q2 asks: List each manager and their cars (if any) where the managers do not work at branch 52.

The language is fully orthogonal and supports arbitrary correlation, using subscripted variables (e.g., Person1, Person2) when necessary to distinguish different instances of the same object type. As a simple correlation example, consider the query “Who supervises an employee who lives in the same city as the supervisor but was born in a different country from the supervisor?” Assuming birth and supervision fact types in the ORM model, query Q3 shows one way to express this in ConQuer.

When an object type appears more than once, ActiveQuery automatically appends subscripts to distinguish the occurrences. You can equate instances by equating their subscripts (e.g., City1). More generally, you can use comparators to compare instances (e.g., Country2 <> Country]). Try this in SQL. It’s not that hard, but you have to admit it’s easier in ConQuer!

As a final example, suppose the ORM model includes the m:n fact type Employee has Rating. Now consider the query “List each branch and those of its employees whose maximum individual rating exceeds the average of his/her branch” (remember that an employee may have many ratings). Query Q4 shows how to do this in ConQuer, illustrating the use of bag functions in for-clauses.

At the relational level, two tables are needed: Employee (empNr, branchNr,...); Achieves (empNr, rating). The SQL for this query is shown here. This is tricky, especially the final correlation between branch numbers, and even experienced SQL programmers might have difficulty with it. The equivalent ConQuer query would be easy for most end users who have even minimal experience with the language. This gives some indication of the potential of a language such as ConQuer for empowering end users.

select max(X1 .branchNr), X2.empNr
from   Employee as X1, Achieves as  X2
where X1 .empNr = X2.empNr
group by X2.empNr
having max(X2.rating) >
    (select avg(X4.rating)
    from Employee as X3, Achieves as X4
    where X3.empNr = X4.empNr
      and X3.branchNr = X1 .branchNr)

ActiveQuery allows you to define derived predicates, and store these definitions. These derived predicates (or “macros”) can then be used just like base predicates in other queries. A subtype may be thought of as a derived object type, with its definition provided by a ConQuer query. For example, the population of the subtype MalePatient can be obtained by executing the following ConQuery query: Find each Patient who is of Gender ‘M’. Given its generality, ConQuer could be adapted to provide a very high language for specifying derivation rules.

A constraint may be viewed as a check that a query searching for a violation of the constraint returns the null set. Hence constraints may also be expressed in terms of queries. Various high level constructs can be provided in the language to make it more natural than the not-exists-check-query form provided in SQL. Although there is no room here to go into detail, it should be clear that this approach is quite powerful.

ORM’s rich graphical constraint notation still needs to be supplemented by a textual language to provide a complete coverage of the kinds of constraints and derivation rules found in business domains. While ORM’s graphical constraints can be verbalized automatically, input textual languages such as ConQuer could be used to define other business rules mapped automatically to SQL. Although the ActiveQuery tool is currently no longer available, work on the NORMA tool is under way to extend the FORML textual language for ORM with the expressive power of ConQuer. This should result in a very high level textual language for both conceptual modeling (including fact types, constraints and derivation rules) and conceptual queries.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.196.175