Overusing Unstructured Data

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Now, to use this data, you have to indicate the table you want to join to, which is very much an unnatural way to do a join. You can use a universally unique GUID key so that all references to the data in the table are unique, eliminating the need for the specifically specified related table name. However, I find when this method is employed if the RelatedTableName is actually used, it is far clearer to the user what is happening.

A major concern with this method is that you cannot use constraints to enforce the relationships; you need either to use triggers or to trust the middle layers to validate data values, which definitely increases the costs of implementation/testing, since you have to verify that it works in all cases, which is something we trust for constraints; even triggers are implemented in one single location.

One reason this method is employed is that it is very easy to add references to the one table. You just put the key value and table name in there, and you are done. Unfortunately, for the people who have to use this for years and years to come, it would have just been easier to spend a bit longer and do some more work, because the generic relationship means that using a constraint is not possible to validate keys, leaving open the possibility of orphaned data.

A second way to do this that is marginally better is to just include keys from all tables, like this:

CREATE TABLE JournalEntry
(
     JournalEntryId  <int or uniqueidentifier> PRIMARY KEY,
     SalesOrderId <int or uniqueidentifier> NULL REFERERENCES
                                                   SalesOrder(SalesOrderId),
     TroubleTicketId <int or uniqueidentifier> NULL REFERERENCES
                                             TroubleTicket(TroubleTicketId),
     <other columns>
);

This is better, in that now joins are clearer and the values are enforced by constraints, but now, you have one more problem (that I conveniently left out of the initial description). What if you need to store some information about the reason for the journal entry? For example, for an order, are you commenting in the journal for a cancelation notice?

Extending the design, it seems like a decent idea that one JournalEntry might relate to more than one SalesOrder or JournalEntry. So, the better idea is to model it more like Figure 8-20.

Figure 8-20. Objects linked for maximum usability/flexibility

CREATE TABLE JournalEntry
(
     JournalEntryId  <int or uniqueidentifier> PRIMARY KEY,
     <other columns>
)
CREATE TABLE SalesOrderJournalEntry
(
     JournalEntryId <int or uniqueidentifier>
                      REFERENCES JournalEntry(JournalId),
     SalesOrderId <int or uniqueidentifier>,
                      REFERENCES SalesOrder(SalesOrderId),
     <SalesOrderSpecificColumns>
     PRIMARY KEY (JournalEntryId, SalesOrderId)
)
CREATE TABLE TroubleTicketJournalEntry
(
     JournalEntryId <int or uniqueidentifier>
                      REFERENCES JournalEntry(JournalId),
     TroubleTicketId <int or uniqueidentifier>,
                      REFERENCES TroubleTicket (TroubleTicketId),
     <TroubleTicketSpecificColumns>
     PRIMARY KEY (JournalEntryId, SalesOrderId)
)

Note that this database is far more self-documented as well, though it does make it harder to implement. You can easily find the relationships between the tables and join on them. Yes, there are a few more tables, but that can play to your benefit as well in some scenarios, but most important, you can represent any data you need to represent, in any cardinality or combination of cardinalities needed. This is the goal in almost any design.

Overusing Unstructured Data

As much as I would like to deny it, or at least find some way to avoid it, people need to have unstructured notes to store various bits and pieces of information about their data. I will confess that a large number of the systems I have created in my career included some column that allowed users to insert freeform text. In the early days, it was a varchar(256) column, then varchar(8000) or text, and now varchar(max). It is not something that you can get away from, because users need this scratchpad just slightly more than Linus needs his security blanket. And it is not such a terrible practice, to be honest. What is the harm in letting the user have a place to note some bit of information about their data? I know I personally have tons of OneNote notebooks with lots of unstructured data.

However, if given too much leeway, or too many generic buckets for text strewn about in a database, far too often what happens is that notes become a replacement for doing actual design. The notes section becomes a replacement for things that ought to be a full-blown column. Should we have a column for special dietary restrictions? Nah, just put it in the notes column. Once the users do something once and particularly find it useful, they will do it again. And they tell their buddies, “Hey, I have started using notes to indicate that the order needs processing. Saved me an hour yesterday.” And then it costs the programmers 200 hours sorting out unstructured data.

Probably the most common use of this I have seen that concerns me is contact notes. I have done this myself in the past, where I have a column that contains formatted text something like the following on a Customer table. Users can add new notes but usually are not allowed to go back and change the notes.

ContactNotes
--------------------------------------------------------------------------------
2008-01-11 – Stuart Pidd -Spoke to Fred on the phone.  Said that his wangle was
broken, referencing Invoice 20001.  Told him I would check and call back
tomorrow.
2008-02-15 – Stuart Pidd – Fred called back, stating his wangle was still
broken, and now it had started to dangle.  Will call back tomorrow.
2008-04-12 – Norm Oliser – Stu was fired for not taking care of one of our best
customers.
--------------------------------------------------------------------------------

This generally is not the best solution to the problem, even for a very small organization. The proper solution is almost certainly to take this data that is being stored into this text column and apply the rigors of normalization to it. Clearly, in this example, you can see three “rows” of data, with at least three “columns.” So instead of having a Customer table with a ContactNotes column, implement the tables like this:

CREATE TABLE Customer
(
      CustomerId   int   CONSTRAINT PKCustomer PRIMARY KEY
      <other columns>
)
CREATE TABLE CustomerContactNotes
(
       CustomerId  int,
       NoteTime     datetime,
       PRIMARY KEY (CustomerId, NoteTime),
       UserId  datatype, --references the User table
       Notes varchar(max)
)

You might even stretch this to the model we discussed earlier with the journal entries where the notes are a generic part of the system and can refer to the customer, multiple customers, and other objects in the database. This might even link to a reminder system to remind Stu to get back to Fred, and he would not be jobless. Though one probably should have expected such out of a guy named Stu Pidd (obviously).

Even using XML to store the notes in this structured manner would be an amazing improvement. You could then determine who entered the notes, what the day was, and what the notes were, and you could fashion a UI that allowed the users to add new fields to the XML, right on the fly. What a tremendous benefit to your users and, let’s face it, to the people who have to go in and answer questions like this, “How many times have we talked to this client by phone?”

The point of this section is simply this: educate your users. Give them a place to write the random note, but teach them that when they start to use notes to store the same specific sorts of things over and over, their jobs could be easier if you gave them a place to store their values that would be searchable, repeatable, and so on. Plus, never again would you have to write queries to “mine” information from notes.

Tip SQL Server provides a tool to help search text called Full Text Search. It can be very useful for searching textual data in a manner much like a typical web search. However, it is no replacement for proper design that makes a different column and row from every single data point that the users are typically interested in.

Summary

This chapter was dedicated to expanding the way you think about tables and to giving you some common solutions to problems that are themselves common. I was careful not to get too esoteric with my topics in this chapter. The point was simply to cover some solutions that are a bit beyond the basic table structures I covered in earlier chapters but not so beyond them that the average reader would say “Bah!” to the whole chapter as a waste of time.

The following are the “good” patterns we covered:

Uniqueness: Simple uniqueness constraints are often not enough to specify uniqueness for “real” data. We discussed going deeper than basic implementation and working through uniqueness scenarios where you exclude values (selective uniqueness), bulk object uniqueness, and discussed the real-world example of trying to piece together uniqueness where you can’t be completely sure (like visitors to a web site).
Data-driven design: The goal is to build your databases to be flexible enough that adding new data to the database that looks and acts like previous values does not require code changes. You do this by attempting to avoid hard-coded data that is apt to change and making columns for typical configurations.
Historical/temporal: Often the user needs to be able to see their data at previous points in time as it has changed over time. I presented strategies you can use to view your data at various points in history. Using a trigger gives you a lot of control, but the new temporal extensions in SQL Server 2016 are very awesome if you can live with the (not terribly large) constraints.
Hierarchies: We discussed several methods of implementing hierarchies, using simple SQL constructs to using hierarchyId, and introduced the different methods that have been created to optimize utilization with a bit of reduction in simplicity.
Large binary data: This pertains particularly to images but could refer any sort of file that you might find in a Windows file system. Storing large binary values allows you to provide your users with a place to extend their data storage.
Generalization: Although this is more a concept than a particular pattern, we discussed why we need to match the design to the users’ realistic needs by generalizing some objects to the system needs (and not to our nerdy academic desires).

We finished up with a section on anti-patterns and poor design practices, including some pretty heinous ones:

Undecipherable data: All data in the database should have some sort of meaning. Users should not have to wonder what a value of 1 means.
One domain table to cover all domains: This is yet another normalization issue, because the overarching goal of a database is to match one table with one need. Domain values may seem like one thing, but the goal should be that every row in a table is usable in any table it is relatable to.
Generic key references: It is a very common need to have multiple tables relate to another. It can also be true that only one table should be related at a time. However, every column should contain one and only one type of data. Otherwise, users have no idea what a value is unless they go hunting.
Overusing unstructured data: Basically, this hearkens back to normalization, where we desire to store one value per column. Users are given a generic column for notes regarding a given item, and because they have unplanned-for needs for additional data storage, they use the notes instead. The mess that ensues, particularly for the people who need to report on this data, is generally the fault of the architect at design time to not give the users a place to enter whatever they need, or to be fair, the users changing their needs over time and adapting to the situation rather than consulting the IT team to adjust the system to their ever changing needs.

Of course, these lists are not exhaustive of all of the possible patterns out there that you should use or not use, respectively. The goal of this chapter was to help you see some of the common usages of objects so you can begin to put together models that follow a common pattern where it makes sense. Feedback, particularly ideas for new sections, is always desired at [email protected].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Overusing Unstructured Data

Create new playlist

Sign In

Sign Up

Table of Contents for
Overusing Unstructured Data