Expression Language Patterns
The expression language in SSIS might appropriately be referred to as the glue that holds the product together. Expressions in SSIS provide a relatively simple and easy-to-use interface to allow data developers to introduce dynamic logic into the ETL infrastructure. When you think through the various moving parts within Integration Services, you’ll likely discover that you can manipulate all of them in one way or another by using expressions.
Expressions provide a fast, effective, and, dare I say, fun way to solve specific ETL challenges. In this chapter, we’ll look into some of the basics of the expression language and I’ll describe a few instances where SSIS expressions are ideal (and a few where they might not be) for effectively solving difficult ETL problems.
Getting to Know the Expression Language
Before we dive into the design patterns surrounding the SSIS expression language, let’s spend a little time defining and becoming familiar with the nuances of the language. Reviewing the language-specific patterns can help you get up to speed and use the language correctly.
What Is the Expression Language?
The SSIS expression language is an interpreted language that has been built into the SSIS runtime environment. This specialized language is used to craft scalar-valued snippets of code (individually referred to as expressions) that you may use at various points within the SSIS environment.
The SSIS designer exposes dozens of interfaces where expressions can be used in place of hard-coded values, allowing the BI professional to leverage that flexibility to create dynamic and reusable elements within SSIS. Conceptually, it’s not unlike the product-specific dialects that exist in other Microsoft development environments. For example, when you are developing reports in SSDT or BIDS that you want to deploy to SQL Server Reporting Services, you can use Visual Basic for Applications (VBA) code to generate dynamic behavior during report execution and rendering.
As you explore the expression language, you’ll find it to be a very powerful addition to the natural capabilities of SQL Server Integration Services. It has a rich library of functionality that will be familiar to both developers and DBAs. Among the functional domains of the SSIS expression language are
The expression language serves two different roles within the package life cycle:
Expressions in SSIS may derive their comparisons or assignments from several fronts. Built-in system variables permit visibility into software environmental data such as package and container start times, machine environment information, package versioning metadata, and more. You can interrogate and manipulate the values of user-defined SSIS variables by using expressions and access values of package parameters to be leveraged elsewhere during package execution. In the data flow, expressions may interact with running data at the cellular level.
Expressions are value-driven at runtime. Unlike settings that are generally only configurable at design time (think data flow column definitions), expressions will calculate their values when the package is actually executed. Furthermore, a single expression may be evaluated many times (perhaps with a different result each time) during the execution life cycle of the package. Consider the case of the ForEach Loop, a container that loops through a specified set of objects or values until it reaches the end of said collection. Expressions that are manipulated within the loop may be updated dozens or even hundreds of times during this process.
Why Use Expressions?
The ability to use expressions is one of the greatest strengths of SQL Server Integration Services. Simply put, expressions help to fill in the small gaps. The expression language isn’t a tool in itself, but rather, it is an interface that helps other SSIS tools more effectively perform their respective functions. That’s all well and good, but in the interests of simplicity, why would an ETL developer choose to use expressions instead of other languages such as T-SQL, C#, or VB.NET? Here are a few compelling reasons to employ the expression language in your SSIS packages:
I’ve done a number of presentations for novice SSIS developers, and when I bring up the topic of the expression language, one question almost always seems to come up: “Where do I use this expression language stuff?” My answer: “Everywhere!” Part of the beauty of expressions is that they can be used almost anywhere within SSIS packages. You can employ expressions on the control flow in precedence constraints. It’s convenient to make your SSIS package variables dynamic by replacing their static values with expressions. You can leverage expressions within the data flow to manipulate data and even control the execution path. The bottom line is that you can manipulate many of the common properties of packages, tasks, constraints, and data flow elements by using expressions.
Although its syntax may seem unusual, the expression language isn’t difficult to learn. Anyone with logical scripting experience (even if that experience is limited to T-SQL) can quickly pick up on the basics and should be able to master the language with a reasonable amount of practice.
Language Essentials
Even for those who have experience scripting in other Microsoft development environments, the first exposure to the SSIS expression language can be a little unsettling. The syntax and functionality are unlike any other language, either interpreted or compiled. It appears to be a strange hybrid of several languages and is certainly a dialect all of its own.
Developers who have spent time using the C-style languages (C, C++, C#, Java) will recognize some of the syntactical nuances within the expression language:
Similarly, anyone experienced in T-SQL will find a great deal of familiar behavior within the SSIS expression language:
The SSIS expression language is quite powerful, with its wide variety of functions and operators. With native behavior including equality tests, type casts, string manipulation, and date arithmetic, the use of expressions within SSIS packages can help to overcome ETL challenges both large and small.
As useful as the expression language is, there are a few key limitations to its use. Bear in mind that these are relatively minor hang-ups; the SSIS expression language is not intended to be a full-featured programming language, but rather a lightweight tool to supplement the behavior of existing SSIS tasks and components. Among some of the challenges are the following:
Listing 10-1. Multiconditional Evaluation in T-SQL
SELECT CASE WHEN @TestCase = 3 THEN 'Test case = Solid'
WHEN @TestCase = 2 THEN 'Test case = Liquid'
WHEN @TestCase = 1 THEN 'Test case = Gas'
ELSE 'Unknown test case' END [TestCaseType]
Listing 10-2. Multiconditional Evaluation in the Expression Language
(TestCase == 1) ? "Test case = Gas" : (TestCase == 2 ? "Test case = Liquid" : (TestCase == 3 ? "Test case = Solid" : "Unknown Test Case"))
Despite its minor shortcomings, the SSIS expression language remains an integral part of the product, and as you’ll see later in this chapter, it has some very practical uses in a well-designed ETL ecosystem.
Putting the Expression Language to Work
Now that you understand what the expression language is (and is not), let’s talk about some design patterns where you might use it.
Although not as common as other uses, it is possible to use SSIS expressions to configure package-level properties. Here are a handful of properties that may be set at the package level by using expressions:
Consider the example of MaxConcurrentExecutables, which defines how many executables (packages, tasks, etc.) can run concurrently. By setting this property through an expression, the ETL developer would be able to dynamically control this value based on any criteria visible through an expression.
Although these properties are configurable by using expressions, it’s far more common to find package-level options set by using package parameters (with later versions of SSIS) or package configurations (SQL Server 2008 and earlier). It is usually best to share common values across package ancestries using parameters or configurations, which allow you greater flexibility and easier maintenance. I expose this particular design pattern more for the purpose of identifying it as an antipattern than for defining parameters for its use. Unless there’s some business case or regulation dictating otherwise, it’s a better long-term solution to externalize these values rather than rely on expressions.
As shown in Figure 10-1, you can configure each variable with a static value in the Value field or define a value expression that will be evaluated at runtime. Note that the variable window was improved starting in SQL Server 2012—in older versions, static values were shown in the Variable window, but you had to use the Properties window to view or alter an expression for a variable.
Figure 10-1. Expressions with variables
In practice, I often see expressions applied to variable values, and then the resulting variable is used as a property on a task or component (as opposed to it being used as an expression to set the property directly). I’m a fan of this design pattern for one simple reason: reusability. It’s not uncommon for components to share certain properties, and building expressions on each of those shared properties for every applicable component is both redundant and unnecessary. For those properties that will be shared across multiple tasks or components, it’s far easier to centralize the expression logic into a variable and then use that variable to set the shared properties. This approach allows for faster development as well as easier maintenance should the logic require changes down the road.
When using this design pattern, don’t forget that you can also “stack” variable values. In the expression statement, you can leverage other variables to set the value of the current variable.
Connection Managers
One of the most practical and common places to use SSIS expressions is the Connection Managers tray. Generally speaking, it’s typically preferable to store dynamic connection properties not in expressions, but rather as parameters, particularly when you’re dealing with structured data. Because of the sensitive and frequently changing nature of connection metadata (server names, user names, and passwords), most ETL professionals choose to externalize those settings to keep them stored securely and externally to the package so they can be globally changed (rather than modified package by package).
One recurring exception to this pattern is connections that interact with the file system. There are several cases where using expressions helps to lighten the load of processing file-based sources or destinations:
For these cases, you can use a little dab of expression language to dynamically build directory paths and file names that your connections in SSIS will use. For this example, let’s assume that you’re generating a flat text file from within your package, and you want to use a dynamic file name based on the current date. By setting the ConnectionString property from within the Properties window of the instance of the Flat File Connection Manager, you can manipulate the runtime value of the file name. As shown in Figure 10-2, you’re specifying the base file name and then appending the elements of the current date to build a customized file name.
Figure 10-2. Dynamic file name using expression
Note that the pattern just discussed could be further extended to include elements of time (hours/minutes/seconds) should your ETL requirements include a restraint for that level of granularity.
Since we’re not going in-depth into all the syntactical elements of the expression language, I’ll just point out a couple of things I’ve done here:
Remember that this pattern is highly flexible. It can be utilized with almost any file connection, whether it’s to be used as a source or a destination. You’re not limited to just flat file connections here either; you can extend this logic to some of the other SSIS connections as well. I’ve used this same design pattern when dealing with FTP data as both a source and destination. By embedding the same logic within the properties of an FTP source, you can programmatically “walk” the directory structure of a remote server when it is in a known and predictable format such as this.
Project-Level Connection Managers
When working with projects in the catalog deployment model, you can expose connection information across multiple packages in the same project by way of project connection managers. When you are using older versions of SSIS (2008 or earlier), or when you are working in the package deployment model in later versions of SSIS, any connection manager defined within a package is independent of those in other packages. Starting with SSIS in SQL Server 2012, however, you now have the ability to attach a connection manager to your workspace at the project level. These are accessible to all packages within the same project.
We’ll not go deeply into the new deployment model in this chapter, but it is important to point out how the use of expressions impacts project connection managers. Because they are attached to the project and not one particular package, the properties of these shared connections are common to all packages in the project. As such, any property setting on these project connection managers—including the use of expressions—would be immediately reflected in all packages in the project. This is a welcome and much needed improvement to the way packages interact with one another, but for those of us who have worked with previous versions of SSIS, it’s a bit of a paradigm shift. Don’t get caught off guard when an expression applied to a project connection in one package gets applied to the other packages in the project!
Control Flow
Within the control flow, there are a couple of different ways to implement SSIS expressions. First, each of the tasks and containers will expose several properties that are configurable using expressions. In addition, the paths between them (known as precedence constraints) allow ETL developers to customize the decision path when moving from one task/container to another.
Conditional Execution Through Expressions and Constraints
The essential function of the control flow is to manage the execution of package elements. By using precedence constraints, you can design a package so that tasks and containers are fired in the proper order and with the correct dependencies intact. For a simple example of this, think about a package that truncates and then loads a staging table. You can perform both of these tasks in the same package, but without a precedence constraint to cause the insert operation to occur after the TRUNCATE TABLE execution, you run the risk of inadvertently loading and then deleting the same data.
You can configure precedence constraints to manage flow-based successful completion of the preceding task (the default behavior), or you may set them to cause the task to execute only if the preceding task fails. In addition, you can set the constraint to Completion, allowing the downstream task to fire when the upstream task is finished, regardless of its outcome. Tasks may have multiple precedence constraints, and you may set these so that any or all of them must be satisfied before the task to which they are attached will execute. Figure 10-3 shows a fairly typical use of precedence constraints; note that the unlabeled arrows represent Success constraints, and the others are labeled as to their purpose. The dashed lines indicate that the task is configured to execute upon completion of either of the preceding tasks.
Figure 10-3. Precedence constraints
As useful as precedence constraints are, the domain of variability that they address is fairly limited: the only conditions that can be tested are whether a task completed as well as the success or failure of said task. In the brief example shown in Figure 10-3, you can probably infer that I’m downloading one or more files from an external source, loading the data from those files into staging tables, and then merging (upserting) the data into a database table. Although there’s nothing technically wrong here, there is room for improvement. For example, what happens if there are no files to be processed? In the example shown, the truncation of the staging table, the loop through the file system to find the downloaded files (even if none exist), and the merge operation will all be executed even if there are no files to process.
In the first job I ever held, I was responsible for, among other things, gathering stray shopping carts from the store parking lot and bringing them back inside. My boss once told me, “This job requires an excessive amount of walking, so do what you can to save steps.” All these years later, that advice still holds true. Why run through extra steps when you can simply skip past them if they are not needed? For the previous example, you can include a relatively simple expression to bypass the execution of the majority of the package when no files are found to process. Saving those steps saves CPU cycles, disk I/O, and other resources.
Precedence constraints also have the ability to use expressions to enforce proper package flow. In Figure 10-4, you’ll see that the evaluation operation is set to Expression to enforce both the execution value of the prior task as well as the value defined in the Expression box. For illustration purposes, assume that you’ve populated an SSIS variable to store the number of files downloaded in the Script task operation, and you’re using the expression to confirm that at least one file was processed. From here, you can either type the expression into the window manually or use the ellipsis button to open the Expression Editor (note that in earlier versions (2008 and earlier) of the product, you will have to enter the expression by hand without the benefit of the Expression Editor).
Figure 10-4. Precedence Constraint Editor
Refer back to the original package; you’ll see that the precedence constraint between the first Script task and the truncation SQL task now reflects the presence of an expression in the constraint (Figure 10-5).
Figure 10-5. Expression notation in precedence constraint
It’s worth noting that the example in Figure 10-5 shows a non-standard notation on the constraint. By default, only the function icon (fx) will appear when you are using an expression as part of a constraint. Assuming that the expression is not a lengthy one, I typically change the ShowAnnotation option of the constraint to ConstraintOptions, which will include the expression itself on the label of the constraint. This is an easy reminder of the expression used in the constraint, and it doesn’t require opening the properties window to see the expression.
In addition to the control flow uses of expressions, most every task and container in SSIS has its own properties that can be configured using expressions. The options for configuration using expressions will vary from one executable to the next, but there are elements common to most tasks and containers:
A common design pattern using a task-level expression is to employ the SqlStatementSource property of the Execute SQL task. In most cases, you can use this task combined with query parameters to create dynamic statements in T-SQL. However, some language constructs (such as subqueries) don’t always work well with parameters, exposing a need to build the SQL string in code. By using an expression instead of static text for the SqlStatementSource property, the ETL developer can have complete control of the T-SQL statement when query parameters don’t fit.
Note There was a limit on string size in SQL Server 2008 Release 2 that’s been greatly relaxed in 2012.
Data Flow Expressions
As we move from the control flow into the data flow, we find the more traditional use of expressions as part of our ETL strategy. Like the higher-level executables, we find that every component in the data flow is affected either directly or indirectly by SSIS expressions.
Lightweight data cleansing is one of the most common uses of the expression language within the data flow of SSIS. Most frequently used within the derived column transformation, expressions can be used for certain cleanup tasks, including these:
Often, you can minimize the need for data cleansing in the data flow simply through well-designed query statements in the extraction from the various data sources. However, sometimes cleanup at the source is just not an option. Many sources of data are nonrelational: consider text files and web services as data sources, for example, which generally do not have the option of cleaning up the data before its arrival into the SSIS space. Sometimes even relational sources fit in this box: I’ve encountered a number of scenarios where the only interface to the data was through a predefined stored procedure that could neither be inspected nor changed by the ETL developer. For cases such as these where source cleansing is not possible, using expressions within the data flow is a good second-level defense.
One design pattern that I use frequently is to trim out extra whitespace and convert blank strings to NULL values. As shown in the following, such an operation could be performed with a single, relatively simple expression:
(LEN(TRIM([Street_Address])) > 0) ? TRIM([Street_Address]) : (DT_WSTR, 100)NULL(DT_WSTR, 100)
Regarding data cleansing using the expression language, I will offer a brief word of caution: if you find yourself needing to do complex, multistep cleansing operations within your SSIS packages, consider using some other means to do the heavy lifting. As I mentioned earlier, the expression language is best suited for lightweight data cleansing; because complex expressions can be difficult to develop and debug, you might instead use a richer tool such as the Script task or Script component, or perhaps Data Quality Services, for these advanced transformations.
Sometimes you will find that you need to create forks in the road with ETL data flow. There are several reasons why you might need to create such branches within your data flow:
Figure 10-6 exposes this design pattern by showing the use of expression logic to break apart a data stream into multiple outputs. In this case, you are processing a billing file by using comparison expressions within the conditional split transformation (see the callout) to determine whether each row is paid on time, not yet due, or past due, and then you’re sending it to the appropriate output accordingly.
Figure 10-6. Using expressions to define multiple paths
One interesting caveat regarding the application of expressions within the data flow is the way in which SSIS exposes component-level expressions. Although the expression language is very useful within the pipeline of the data flow, most components do not actually expose properties that can be set using expressions. For those that do allow expressions on certain properties, these expressions are surfaced as elements of the data flow itself and will appear as part of the options in the Data Flow Properties window while working in the control flow.
As shown in Figure 10-7, you are using the expression properties of the data flow to access the ADO.NET data source within that data flow. As you can see, the identifier in the Property column shows that this expression belongs to the data source within the data flow, allowing you to set the SqlCommand property of that source. It’s useful to note here that I used the ADO.NET source purposefully in this example. Since this source does not currently allow the use of parameters, setting the SqlCommand property is often an acceptable substitute for dynamically retrieving data from a relational database using this component.
Figure 10-7. Data flow expression
Application of Business Rules
Although they share some of the same methods, the applying business rules differs conceptually from data cleansing. For the most part, data cleansing is considered to be universal: misspelled words, inconsistent casing, extraneous spacing, and the NULL-versus-blank-versus-zero quandary are all common problems that must be dealt with in almost every ETL process. Business rules, on the other hand, are specific use cases in which data is manipulated, extrapolated, or discarded based on custom logic that is specific to the business at hand. These rules may be general enough to apply to an entire industry (healthcare billing workflows, for example) or as specific as the arrangement of data to suit the preferences of an individual manager.
Generally speaking, the use of expressions to apply business logic works best when limited to a small number of simple business rule cases. As mentioned earlier, the expression language is not ideal for multiple test conditions and therefore may not be ideally suited for multifaceted and complex business rules. For enterprise-level business rule application, consider other tools in SSIS, such as the Script component or Execute SQL task (for operations that can be performed at the relational database level), or perhaps a separate tool such as SQL Server Data Quality Services or Master Data Services.
CHOOSING BETWEEN COMPLEX EXPRESSIONS AND OTHER TOOLS
In my experience, the majority of uses of SSIS expressions involve short, simple expressions. Interrogating the value of a variable, modifying the contents of an existing column, comparing two values, and other similar operations tend to require relatively brief and uncomplicated logic as an SSIS expression. However, there are many cases where a short-and-sweet expression just won’t get it done.
In these cases of more complicated logic, is an SSIS expression still the best choice? In some instances, the answer is no. As mentioned earlier, there are instances in the ETL cycle where the expression language is ill-suited to solve the problem. In cases where the logic required involves complexity that exceeds that which is practical or convenient for the SSIS expression language, a common pattern is to engage a separate tool to address the problem at hand. Some of the other methods for handling these complex logical scenarios are as follows:
There are no hard-and-fast rules defining when an expression may not be the best solution. However, there are a few design patterns that I tend to follow when deciding whether to use an expression or some other tool when applying dynamic logic in my SSIS packages. Typically, I will avoid using expressions in situations where
The bottom line is that not every ETL challenge within SSIS should be solved using expressions. The expression language was intended as a lightweight solution, and used in that context, it is an outstanding supplement to the product line. Try to think of SSIS expressions as spackle; small, light, elegant, and used pervasively, but in small doses. As effective as spackle is, a building contractor would never think of building an entire house using only spackle. As with any tool, expressions in SSIS are best used in proper context and should not be considered as a one-size-fits-all solution to every problem.
Conclusion
ETL can be hard. Often, it’s not the big design problems but the small “how do I . . .?” tactical questions that collectively cause the most friction during SSIS development. The SSIS expression language was designed for these types of questions. Its small footprint, somewhat familiar syntax, and extensive usability across the breadth of SSIS make it an excellent addition to the capabilities within Integration Services. Used properly, it can help to address a variety of problem domains and hopefully ease the burden on the ETL developer.
18.119.133.160