Entity Framework querying optimization

First, when trying to understand EF performance, we need to know what stages EF operates on in order to actually query the database server with our object query. We always must keep in mind that EF is an object model mapped to a physical model that is produced from database's metadata. When we make a query, it is made across the entity objects that are mapped to the known physical layer within the EF itself. This mapping later produces the right SQL, which is sent to the database server. This means that if a change is made against the database metadata, the known metadata in EF may become invalid and produce runtime errors.

Talking about performance, we must dive deep into SQL materialization. This knowledge of the stage list made by EF becomes critical.

Querying execution lifecycle

The first step when we make a query from scratch with EF involves metadata loading. This step reflects all entity class metadata and related mapping to the physical layer. When made, the result is cached in all application domains. This step definitely incurs high cost. Luckily, it takes place only once in an application's lifetime, but only in the case when we deal with multiple application domains.

The second step that occurs any time we perform any kind of database access (query or edit) is the connection opening (later closed immediately when the result is materialized). Although this does not cost so much because of CLR connection pooling, it adds some minor latency times to the first database connection, when the first connection is added to the pool.

Once the connection is available, Entity Framework needs compiling an in-memory query, representing the SQL query in object-oriented way. This in-memory query, called query view. For performance purposes, this query view is cached by Entity Framework itself per each AppDomain. This means that subsequent executions of the same query, will not produce a new query view, simply the previous query view is used and the related SQL statement is sent to the database server.

Now it is time for EF to create the proper SQL statements (related to the specific database dialect) by analyzing the expression tree of all LINQ expressions used for the object query. This adds some execution time, and is still cached for a single query.

The following steps are less incisive in terms of performance. These are SQL execution on database (out of our performance concern scope) entity type load by reading the metadata from step one, metadata validation, change-tracking activation (it's possible to disable this step if not needed), and finally, object materialization. The last step may increase execution time if multiple objects are materialized together.

Without any caching, EF would definitely perform 10 times worse or more than connected ADO.NET would in the data retrieval scenario, in which the query view compilation would always take quite some more time than the database SELECT execution itself.

Although the connection pooling reduce connection open/close times by caching such connections, any time we reuse a connection from the pool, some handshake happens (at least the authentication). When we deal with Internet accessed SQL Servers, like SQL Azure, a frequent connection pooling can produce a significant delay if we insert (for instance) hundreds of items with a new connection opening handshake per item. In this case, a connection retention like storing a single connection in a static variable, will avoid pooling overhead.

Note

More detail on ER performance consideration can be found at https://msdn.microsoft.com/en-us/library/cc853327.aspx.

Querying approaches

Entity Framework gives us the ability to query with different designs. By default, the approach of EF is to execute in the main query what we actually write in the query. Because of the object-oriented approach of EF entities, later, we can later access its properties, such as the navigation properties, to associate entities without having to put such navigation imperatively in the main query. The execution is available until the context is available. This also means that another statement is sent to the database server and another connection open/close happens per main item iterated, producing many database round trips. This default approach, named lazy loading, can also be disabled in the design view of EF itself. In such a case, if we want to access a navigation property with disabled lazy loading, we will need to manually load it or else we will simply get a null value.

Another querying approach is by preloading all the required navigation properties using an approach called eager loading. By executing an EF query in eager loading against a master-detail relationship, the result in terms of database querying is a SELECT statement producing a Cartesian product of all tables involved in the executed queries. This may produce a huge result set within the database server. Once such a set comes to EF, it creates a sort of in-memory data grouping to give objects the right hierarchical shape. With eager loading, no multiple round trips happen, but the big dataset may easily result in great network usage and client (EF side) materialization costs.

The eager and the lazy loading approach gives us the ability to use interesting OOP designs. For example, we could add computational properties or methods to any entity centralizing such logic. This becomes very useful for data-driven applications. Here's an example:

partial class InvoiceElement
{
    public double BaseAmount { get { return Amount * UnitPrice; } }
    public double TotalAmount { get { return Amount * UnitPrice * VatMultiplier; } }
}

partial class Invoice
{
    public double BaseAmount { get { return InvoiceElement.Sum(x => x.BaseAmount); } }
    public double TotalAmount { get { return InvoiceElement.Sum(x => x.TotalAmount); } }
}

This means that anytime we will need to know the total invoice amount, it will simply be available as a computed property. However, this also means that we always have to take in memory all invoice elements, even if we simply need a total amount.

The last available approach uses query shaping. Instead of executing a query on standard inferred entities from a physical database, we use EF to materialize only the required data, as happens when dealing with a View on a database server. Here is an example:

//the AsNoTracking disable change tracking
//this avoid wasting some time and resources
using (var db = new Data.InvoicingDBContainer())
{

    //the lazy execution - the default one
    var lazy_query = db.Invoice.AsNoTracking();

    //the eager execution
    //this will pre-execute in a super-join all requested navigation properties
    var eager_query = db.Invoice.AsNoTracking()
        .Include("Customer")
        .Include("InvoiceElement");

    //a new query with a new shape
    //this give us the ability to get
    //from the database server only what we need
    //without wasting network resources and
    //moving computation across the remote server
    var shaped_query = from i in db.Invoice.AsNoTracking()
                        select new
                        {
                            i.Number,
                            i.Date,
                            CustomerName = i.Customer.CompanyName,
                            BaseAmount = i.InvoiceElement.Sum(x => x.Amount * x.UnitPrice),
                            TotalAmount = i.InvoiceElement.Sum(x => x.Amount * x.UnitPrice * x.VatMultiplier),
                        };

    Console.WriteLine("Simple (lazy) query");
    TraceTime(() =>
    {
        //enumerate the result to execute the query
        foreach (var item in lazy_query)
        {
            var date = item.Date;
            var total = item.TotalAmount;

            //accessing navigation property in lazy way
            //will trigger context to load the related
            //navigation property for each entity in resultset
            var companyname = item.Customer.CompanyName;
        }
    });

    Console.WriteLine("Simple (eager) query");
    TraceTime(() =>
    {
        //enumerate the result to execute the query
        foreach (var item in eager_query)
        {
            var date = item.Date;
            var total = item.TotalAmount;

            //with eager loading no multiple round-trips are made
            //but at beginning a bigger result-set is loaded
            var companyname = item.Customer.CompanyName;
        }
    });

    Console.WriteLine("Shaped query");
    TraceTime(() =>
    {
        //enumerate the result to execute the query
        foreach (var item in shaped_query)
        {
            //all needed data is already within
            //query materialization
        }
    });
}

Performance thoughts

Analyzing performance concerns of the three approaches is not so easy. The first thing to learn is that each approach may perform well in different scenarios.

The following table shows the pros and cons of lazy loading, eager loading, and shaped query:

Approach

Pros

Cons

Lazy loading

  • All local properties always available and ready to be edited
  • Full OOP programming against navigation and local properties
  • Local-side execution of logic once data is materialized
  • Navigation available only until the context is available
  • Wastes many resources when a property is not needed, as lazy loading always makes a sort using SELECT * on the base entity
  • Lots of round trips to the DB for each entity navigation property during first usage

Eager loading

  • All local properties always available and ready to be edited
  • Full OOP programming against navigations and local properties
  • Local-side execution of logic once data is materialized
  • No round trips to DB
  • Lot of resources are wasted anytime a property is not needed because makes always a sort of SELECT * on base entity
  • High execution cost for network and object materialization

Shaped query

  • Server-side execution of logics within the query
  • No round trips to DB
  • Minimal resource usage on network and EF side
  • Only needed data is materialized
  • Unable to centralize logics within objects because of the data-driven approach
  • Read-only objects

Oddly, lazy loading always loads all entity data to reduce eventual round trips to the database by applying a Data Transfer Object (DTO) pattern with the database server, while navigation properties use the Lazy Load pattern, which is exactly the opposite of DTO.

The execution of the following example will give us interesting evidence of how EF performs different query approaches:

First execution
Simple (lazy) query
-> in 1,717 ms
Simple (eager) query
-> in 189 ms
Shaped query
-> in 181 ms

Second execution
Simple (lazy) query
-> in 1,153 ms
Simple (eager) query
-> in 97 ms
Shaped query
-> in 10 ms

The example executed on my laptop includes fake data of 100 invoices, each with 84 elements. Absolute values are definitely useless, while relative data analysis gives us several interesting features. First, we need to observe how EF metadata and query view generation add sensible time cost to each querying approach.

In terms of real execution time—visible from the second execution and later—by using lazy loading, the request of all invoice elements per invoice adds huge round-trip times to the whole querying mechanism. This elaboration time is more than 10 times than that of eager loading and more than 20 times that of shaped loading.

Compared to lazy loading, the eager loading approach performs better when considering latency time of a single round trip. However, upon analyzing the produced database query in SQL Server Profiler, we can see the concrete SQL statement produced. A huge and resource-wasting statement! Here is an example:

SELECT
    [Project1].[InvoiceID] AS [InvoiceID],
    [Project1].[CustomerCustomerID] AS [CustomerCustomerID],
    [Project1].[Number] AS [Number],
    [Project1].[Date] AS [Date],
    [Project1].[CustomerID] AS [CustomerID],
    [Project1].[CompanyName] AS [CompanyName],
    [Project1].[Address1] AS [Address1],
    [Project1].[Address2] AS [Address2],
    [Project1].[Address3] AS [Address3],
    [Project1].[TenantTenantID] AS [TenantTenantID],
    [Project1].[C1] AS [C1],
    [Project1].[InvoiceElementID] AS [InvoiceElementID],
    [Project1].[InvoiceInvoiceID] AS [InvoiceInvoiceID],
    [Project1].[Amount] AS [Amount],
    [Project1].[Description] AS [Description],
    [Project1].[UnitPrice] AS [UnitPrice],
    [Project1].[VatMultiplier] AS [VatMultiplier]
    FROM ( SELECT
        [Extent1].[InvoiceID] AS [InvoiceID],
        [Extent1].[CustomerCustomerID] AS [CustomerCustomerID],
        [Extent1].[Number] AS [Number],
        [Extent1].[Date] AS [Date],
        [Extent2].[CustomerID] AS [CustomerID],
        [Extent2].[CompanyName] AS [CompanyName],
        [Extent2].[Address1] AS [Address1],
        [Extent2].[Address2] AS [Address2],
        [Extent2].[Address3] AS [Address3],
        [Extent2].[TenantTenantID] AS [TenantTenantID],
        [Extent3].[InvoiceElementID] AS [InvoiceElementID],
        [Extent3].[InvoiceInvoiceID] AS [InvoiceInvoiceID],
        [Extent3].[Amount] AS [Amount],
        [Extent3].[Description] AS [Description],
        [Extent3].[UnitPrice] AS [UnitPrice],
        [Extent3].[VatMultiplier] AS [VatMultiplier],
        CASE WHEN ([Extent3].[InvoiceElementID] IS NULL) THEN CAST(NULL AS int) ELSE 1 END AS [C1]
        FROM   [dbo].[Invoice] AS [Extent1]
        INNER JOIN [dbo].[Customer] AS [Extent2] ON [Extent1].[CustomerCustomerID] = [Extent2].[CustomerID]
        LEFT OUTER JOIN [dbo].[InvoiceElement] AS [Extent3] ON [Extent1].[InvoiceID] = [Extent3].[InvoiceInvoiceID]
    )  AS [Project1]
    ORDER BY [Project1].[InvoiceID] ASC, [Project1].[CustomerID] ASC, [Project1].[C1] ASC

This statement produces a result-set of 8,400 rows. However, an execution time of 97 milliseconds is good for a set this size. Let's consider that the query is executed on a local database. If the same query were executed across a network, it would lead to a lot of wastage of network resources, as well as a higher execution time. Hence, it is important to use eager loading carefully.

Obviously, the shaped query performed the best with respect to latency times and overall resource usage. Take into consideration that although there are advantages, it is lacking in terms of scalability. In a system design where we have a single database system and multiple application servers—because of the database-side execution of computations—once the relational database ends its computational resources, it will be impossible to have better performance. It will be impossible improving performances although adding hundreds of application servers running our .NET application.

Regarding the quality of the SQL statement, a shaped query produce the most beautiful code. Here's an example:

SELECT
    [Project1].[InvoiceID] AS [InvoiceID],
    [Project1].[Number] AS [Number],
    [Project1].[Date] AS [Date],
    [Project1].[CompanyName] AS [CompanyName],
    [Project1].[C1] AS [C1],
    (SELECT
        SUM([Filter2].[A1]) AS [A1]
        FROM ( SELECT
            [Extent4].[Amount] * [Extent4].[UnitPrice] * [Extent4].[VatMultiplier] AS [A1]
            FROM [dbo].[InvoiceElement] AS [Extent4]
            WHERE [Project1].[InvoiceID] = [Extent4].[InvoiceInvoiceID]
        )  AS [Filter2]) AS [C2]
    FROM ( SELECT
        [Extent1].[InvoiceID] AS [InvoiceID],
        [Extent1].[Number] AS [Number],
        [Extent1].[Date] AS [Date],
        [Extent2].[CompanyName] AS [CompanyName],
        (SELECT
            SUM([Filter1].[A1]) AS [A1]
            FROM ( SELECT
                [Extent3].[Amount] * [Extent3].[UnitPrice] AS [A1]
                FROM [dbo].[InvoiceElement] AS [Extent3]
                WHERE [Extent1].[InvoiceID] = [Extent3].[InvoiceInvoiceID]
            )  AS [Filter1]) AS [C1]
        FROM  [dbo].[Invoice] AS [Extent1]
        INNER JOIN [dbo].[Customer] AS [Extent2] ON [Extent1].[CustomerCustomerID] = [Extent2].[CustomerID]
    )  AS [Project1]
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.239.118