Lookup programming

As mentioned earlier, complete dataset retrieval is something that is poorly optimized in any application scenario. When dealing with a huge dataset, trying to load more than a million rows together in the same materialized Entity Framework query or something similar will surely result in an OutOfMemoryException output.

As seen in the Stream-like querying section in Chapter 7, Database Querying, ADO.NET gives us the ability to execute queries without having to put all the data together in our memory, as we would have to with the old DbCommand class or new Entity Framework ones. Sometimes, we are in the need of executing a lot of logic that needs frequent lookups at a data source. Let's talk about the previous example again.

What if a GPS device always sends the same position? In our system, we will take this position, produce a Stage 1 message, send it to the Stage 1 queue, and then down to our grid engine, which will make a request to our database to check whether such a position is already known to us or not, and eventually skip it.

The best solution to avoid such round trips is to know that in the first stage, the GPS position is already present in our system, thus being able to skip duplicated items as quickly as possible. Obviously, a second check at a lower level is mandatory, but this pre-check stage will easily boost latency of the whole application.

Instead of directly checking data duplication at the first stage with a direct database query, we can use a cache or a huge collection. In our solution, storing all coordinates could soon break any system. This is not a practical method, but we can use the opportunity to evaluate how different local data-storing classes behave.

Again, we will read a simple CSV file for testing purposes—this time without any CSV extraction limitation with the Take method:

var fname = @"C:Temppositions_export.csv";

//skip line 1 containing column names
//take only 100 items for testing purpose
//split string for semicolon char
var positions = File.ReadAllLines(fname).Skip(1)
    .Select(row => row.Split(';'))
    //parse csv data as "latitude;longitude"
    .Select(x => new { Latitude = float.Parse(x[0]), Longitude = float.Parse(x[1]) })
    //avoid unnecessary duplications
    .Distinct()
    //produce a temporary key as a unique string taken from
    //the hash code of the anonymous instance (although GetHashCode does not guarantee a complete uniqueness like any hashing, we will use it for testing purposes)
    .Select(x => new { x.Latitude, x.Longitude, TempID = x.GetHashCode().ToString() })
    .ToArray();

With the help of a Stopwatch class, let us evaluate how much it costs LINQ to have objects in our memory find each item by themselves, by value and not by reference:

s.Start();
foreach (var p in positions)
{
    var found = positions.FirstOrDefault(x => x.Latitude == p.Latitude && x.Longitude == p.Longitude);
}
s.Stop();

Console.WriteLine("By lat/lon {0:N0}ms", s.ElapsedMilliseconds);

s.Reset();
s.Start();
foreach (var p in positions)
{
    var found = positions.FirstOrDefault(x => x.TempID == p.TempID);
}
s.Stop();

Console.WriteLine("By string equals {0:N0}ms", s.ElapsedMilliseconds);

Do consider that the cost is directly proportional to the number of rows. The following is the time each item takes to find itself by evaluating a by-value equality, with less than 6,000 rows in our CSV:

By evaluating lat/lon: 662ms
By evaluating the string key: 757ms

Are you wondering that the cost is because of the double integer key (lat/lon)? No! A single string costs more than a couple of integers. Remember that LINQ simply parses all objects in memory. It is not a database index seek.

So, how can we reproduce the logic of a database index seek in our .NET code? Simple! With the old Hashtable class, or the newly created dictionary or the HashSet classes (HashSet is the most recent).

Here, we give the dictionary a key by writing an anonymous Func<Myparam, object>:

var dictionary = positions.ToDictionary(x => x.TempID, x => new { x.Latitude, x.Longitude });

s.Reset();
s.Start();
foreach (var p in positions)
{
    var found = dictionary[p.TempID];
}
s.Stop();

Console.WriteLine("By string dictionary {0:N0}ms", s.ElapsedMilliseconds);

The result? The time taken to find each item again on its own, by asking the dictionary, seeking by the string key for the same number of items in memory, is 0ms! The drawback is still that we cannot put a whole table within a dictionary instance. The other option is to use a local cache object to contain the lookup data. Such a cache instance will contain as many items as possible without ever breaking an application's stability. Obviously, this does not contain the whole table in memory (which is a bad idea), but this choice will boost the latency time of your engines substantially by bringing the whole system to a higher throughput speed.

The following example shows how to preload an in-memory cache and check for missing cache items by verifying that all items are definitely within the cache. Bear in mind that a cache has its own memory management that tries to optimize storing, most used entries (items within the cache) or newly added items. Here's an example:

//try putting all items within the cache

foreach (var p in positions)
    cache.Add(p.TempID, p, DateTimeOffset.Now.AddMinutes(30));

//cache miss counter
int miss = 0;

s.Reset();
s.Start();
foreach (var p in positions)
{
    var found = cache[p.TempID];
    if (found == null)
        miss++;
}
s.Stop();

Console.WriteLine("By MemoryCache {0:N0}ms with {1:N0} misses", s.ElapsedMilliseconds, miss);

The result comes to a latency of 3ms, with zero misses. This proves that although dictionary always remains the fastest class in item retrieval, a MemoryCache is not so bad.

In this example, the cache has been drilled together at the start. In the real world, this is a good practice, and although it creates some initial latency time, this initial time cost will later boost the cache checks. However, if such a design does not fit your needs, the lazy approach is available. It does not drill the cache at the beginning, but at any cache check.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.255.187