Working with a business domain model

When designing an application, we often create an object model that mimics business domain concepts. The idea is to clearly articulate data in a form that feels most natural to the programmer.

Let's say we need to retrieve customers' data from a relational database. A customer record may be stored in a CUSTOMER table, and each customer is stored as a row in the table. When we fetch customer data from the database, we can construct a Customer object and push that into an array. Similarly, when we work with NoSQL databases, we may receive data as JSON documents and put them into an array of objects. In both cases, we can see that data is represented as an array of objects. Applications are usually designed to work with objects as defined using the struct statement.

Let's take a look at a use case for analyzing taxi data coming from New York City. The data is publicly available as several CSV files. For illustration purposes, we have downloaded the data for December 2018 and truncated it to 100,000 records. 

First, we define a type called TripPayment, as follows:

struct TripPayment
vendor_id::String
tpep_pickup_datetime::String
tpep_dropoff_datetime::String
passenger_count::Int
trip_distance::Float64
fare_amount::Float64
extra::Float64
mta_tax::Float64
tip_amount::Float64
tolls_amount::Float64
improvement_surcharge::Float64
total_amount::Float64
end

To read the data into memory, we will take advantage of the CSV.jl package. Let's define a function to read the file into a vector:

function read_trip_payment_file(file)
f = CSV.File(file, datarow = 3)
records = Vector{TripPayment}(undef, length(f))
for (i, row) in enumerate(f)
records[i] = TripPayment(row.VendorID,
row.tpep_pickup_datetime,
row.tpep_dropoff_datetime,
row.passenger_count,
row.trip_distance,
row.fare_amount,
row.extra,
row.mta_tax,
row.tip_amount,
row.tolls_amount,
row.improvement_surcharge,
row.total_amount)
end
return records
end

Now, when we fetch the data, we end up with an array. In this example, we have downloaded 100,000 records, as shown in the following screenshot:

Now, suppose that we need to analyze this dataset. In many data analysis use cases, we simply calculate various statistics for some of the attributes in the payment records. For example, we may want to find the average fare amount, as follows:

This should be a fairly fast operation already because it uses a generator syntax and avoids allocation. 

Some Julia functions accept generator syntax, which can be written just like an array comprehension without the square brackets. It is very memory efficient because it avoids allocating memory for the intermediate object.

The only thing is that it needs to access the fare_amount field for every record. If we benchmark the function, it shows the following:

How do we know whether it runs at optimal speed? We don't unless we try to do it differently. Because all we are doing is calculating the mean of 100,000 floating-point numbers, we can easily replicate that with a simple array. Let's replicate the data in a separate array:

fare_amounts = [r.fare_amount for r in records];

Then, we can benchmark the mean function by passing the array as is:

Whoa! What's happening here? It is 24x faster than before. 

In this case, the compiler was able to make use of the more advanced CPU instructions. Because Julia arrays are dense arrays, that is, data is compactly stored in a contiguous block of memory, it enables the compiler to fully optimize the operation.

Converting data into an array seems to be a decent solution. However, just imagine that you have to create these temporary arrays for every single field. It is not much fun anymore as there is a possibility to miss a field while doing so. Is there a better way to solve this problem?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.238.76