The log-likelihood ratio (LLR) is a measure of how two events A and B are unlikely to be independent but occur together more than by chance (more than the single event frequency). In other words, the LLR indicates where a significant co-occurrence might exist between two events A and B with a frequency higher than a normal distribution (over the two events variables) would predict.
It has been shown by Ted Dunning (http://tdunning.blogspot.it/2008/03/surprise-and-coincidence.html) that the LLR can be expressed based on binomial distributions for events A and B using a matrix k with the following entries:
A |
Not A | |
---|---|---|
B |
k11 |
k12 |
Not B |
k21 |
k22 |
Here, and is the Shannon entropy that measures the information contained in the vector p.
Note: is also called the Mutual Information (MI) of the two event variables A and B, measuring how the occurrence of the two events depend on each other.
This test is also called G2, and it has been proven effective to detect co-occurrence of rare events (especially in text analysis), so it's useful with sparse databases (or a utility matrix, in our case).
In our case, the events A and B are the like or dislike of two movies A and B by a user, where the event of like a movie is defined when the rating is greater than 3
(and vice versa for dislike). Therefore, the implementation of the algorithm is given by the following class:
The constructor takes as input the utility matrix, the movie titles list, and the likethreshold
that is used to define if a user likes a movie or not (default 3
). The function loglikelihood_ratio
generates the matrix with all the LLR values for each pair of movies i and j calculating the matrix k (calc_k
) and the corresponding LLR (calc_llr
). The function GetRecItems
returns the recommended movie list for the user with ratings given by u_vec
(the method does not predict the rating values).
3.17.164.34