Hotel recommendation based on Hybrid Model using Implicit and Explicit Feedback

Home, - Hotel recommendation based on Hybrid Model

Hotel recommendation based on Hybrid Model using Implicit and Explicit Feedback

The objective of this paper is to design a hotel recommender system using hybrid model. Recommended systems are the systems designed for making suggestions of items to the users based on their preferences. It aims to suggest items that are liked by users. It has become important for business like travel, hotel to promote their products. With the increase in social networks users seek recommendations from websites like twitter, TripAdvisor, Trivago, Yelp. Users. Users read reviews, ratings, opinion based on the recommendation without user preferences. For instance, users with limited budget may be recommended with expensive hotels due to its high rating. With thousands of hotel , it is difficult to find the hotel of their choice with several criteria .Personalizing the user search by their preference is a burning need for better hotel recommendation.

At present, most of the recommendation systems are content-based models considering the search input but not the users preferences. Existing recommended systems in hotel  industry uses a straightforward strategy, in general ,system compares the profile of existing user with specific features and used them to predict the user preferences (Ricci 2002; Gavalas&Kenteris, 2011).This is especially true about mobile recommender systems (Yang & Hwang, 2013).In such system ,the user is asked to provide the set of information that describes interest, needs or limitations which are used by the system to make recommendation by correlating user's response against  available information/packages. These methods are known as content-based recommendations (Gavalas&Kenteris, 2011).

In this thesis, intelligent hybrid hotel recommender model is applied based on the user preferences and item properties exploiting implicit and explicit feedbacks . In addition, using data sources of different types, it employs multi-criteria rating approach to better capture users' preferences and augment the accuracy of the recommendations. The system is designed in different layers, using multiple sub-recommender systems each addressing specific aspect of the subject problem, aiming to increase efficiency and effectiveness of recommendations by considering. The proposed system is trained over TripAdvisor data, collected from multiple sources and integrated into a single database. The final solution is verified and tested in different settings and scenarios to confirm and validate its accuracy.

Recommendation using Explicit Feedback :  Social Networks such as TripAdvisor, Zomato provides   user to share rating, opinion on hotels. Despise of hotel ratings, other features such as thumbs up and thumbs down in Facebook. These kind of feedbacks are explicit feedback. These feedback are categorized into content-based(CB) and collaborative filtering(CF)

Content-based(CB) methods (Lops et al, 2011; Pazzani and Billsus, 2007) generate recommendations by exploiting regularities in the item content. It is a common approach in recommendation system. For instance, in hotel recommendation content could be location, price, star rating .To recommend a user, we need to find out the similar hotel to the feature of the items previously rated   by the user. Thesimilarity between hotels  are computed by popular measures such as Pearson Correlation Coefficient(PCC) and Vector Space Similarity.Content-based information filtering has proven to be effective in locating textual items relevant to a topic using techniques, such as Boolean queries (Anick et al., 1990; Lee et al., 1993; Verhoeff et al., 1961), vector-space queries (Salton and Buckley, 1998), probabilistic model (Robertson and Sparck, 1976), neural network (Kim and Raghavan, 2000) and fuzzy set model (Ogawa etal)

However, there are disadvantages of using this approach .Firstly ,it is limited by the number and the types of features associated. Secondly ,users might be recommended with hotels that are highly similar to the hotels he/she liked leading to lack of diversity.

Collaborative filtering(CF) method generates recommendation by analyzing preferences provided by users such as purchase history or users ,previous ratings /reviews on the items . Most popular and accurate Collaborative filtering method is Matrix Factorization (MF) (Koren et al, 2009). This approach discover the latent factor spaces shared between users and hotels where latent factors can be used to describe characteristics of hotels and taste of users.The Tapestry text filtering system, developed by Nichols and others at the Xerox Palo Alto Research Center (PARC), applied collaborative filtering (Douglas, 1993; Harman, 1994). The GroupLens project at the University of Minnesota is a popular collaborative system.

 Although this approach has been widely used it has few drawbacks such as robustness, sparsity and scalability (Claypool et al., 1999; Sarwar et al., 2000) that will be addressed in the hybrid model. Cold start problem ,where recommendations are required for items that no user has rated yet.

Hybrid Model's goal is to resolve two big problem and also combine user preference and popularity of hotels to recommend creating a composite recommender. The main advantage of this model is the inclusion of algorithms that cover different aspects of the data and the subject problem to produce improved recommendation

  • Using collaborative and content-based filtering separately and then combining their predictions, using collaborative filtering as the basis and adding content-based capabilities to the system, or incorporating both approaches in one model (Adomavicius&Tuzhilin, 2005). Netflix, 12 an American provider of Internet streaming media, is a famous example of the success of hybrid systems (Bennett & Lanning, 2007). Netflix analyzes the searching and watching preferences and habits of similar users (collaborative filtering) and recommends movies which have similar features as the ones the user has already highly rated (content-based filtering).
  • The other group is the sequential combination of content-based filtering and collaborative filtering. In this system, firstly, content-based filtering algorithm is applied to find users, who share similar interests. Secondly, collaborative algorithm is applied to make predictions, such as RAAP (Delgado et al., 1998) and Fab filtering systems(Balabanovic and Shoham, 1990). RAAP is a content-based collaborative information filtering for helping the user to classify domain specific information found in the WWW, and also recommends these URLs to other users with similar interests. To decide the similar interests of users is using scalable Pearson correlation algorithm based on web page category. Fab system, which uses content based techniques instead of user ratings to create profiles of users. So the quality of predictions is fully depended on the content-based techniques, inaccurate profiles result in inaccurate correlations with other users and thus make poor predictions.

Hybrid systems can be also implemented through integrating featurebased methods with collaborative filtering and forming a single learning model, which is often in the form of a matrix factorization model (e.g. Shan & Banerjee, 2010; Chen et al., 2012).

Common Algorithm

  • Association Rules :Association rules are a well-known technique in data mining, e.g. the Apriori algorithm (Agrawal & Srikant, 1994). Such techniques learn from the data and extract rules which predict 14 the occurrence of an item based on the other items' occurrences. There are also a number of studies which employed association rules in the context of recommender systems (e.g. Mobasher et al., 2001; Lin, Alvarez, & Ruiz, 2002). However, association rules need to be adopted according to the application of the recommender system. That is in recommender systems any item can be recommended, thus the association rules should be able to capture the associations among any items, even if the support is small. Here, it becomes challenging since setting a small threshold for the support can lead to a large set of associations! There exist some heuristic approaches, such as adaptive support (Mobasher et al., 2001) and sliding windows (Davidson et al., 2010), to overcome the mentioned problem.
  • Bayesian Classifiers :In Bayesian network classifiers all features are considered as random continuous or discrete variables (Friedman, Geiger, &Goldszmidt, 1997). In particular, Bayes theorem and conditional probabilities are used in Bayesian classifiers to classify the given data through maximizing the posterior probability of the items' class. In recommenders' context, ratings can be considered as classes and the Bayesian classifier can be applied on the real-valued ratings. It is assumed that given a class (e.g. rating), users (or items) are independent, and thus the probability of the class is calculated. However, it is obvious that such assumption is not satisfied in collaborative filtering where it is assumed that users and/or items are related. For this reason, Bayesian classifiers are mostly coupled with another algorithm in recommender systems. For example, Candillier, Meyer, and Boullé (2007) coupled a Bayesian model with a k-NearestNeighbors algorithm.
  • K-Nearest Neighbors :K-Nearest-Neighbor (KNN) based algorithms, also called memory-based algorithms, are widely used in the context of recommender systems (Su&Khoshgoftar, 2009). They can be considered as a generalization to the association rules as they go over all the items and/or users in the corpus. One of the serious limitations KNN approaches is their lack of scalability. Moreover, they might be time consuming in large-scale real-life applications as the time needed for building the model is quadratic, i.e. a function of squared number of objects in the corpus.
  • Matrix Factorization :It was in Netflix challenge when matrix factorization techniques became very popular (Bell &Koren, 2007). Matrix factorization algorithms are not only fast and accurate but also relatively easy to implement. However, they might be difficult to be adopted for item-item recommendations. In general, such techniques transform a given matrix into typically three simpler matrices. In recommender systems, matrix factorization techniques should deal with the missing values problem. The first matrix factorization approaches handled the missing value problem through replacing the missing values of the rating matrix (Sarwar et al., 2000), which was not an effective way since it resulted in large dense matrices. Another more effective approach is to use parameters and regularization (Takács et al., 2008)

Recommendation using ImplicitFeedback :Fetching feedback from users who aren't willing to rate their preferences .Implicit feedback includes the time a used stayed on the webpage ,number of clicks on an item ,location.The importance of implicit feedback has been recognized recently, and it provides an opportunity to utilize the vast amount of implicit data that have already been collected over the years, such as activity logs

Software Packages

Recommenderlab (https://cran.r-project.org/web/packages/recommenderlab/index.html) is a package for R programming language. It provides a research infrastructure to test and develop recommender algorithms including UBCF, IBCF, FunkSVD and association rulebased algorithms. The software is free/open source under GPL-2 license

we suggest a technique that introduces the contents of items into the item-based collaborative filtering to improve its prediction quality and solve the cold start problem. The detailed procedure of our approach is as follows

Utility Matrix :The utility matrix gives each user-item pair; a value represents the degree of preference of that user for that item. User ID representing user and hotel cluster representing item .When a hotel cluster is viewed by a user , rating of 1 is given ;when a hotel is booked ;rating of 5 is given.

Hierarchical Clustering :In terms of scalability problem, hierarchical clustering is applied to cluster the large number of users into different clusters. Using K-Means clustering approach we can handle both categorical and numerical features The number of clusters/prototypes (k) should be given to the K-Prototypes algorithm. Thus, finding the optimal number of clusters, the best k, was crucial and could affect the performance of the system. We used the Gap statistic [14] for estimating the best k for the users' data. We found that the Gap statistic peaks at k = 4 with the value of ~1.006. Thus, the existence of 4 clusters was confirmed. Also, the number of user in the data is massive, which makes it impossible to implement Matrix Factorization on the original utility matrix. Therefore, users are classified into user cluster and utility matrix is compressed based on that.

Singular Value Decomposition (SVD)method :After clustering the users based on their preference in utility matrix, the utility matrix might still be super sparse because it is also rare to a cluster of users to rate most of the hotel. We would like to find a method to fill the unrated entries in utility matrix by smallest error. Here SVD is applied to do that.

Decision Tree Classifier :Decision tree is to predict the cluster label of a new user by inputting the user's profile data. Decision tree is a high-level overview of all the sample data, which not only can accurately identify all categories of the sample, but also can effectively identify the class of the new customer. In order to avoid overfitting, cross-validation method is adopted to obtain the best decision tree. A procedure to do that in hotel recommendation.

Combining user preference with the item properties : if the hotel is booked by user it is rated as 5, if it is clicked by user then rating 1, else unrated. Based on   booking history, ranking matrix of hotel properties can be created based on the above scenarios

TripAdvisor website is used as the data source for hotel in this research.The TripAdvisor website is free to use and the company's business plan is based on the support from advertisement. Hence, the availability of the data as well as the possibility of comparing the results with similar studies that used TripAdvisor as the input data source were some of the main reasons for selecting TripAdvisor

Using R programming language we have scrapped the data such as user reviews ,hotel ratings in the TripAdvisor website using multiple sources. Data consist of  ratings, value, location, user reviews

To achieve better accuracy and evaluation of results the data is needed to pre-process mainly because it was collected from multiple sources. Pre processing task such as noise removal ,lowercase conversion ,stop word and punctuations removal

The leave one-out cross validation (LOOCV) approach was selected for validating the results. In LOOCV with n data points, 1 observation (data point) is considered as the validation set in each run, while the remaining data points form the training set. The procedure is repeated n times, taking all data points as the validation set once. A set of decision-based error measures, i.e. accuracy, specificity, sensitivity, and informedness, as well as three prediction-based error metrics, i.e. mean absolute error (MAE), mean squared error (MSE), and root-meansquare error (RMSE), were calculated for evaluating .

Hybrid model results in prediction with 53.6% accuracy on testing data-4% improvement on content-base model. This result is consistent with our hypothesis: both user preference and hotel popularity are vital in recommendation system.

Finally, in this thesis, a hybrid solution was proposed for predicting ratings for user-hotel pairs and making the recommendation. Recommendation technique using explicit feedbacks such as ratings  and implicit feedback such as clicks and page views are gaining popularity .We apply clustering technique to the item content information to complement the user rating information, which improves the correctness of collaborative similarity, and solves the cold start problem.

Our hybrid model can be further improved with the following aspects:Larger dataset will be applied in this model so density-based clustering method should be used instead of hierarchical clustering,More features such as hotel country and hotel market might be included to test their impacts in prediction.Although a set of content features was used in this thesis for training the recommender systems, there might exists many other (content) features on users or hotels that are worth to examine

Leave a comment