Yelp online reviews have much influence on users and have an important impact on consumers’ decision-making. Users tend to search for reviews from Yelp to get useful information, however, sometimes fail to get relevant, reliable, and trustworthy reviews. The reasons are high rating reviews tend to be outdated and latest reviews tend to have low or none of the useful votes, and some businesses do not have any reviews that are voted as useful. This is problematic because users and businesses cannot define which is the most relevant review when searching on the review, have difficulty in which reviews they should value the most, and if so, users need to look through lots of reviews to find the newest and most relevant reviews. Hereby, this project aims to predict which review will be rated as useful by using the deep-learning classification Naïve-Bayes algorithm in order to solve the problems generated by the lack of user reviews.
● Python Pandas, Matplotlib ● RapidMiner
To explore the data of the three original datasets, yelp_review, yelp_business, and yelp_user:
● Original “yelp_review” dataset statistics Based on the core problem we expect to solve, the “yelp_review” is one of the main datasets that is needed for identifying useful reviews. The review dataset is the most important since it is the fundamental source for detecting the useful reviews. The “Yelp_review” dataset contains 5,261,608 rows, 9 columns, and has records from July 22, 2004 to December 11, 2017. In addition, some of the text is not English and some of the values are missing.
● Original “yelp_business” dataset statistics The “yelp_business” dataset contains business location information which may be a significant attribute when determining useful reviews. Moreover, the business table is needed in order to find out relevant information associated with the review business_id. The “yelp_business” dataset contains 174,134 rows, and 13 attributes.
● Original “yelp_user” dataset statistics The “yelp_user” dataset is needed to match the user_id to the review dataset user_id and see which of the users have written down useful reviews and see if there is any similarity among those users. The screenshot below is a portion of the summary statistics, in the table it explains how many friends the user has or which types and how many compliments a user got from other users. The compliments attribute’ averages are in the range of 0 to 3 and have a big difference in the range of min and max.
Our project further explored the relationship with useful attribute and the location to decide whether to limit the location to a specific area or not. The figure indicates that all states have evenly distributed with about the same ratio of 5 to 5 to reviews that have zero useful votes and have over zero useful votes. Therefore, there is no need to narrow down to a specific area.
We further try to find out the correlation between useful reviews and other attributes, so our team run a regression model. The regression analysis result shows that funny and cool variables’ relationships being positive, however, it is not strongly correlated. Therefore, our group concluded that for the project, we should focus on the review text and useful attributes in the review dataset.
This project used a Naive Bayes algorithm with Binary Term Occurrences. It is simple, fast, and very effective, especially when training the model. It works well regardless of the size of the data, therefore, estimation probabilities for prediction could be easily obtained. At this point, we decided to use the ‘Weight by Information Gain Ratio’ operator with top k 3000 weight relation because it calculates the weight of attributes with respect to the label attribute by using the information gain ratio. The higher the weight of an attribute, the more relevant it is considered.
Compared to the first phase of text pre-processing, ‘Transform Case’ and ‘Stem’ operators are excluded because for prediction models to identify useful reviews, ‘past tense’, ‘comparative’, ‘superlative’ or ‘capital letter’ do play a significant role. ‘Stemming’ and ‘Transform Cases’ could be useful such as identifying hot topics. In this case transforming all cases to one type and normalizing the same meaning words to one unified ‘Stem’ word might be more efficient to running the model. However, in our case, the reviews contain capital letters and different types of adjectives that presents unique patterns for useful reviews as follows: As you see, some of the classified useful reviews emphasized their feelings with capital letters and used past tense to describe their experiences from the business. Furthermore, a lot of useful reviews used comparative or superlative adjectives such as ‘slower’, ‘smoother’, ‘lighter’, ‘heavier’, ‘greatest’, ‘worst’, etc. Therefore, if those features are eliminated by ‘stemming’ or ‘transform cases’ operators, the accuracy of the model decreases.
To this end, we created an Yelp review mining with the Naive Bayes algorithm with 94.5% of accuracy. Through this modeling we were able to present four categories that identified the features of useful reviews based on the classified results. Although the information obtained is limited, the model proposed a unique feature of useful reviews, and this feature could open a door for further algorithm development by setting more comprehensive classification criteria based on the project’s result.
● Defined 3 issues on the current Yelp review algorithm that caused a bottleneck to improve customer experience and identified the difference between useful and not-useful reviews to solve the business backlog.
● Created an NLP and software development life cycle-based online review mining with the Naive Bayes algorithm with 94.5% accuracy