Finding the Most Influencial Yelpers Using Stochastic Optimization Methods

**Who are the most influencial Yelp reviewers?** This question is an example of the Influence Maximization Problem

$$I: G \times S_1 \to \mathbb{R}$$

$f:S_1 \to \mathbb{R} $

$$f(S_1) = \frac{1}{N} \sum^N I(G, S_1) $$

How do we find a network of users? One simple method is to query the user data to find "friends" of the users, and build out a friendship graph. However, use of the friendship network will not yield a directed graph; moreover, anecdotal evidence suggests that Yelp users do not necessarily follow the reviews of their friends either due to distance (living in different cities) or due to differences in taste.

We looked for a more innovative approach to find the network in the review data. Analyzing only the positive reviews (ie. four stars or above), we modeled the network of users.

What we found is a hidden network of users where each user is represented by node. A directed edge is formed from node(user)$i$ to node $j$, if node $j$ reviews a business positively within 30 days after node $i$ has reviewed the same business. We made the assumption, which is not a trivial one, that when user $j$ positively reviews a business, she read and was influenced by the positive review of user $i$. Given this assumption, we model the edge weight between node $i$ and node $j$ as a sample from the Beta distribution such that,

$$e_{i,j} \sim Beta(\alpha_{i,j}, \beta_{j}) $$

where $e_{i,j}$ represents probability of node(user) $i$ influencing node $j$.

We define the parameters as follows:

$\alpha_{i,j}:=$ Number reviews node $j$ made following node $i$

$\beta_{j}:=$ Total number of reviews by node $j$

The from the expectation of Beta distribution, we "expect" node $i$ will influence node $j$ by the following equation:

$$\mathbf{E}[e_{i,j}] = \frac{\alpha_{i,j}}{\alpha_{i,j} + \beta_{j}}$$

We present our graphs of three states below. *(Data from other states are too big for simple visualization here.)*

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas vel sodales libero, varius porttitor leo. Nullam pellentesque varius neque, at malesuada dolor. Suspendisse turpis nisl, consectetur sit amet maximus non, sollicitudin id risus. In tincidunt sapien neque, volutpat convallis nisl tempor quis. Vivamus lorem sem, imperdiet sed orci eget, scelerisque mollis dolor. Sed feugiat ut leo nec dignissim. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Read more

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas vel sodales libero, varius porttitor leo. Nullam pellentesque varius neque, at malesuada dolor. Suspendisse turpis nisl, consectetur sit amet maximus non, sollicitudin id risus. In tincidunt sapien neque, volutpat convallis nisl tempor quis. Vivamus lorem sem, imperdiet sed orci eget, scelerisque mollis dolor. Sed feugiat ut leo nec dignissim. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Read more

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas vel sodales libero, varius porttitor leo. Nullam pellentesque varius neque, at malesuada dolor. Suspendisse turpis nisl, consectetur sit amet maximus non, sollicitudin id risus. In tincidunt sapien neque, volutpat convallis nisl tempor quis. Vivamus lorem sem, imperdiet sed orci eget, scelerisque mollis dolor. Sed feugiat ut leo nec dignissim. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Read more

Before we perform the optimization using the three methods, we take a quick excursion to clarify a problem in the influence maximization problem.

$$\hat{f} \approx f + e$$ where $e \sim Normal(0,\sigma_f)$.

We ran several sample runs to find the $\sigma_f$ as a function of $N$. Below is a good example that demonstrates the asymptotic value of $\sigma_f$.

In conclusion, we define $\sigma_f$ as polynomial function of $N$ such that

$$\sigma_f = \frac{\lambda}{\sqrt{N}}$$

where $\lambda$ is unique to the starting set, $S$.

Thus, by boosting up $N$, we are able to reduce the noise of $\sigma_f$, and find $\hat{f}$ that is a close approximation to $f$.

We settled on $N=10000$, which we found to be both computationally feasible and sufficiently accurate in the estimation of $f$.

Using the three optimization methods described above, we solved the influence maximization problem on the North Carolina user network graph. We tested our optimization methods against the heuristic method of selecting $k$ nodes with the highest out-degree. We ran the test in $k=\{1,2,...20\}$.

For each solution, we evaluate its influence, $f$, and plot the results below. We find that all three algorithms outperforms the heuristic method for the most of $k$'s.

Unfortunately, as we can see from the chart above, simulated annealing noticeably failed to find the optimal solution. We believe that one of the weaknesses of stochastic optimization is inherent stochasticity and its vulnerability to noise in the optimizing function.

We end our exploration of the influence maximization problem here. Rather than the outcome of the influence maximization problem itself, we found our result demonstrated weaknesses in these stochastic methods in dealing with optimization of functions with noise. We believe exploration of dealing with functions with noise in optimization problem is an interesting future direction of research.

Stochastic programming is an optimization method that addresses these challenges in noisy functions. We may find the techniques found in stochastic programming applicable to simulated annealing and genetic algorithms.

The influence maximization problem should provide a good testing problem for the future optimization techniques.