Skip to article frontmatterSkip to article content

8.3 Page Rank and the Google Matrix

ranking websites

Dept. of Electrical and Systems Engineering
University of Pennsylvania

Binder

Lecture notes

1Reading

Material related to this page, as well as additional exercises, can be found in LAA 5th5^{th} edition 10.2.

2Learning Objectives

By the end of this page, you should know:

  • what is a random walk on a graph
  • how a simple internet is modeled as a random walk on a graph
  • how ranking webpages helps internet users
  • the primary modifications made by Google to address the non-regularity of transition matrices that model links between webpages
  • the magnanimity of Google’s transition matrix and how they handle them

3Random Walk on the Internet

An internet user’s behavior while surfing the web can be modeled as a Markov chain that captures a random walk on a graph. We start with a simple example of this, and then explain how it can be used to design a search engine.

Consider the following graph:

Graph

which has seven vertices interconnected by edges. Let’s pretend this graph models a very simple internet! each vertex, or node, is a webpage, and each edge is a hyperlink connecting pages to each other. (For now we assume that if page ii links to page jj, then page jj also links to page ii, but this isn’t necessary.) We assume that if a user is on page ii, they will click on one of the hyperlinks with equal probability. For example, in our simple internet, if a user is on page 5, they will visit page 2 next 50% of the time and page 6 next 50% of the time. Similarly, if a user is on page 3, they will visit page 1, 2, or 4 next 33% of the time each.

We can model this user behavior using a Markov chain, which is called a simple random walk on a graph. The transition matrix for this graph is given by:

12 34567P=[013140000120140120012130101300014000001300013000140120100000130]1234567\begin{align*} & \quad \quad \begin{matrix} 1 & 2 & \ 3 & 4 & 5 & 6 & 7 \end{matrix} \\ P &= \begin{bmatrix} 0 & \frac{1}{3} & \frac{1}{4} & 0 & 0 & 0 & 0 \\ \frac{1}{2} & 0 & \frac{1}{4} & 0 & \frac{1}{2} & 0 & 0 \\ \frac{1}{2} & \frac{1}{3} & 0 & 1 & 0 & \frac{1}{3} & 0 \\ 0 & 0 & \frac{1}{4} & 0 & 0 & 0 & 0 \\ 0 & \frac{1}{3} & 0 & 0 & 0 & \frac{1}{3} & 0 \\ 0 & 0 & \frac{1}{4} & 0 & \frac{1}{2} & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & \frac{1}{3} & 0 \end{bmatrix} \begin{matrix} 1 \\ 2 \\ 3 \\ 4 \\ 5 \\ 6 \\ 7\end{matrix} \end{align*}

This allows us to answer questions such as the following: suppose 100 users start on page 6. After each user clicks on three hyperlinks, what % of users do we expect to find on each web page. The solution is given by setting our initial user distribution to:

x(0)=[0000010]\vv x(0) = \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 1 \\ 0 \end{bmatrix}

and computing x(3)=P3x(0)=[.0833.0417.40280.27780.1944]\vv x(3) = P^3\vv x(0) = \begin{bmatrix} .0833 \\ .0417 \\ .4028 \\ 0 \\ .2778 \\ 0 \\ .1944 \end{bmatrix}

This tells, for example, that 40.28% of users starting on page 6 end up on page 3 after 3 hyperlink clicks.

4Page Rank

The founders of Google, Sergey Brin and Lawrence Page, reasoned that important pages had links coming from other “important” pages, and thus, a typical internet user would spend more time on more important pages, and less time on less important pages. This can be captured by the steady state distribution x\vv x^* of the Markov chain we are using to model the internet: in the long run, a typical user will spend xi\vv x^*_i % of their time on page ii. This is precisely the observation used to define the Page Rank algorithm, which Google uses to rank the importance of the webpages it catalogs.

Now, if a typical transition matrix PP describing the internet were regular, we would be done - we simply compute x=Px\vv x^* = P\vv x^* and use x\vv x^* to rank websites. Unfortunately, typical models of the web are directed graphs, which lead to non-regular transition matrices PP. To address this, Google makes two adjustments, which we illustrate with the following slight modification of our previous example:

Directed Graph
12 34567P=[01200000001301200100001300013100001200013000130120000000131]1234567\begin{align*} & \quad \quad \begin{matrix} 1 & 2 & \ 3 & 4 & 5 & 6 & 7 \end{matrix} \\ P &= \begin{bmatrix} 0 & \frac{1}{2} & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & \frac{1}{3} & 0 & \frac{1}{2} & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 & \frac{1}{3} & 0 \\ 0 & 0 & \frac{1}{3} & 1 & 0 & 0 & 0 \\ 0 & \frac{1}{2} & 0 & 0 & 0 & \frac{1}{3} & 0 \\ 0 & 0 & \frac{1}{3} & 0 & \frac{1}{2} & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & \frac{1}{3} & 1 \end{bmatrix} \begin{matrix} 1 \\ 2 \\ 3 \\ 4 \\ 5 \\ 6 \\ 7\end{matrix} \end{align*}

The first issue that arises here is that pages 4 and 7 are dangling nodes: if a user ever ends up on page 4 or 7, they never leave as there are no outgoing links. This means that columns 4 and 7 never change as we compute PkP^k, and hence PP cannot be regular. To handle dangling nodes, the following adjustment is made:

The above means that if state jj is an absorbing state, we replace column jj of PP with the vector [1n1n]\bm \frac{1}{n} & \ldots & \frac{1}{n} \em. For example, our modified transition matrix is now:

12 34567P=[0120170017001317120171001701317001317001701201701317001317120170001701317]1234567\begin{align*} & \quad \quad \begin{matrix}1 & 2 & \ 3 & 4 & 5 & 6 & 7 \end{matrix} \\ P_{*} &= \begin{bmatrix} 0 & \frac{1}{2} & 0 & \frac{1}{7} & 0 & 0 & \frac{1}{7} \\ 0 & 0 & \frac{1}{3} & \frac{1}{7} & \frac{1}{2} & 0 & \frac{1}{7} \\ 1 & 0 & 0 & \frac{1}{7} & 0 & \frac{1}{3} & \frac{1}{7} \\ 0 & 0 & \frac{1}{3} &\frac{1}{7} & 0 & 0 & \frac{1}{7} \\ 0 & \frac{1}{2} & 0 &\frac{1}{7} & 0 & \frac{1}{3} & \frac{1}{7} \\ 0 & 0 & \frac{1}{3} & \frac{1}{7} & \frac{1}{2} & 0 & \frac{1}{7} \\ 0 & 0 & 0 & \frac{1}{7} & 0 & \frac{1}{3} & \frac{1}{7} \end{bmatrix} \begin{matrix} 1 \\ 2 \\ 3 \\ 4 \\ 5 \\ 6 \\ 7\end{matrix} \end{align*}

While this helps eliminate dangling nodes, we may still not have a regular transition matrix, as there might still be cycles of pages. For example, if page ii only links to page jj, and page jj only links to page ii, a user entering either page is condemned to moving back and forth between the two pages. This means the corresponding columns of PkP_*^k will always have zeros in them, and hence PP_* would not be regular.

In terms of the modified transition matrix PP_*, the new transition matrix will be

G=pP+(1p)K,G = pP_* + (1-p)K,

where KRn×nK \in \mathbb{R}^{n \times n} and Kij=1nK_{ij} = \frac{1}{n} for i,j=1,,ni,j=1,\ldots,n. The matrix GG is called the Google matrix. GG is easily seen to be regular, as all entries of GG are positive.

Although any value of pp is valid, Google is thought to use p=0.85p = 0.85. For our example, the Google matrix is

G=.85[0120170017001317120171001701317001317001701201701317001317120170001701317]+.15[1717171717171717171717171717171717171717171717171717171717171717171717]=[.021429.446429.021429.142857.021429.021429.142857.021429.021429.304762.142857.446429.021429.142857.871429.021429.021429.142857.021429.304762.142857.021429.021429.304762.142857.021429.021429.142857.021429.446429.021429.142857.021429.304762.142857.021429.021429.304762.142857.446429.021429.142857.021429.021429.021429.142857.021429.304762.142857]\begin{align*} G &= .85\begin{bmatrix} 0 & \frac{1}{2} & 0 & \frac{1}{7} & 0 & 0 & \frac{1}{7} \\ 0 & 0 & \frac{1}{3} & \frac{1}{7} & \frac{1}{2} & 0 & \frac{1}{7} \\ 1 & 0 & 0 & \frac{1}{7} & 0 & \frac{1}{3} & \frac{1}{7} \\ 0 & 0 & \frac{1}{3} &\frac{1}{7} & 0 & 0 & \frac{1}{7} \\ 0 & \frac{1}{2} & 0 &\frac{1}{7} & 0 & \frac{1}{3} & \frac{1}{7} \\ 0 & 0 & \frac{1}{3} & \frac{1}{7} & \frac{1}{2} & 0 & \frac{1}{7} \\ 0 & 0 & 0 & \frac{1}{7} & 0 & \frac{1}{3} & \frac{1}{7} \end{bmatrix} + .15\begin{bmatrix} \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} \\ \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} \\ \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} \\ \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} \\ \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} \end{bmatrix} \\ &= \begin{bmatrix} .021429 & .446429 & .021429 & .142857 & .021429 & .021429 & .142857 \\ .021429 & .021429 & .304762 & .142857 & .446429 & .021429 & .142857 \\ .871429 & .021429 & .021429 & .142857 & .021429 & .304762 & .142857 \\ .021429 & .021429 & .304762 & .142857 & .021429 & .021429 & .142857 \\ .021429 & .446429 & .021429 & .142857 & .021429 & .304762 & .142857 \\ .021429 & .021429 & .304762 & .142857 & .446429 & .021429 & .142857 \\ .021429 & .021429 & .021429 & .142857 & .021429 & .304762 & .142857 \end{bmatrix} \end{align*}

We can now compute the steady state vector x=Gx\vv x^* = G\vv x^*, which is found to be:

x=[.116293.168567.191263.098844.164054.168567.092413]\vv x^* = \begin{bmatrix} .116293 \\ .168567 \\ .191263 \\ .098844 \\ .164054 \\ .168567 \\ .092413 \end{bmatrix}

Thus, we can rank the pages in terms of descending importance as: 3, 2, 6, 5, 1, 4, 7.

Binder