I wrote the predecessor to this post as an assignment for school. However, I found it an interesting introduction to linear algebra so I've updated it and published it here.
When I was five, I typed a word into Google for the first time and was greeted instantly by a page of relevant results. To this day, it still amazes me how Google manages to rank their results. They can take the entire internet, and, with no other information, figure out what matters and what doesn’t. For years, this whole system was a kind of “black box” to me, until I stumbled upon a mention of eigenvectors, which led to this paper.. The ranking algorithm, known as PageRank, has been a key part of their success over the years. Invented in 1998 by Larry Page and Sergey Brin, graduate students at Stanford, it was once a trade secret to Google’s rise. It is important to note that it does not deal with determining what results are relevant to a search, but rather, given some results, which are the most important. The goal of this paper is to explain this system and use it on a test case to gain a more intuitive understanding. The problem the system solves can be stated as: Given a set of pages with links to each other, how can we order them based on popularity?
Before proceeding, it will be important to define the problem in mathematical terms. The only information we will be given is all the relevant pages and their links to each other. From this, we can think of the web as a directed graph. Each web page will be a node, and each link will be an edge, or a connection between two nodes, pointing towards its destination. Below is a sample network of four web pages on which the process will be shown. Node 2, for example, has links to node 3 and 4 and is linked to by node 1.
We can also think about what we want popularity to mean in our context. We want a page with more links in-- say a popular blog-- to matter more than pages with few. We also want to consider not just how many links there are, but where they come from. A blog with several links from very important websites clearly gets more visitors than one with many links from unused websites, and should be ranked as more popular despite having fewer links.
To order the pages, we will need to assign them scores of how popular they are. We need, essentially, a list of numbers, and a vector is just that. With a vector with n elements, we can store the importance of n pages. In our case, n is 4. To begin, we fill this vector with ones as our starting values. This process initially treats all pages the same, regardless of any other information known about them.
The algorithm works as follows. Every page will start with a set importance. Each iteration, it will divide its importance evenly between every page it links to and add that to each of their values. This can be thought of as each node “giving” its value to all the nodes it links to, and “receiving” value from all the nodes that link to it.
We can represent this transformation with an n by n matrix, where n is our number of pages. Each column contains all of the links leaving a single page, and each row contains all the links pointing to a page. This may not initially feel intuitive, but it fits in perfectly with the definition of matrix and vector multiplication. When we multiply a vector by a matrix, for each element, we add up all the values in the corresponding rows by their corresponding other values in the vector. This returns another vector, as seen below.
For each link’s corresponding cell in our matrix, we set the value to 1/k, where k is the number of outgoing links from that page. For each page, with k links, all weighted 1/k, the combined weight from each page is k/k, or just 1. Below is our matrix for this graph.
We can think of applying this transformation as multiplying our starting vector by the matrix once. Since the diagonal, the “a component in a” and “b component in b,” is full of zeros, we can see that each step all of a page’s current value will be “lost,” at the same time as it is split among other pages. Page 1, represented in the first column, will contribute nothing to itself (the top zero) and a third of its value to each of pages 2, 3, and 4.
Since the weights of links from a page sum to one, each column in the matrix will sum to one. This has several implications for our matrix. It means that it is a Markov Matrix (also called a transition matrix or stochastic matrix) by definition. This will come into play later when finding the solution to our question of popularity.
This does, however, add some restrictions to our input. Each page must have at least one vector in, and one vector out, otherwise the matrix will not be a Markov matrix. The actual algorithm used by Google deals with cases with no inbound or outbound links by “fuzzing” the data, so none of the cells equal zero. For now, however, we will just skip over these cases. Websites with no links in or out are not terribly common, and the ones that are exist often not very important or interesting. While Google’s algorithm has expanded significantly on what is show here, this process serves well to explain the “core” of the process and why it works.
Once we have this matrix, represented as A in the figure to the left, we need to figure out what state the vector, and thus the probabilities, will end up in when transformed not just once, but many times. This is an application of another topic from linear algebra: the eigenvector.
An eigenvector of a matrix is a vector that, when multiplied by that matrix, changes only in scale, or magnitude. The change in scale is our corresponding eigenvalue. In other words, we are looking a vector that will point in the same direction when transformed, or multiplied, by the matrix. If A is a matrix, v is an eigenvector and lambda is a corresponding eigenvalue, we can write this definition as follows.
Usually, finding eigenvectors requires finding the eigenvalues first, and these require finding the roots of something called a characteristic polynomial, which can be tough. However, in our case, with a Markov matrix, we always have at least one eigenvalue: 1. This is intuitive- each node is only dividing its importance between what it links to, so net importance in the system, and thus the magnitude of the vector, should be constant. There may be other eigenvalues as well, but we can discard them as they are not relevant here.
In our case, the values of the eigenvector can be thought of as “equilibrium values,” as the system will approach them over many iterations. And indeed, this will be how to find them. For any matrix trying to represent the entire internet, any algebraic method will be unwieldy and take much too long. There could be billions, or even trillion of rows and columns! Instead, in our iterative method, we will apply our transformation (multiply by our matrix) repeatedly. As we do this, the starting vector will approach some other vector. To the left is a series of resulting vectors of the starting vector multiplied by our transformation matrix, A, different numbers of times, showing how they approach a value. This final value, approximated well enough with twenty transformations, gives the solution to our problem.
The elements in this vector represent the “popularity values” assigned to our pages. Page 1 has a value of roughly 1.550, page 2 has roughly 0.515, and so on. All that is left to do is sort the pages by their values in the vector and return them as a list, as seen below.
Final Popularity Rankings:
Below is this data, overlaid on our original network. Here you can see how the numbers of links affect the popularity values and rankings of each page.
At this point, the results of the algorithm would be returned to the user on the Google search page, where links and titles of each page are given in this order. This process is not performed for every search, rather, only to update the index. The values are stored in a vector, so we can just reuse that vector for every search. This is also combined with a sorting for relevance, which we will not discuss here
In this investigation, I attempted to learn about Google’s PageRank algorithm, and how it uses matrices to find out how popular web pages are. During my exploration, I learned about how we can represent a graph in a vector or matrix, how matrix transformations of vectors work, and the properties of one type of matrix, Markov matrix. The reality is that the system examined here is a much simplified model of the one that Google was built on. In fact, today Google has moved away from this ranking system entirely towards its own proprietary own. Still, the processes shown here are a critical part of the history of the internet: Google was one of the key factors that allowed the internet to grow into what it is today, and that was utilizing the very process explored in this paper.
The most important learning I will take away from this, however, does not have to do with matrices, but rather more generally with algorithms. Through my exploration, I learned that their are mathematical processes underpinning much of our society today, and that even something like typing a search into Google relies on a long process constructed with years of experience with math.
Radu, Raluca Tanase Remus. “Lecture #3.” PageRank Algorithm - The Mathematics of
Google Search, www.math.cornell.edu/~mec/Winter2009/RalucaRemus/
“PageRank.” Wikipedia, Wikimedia Foundation, 14 Dec. 2017,
K. Bryan. “The $25,000,000,000 eigenvector: The linear algebra behind Google”
SIAM Review, 48 (3) (2006), pp. 569-581
“Google matrix.” Wikipedia, Wikimedia Foundation, 28 Aug. 2017,
“How Google works: Markov chains and eigenvalues.” Klein Project Blog, 7 Feb. 2017,