- Home
- Search Engine Optimisation
- How Google Applies Science to Search (Part 1)
How Google Applies Science to Search (Part 1)
- By Kalena Jordan
- Published 09/8/2008
- Search Engine Optimisation
- Unrated
Kalena Jordan
Article by Kalena Jordan, one of the first search engine optimization experts in Australia, who is well known and respected in the industry, particularly in the U.S. As well as running a daily Search Engine Advice Column, Kalena manages Search Engine College - an online training institution offering instructor-led short courses and downloadable self-study courses in Search Engine Optimization and other Search Engine Marketing subjects.
View all articles by Kalena JordanA scientist at heart, Craig is probably best known as the developer of Froogle (recently re-named Google Product Search) and the founder of Google's software engineering center in New York City.
This article
is a summary of his presentation at Webstock 2008.
Google's Spelling Bee
Craig started
his presentation by talking about one of his first challenges in his job at
Google: the spelling correction tool. As the popularity of the search engine
grew, Google needed to be able to spell-correct lots of obscure words. So his
solution was to take a sampling of content from the entire web. Craig's team
came up with a algorithmic model and ran it over the web. He discovered that
there were several correct answers to the same question. For example, words like
"kofee" could mean either the searcher is seeking a cup of java or information
about Kofee/Kofi Anan.
To combat this, Craig came up with an interesting
solution: the "Did you mean?" alternative spelling option, based on predictive
examples of searcher spelling patterns. You can see this in action if you type
in "kofee anan" in Google. Above the search results is a line that reads: "Did
you mean: kofi annan" and links to the search results for this spelling
variation too.
But the research went even further. Craig's team worked
out how to take into account the context of the search query by studying the two
or three other keywords surrounding the query, for example "kofee cup" or "kofee
anan". The research used the science of bigrams and trigrams to better understand
how people search. Bigrams are groups of two written letters, two syllables, or
two words, very commonly used as the basis for simple statistical analysis of
text. So Craig and his team applied this knowledge to Google's spelling
correction system and now, Google's algorithm can determine the searcher's
intent with much more accuracy, based on the context of the search
query.
As an example of the spelling challenges that Google faces, Craig
showed the audience the huge number of ways "Britney Spears"
is misspelled on the web. He said it's encouraging to see that the most
popular spelling is also the most correct one. Scale is important!
Google Maps Lead to Apps
The Google
team wrote the code for Google Maps many years ago but the code was
actually built into your browser. When Google maps first launched, people took
the dense data-script and worked out how to reverse engineer it for their own
use. Google engineers decided to release an API key to make these mash-ups
easier after seeing so many people reverse engineer Google Maps without Google's
help. Now people can mash-up Google maps within minutes to create their own
applications.
To show how easy this is to do, Craig took the audience
through the steps to create an interactive application with Google Maps. In the
space of about two minutes, he signed up for an API key, grabbed the HTML code
and pasted it into his page. He then hacked the map to show Wellington Town Hall
(our location) and made the point how easy it is to create really useful tools
out of technology that is already available.
As an example, Craig showed
the audience Seattle Bus Monster. This site uses an API
key for Google Maps to make Seattle bus data and trackin
Craig referenced Rodney Brooks from MIT whose provocative paper
"Fast, Cheap and Out of Control" offered new logic and a completely different
view of machines. The idea is that there is no center of control among robots so
you should make lots of them; don't treat them so precious. Craig said
developers should use this logic to create lots of small apps that you can
replicate and tweak, rather than one big expensive app that can go horribly
wrong. Scale trumps smarts every time!
Experiments in Scale That Have Impacted
Google's Operations
Precision vs. Recall
Back
in the early 90's, information retrieval on the web was limited to things like
Lexus/Nexus. So at that stage, Google would take queries and apply it to the
broadest possible search. This was great recall at the cost of precision. But
Larry and Sergey wanted something better so they decided to use Boolean search.
At the time it was heresy because everything was focused on recall. But the
Google founders knew that things had to be super relevant so they developed an
algorithm - the core algorithm. It was very simple and relied on Boolean search
to determine relevancy.
Genomic
Sequencing
In the mid 90's a large project - the Human Genome
Project - was underway. The race was on to sequence the genome. Scientists
decided to feed this out to a bunch of different people. They chopped up the
genome for researchers everywhere and allowed it to replicate. The researchers
mapped each chunk with genetic markers and computed a tiling path of tiny
fragments.
Sequencing was very expensive, so the data was computed based
on a minute number of chunks - very labor intensive. The sequencing took forever
and reassembling was a long way off. But then a company came along that said
they could do it faster. Sequencing becomes cheaper by automating the job using
machines rather than individual people so this company used a clever computer
algorithm to conduct the sequencing. This reduced the cost and the researchers
were therefore able to reassemble more fragments and achieve a rough draft of
the genome in 2000. This sequencing approach was the shotgun approach, where
accuracy is lower, but the larger scale allowed the impossible to become
possible.
Web
Definitions
Google used to do a terrible job of defining
terms. Craig noticed people were searching for "definition of...", or "what is
a...." etc so he wanted the search engine to provide better results for these
searches. He found lots of web pages that contained glossaries and definitions,
so he hacked up a Perl script to get the glossary formats.
The first
recall results were only 50 percent accurate. He wanted to improve this rate, so
he did some experiments with the data. But he could never reach an accuracy
level he was happy with. It was later he realized that most of the questions
people actually needed answers to could be answered with his crappy little Perl
script. He concluded that 100 percent accuracy is not important, that scale is
much more important.
Now Google allows you to use the "definition:" query
and the question format to get definitions from around the web. Type in "what is
a blog?" and you'll get lots of results from Craig's original
script.
Protein
Sequencing
In biology, Craig says, you're constantly
producing proteins. The proteins fold up with particular sequencing. Within
computing, you can use this knowledge to do amazing things. You can conduct
computations with this type of data but it's time consuming. Somebody at
Stanford University noticed that proteins spend a lot of time moving about
before folding into an alpha helix. So it was suggested they start the
computations with lots of configurations. In this way you can parallelize the
data by scale and one will be magically close to a folded protein. So they
worked out a way to reduce the problem to a simple process based on mass scale.
This is why Google uses maximum scale to conduct algorithmic
computations.
Chess vs.
Go
You can now compute the value of any potential move in
chess. Based on that information, you can compute your projected probability of
winning the game from any move. Chess grand masters put a lot of time into this
knowledge. But the opposite is true for the game Go, because there is more
randomness to the game play.