Search Tool Data Analysis

by Tyler Hauck (thauckthauck in BIT330, Fall 2008)

Questions and queries

Web search engines

Text description of query submitted to web search engines: How fast is internet usage growing on the continent of Africa?

Queries submitted: internet usage Africa "growth rate"

Blog search engines

Text description of query submitted to blog search engines: How would a Lehman Brothers collapse affect financial markets?

Queries submitted: "Lehman Brothers" collapse "financial markets"

Data that I collected

Search engine overlap data

Web search Live Google Yahoo Web
Live 60 30 25
Google 55 20
Yahoo Web 65
All 10
Blog search Technorati Google Blog Bloglines
Technorati 5 0 10
Google Blog 85 10
Bloglines 40
All 0

Search engine ranking overlap data

This table provides a measure of how much of Google's responses are reproduced by Yahoo.
GY Yahoo
Google 5 10 20
5 1 1 1
10 0 1 1
20 0 0 1
This table provides a measure of how much of Yahoo's responses are reproduced by Google.
YG Google
Yahoo 5 10 20
5 1 1 1
10 0 0 0
20 1 1 2
This table provides a measure of how much of Blogline's responses are reproduced by Google Blog Search.
BG Google
Bloglines 5 10 20
5 0 1 2
10 0 0 0
20 0 0 0
This table provides a measure of how much of Google Blog Search's responses are reproduced by Bloglines.
GB Bloglines
GBlog 5 10 20
5 0 0 0
10 1 1 1
20 1 1 1


Web search

Precision and Overlap Live Google Yahoo Web Overlap L/G Overlap L/Y Overlap G/Y Overlap All
Mean 42.778 54.444 51.667 18.333 20 20.556 10
Median 42.5 57.5 52.5 20 20 20 10
Standard Deviation 22.766 20.065 22.426 9.5486 11.376 7.8382 7.4755

This table shows summarized results using data collected by a class of 15 students. Here we can see the mean, median, and standard deviation of the precision of the individual search engines as well as how many times these results overlap. Both mean and median are included because if the results of the data collection includes any extremely high or low values, it would throw off the mean and make it a far less meaningful measurement of the overall tendency of the data. We can see here that the means and medians are relatively similar, suggesting that data is relatively evenly distributed. All of the search engines yielded relevancy means within 10% of each other. Google yielded it the most relevant results, followed by Yahoo, and Live yielded the fewest relevant results. As far as overlap is concerned, Live/Yahoo, Live/Google, and Google/Yahoo all had a median overlap of 20%. Only 10% of relevant results were returned by all three search engines. That means that if our search was limited to only one search tool, we would have missed out on a large percentage of these relevant results.

Ranking Overlap Google/Yahoo o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,20) o(10,20) o(20,20)
Mean 1.05882 1.35294 1.64706 1.29412 2 2.64706 1.64706 2.47059 3.70588
Median 1 1 2 1 2 3 1 3 4
Std Dev 1.19742 1.32009 1.41161 1.21268 1.32288 1.72993 1.22174 1.54587 2.11438
Ranking Overlap Yahoo/Google o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,20) o(10,20) o(20,20)
Mean 1.05882 1.17647 1.64706 1.47059 1.94118 2.47059 1.88235 2.64706 3.76471
Median 1 1 1 1 2 3 2 3 4
Std Dev 1.19742 1.28624 1.36662 1.23073 1.39062 1.58578 1.26897 1.72993 2.07754

The second and third tables shows a more in depth analysis of the overlap comparisons between Google and Yahoo. We compared results based on where they fell in the results, looking at the top five, top 10 and top 20 groups. The overlap for the top five was very small and it only increased a bit as we added more results to the comparison pool. In the first table, the results show where Google results appeared on the Yahoo results, starting with the first five Google results compared to the first five yahoo results and working our way up to all 20 Google results compared to all 20 Yahoo results.

Blog search

Precision and Overlap Technorati Google Blog Bloglines Overlap T/G Overlap T/B Overlap G/B Overlap All
Mean 33.056 52.5 44.444 3.6111 9.1667 6.9444 1.3889
Median 30 42.5 47.5 0 7.5 5 0
Standard Deviation 21.153 22.179 14.337 7.0305 7.7174 6.4486 3.3456

This first blog table is not unlike the first table for web searches. It was used to compare mean, median, and standard deviation results derived from the class data on precision of each individual blog search engine, as well as overlap of results across the three different search tools. This time we see that the mean and median are quite different for Google, suggesting that the mean Google relevance is sitting at 52.5% because one or a couple members of the class had great success with that blog search tool. If we look at the medians, Bloglines is the top performer and has a much lower standard deviation than Google, suggesting that the class Bloglines results were much more consistent and closer to the mean. There are 9.2% and 6.9% overlaps between Technorati/Bloglines and Google/Bloglines while there is only a 3.6% overlap between Technorati/Google and only a 1.4% overall overlap.

Ranking Overlap Google/Bloglines o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,20) o(10,20) o(20,20)
Mean 0.29412 0.35294 0.47059 0.41176 0.47059 0.82353 0.70588 0.76471 1.05882
Median 0 0 0 0 0 0 0 0 1
Std Dev 0.46967 0.60634 0.62426 0.61835 0.71743 1.0146 0.91956 1.09141 1.19742
Ranking Overlap Bloglines/Google o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,20) o(10,20) o(20,20)
Mean 0.29412 0.35294 0.58824 0.41176 0.52941 0.82353 0.52941 0.88235 1.11765
Median 0 0 0 0 0 1 0 1 1
Std Dev 0.46967 0.60634 0.87026 0.61835 0.71743 1.07444 0.62426 0.99262 1.16632

These two tables are also similar to the Google/Yahoo in-depth overlap comparison we looked at in the web search section. From this data, it is clear that the overlap between these two search engines is not very big. For top five results, we see no overlap mean greater than 0.5 and the medians are all sitting at 0.


Web search

Looking at the class results, we can see that the Web Search Engine with the highest percentage of relevant results is Google. However, looking further at the data, something becomes apparent: how low the percentage of revelant results that appeared on all three of the search engines is. Even more important is how low the overlap was between top five results. Many search engine users are only going to look at the top results returned by a search engine. This is important to consider when evaluating the effectiveness of a web search because it would be evaluating the tool as people actually use it. We should definitely consider looking at how many of the top five results of each individual website were considered relevant.

Based on my experience, I would still probably recommend that a user start their search with Google even though it did not retrieve the highest percentage of relevant websites when I used it. My personal opinion is that the revelent sites it did return were more relevant than the relevant sites returned by the other two search engines. As far as class data goes, Google did indeed retrieve the most relevant sites on average so to me it is the clear winner. I also far preferred its design and simplicity over the other two search tools. Something I cannot be sure about, however, is whether or not I am just a creature of habit. Admittedly, I rarely venture onto Live or Yahoo when I am doing a web search for myself.

When the task at hand is retrieving as much relevant information as possible, the data on hand suggests that perhaps running a search on all three is important. The median and mean overlap percentage between all three is only 10%. If you were to limit your search to only one of the tools, you would be missing out on a lot of relevant data.

I expected the average relevancy on all of the search tools to be a lot higher than ~50%. This low return prompted me to think about the average web search skill level of the members of our class at the time of the experiment, but then I realized that our skills were probably pretty representative of the average search tool user.

I formed my search around the question "How fast is internet usage growing on the continent of Africa?" A few new questions were prompted by some of the results I received when I entered my search query. First of all, some pages that were returned to me had a lot of information about internet usage in specific African countries. Searches based on a question like "How fast is internet usage growing in Kenya?" would return me information that is relevant to the original question, and would also allow me to go a bit more in depth with the topic. Also, reconsidering my original question, I have reached the conclusion that it was probably not the best question to ask knowing what I know about Africa. North Africa is so fundamentally different than the rest of the continent that it is probably hard to come up with information about the growth of the internet on the continent as a whole. I believe that getting a lot more specific with my search would have yielded results that were a lot more relevant to the topic I was trying to understand.

Blog search

I should probably start off my analysis of the blog search tools by admitting that I had never personally used one before this experiment so this was definitely a new experience.

The nature of blog searching becomes very transparent just by looking at the data collected by the class. All of the blog search tools had at least some luck returning relevant results, but there was very little overlap between these. Blogs are constantly being written, being edited, being deleted, etc… so searching for them is inherently more difficult. I am not even sure that I would go as far as to say that one of these search tools actually performed better than the other. We need not look further than the data to see that the class means and medians were so different and the standard deviations so high suggesting that we had a wide range of successes and failures with each of the tools.

I found the blog search to be much more difficult than the web search. I had to revise my query many times before I yielded enough results to have useful data. Even then, I was very unsuccessful with Technorati and very successful with Google Blogs with regard to the relevant results that were returned to me. I am hesitant to say that this suggests anything about the quality of either tool and is instead a reflection on my experience with these search tools.

If I were going to make a recommendation to anyone about using blog search tools it would be to try all of them because they yield such different results that you will almost definitely find something you like one each one of the sites. Again, my personal preference is Google Blogs based on familiarity, but do not necessarily believe it is the "best" tool.

As far as queries go, I found that being too specific often returned nothing really significant to me. I do, however, believe that my search query should have been something else. I used: "Lehman Brothers" collapse "financial markets"; in search of the answer to the question, "How would a Lehman Brothers collapse affect financial markets?" What I didn't consider was that my search string was also valid for a question like "How will the performance of the financial markets affect Lehman Brothers?" A very slight difference in wording makes an entirely new question, and one that could also be something that people would have recently been blogging about. I think if I were to rerun this search, I would probably use an entirely different phrase than "financial markets" in my search query. What makes this a difficult search too is the fact that this is such a timely topic that the results for it are literally changing by the second. Something important to take into account when doing a blog search is how relevant the topic you are searching for is in the moment you are searching for it.

Consequently, my recommendations to a blog search tool user are to explore all of the different blog search options, try again and again with what you are using for your query, and keep in mind the timeliness of the subject you are researching. Sometimes I would hit back and refresh and the results I had were completely different. For this assignment that made it very difficult for me because I would have to start all over again to make sure I had consistent data, however for someone just doing a regular search where they do not necessarily have to keep track of the results, this is actually a pretty cool feature because it shows how the information on blogs is constantly changing.

If I were to investigate more into this topic, I would change my search question to something that reflects the current state of Lehman Brothers in the economy. After my search, Lehman filed bankruptcy, but now lives again as Barclays Capital. I would be more interested now in looking for how Barclays Capital is going to fit into the new world economy. This is a good question to look for on a blog search because it is a hot topic that is definitely open for discussion.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License