Search Engine News

The Search Engine Spam Police
by Shari Thurow, Guest Writer

The major search engines and web directories consider spammers to be those who take extreme measures to get web pages ranked well. What types of pages are considered spam?

In a Search Engine Strategies session entitled "The Search Engine Spam Police," representatives from search engines Inktomi, Google, FAST Search, and web directories LookSmart and the Open Directory Project explored the issue of spamming and presented the audience with some general guidelines to follow.

In yesterday's issue of SearchDay we covered the advice and tips offered by the human compiled web directories.  Today we'll focus on the policies of the crawler built search engines.

Bob Keating, Editor-in-Chief of the Open Directory Project (ODP), defined spam as the aggressive and continuous submission of identical sites to the same or multiple, inappropriate categories, and sites that violate submission policies for inclusion. 

Types of sites that ODP considers spam are:

(1) Affiliate sites with same or similar content but a different site designs. 

(2) Mirror sites. Submitting mirror URLs to different categories is also considered spam. Multi-lingual sites are acceptable as long as the URL resolves to the appropriate language.

(3) Sites that use redirects or any type of bait-and-switch practice. Using frames to hide a real URL, commonly referred to as "poor man's cloaking," is also considered spam.

(4) Sites whose sole purpose is to drive traffic to affiliate links or sites that contain these types of links.

If an editor or a submitter is caught spamming, the editor is immediately removed from ODP without notice, and future submissions are either deleted or blocked. If the spam is particularly relentless, ODP might remove "listable" listings as well. If you suspect that an editor or submitter is spamming, report the spam abuse to staff@dmoz.org.


Tim Mayer, former Director of Web Search Product Management at Inktomi, stated that "Inktomi considers spam to be pages created deliberately to trick the search engine into offering inappropriate, redundant, or
poor-quality search results."  Spam is more about how and to what extent a technique is used, Mayer explained, rather than if a technique is used.

Some of the common practices that Inktomi considers spam are:

(1) Web pages that are built primarily for the search engines and not your target audience, especially machine-generated pages.

(2) Pages that contain hidden text and hidden links.

(3) "Great quantity and little value" pages.

(4) Link farming and link spamming, particularly free-for-all (FFA) links.

(5) Cloaking, a practice in which the search engine and the end user do not view the same page.

(6) Sites with numerous, unnecessary host names (i.e. poker.abc.com, blackjack.abc.com, etc.).

(7) Excessively cross-linking sites to artificially inflate a site's apparent popularity.

(8) Affiliate spam.

If a webmaster is caught spamming, Inktomi will either demote the offending web page/site from its index or completely ban it. 

Jen McGrath, Software Engineer at Google, advised webmasters to create sites with appropriate, relevant content and a straightforward design.  In other words, make a useful site that clearly benefits your end users.

McGrath also advised webmasters to submit your site to web directories and let other sites link to you.  Your site does benefit from the sites that link to it.  However, your site can be penalized for the sites that you
link to.  Spam penalties include demotion and removal from Google's index.

Some items that Google considers spam are:

(1) Cloaking.

(2) Automated queries to Google to check positioning.  The goal of this is primarily to tweak a site for positioning purposes, not to create content that benefits end users.

(3) Hidden text or hidden links.

(4) Stuffing pages with irrelevant keywords.

(5) Doorway pages, domains, and subdomains with the same or similar content.

(6) "Sneaky" redirects.

Rolf Michelsen, Software Engineering Manager at Search, defined spam as using techniques to artificially influence a search engine's precision or relevancy. Just as Mayer stated earlier, spam is based on effect rather than technique.

Michelsen presented the following guidelines: 

Do:

(1) Focus on content.
(2) Create a site that is easy to use in simple browsers.
(3) Link to other relevant sites.
(4) Submit the URL of your main site.

Don't:

(1) Cloak.
(2) Stuff irrelevant keywords into web pages using invisible text.
(3) Submit all URLs, every day, using the free submit.
(4) Participate in link farming or FFA links.
(5) Resort to "snake oil" search engine marketers.  In other words, don't fight spam with spam.

 

How Search Engines Look at Links
by Craig Fifield, Guest Writer

Link analysis is one of the most important techniques search engines use to determine relevance, and understanding how it works is crucial for successful search engine optimization.  Representatives from Google and Teoma explain how it's done.

If you have spent any time over the past few years studying search engine marketing you are probably familiar with the linking craze going on in the industry. Everyone from experts to those new to the field toss about terms like "link popularity" and "page rank" and it seems that all related discussion forums and web sites have entire sections devoted to linking. As the foundation of the web, links have always been important, but links themselves haven't changed much since the day they were created so why all the renewed interest?

The reason is that the major search engines are utilizing links more and more to improve the relevance of their search results. However, the world of links and their use by search engines can get confusing quickly. To help sort through the more important elements of linking the session "Looking at Links" was held at the Search Engine Strategies conference in San Jose, California. The search engines that utilize links the most, Google and Teoma, both sent representatives to explain why links are important to their engines, and how to best utilize them on a web site.

Daniel Dulitz, Director of Technology for Google, started things off by stating one of the more important points of the session -- as search engine indexes grow larger it becomes almost impossible to determine a web page's relevancy based solely upon on-the-page factors (page text, metas, titles, etc.). It's this fact, combined with the reality that most on-the-page factors can't be trusted due to abuse, that prompted Google to begin looking at the link structure of the web to help determine a page's relevance to a query.

According to Dulitz, when determining the relevance of a web page to a search they use their PageRank system to attempt to "model the behavior of web surfers" by analyzing the manner in which pages are linked to one another. He explained that Google views the interlinking of web pages as a way of "leveraging the democratic structure of the web" with links equating to votes.

Google essentially treats each link from one site to another as a vote for the site receiving the link (link popularity), but each vote is not created equal. Dulitz used a simple diagram to show that each page of a site only has one vote to give, so the more links to different sites on the same page the less of a vote each one receives. He also stated that links from higher quality sites carry more weight than those of lesser quality sites (e.g. sites with hidden links, involved in link farms, no incoming links, etc.). In addition, Google not only analyzes who is linking to whom, but they also analyze the text in and around the links to help determine the relevance of the pages receiving the links.

Paul Gardi, Vice President of Search for Ask Jeeves/Teoma, began with similar comments to those made by Dulitz. Gardi stated that "due to statistical convergence" and the ease with which they can be abused, neither page text analysis nor standard link popularity can be relied upon when determining the relevancy of a web page. Specifically, he mentioned that standard link popularity is ineffective because it does not help determine the subject or the context of the site, and larger more popular sites tend to overwhelm smaller sites that may actually be more relevant to a search.

To combat these issues Teoma views the web as a global entity that contains many subject based web site communities. They study these subject communities and the manner in which they are interlinked within themselves and with each other to determine not only their link popularity, but also the subject and context of the involved sites. According to Gardi, Teoma is able to do this by using their unique method of ranking sites. He explained that rather than relying on general link popularity to determine results, their engine attempts to employ a "subject specific popularity" to locate the most popular sites within a specific subject community. This is done by first analyzing the web as a whole to identify subject communities.

Teoma then employs link popularity within those communities to determine which sites are the "authorities" on the subject of the query and it's those sites that are returned as their results to a search. In addition, he mentioned that by analyzing the links of the authority sites their technology is also able to locate high-quality resource pages (links pages) that are related to the original query. Each of these components is then made available on their search results page as follows: "Results" are the authorities, "Refine" is the related subject communities, and "Resources" are the related links pages.

Overall, the session was well received and very informative, especially for those new to the subject. Considering that most major search engines now utilize some method of link analysis, anyone that has a vested interest in being properly indexed by the search engines should consider attending in the future.

Home Page

Sue Strand

605-274-1565 CST

sstrand at worldrecreationaldiscounts.com