Fair and Balanced

Thursday, August 19, 2004

[From topix.net: Headline reads: Prostitution Ring. Article begins: Employment Opportunity. I really like reading services like Google News and Topix, which generate pages by pulling news from thousands of news sites. The variety of sources is nice, and the format makes it easy to scan the headlines. Topix even offers RSS feeds of their pages.

But, probably the real reason I like these sites is the programs behind them seem to have my sense of humor. What can you say about this headline and excerpt? Hey, the software was just doing what it was told: Pull the headline, and use the first few words on the page as an excerpt. Completely innocent, and highly amusing.

This does, however, highlight a real problem for web designers. With all the search engine bots running around on the web, pulling data and presenting it however they see fit, how can you make sure your pages are presented in the proper context?

Brad Choate restricts what Google sees on each page using PHP. This keeps the search engine from indexing content on each page that doesn’t really apply to what the page is about: navigation, ads, etc. It’s a good idea and, much like the Force, would be easy to misuse. Scammers could show one page to Google and a completely different page to browsers in order to get higher page rank and hijack visitors.

As XML and XHTML become more prevalent, I would like to see an XHTML module for search indexing. This would allow you to put a standard set of clues in your markup to let search bots know what was important. Something as simple as a search-index attribute added to all tags would let you pick and choose what gets included in search engine results. You still have to worry about misuse, but I’m sure Google would adapt, and other search engines would soon follow. News sites could mark all their advertisements as search-index="no", and not have to worry about Topix making them look silly.

But then, what fun would that be?