If you are not registered or logged in, you may still use these forums but with limited features. Show recent topics
  [Search] Search   [Hottest Topics] Hottest Topics   [Members]  Member Listing   [FAQ]  FAQ 
[Register] Register / 
[Login] Login 
Robots.txt  XML
Forum Index » General Discussion
Author Message
JTD
Graduate

Joined: 08/05/2004 21:52:50
Messages: 529
Location: Arkansas
Offline

Here is a nice robots.txt file that will allow the good bots to enter and keep out the bad bots.

#
# robots.txt generated by www.1-hit.com's robot generator
# Please, we do NOT allow nonauthorized robots any longer.
#
User-agent: *
Disallow: /cgi-bin/

User-agent: googlebot

User-agent: Scooter

User-agent: Openbot

User-agent: fast

User-agent: ZyBorg

User-agent: Slurp

User-agent: Googlebot-Image

User-agent: msnbot


User-agent: URL_Spider_Pro
Disallow: /

User-agent: CherryPicker
Disallow: /

User-agent: EmailCollector
Disallow: /

User-agent: EmailSiphon
Disallow: /

User-agent: WebBandit
Disallow: /

User-agent: EmailWolf
Disallow: /

User-agent: ExtractorPro
Disallow: /

User-agent: CopyRightCheck
Disallow: /

User-agent: Crescent
Disallow: /

User-agent: SiteSnagger
Disallow: /

User-agent: ProWebWalker
Disallow: /

User-agent: CheeseBot
Disallow: /

User-agent: LNSpiderguy
Disallow: /

User-agent: Black Hole
Disallow: /

User-agent: Titan
Disallow: /

User-agent: WebStripper
Disallow: /

User-agent: NetMechanic
Disallow: /

User-agent: Wget
Disallow: /

User-agent: mozilla/4
Disallow: /

User-agent: mozilla/5
Disallow: /

User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows NT)
Disallow: /

User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 95)
Disallow: /

User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 98 )
Disallow: /

User-agent: Teleport
Disallow: /

User-agent: TeleportPro
Disallow: /

User-agent: MIIxpc
Disallow: /

User-agent: Telesoft
Disallow: /

User-agent: Website Quester
Disallow: /

User-agent: WebZip
Disallow: /

User-agent: moget/2.1
Disallow: /

User-agent: WebZip/4.0
Disallow: /

User-agent: WebSauger
Disallow: /

User-agent: WebCopier
Disallow: /

User-agent: NetAnts
Disallow: /

User-agent: Mister PiX
Disallow: /

User-agent: WebAuto
Disallow: /

User-agent: TheNomad
Disallow: /

User-agent: WWW-Collector-E
Disallow: /

User-agent: RMA
Disallow: /

User-agent: libWeb/clsHTTP
Disallow: /

User-agent: asterias
Disallow: /

User-agent: httplib
Disallow: /

User-agent: turingos
Disallow: /

User-agent: spanner
Disallow: /

User-agent: InfoNaviRobot
Disallow: /

User-agent: Harvest/1.5
Disallow: /

User-agent: Bullseye/1.0
Disallow: /

User-agent: Mozilla/4.0 (compatible; BullsEye; Windows 95)
Disallow: /

User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0
Disallow: /

User-agent: CherryPickerSE/1.0
Disallow: /

User-agent: CherryPickerElite/1.0
Disallow: /

User-agent: WebBandit/3.50
Disallow: /

User-agent: NICErsPRO
Disallow: /

User-agent: Microsoft URL Control - 5.01.4511
Disallow: /

User-agent: DittoSpyder
Disallow: /

User-agent: Foobot
Disallow: /

User-agent: WebmasterWorldForumBot
Disallow: /

User-agent: SpankBot
Disallow: /

User-agent: BotALot
Disallow: /

User-agent: lwp-trivial/1.34
Disallow: /

User-agent: lwp-trivial
Disallow: /

User-agent: Wget/1.6
Disallow: /

User-agent: BunnySlippers
Disallow: /

User-agent: Microsoft URL Control - 6.00.8169
Disallow: /

User-agent: URLy Warning
Disallow: /

User-agent: Wget/1.5.3
Disallow: /

User-agent: LinkWalker
Disallow: /

User-agent: cosmos
Disallow: /

User-agent: moget
Disallow: /

User-agent: hloader
Disallow: /

User-agent: humanlinks
Disallow: /

User-agent: LinkextractorPro
Disallow: /

User-agent: Offline Explorer
Disallow: /

User-agent: Mata Hari
Disallow: /

User-agent: LexiBot
Disallow: /

User-agent: Web Image Collector
Disallow: /

User-agent: The Intraformant
Disallow: /

User-agent: True_Robot/1.0
Disallow: /

User-agent: True_Robot
Disallow: /

User-agent: BlowFish/1.0
Disallow: /

User-agent: JennyBot
Disallow: /

User-agent: MIIxpc/4.2
Disallow: /

User-agent: BuiltBotTough
Disallow: /

User-agent: ProPowerBot/2.14
Disallow: /

User-agent: BackDoorBot/1.0
Disallow: /

User-agent: toCrawl/UrlDispatcher
Disallow: /

User-agent: WebEnhancer
Disallow: /

User-agent: TightTwatBot
Disallow: /

User-agent: suzuran
Disallow: /

User-agent: VCI WebViewer VCI WebViewer Win32
Disallow: /

User-agent: VCI
Disallow: /

User-agent: Szukacz/1.4
Disallow: /

User-agent: QueryN Metasearch
Disallow: /

User-agent: Openfind data gathere
Disallow: /

User-agent: Openfind
Disallow: /

User-agent: Xenu's Link Sleuth 1.1c
Disallow: /

User-agent: Xenu's
Disallow: /

User-agent: Zeus
Disallow: /

User-agent: RepoMonkey Bait & Tackle/v1.01
Disallow: /

User-agent: RepoMonkey
Disallow: /

User-agent: Zeus 32297 Webster Pro V2.9 Win32
Disallow: /

User-agent: Webster Pro
Disallow: /

User-agent: EroCrawler
Disallow: /

User-agent: LinkScan/8.1a Unix
Disallow: /

User-agent: Keyword Density/0.9
Disallow: /

User-agent: Kenjin Spider
Disallow: /

User-agent: Cegbfeieh
Disallow: /


LINK-> Use Lazarus Guestbook
[WWW] [Yahoo!] aim icon [MSN]
Carbonize
Master
[Avatar]

Joined: 12/06/2003 19:26:08
Messages: 4292
Location: Bristol, UK
Offline

This serves little purpose. Email harvesters just ignore robot.txt files. I would also like to point out that Xenu is actually a rather nice program for checking that all the links on your site are valid.

Carbonize
I am not the maker of the Advanced Guestbook

get Lazarus
[Email] [WWW] [Yahoo!] aim icon [MSN] [ICQ]
amber222
Graduate

Joined: 07/05/2004 21:13:07
Messages: 586
Offline

Blocking the bad bots from .htaccess definitely works. Problem is there are so many, and not everyone has .htaccess.

Block bad bots in .htaccess like this:
  • SetEnvIfNoCase User-Agent "^NaverBot" bad_bot
    SetEnvIfNoCase User-Agent "^TurnitinBot" bad_bot
    SetEnvIfNoCase User-Agent "^EmailSiphon" bad_bot
    SetEnvIfNoCase User-Agent "^EmailWolf" bad_bot
    SetEnvIfNoCase User-Agent "^ExtractorPro" bad_bot
    SetEnvIfNoCase User-Agent "^CherryPicker" bad_bot
    SetEnvIfNoCase User-Agent "^webcollage" bad_bot
    SetEnvIfNoCase User-Agent "^Port_Huron_Labs" bad_bot

    <Limit GET POST>
    Order Allow,Deny
    Allow from all
    Deny from env=bad_bot
    </Limit>

  • Add new bots to the top. You don't have to use the bot's full name - first 3 letters will work. Get bot names from your server stats.

    Everybody should block that Naverbot!

    Another alternative is the robot trap. Bots that ignore the robots.txt file are caught in an endless loop.

    Info on building a robot trap:
  • How to keep bad robots, spiders and web crawlers away: http://www.fleiner.com/bots/#trap

    A Rogue Robot Trap: http://www.braemoor.co.uk/software/robottrap.shtml

    How to build a Bot Trap and keep bad bots away from a web site: http://www.kloth.net/internet/bottrap.php

  • This site has a good forum about robots and they share info on the function of various bots.
    Here's their PHP version of the robot trap:
  • http://www.webmasterworld.com/forum88/3104.htm
  • JTD
    Graduate

    Joined: 08/05/2004 21:52:50
    Messages: 529
    Location: Arkansas
    Offline

    Well I also use .htaccess along with the robots.txt file. Double protection I guess or maybe paranoid.

    LINK-> Use Lazarus Guestbook
    [WWW] [Yahoo!] aim icon [MSN]
     
    Forum Index » General Discussion
    Go to:   
    Based on the open source JForum