WWW Utilities

Documentation

The WWW Consortium is the authoritative source on the latest standards in Markup (HTML/XHTML), Styles (CSS/XSL), scalable vector graphics (SVG), synchronized multimedia (SMIL), Protocols (HTTP) and others. Stefan Mnz' Selfhtml is simply the best documentation of HTML and Javascript I know of. (in German)

WWW Server

Apache is all you need :-).

SGML, HTML, and XML Tools

PSGML
is an emacs mode for editing SGML and XML documents. Given a DTD, it supports indentation, syntax highlighting, and insertion of only the allowed elements in a given context. Although it is not a validating parser, it helps to write syntactically correct documents.
XAE
is an XML Authoring Environment for Emacs.
W3C Validator
indexes
Free XML software

Webserver Log Analyzers

Analog
is written in C, said to be fast, very popular, very versatile and configurable, provides output in several languages, can sort web pages hierarchically
webalizer
This is the most "modern" of the freely available log analyzers (as of July 99). - nice graphical display - configurable - can also report referrer and agent if the log format is combined/extended
http-analyze
webalizer and http-analyze are very similar. (webalizer is patterned after http-analyze.) Version 2.01 is free for non-commercial use, but there is no source.
MKStats
commercial
Wusage ($25-2000)
another popular log analyzer gone commercial
wwwstat
one of the first open source log analyzers, last version Nov 96
checklog
is a small perl script that produces ASCII output. It guesses the "number of visitors" (and how many pages a single visitor accesses) on the assumption that accesses from the same site in small time intervals are from the same user. This assumption may be wrong for servers with high traffic.
FTPWebLog
is part of the WWW Utilities by Benjamin Franz.
BotWatch
lets you watch which robots wander your site.
Indexes
serverwatch.internet.com
log analyzers @ hypernews
log analyzers @ Yahoo

Indexing and Searching

Searching large document collections by keyword is facilitated by the generation of keyword indexes, i.e. databases containing the information which words appear in which documents. Index and search packages can be judged by the following characteristics:

Type of access in the process of gathering
Some solutions allow to access the documents through the local file system only, which limits severely what can be indexed. Others can index only what is available through public http-servers. Most flexible spiders/gatherers grab through any major protocol (http/ftp/gopher/...), allow to access password-protected sites, and - for performance reasons - allow to bypass the server for files in the local file system (like ht://dig).
Supported file formats
Some solutions are geared towards HTML and can index essentially only HTML and very few other formats (like ht://dig). More flexible gatherers allow to plug in external converters that decompress and extract plain text from any desired file format on the fly (like Harvest-NG).
How much information is extracted?
Simple-minded solutions just do a full-text extraction, which may be infeasible for huge document collections. More advanced solutions - but being also more limited to HTML - know about HTML meta tags, titles, subtitles, links to a document, first sentences of paragraphs ... to put more weight on the more important parts of a document (like Harvest).
What database technology is used for the index?
This decides on the space/speed tradeoff. glimpse, for example, allows three kinds of databases with different space/speed tradeoffs. Simple solutions just store everything in one file and recreate it on every run. Advanced databases allow distributed index files and incremental updates.
The search interface
How flexible/powerful is the query language? Does it support boolean or regular expressions? Can similar sounding words, synonyms, words with alternative endings, ... also be found? How customizable is the WWW-interface? Is there a command-line version of the search program?

The following table lists the features of some open source indexing solutions as of June 2000:

name glimpse 4.1 SWISH++ 4.6.1 Harvest 1.6.1 (Feb 2000) Harvest-NG ht://dig 3.1.5
license restricted (shareware) license since version 4.1, source available GPL GPL GPL GPL
access methods file system local file system; http through wget (requires temporary copy of remote files) http, https, ftp, gopher, and nntp http, https, ftp, gopher, and nntp http (and filesystem)
supported file formats ASCII "text" (including HTML) ASCII, HTML, PDF, external converters, strings-like text extraction for binaries completely customizable through external decompressors, convertors, and summarizers completely customizable through external decompressors, convertors, and summarizers ASCII, HTML, PDF, other formats through external converters
supported compression formats pluggable filters (read stdin, write stdout) allow customization customizable gzip (customizable) gzip (customizable) customizable
full text/ summary full text, but filters can be used for summary ? full text and summary (completely customizable) full text and summary (completely customizable) full text
URLs extracted from which formats? None. The whole thing is file system based. see wget HTML and PDF (completely customizable) HTML and PDF (completely customizable) HTML
index/ document size ratio "tiny" (2-3%), "small" (7-9%), "medium" (20-30%) ? depending on how the summaries are done; excessive if full-text indexing ? ?
database special memory mapped images of STL-maps glimpse on SOIF files DBM-based database of SOIF objects Berkeley DB
distributed database? no no no no no
incremental indexing? yes, but may result in a less efficient index yes yes yes yes
interface command line, server command line, server, WWW/CGI WWW/CGI None. This is only a gatherer and must be combined with a search interface to make a full search solution. WWW/CGI
boolean queries? yes yes yes no yes
regex queries? some no some no no
fuzzy matching? (soundex, endings) no yes (English only) no no yes

More comments:

ht://dig
is currently the best "plug-n-play" of the open source indexing and searching packages. (It is used by SuSE Linux as the local search engine.) ht://dig now (as of htdig-3.1.5) also allows on-the-fly decompression and conversion through external/configurable converters.
mnogosearch
not tested...
Harvest
is a collection of tools to gather, index, search, and cache documents in various formats. The original Harvest project ended 1996 with version 1.4pl2. (It had a BSD-style license.) Versions 1.5 were maintained by and distributed under the GPL. Harvest's document cache was further developed under the name of Squid.
Harvest-NG
Harvest-NG is a partial reimplementation of the Harvest package in Perl.
SWISH-E
is an enhanced version of Kevin Hughes' SWISH.
Indexes
Suchfibel.de
SearchTools.com

Link Checker

MOMSpider
This was the the first "big" link checker. It is dependent on perl 4 and the (old) libwww-perl library and a bit outdated.
Checkbot
Checkbot is a perl-program dependent on the "new" libwww-perl 5 library. It is easy to install and configure and does a reasonable job.
Checklinks
A perl5 link checker. Version 1.0.1 is a bug-fix release of March 26, 2000.
DLC
Perl5, GPL, version 0.4 of December, 7, 1999.
InSite
"Site management tool" (link check and statistics, parallel remote link checking), Perl5, (Version 2.20 of 2000-09-27).
Linbot
Linbot is an easy-to-use python programm, which is distributed with SuSE Linux. It is now discontinued due to legal problems.
Lambda LinkCheck
by Lars Marius Garshol is a multi-threaded link checker written in python. (discontinued, offline?)
Webtester
Perl script, (Version 1.10; December 29, 1999)
LinkCheck
Yet another Perl link checker.
LinkChecker
Hkan Svensson, version 1.1.2 of September 21, 1998; GPL.
linkcheck
is an old (1994) Perl4 script.
Webxref
another old (1995) Perl script
lvrfy
is an old shell script (1995).
Indexes
Software & Services for Checking Links & Creating Site Maps @elsop.com
index @ directory.google.com
index @ dmoz.org
index @ thinkmobile.com

Promotion

Indexes
planet.fef.com

General Indexes

References From Chapter 11 ``The HTML Sourcebook''
DLR WWW Software archive

last reviewed: March 13, 2001, Stefan Jaschke
Disclaimer