WWW Utilities


The WWW Consortium is the authoritative source on the latest standards in Markup (HTML/XHTML), Styles (CSS/XSL), scalable vector graphics (SVG), synchronized multimedia (SMIL), Protocols (HTTP) and others. Stefan Mnz' Selfhtml is simply the best documentation of HTML and Javascript I know of. (in German)

WWW Server

Apache is all you need :-).

SGML, HTML, and XML Tools

is an emacs mode for editing SGML and XML documents. Given a DTD, it supports indentation, syntax highlighting, and insertion of only the allowed elements in a given context. Although it is not a validating parser, it helps to write syntactically correct documents.
is an XML Authoring Environment for Emacs.
W3C Validator
Free XML software

Webserver Log Analyzers

is written in C, said to be fast, very popular, very versatile and configurable, provides output in several languages, can sort web pages hierarchically
This is the most "modern" of the freely available log analyzers (as of July 99). - nice graphical display - configurable - can also report referrer and agent if the log format is combined/extended
webalizer and http-analyze are very similar. (webalizer is patterned after http-analyze.) Version 2.01 is free for non-commercial use, but there is no source.
Wusage ($25-2000)
another popular log analyzer gone commercial
one of the first open source log analyzers, last version Nov 96
is a small perl script that produces ASCII output. It guesses the "number of visitors" (and how many pages a single visitor accesses) on the assumption that accesses from the same site in small time intervals are from the same user. This assumption may be wrong for servers with high traffic.
is part of the WWW Utilities by Benjamin Franz.
lets you watch which robots wander your site.
log analyzers @ hypernews
log analyzers @ Yahoo

Indexing and Searching

Searching large document collections by keyword is facilitated by the generation of keyword indexes, i.e. databases containing the information which words appear in which documents. Index and search packages can be judged by the following characteristics:

Type of access in the process of gathering
Some solutions allow to access the documents through the local file system only, which limits severely what can be indexed. Others can index only what is available through public http-servers. Most flexible spiders/gatherers grab through any major protocol (http/ftp/gopher/...), allow to access password-protected sites, and - for performance reasons - allow to bypass the server for files in the local file system (like ht://dig).
Supported file formats
Some solutions are geared towards HTML and can index essentially only HTML and very few other formats (like ht://dig). More flexible gatherers allow to plug in external converters that decompress and extract plain text from any desired file format on the fly (like Harvest-NG).
How much information is extracted?
Simple-minded solutions just do a full-text extraction, which may be infeasible for huge document collections. More advanced solutions - but being also more limited to HTML - know about HTML meta tags, titles, subtitles, links to a document, first sentences of paragraphs ... to put more weight on the more important parts of a document (like Harvest).
What database technology is used for the index?
This decides on the space/speed tradeoff. glimpse, for example, allows three kinds of databases with different space/speed tradeoffs. Simple solutions just store everything in one file and recreate it on every run. Advanced databases allow distributed index files and incremental updates.
The search interface
How flexible/powerful is the query language? Does it support boolean or regular expressions? Can similar sounding words, synonyms, words with alternative endings, ... also be found? How customizable is the WWW-interface? Is there a command-line version of the search program?

The following table lists the features of some open source indexing solutions as of June 2000:

name glimpse 4.1 SWISH++ 4.6.1 Harvest 1.6.1 (Feb 2000) Harvest-NG ht://dig 3.1.5
license restricted (shareware) license since version 4.1, source available GPL GPL GPL GPL
access methods file system local file system; http through wget (requires temporary copy of remote files) http, https, ftp, gopher, and nntp http, https, ftp, gopher, and nntp http (and filesystem)
supported file formats ASCII "text" (including HTML) ASCII, HTML, PDF, external converters, strings-like text extraction for binaries completely customizable through external decompressors, convertors, and summarizers completely customizable through external decompressors, convertors, and summarizers ASCII, HTML, PDF, other formats through external converters
supported compression formats pluggable filters (read stdin, write stdout) allow customization customizable gzip (customizable) gzip (customizable) customizable
full text/ summary full text, but filters can be used for summary ? full text and summary (completely customizable) full text and summary (completely customizable) full text
URLs extracted from which formats? None. The whole thing is file system based. see wget HTML and PDF (completely customizable) HTML and PDF (completely customizable) HTML
index/ document size ratio "tiny" (2-3%), "small" (7-9%), "medium" (20-30%) ? depending on how the summaries are done; excessive if full-text indexing ? ?
database special memory mapped images of STL-maps glimpse on SOIF files DBM-based database of SOIF objects Berkeley DB
distributed database? no no no no no
incremental indexing? yes, but may result in a less efficient index yes yes yes yes
interface command line, server command line, server, WWW/CGI WWW/CGI None. This is only a gatherer and must be combined with a search interface to make a full search solution. WWW/CGI
boolean queries? yes yes yes no yes
regex queries? some no some no no
fuzzy matching? (soundex, endings) no yes (English only) no no yes

More comments:

is currently the best "plug-n-play" of the open source indexing and searching packages. (It is used by SuSE Linux as the local search engine.) ht://dig now (as of htdig-3.1.5) also allows on-the-fly decompression and conversion through external/configurable converters.
not tested...
is a collection of tools to gather, index, search, and cache documents in various formats. The original Harvest project ended 1996 with version 1.4pl2. (It had a BSD-style license.) Versions 1.5 were maintained by and distributed under the GPL. Harvest's document cache was further developed under the name of Squid.
Harvest-NG is a partial reimplementation of the Harvest package in Perl.
is an enhanced version of Kevin Hughes' SWISH.

Link Checker

This was the the first "big" link checker. It is dependent on perl 4 and the (old) libwww-perl library and a bit outdated.
Checkbot is a perl-program dependent on the "new" libwww-perl 5 library. It is easy to install and configure and does a reasonable job.
A perl5 link checker. Version 1.0.1 is a bug-fix release of March 26, 2000.
Perl5, GPL, version 0.4 of December, 7, 1999.
"Site management tool" (link check and statistics, parallel remote link checking), Perl5, (Version 2.20 of 2000-09-27).
Linbot is an easy-to-use python programm, which is distributed with SuSE Linux. It is now discontinued due to legal problems.
Lambda LinkCheck
by Lars Marius Garshol is a multi-threaded link checker written in python. (discontinued, offline?)
Perl script, (Version 1.10; December 29, 1999)
Yet another Perl link checker.
Hkan Svensson, version 1.1.2 of September 21, 1998; GPL.
is an old (1994) Perl4 script.
another old (1995) Perl script
is an old shell script (1995).
Software & Services for Checking Links & Creating Site Maps @elsop.com
index @ directory.google.com
index @ dmoz.org
index @ thinkmobile.com



General Indexes

References From Chapter 11 ``The HTML Sourcebook''
DLR WWW Software archive

last reviewed: March 13, 2001, Stefan Jaschke