The WWW Consortium is the authoritative source on the latest standards in Markup (HTML/XHTML), Styles (CSS/XSL), scalable vector graphics (SVG), synchronized multimedia (SMIL), Protocols (HTTP) and others. Stefan Münz' Selfhtml is simply the best documentation of HTML and Javascript I know of. (in German)
Apache is all you need :-).
Searching large document collections by keyword is facilitated by the generation of keyword indexes, i.e. databases containing the information which words appear in which documents. Index and search packages can be judged by the following characteristics:
The following table lists the features of some open source indexing solutions as of June 2000:
| name | glimpse 4.1 | SWISH++ 4.6.1 | Harvest 1.6.1 (Feb 2000) | Harvest-NG | ht://dig 3.1.5 |
|---|---|---|---|---|---|
| license | restricted (shareware) license since version 4.1, source available | GPL | GPL | GPL | GPL |
| access methods | file system | local file system; http through wget (requires temporary copy of remote files) | http, https, ftp, gopher, and nntp | http, https, ftp, gopher, and nntp | http (and filesystem) |
| supported file formats | ASCII "text" (including HTML) | ASCII, HTML, PDF, external converters,
strings-like text extraction for binaries |
completely customizable through external decompressors, convertors, and summarizers | completely customizable through external decompressors, convertors, and summarizers | ASCII, HTML, PDF, other formats through external converters |
| supported compression formats | pluggable filters (read stdin, write stdout) allow customization | customizable | gzip (customizable) | gzip (customizable) | customizable |
| full text/ summary | full text, but filters can be used for summary | ? | full text and summary (completely customizable) | full text and summary (completely customizable) | full text |
| URLs extracted from which formats? | None. The whole thing is file system based. | see wget | HTML and PDF (completely customizable) | HTML and PDF (completely customizable) | HTML |
| index/ document size ratio | "tiny" (2-3%), "small" (7-9%), "medium" (20-30%) | ? | depending on how the summaries are done; excessive if full-text indexing | ? | ? |
| database | special | memory mapped images of STL-maps |
glimpse on SOIF files | DBM-based database of SOIF objects | Berkeley DB |
| distributed database? | no | no | no | no | no |
| incremental indexing? | yes, but may result in a less efficient index | yes | yes | yes | yes |
| interface | command line, server | command line, server, WWW/CGI | WWW/CGI | None. This is only a gatherer and must be combined with a search interface to make a full search solution. | WWW/CGI |
| boolean queries? | yes | yes | yes | no | yes |
| regex queries? | some | no | some | no | no |
| fuzzy matching? (soundex, endings) | no | yes (English only) | no | no | yes |
More comments: