Managing the Web at its birthplace

Prepared for the South African Annual Conference on Web applications (1999-09-09)

To be published in the South African Journal of Information Management

Dariusz J. Kogut (for the TORCH section)

Abstract

CERN, the European Laboratory for Particle Physics is the place where the Web was born. It has the longest tradition in Web-related development, however, its main role is to do basic physics' research. This ambiguous combination of interests makes the task of managing the Web at CERN most challenging. The large amount of Web servers to consolidate, the users' high expectations on performance and centrally supported tools, the expression of individual creativity within some corporate image standards creates a non-trivial Intranet operation.

In this paper the CERN central servers' configuration will be discussed. The choice of commercial and public domain software for the server, search engine, Web discussion forums and log statistics will be presented. The guidelines and recommendations on Web authoring and site management will be listed as well as the criteria that led us to these conclusions.

Introduction

Managing a very large Intranet and, by the way, the oldest one, may seem at first glance to be a straight forward technical issue but it is more than that. It touches upon very interesting human and social aspects like:

who can publish
what can be published
how to promote important information
what is important information
who is responsible for the information quality
what is creative/what is bad taste
how independent are the data structures from the content
how uniform the Intranet appears

Our Intranet is far from being the most attractive and conscise, however, the following statements are all true and explain or justify the situation:

our "lack of discipline" has also its advantages as it led to achievements like the World Wide Web.
the main objective of the CERN authors and web server managers is to do High Energy Physics research, therefore they never focus on their sites' aesthetics.
Physicists are used to build entire projects by themselves and solve problems across disciplines (physics, engineering, computing) therefore hate conforming to restrictions on corporate image, accepted without problem in business environments.

Experience has shown that trying to over-centralize and control the presentation, location, platform or number of the CERN web page/sites/servers is a utopia, anyway contrary to the design and philosophy of the Web, the "universe of network-accessible information" (Tim Berners-Lee).

What we found realistic and valuable was to offer solid and open solutions that would prove themselves performant and simple to adopt or comply. These solutions were made available on the central web server www.cern.ch for those who have no resources to set-up a service of their own. They were also documented as tools, courses and guidelines for those who want to stay independent and run their own service, with the possibility to get advice.

This work covered the period of 1997-98 and is summarized here. As people and strategies change there is no commitment that the policy (or URLs!) of this paper will remain valid in the distant future.

The choice of web server

CERN was, of course, using the home-made CERN httpd from the web's birth till June 1998 when the central server www.cern.ch converted to Apache (some other servers on-site had done this earlier). It has been amazing to notice the performance degradation observed during the last year of use of the CERN httpd and it was very hard for us to identify the reason for that blockage. We upgraded the hardware, doubled the CPUs, added memory, increased the afs cache, monitored the impact of cgis running on the host, to witness in practice that the CERN httpd simply doesn't scale enough over a certain number of hits (~100K requests/day or on average 4 requests/sec). This was suggested by Jes Sorensen/CERN and proved to be true after we installed Apache and concluded to the following values for the httpd.conf tuning parameters:

MaxKeepAliveRequests 100
MaxSpareServers 40
MaxClients 250
MaxRequestsPerChild 1000
MinSpareServers 20

Apache proved to be a very powerful Web server.What we appreciated the most in Apache was the use of Virtual Web Servers, Server Side Includes and the handling of page protections.

The Virtual Web Servers allowed us to host those web sites that didn't have resources to maintain their own system any more but wanted to keep their identity (URLs) unchanged.

The Server Side Includes helped us create and maintain a uniform look on our service pages and every time site-wide changes were needed the update had to be done only in one place.

Most information owners want to know who visits their pages and ask for the possibility to access the Apache logs and extract the information that concerns them. We chose the analog package to do this by making available to the user a list of fields and tailorable parameters they can select and make statistics reports for the period, the criteria and the format that suit them.

Another preoccupation every system manager has when running a web server on his/her system is the potential security risks of CGI scripts. To ensure some safety we decided to keep all scripts in one place. Their authors aren't allowed to move them there by themselves but need to submit them to a programmer of the web support team for checking of potential security holes. Some documentation was made available recently on typical errors in Perl with shell commands and file I/O.

(Co*)authoring

(*)"L'enfer c'est les autres" J-P. Sartre

The challenges in this area depend:

on the site's size (one page, a few pages, hundreds of pages?)
on the documents' history (newly created, legacy papers)
on the documents' "mission" (to print or to link)
on the documents' complexity (images, dynamic content)
on the number of (co-)authors
on their working platform
on their preferred authoring tool (package to edit pages and manage web sites)

The above do not concern page/site style, fonts and content but only version consistency, document portability and write permissions.

We started by making an inventory of the tools available in the market for page editing, site management and image processing. It was immediately obvious that it is a never-ending task to keep it up-to-date, evaluate all of the packages or support more than a handful of them.

Therefore, we evaluated a subset of these products against a list of criteria and concluded that:

legacy documents that traditionally existed on paper and should continue to be printable with a rigid format and page numbers should be published as pdf or postscript.
users who have a very small number of pages to write should be left free to use their preferred editing tool and save them in a web-viewable format (e.g. HTML or pdf).
Unix users are usually happy to write plain HTML and should be left alone till their sites become relatively large (over 50 pages).
PC and Mac users enjoy the availability of a large number of editing/management tools and should choose the most standard and open ones, i.e. those creating portable code across platforms and server software.

The products we actually examined more carefully were:

MacromediaDreamweaver (for Windows and Macs)
GoLive CyberStudio (excellent but Mac-only at the time)
MS FrontPage 98 (very good if one stays within Microsoft products)
NetObjects Fusion (good for very large sites, several hundred pages, expensive)
Adobe PageMill (limited functionality)

Their page editing facilities are very similar i.e. link insertion, list,template and table creation are quite easy etc.

The worries start when proprietory add-ons reduce the portability possibilities of the end-product.

We needed to come up with a recommendation in a finite amount of time and suggested the Macromedia Dreamweaver for medium-sized sites not for its perfection (there are bugs!) but for its open,straight-forward, W3C standards' compliant features. Some of them, not exhaustive nor exclusive are:

HTML hiding
Clean HTML production
Easy link,table,image,font,metadata insertion/change
Easy form creation Templates (local or remote) & Library elements
Page result preview in user's preferred browsers
Preferred external editor invocation
Link validity checking
Easy site definition and uploading to server
Easy maintenance of any site on remote Unix or Windows system
Site map automatically built
Safe editing by >1 authors
Help with scripting,Javascript, DHTML
Server Side Includes (SSI) Support (also locally)
Common plug-ins' insertion
Cascading Style Sheets' (CSS) support
XML support / parser

From the page usability point of view one of the most serious problems we have due to the large number of authors (over 1,000) and the lack of coordination is the quality of the page content. Pages are written and forgotten, authors leave the organization, users don't know whom to contact to find out what is still valid. Some of our collaborators in the 20 member States of the laboratory or the rest of the world have modest network connectivity and can't use sites heavily loaded with images and animations. For these reasons we issued guidelines to authors explaining that on every page there should be:

a signature for the readers' information
a mail address for feedback
a date when the page might expire or need review
some concern for users with slow lines (no page over-load with pictures etc)
the appropriate metadata for promoting the pages in good ranks of search results
a robots.txt file at the level of the site's document root that prevents search engines from unauthorised indexing
good content of the TITLE tag that corresponds to the page' mission
ALT attributes of IMG elements that provide a text description of an image (vital for interoperability with speech-based and text only user agents).

Some management structures were also put in place per functional unit in the laboratory to assign webmasters in each administrative group so that some homogeneity and information update can be achieved.

An unsolvable problem is the choice of URL names. Every time the administrative structure was reflected in the file path or the URL we had a problem to keep it meaningful for more than a year. To this, we hardly have anything better to offer that tumb-stones and Redirects to the pages' new location. In the process some links break. For such checkings and web-based communication within working projects we have selected or developed a few tools explained below.

Tools

HyperNews : a freeware system to create web-based discussion forums. Its basic advantages: the fun user interface, the ease of message submission and the great support by the developers.
DoctorHTML : a one-time charge commercial product that checks the integrity of HTML links in a tree of documents. Its best feature: the author's fast replies to questions.
WhatsNew : home-made script that lists web pages updated in a given site during the last X days.
Pagesize : home-made script that gives a page's size in bytes (text,tags and images are measured separately).
metadata tool : home-made script that prompts the author to give values to metadata and applies them to the page.
HTML & links' validation tool : home-made tool that checks a given URL for spelling or HTML errors and broken links.
translate : Perl script that substitutes one string with another across an entire site. The last 3 tools in this list are now obsolete if one uses a modern authoring tool.

Searching

In 1997 we decided to buy a proper search engine because:

performant search engines are complex products and in-house development would not have the appropriate quality.
the size of our Intranet became important (it then contained 250,000 and now about 430,000 unique URLs).
although network bandwidth in Europe is expensive our users had no choice but to search local web information in american indexes.

In order to select a search engine we made a list of requirements like:

support of metadata
modest system resource consumption
possibility to make private index collections
friendly web-based administration interface
indexing document types other than HTML (e.g. ps,pdf)
customisable user interface and help
good customer support (access to the product developers)
possibility to index protected pages
ease of product installation
speed of query results
frequent/flexible index update without "harassing" the web servers on-site.
very good price!

We found all of the above requirements quite well implemented in the Ultraseek server by Infoseek, with its recent enhancement the Content Classification Engine.

Indexing works by following links. Web documents not linked from anywhere will not be indexed but can be added manually by the author if they are allowed by the configuration filters. This is meant to protect against unauthorised inclusion of foreign pages in the CERN index. As our (>=300,000) pages in index are too many and their scope too vast, we developed a sub-search facility in Perl for authors who only want to narrow down searches in their part of the directory tree. This was very much used and appreciated.

Happy with the basic features of our search engine as administrators we found that our users had problems to manipulate well the commands that narrow down searches to the best-matching answers. For this reason we developed an additional layer on-top of the standard user interface that allows typing queries in plain english text. This application is explained below.

Natural language web searching at CERN

The ultimate goal of search engine programmers is to make querying a database or index so simple that anyone can access the information without difficulty. Therefore TORCH (naTural language prOcessor for web seaRCHing) has been developed at CERN. Natural language research seems to be one of the hardest problems of artificial intelligence due to the complexity, irregularity and diversity of human language. In a natural language web searching, a user may conduct a search by describing the desired information. TORCH does not require the searcher to use a mysterious syntax. One may type his natural English language query and does not need to care for boolean logic or group of keywords. Advanced users could be interested in a set of synonyms and a coordinate terms' page that appear for each query as a supplement.

The diagram below describes TORCH main components.

TORCH has been created in Python language (http://www.python.org) and implemented as a part of Infoseek Search Engine. TORCH performs a syntactic (including fuzzy logic negotiation) and semantic analysis of a given query. It retrieves synonyms, antonyms, hypernyms and hyponyms from WordNet Dictionary to make the searching more efficient. Afterwards it tries to find a definition of each keyword in CERN Public Pages Dictionary that extends TORCH knowledge on the advanced particle physics. If the definition is found it will be displayed on the user interface web page. As far as CERN Public Pages Dictionary is concerned the definitions are stored as a piece of HTML code so all the advantages of hypertext language could be used. TORCH builds its own Infoseek Search Engine query during the semantic analysis process. In general, the query starts with proper nouns followed by a strategic keyword. Next phrases, nouns and adjectives appear. CERN coordinate terms are stuck together and placed at the end of the query. Due to its open architecture, TORCH may be easily maintained and developed.
�

Conclusions

History teaches us that who was once first in a field will not stay at the top for ever. However, a long tradition gives awareness and the ability to critically think whether something is a gift or a trap. A service operation like web support is most succesful when based on open products, minumum centralization, clear recommendations and well-founded guidelines that reflect the technology solutions of the present.

Acknowledgements

To Tim Berners-Lee for his creativity, modesty and integrity. To Robert Cailliau for reminding me often that things should be made "as simple as possible, but not simpler" and to Darek Kogut for his discrete devotion to the TORCH development.