Robert Inder (R.Inder@ed.ac.uk)
Division of Informatics
University of Edinburgh
Edinburgh, Scotland
This paper describes CAPE, a programming environment that combines Clips And Perl with Extensions. CLIPS is an efficient and expressive forward-chaining rule-based system with a flexible object system (supporting both message passing and generic functions). Perl is a popular procedural language with extremely powerful regular expression matching facilities, and a huge library of freely available software modules. CAPE closely integrates these two programming languages, and provides extensions to facilitate building systems with an intimate mixture of the two languages. These features make CAPE an excellent language for building knowledge-based systems to exploit the opportunities being presented by the Internet.
This paper describes the current version of CAPE and the facilities it offers programmers, including the demonstration systems and ``component applications'' that are distributed with it. The use of the system is then discussed with reference to an application for automatically generating graphs of remote web sites. Finally, planned developments of the system are indicated.
Rule-based Programming, Knowledge-Based System Tools, CLIPS, Perl
Conventional Knowledge Based Systems (KBSs) involve the controlled manipulation of symbolic descriptions of the world.
Extracting such symbolic descriptions from sensory data was (and still is) a major challenge in its own right, addressed by specific fields such as speech recognition and vision, and most work in KBSs tries to sidestep it.
At first, KBSs worked with very restricted amounts of information (e.g. chess positions) and typically solving problems described in special format data files. They began to achieve much wider use/acceptance when Expert Systems began to ask questions of users, (``Has the patient got a rash?''), thereby using the user's perceptual abilities and common sense to analyse the world into the abstract terms that the KBSs required. Subsequently, KBSs began to access data, either statically (i.e. KBS linked to database) or by analysing real-time feeds (e.g. network monitoring). To a greater or lesser extent, these systems can determine what data is relevant, and when, but only within a fixed (or at least limited) range of data for which the format and semantics were known to the system builders (or maintainers).
Now, we have the Internet.
By providing ``universal'' connectivity between computers, the Internet has created a previously inconceivable range of possibilities for any program. Innumerable documents -- ranging from the ephemera of electronic ``news'' discussions through to published documents (including newspapers, literature and technical documents) -- can be retrieved from almost anywhere in the world within moments. Programs can instantly query a multitude of databases, covering everything from holiday prices to registered trade marks, and call on other computational services such as dynamic information feeds (e.g. stock prices), notification of changes and so forth. Finally, the Internet allows programs to easily interact with huge numbers of people, either by email or using Web-based forms.
The number of sources of information and services continues to grow at an incomprehensible pace. Crucially, though, those sources are outside the control of--indeed, since they are continually growing and evolving, outside the knowledge of--the system's builders.
Fully exploiting the opportunity presented by the Internet--e.g. by building systems which can analyse and filter information for a particular purpose--will in several ways require a new breed of KBSs. They will, at the very least, have to be easily configured to deal with new sources of information, and should, more importantly, be able to explore, discover and operate within a huge and ever-changing world of information.
By making it easier to both locate and distribute software, the rise of the Internet has also greatly increased the number that readily accessible software packages. These include not only end-user applications, ``plugins'' and extensions for system tools and programming languages, but also library components for more-or-less standard tasks, such as parsing standard document formats and mark-ups.
In particular, there are efficient implementations of data-intensive algorithms such as neural nets, statistical analysis or ``data mining'' tools, and information retrieval and other techniques for processing free text based on word occurrence statistics. Such packages are potentially valuable new sources of information. It will therefore be increasingly important that software tools are able to work with them--either by incorporating libraries or by interfacing to stand-alond packages.
To best exploit the opportunities presented by the rise of Internet, a KBS programming tool should offer...
One way to achieve this is by building libraries for
a language which has some of the desired properties.
This is the approach taken by Jess, the
Java Expert System Shell (see
http://herzberg1.ca.sandia.gov/jess/), which implements
a rule-based engine within Java.
The alternative is to combine two (or more) languages which each exhibit some of the desired properties into a single programming environment, in the manner of Poplog [1]. This is the approach adopted by CAPE.
CAPE (Clips And Perl with Extensions) is a programming environment which allows programs to be written in an intimate mixture of the CLIPS [2] (CLIPS: C Language Integrated Production System) rule-based and object-oriented language, and Perl [3], a procedural programming language.
CLIPS was chosen because it closely integrates a very fast forward chaining rule-based system with a flexible object system that supports both message passing and generic functions. CLIPS was initially a partial re-implementation, in C, of Inference Art [6], which was arguably the most powerful of the lisp-based ``knowledge representation tookits'' that emerged during the mid/late 1980s. Its rule language features very powerful and efficient pattern matching facilities based on the RETE algorithm [7], and including the ability to match against the state of objects, and a truth maintenance mechanism. There is tutorial material for CLIPS in [4], and CLIPS itself is accessible via http://www.ghgcorp.com/clips/ along with detailed manuals and pointers to related software and information.
Perl was chosen because of its extremely powerful regular expression matching facilities, and its huge library of freely available software modules. It also supports complex data structures, and (combinations of symbolic patterns), There are many books about Perl programming (e.g. [5]). Perl itself is accessible via, http://www.perl.com/ , along with manuals and pointers to a huge quantity of related software and information. To the two separate language sub-systems, CAPE adds inter-calling and data transfer. It also provides mechanisms for synchronising the initialisation of data structures between the two programming systems. Finally, it uses CLIPS's run function facility to support monitoring Internet sockets while reasoning, allowing the system to remain responsive to socket activity while reasoning.
Once a chunk has been read, it is pre-processed. CAPE allows textual substitutions to be defined using Perl regular expressions, and for arbitrary user-defined functions to be called. CAPE itself as yet makes only minimal use of this mechanism (to allow Perl scalar variables to be accessed from within CLIPS code), but the mechanism is provided primarily as a hook to support future language extensions.
Finally, the pre-processed code is passed to one or other of the two language interpreters for execution.
Finally, CAPE offers the user a read/eval/print loop, which prompts for a command, reads and pre-processes a chunk as described above, sends it for evaluation by either the Perl or Clips interpreter, and ensures that the result is printed. Listener itself is in Perl, so can be redefined, and thus customised or extended, by user.
The most general bridge between CLIPS and Perl is through evaluation. Both languages provide an eval function which takes a string argument that it parses and executes. Both these functions are made available to the ``other'' language, along with a function which interprets its argument as a CAPE ``chunk'' which is classified as described above. String to be evaluated through this mechanism are pre-processed in the same way as code that CAPE has read.
In addition to this completely general evaluation mechanism, CAPE also provides functions for directly calling named functions, and for sending messages. When these functions are called, C code within CAPE maps their arguments directly from the C data structures underlying the calling language into the corresponding C data structures for the target language. Any result returned by the called function is similarly transformed into the form required by the calling language. Since both CLIPS and Perl employ dynamic memory allocation and garbage collection, the code for mapping arguments between the systems must itself allocate (and de-allocate) memory in order to avoid problems caused by the interaction of the two memory management systems.
CAPE itself only supports the mapping of simple data types (strings and numbers, and lists of these) between the component languages. This is felt to be a good compromise between power and simplicity of implementation (and thus comprehensibilty and reliability). It would be possible to provide more complex mappings (e.g. transforming a complete CLIPS object into a Perl hash or object). However, there is as yet no compelling case for providing such mappings in isolation, as opposed to as part of a comprehensive integration of the languages' object systems, which while desirable (see ``Future work'' below), clearly involves considerable design, programming and testing effort.
The facilities just described allow a CLIPS programs to call specific Perl subroutines, evaluate strings or match Perl regular expressions, on either the left- or right-hand side of a rule--that is, in either the condition or the action part--or from within the body of a CLIPS function or message handler. In particular, rule firing can be made dependent on successful matching of Perl regular expressions.
In the other direction, Perl programs can call functions defined in CLIPS--both normal functions and generic functions--and can send messages to CLIPS objects. There are also Perl subroutines defined for asserting facts into CLIPS working memory, and for having CLIPS evaluate an arbitrary string.
CAPE provides functions for initiating and configuring Internet socket handling, and for monitoring sockets while the CLIPS rule engine is running. These functions are available from both CLIPS and Perl. The current state of port monitoring and socket connection is used to continually update a collection of CLIPS objects which are accessible from both languages and available for pattern matching in CLIPS rules. This means systems to provide services via an Internet port can be built using only CAPE itself.
The lowest level handling of socket activity is done in C within the core of CAPE. CLIPS is able to call an arbitrary function after each rule firing, and CAPE uses this to check for and accept connections to any ports being monitored, and to aggregate any data received from any current connection. Then, whenever the buffer associated with a connection is full, or its record break character (currently new line) is received, the accumulated byte string is passed to a connection-specific Perl function for filtering. This function could simply respond directly to the input received. Typically, though, it will map it into CLIPS working memory, thereby allowing the full power of the pattern matching to be used to decide when and how to respond.
The CLIPS rule engine has a very powerful mechanism for deciding which rule to fire at any point. This can be used to make extremely subtle and flexible decisions about the flow of control within the system--i.e. about what the CAPE application should do next. Doing this requires structuring the system as a whole as a rule-based system, so that firing rules triggers activities that are possibly implemented in Perl. A system implemented in this way can remain responsive to external events, provided that the code executed by any particular rule does not take too long to run. Perl is used to help map ``the world'' into symbols that CLIPS can then reason about, and then to interpret the symbolic results of that reasoning into actions in the world.
CAPE has been designed (and primarily tested and exercised) with this model of system operation in mind. However, there is no (known!) obstacle to building an ``inverted'' system -- i.e. one in which overall control resides in a Perl program (or an interface generating call-backs into Perl), within which some subroutines specify or initiate rule-based reasoning.
In addition to the demonstration and component applications distributed with the system (see below), CAPE is being used to develop DIME: the Distributed Information Manipulation Environment. DIME is a collection of components useful for building systems to retrieve and manipulate distributed data. These components include a highly flexible document cache (which is able to fetch documents when necessary, store them locally, and search them in various ways) and a framework for specifying and controlling interaction with remote search systems.
DIME is being used to develop two research systems. The first is K-DIME, or Kansei DIME [8], which is concerned with fetching and filtering images based on subjective criteria, or kansei. The second system being build using DIME is Maxwell, a ``smart'' agent intended to both query and combine the results from a number of book vendors databases on behalf of the user. An initial version of Maxwell was implemented in Emacs Lisp (see [9] and [10]), and a more powerful version is currently being implemented in CAPE using DIME.
The remainder of this section will illustrate the way various features of CAPE are used and interact by describing a system for generating graphical maps of web sites based on the structure of the site and the nature of the pages it contains.
Given an initial URL and a regular expression to delimit the area to be mapped, the system fetches the relevant pages and analyses them to extract the links that they contain. The eventual aim is to generate a graph and output it in the format required by a graph layout system. The example system is geared towards the layout system found in the ``thinking support tool'' D-Abductor [11].
Although one can trivially generate a graph description from a web site, such approaches do not scale well: even modest web sites give rise to graphs that contain too many nodes to be drawn effectively. Producing workable graphs, therefore, involves either generating or recognising structures that can be used as a basis for omitting or temporarily hiding parts of the graph.
Different graph layout systems will hit size limits at different points. Indeed, the layout system used in this system is not particularly good in this respect, since because of the way it arranges nodes to reflect graph structure, certain site topologies cause it to make very inefficient use of screen space. However, that simply means that the limitations of drawing complete graphs are apparent a sooner, so the scope for intelligently transforming the site description to overcome them can be explored while working with smaller web sites.
A system for mapping a remote web site must fetch the pages that are relevant to the map, identify their inter-connections, and their links to the outside world, and then decide how they can be clustered, and which links, pages, servers and clusters should be shown in the particular graph.
Our programming language should ideally:
The mapping system uses Perl for
However, the top-level of the system is written in CLIPS. In particular, the system's ontology is described by defining CLIPS objects for such concepts as pages, page sets, servers and directories. CLIPS rules and methods are then used for mapping pages into suitable graph descriptions. The current prototype contains just over 40 rules, doing such things as:
This approach is taken to ensure that, as far as possible, rules can be written at the domain level, without reference to the mechanics of fetching or searching documents. The figures give some idea of the extent to which this has been achieved.
(defrule decide-to-fetch-page
"If we know of a page relevant to our target,
note that we should fetch it"
(declare (salience -50))
(target ?query ?function)
(object (is-a page)
(name ?doc)
(type html|unknown)
(did ?*not-fetched*)
(url ?url&:(perl-test ?function ?url)))
=>
(assert (to-fetch ?query 1 ?url ?doc)))
|
Figure 1 shows the rule that is responsible for deciding
to fetch a page. This is perfectly ordinary CLIPS rule matching against a
fact and an object, and using a user-defined function to test the value
of one of the object's slots (url). The only unusual feature is
the fact that the actual test on the URL is carried out in Perl.
The simplest way to have done this would have been to directly match the
URL against a Perl regular expression defining the pages of interest
by using the CAPE function p-match-p, thus
(url ?url\&:(p-match-p ?url ?regexp))
In fact, though, the rule used is somewhat more sophisticated in its use of Perl facilities. At the time when the target for the query is defined, the regular expression is passed to a Perl function which defines a second Perl function to check a URL against the regular expression, and returns the name of that function. Because this function is only used to match against a single (constant) regular expression, Perl can be told to optimise the matching process. The specially-defined function is then called using the CAPE function perl-test, which calls the Perl function named by first argument, and returns a CLIPS boolean. Other arguments to perl-test are converted to Perl primitives (in this case, the URL is converted to a Perl string) and passed in to the Perl function.
Figure 2 illustrates a second rule, which recognises a group of documents that are ``siblings''--i.e. that contain links to the same URL which have the same anchor text. The rule matches over facts describing the links that have been found in the various documents, finding three that refer to different document identifiers (?doc-id, ?doc-id2 and ?doc-id3 but are otherwise identical. Facts like these can be generated in CAPE in a single line of code, just by requesting the matches for an appropriate regular expression.
They could, of course, be generated in CLIPS, but it would require considerably more error-prone and tedious programming.
Note also the use of a second Perl function (same_doc) which compares two URLs to determine whether they refer to the same document. This involves more than just checking whether the URLs are identical, since it is also necessary to ignore any anchors within the document, and to disregard any trailing slash.
(defrule spot-siblings
"Find 'identical' links to the same place from
different sources"
(declare (salience 50))
(link ?doc-id ?source ?anchor ?dest $?rest)
(not
(object (is-a sibling-set)
(anchor ?anchor)
(destination ?dest)))
(link ?doc-id2&~?doc-id
~?source ?anchor ?dest $?rest)
(link ?doc-id3&~?doc-id&~?dod-id2
~?source ?anchor ?dest $?rest)
(object (is-a webitem)
(name ?obj)
(url ?u&:(perl-test same_doc ?dest ?u)))
=>
(bind ?ss (gensym "family-"))
(make-instance ?ss of sibling-set
(intarget TRUE)
(status possible)
(anchor ?anchor)
(destination-obj ?obj)
(destination ?dest)))
|
CAPE is a new tool which brings together two very different programming systems. It runs under a number of versions of UNIX, including Solaris and Linux. The core functionality of the current version of the system has been stable since the spring of 1998, and, as described above, the system itself has been used for the development of a number of research systems since the following summer.
The system was made available by FTP and announced on the Web in February 1999, and has been being downloaded over one hundred and fifty times since then.
The core CAPE environment comprises 3000 lines of C code, plus 600 lines of Perl (and a small amount of CLIPS). The system comes with a thirty-one page manual ([12]) and 1500 lines of CAPE code in two standard CAPE ``component applications'', or ``Capplets'' (support for regression testing CAPE applications and a simple Web server).
The distribution currently also includes three demonstration applications:
The entire application was build from scratch in five days. It contains a total of 2800 lines of CAPE code (about one third Perl), although re-implementation using the webserve component application would reduce this substantially.
One of the main difficult features of CAPE is the fact that CLIPS and Perl have different syntaxes, CLIPS being Lisp-like and Perl being generally (arguably ``vaguely'') C-like. This unfortunately means that fully exploiting CAPE requires reasonable proficiency in both languages, and in working with more than one language at a time. The possibility of producing a single ``unified'' syntax has considerable appeal. However, defining such a thing is not easy, since Perl syntax is already complex and very ``dense'--that is, a high proportion of characters and character combinations are already assigned meanings. Moreover, the appeal of a single syntax must be seen alongside the advantages of continuing to support unchanged the syntax of each of the underlying languages: access to the substantial bodies of software, documentation and expertise for these languages.
There are a number of obvious extensions to CAPE, some of which will be added to the system in the near future:
CAPE is a powerful tool for building a new generation of KBSs. It combines the strengths of two well-established tools with very powerful but complementary pattern-matching mechanisms. Perl's ability to search text and match powerful regular expressions is unequaled, while CLIPS provides powerful mechanisms for finding patterns of combinations of symbolic information. The CAPE programmer can exploit the strengths of both, using Perl to analyse documents or query results, and CLIPS to recognise and react to the combinations of matches found.
CAPE provides powerful mechanisms to support a number of key activities:
Together, these features make CAPE a powerful tool for building KBSs that can exploit the opportunities offered by the Internet.
CAPE is freely available, with full source, from
http://www.hcrc.ed.ac.uk/~robert/CAPE
CAPE was developed while the author was supported by a NEDO Visiting Researcher fellowship at the Electrotechnical Laboratory (ETL) in Tsukuba, Japan. Matt Hurst helped greatly with bug hunting in the early stages, and [5] was invaluable.
[]
Electrotechnical Laboratory, Tsukuba, Japan.
This document was generated using the LaTeX2HTML translator Version 97.1 (release) (July 13th, 1997)
Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -split 0 paper.tex.
The translation was initiated by Robert Inder on 12/9/1999