A Formal View to Classification and Retrieval Mechanism for Reusable Objects

Giancarlo Succi, Carl Uhrik and Tullio Vernazza^*

DISA - Università di Trento
^* DIST - Università di Genova

The essence of the problem is to deposit software objects into the repository according to descriptions of the objects expressed in a formal language concise enough to serve as a subsequent indexing scheme. Then at some future time, a potential reuser of the object may describe his concept of a desired object in the same indexing scheme language.

The indexing scheme should respect each of the two processes: the object insertion process, which is called classification, and the search query process, which is called retrieval. One or both processes generally involve human interpretations of intended informal meanings. Consequently, caution needs to be exercised to ensure that the formal language used to describe objects at the time of their insertion in the repository sufficiently matches the formal description that will be issued as a query by a potential reuser. This aspect of the reuse requires a cross-cultural fertilization of knowledge between indexers and eventual reusers which establishes a mutually beneficial understanding of the behavior patterns of the human elements in both processes.

Despite the inevitable informal aspect, the formal language used on each side of the classification-retrieval process can help channel the insertions and requests towards convergence when indeed there is a conceptual match. Such language engineering is in fact nothing new. Frakes and Gandel [8] give a survey of methods for representing software components for reuse which include traditional library science methods, knowledge based methods and hypertextual systems. Notably, completely formal approaches (predicate calculus, Horn clauses, or a declarative machine language) to represent search queries or classifications of inserted objects as formal specifications [4,5,6,10] and automatic programming using generative methods [2,3,7,9] have been excluded from that survey since these technics completely side step the classification/retrieval language compatibility issues and put other problems in their place. Although Artificial Intelligence and HyperText systems were included in the survey of Frakes and Gandel [8], these systems work on a fine-grained level of detail requiring massive investments of effort to establish the complete connectivity of the elements in the cases of complex software objects/systems. A commonly accepted alternative to formal approaches with a large-grained level of descriptive representation for software objects is to use the widely known classification and indexing schemes originally developed in the context of library and information sciences.

A taxonomy of the common classification methodologies

In traditional library and information science, indexing, or classification, is the process of creating a representation which is essentially a description of an object.

Many indexing languages have been developed for library and information retrieval applications; they define an item's physical location in a library and summarize what the item is about by describing its subject or content. What an information item considered to be 'about' may vary with the search situation which is part of the need for carefully engineering the classification and retrieval language so that it will indeed be useful. Information may be attached to an object manually, from user input, or automatically, i.e., extracted from information such as text found in the object itself (presuming the object contains text itself) or documentation accompanying the object (such as a microfilm or an executable binary program).

One can characterize such languages as either ``languages with controlled vocabulary'' or ``languages with uncontrolled vocabulary''. For controlled vocabularies, the essential idea is that by restricting indexers and searchers to a common set of terms, one a-priori encourages searches to match appropriate objects. Thus, the ``languages with controlled vocabulary'' were developed to help ensure that terms used by indexers and searchers would be the same. They can be separated into classed systems and keyword systems.

Classed Systems

In enumerated classification, a single subject matter class label must be assigned to an object. The system assumes that classes must be mutually exclusive, and classes cannot be combined to form new classes ex post facto (after the system has been devised) though classes may be later subdivided. This means that classes must have at least a partial hierarchical taxonomy. A simple example is the Dewey decimal system which has named classes assigned to decimal number where the highest level on classification is associated with the integer part and progressively smaller decimal fractions correspond to subdivisions of subject matter topics into subtopics.

A disadvantage of enumerated methodology is that all possible class labels describing a domain are listed in an initial ``classification scheme''. Hence, the changes in such a scheme cannot be made without totally restructuring the system. In software reuse, i.e. applications where domains and terminology are constantly changing, this is a serious limitation.

In faceted classification, one or more descriptive properties or attributes of an object may be assigned a value from a pre-agreed structured set of basic terms associated with each such attribute. These attributes are limited in number and are called facets. Each facet conveys information about a particular descriptive aspect or property of a class or set of objects. The facets collectively may be thought of as spanning an attribute vector space which will be useful in describing any particular object or in describing groups (i.e., classes or sets) of objects according to logical conditions expressed over the facets: e.g., $[Function = PopQueue] \wedge [ArgumentCount \gt 3 ]$ .

Unclassed or Keyword Based Systems

Keyword based approaches focus on attaching one or more keywords (acceptable natural-language words and phrases) to each object in order to describe its properties. Terms are typically arranged alphabetically.

Much like classed systems, two possible approaches exist, corresponding to whether the keywords regard subject matter or an open variety of properties. Indexing languages as subject-heading systems does not allow the synthesis of basic terms to express composite concepts because all composite terms are created before the system is used by indexers. However, it does not preclude multiple keywords in a search that describes different views on the subject matter.

Descriptor systems use keywords designed to allow searchers to synthesize terms and composite terms using Boolean operations. In this way, some of the syntactic finesse of faceted approaches is given to a searcher though not to an indexer.

Keywords systems are easier to create and to modify than classed systems. In fact they allow the addition of terms to cover new concepts without affecting existing terms.

Free Text Retrieval

In an uncontrolled vocabulary, no restriction is placed on which terms can be used to describe an item. Some subcategories exist in the methodology, but all searches involve the search for textual substrings within a textual description. The source of the text may be extracted from a separate description - e.g., an abstract or brief textual characterization associated with the item to be retrieved, or the text may be derived directly from the object in the repository/library itself. In the later case, the exact syntax or some reflection of it (e.g., maintaining word orders of the main terms in a phrase - also somewhat erroneously called ``keyword in context'').

Formalizing the Methodologies

For each mechanism, first the classification template is presented, then the query is formalized and the definition of the result is stated; finally some examples are presented.

Keywords

Moreover, the objects that we are interested in classifying and retrieving based on keywords are collected into a set called the Repository: ${\sl Repository\/} = \{ O_{i} \}$ .

For retrieval, a simple query is defined again (like a descriptor) as a non-empty set of keywords: $Q^{k}=\{ qk_{i} \}\subset K$ Usually, the keywords in a descriptor are considered as constituting elements of a disjunctive assertion (connected with an or operator), while the keywords in a query are considered as forming a conjunctive condition (a test constructed with and operators). I.e., $D^{k}(O)\equiv \vee dk_{i}$ and $Q^{k}(O)\equiv \wedge qk_{i}$ .

Subsequently, the result of a query against the Repository is $R^{k}(Q^{k})=\{O:O \in {\sl Repository\/}, \forall qk_{i} \in Q^{k}, qk_{i} \in D^{k}(O)\}$ .

This means that the result is the set of all objects with descriptors containing all the keywords of the query.

For example, if we were dealing with microcomputers, we might have keywords associated with various aspects of the CPU, ROM memory, RAM memory, disk, type of display and number of serial I/O ports. A descriptions of typical system might look like: $\{ 1Mb\_Ram, Color\_Display, SCSI\_disk, 68000\_CPU \}$ .

Note that not all relevant keywords have necessarily been specified. This is not a requirement of the methodology. A query might look like: $\{ 500k\_Ram, Color\_Display \}$ .

This would certainly include the above descriptor as a result if it were recorded in the repository, but it could also produce: $\{ 1Mb\_Ram, Color\_Display, IDE\_disk, 8080\_CPU \}$ . as an additional result.

It is also possible to have more complex queries that contain or operators: $Q^{ok}= Q^{k}_{1} \vee Q^{k}_{2} \vee \cdot \cdot \cdot \vee Q^{k}_{m}$ . where each subquery component Q^k_i is precisely as before, a list of keywords: $Q^{k}_{i} = \{ qk_{i,1}, qk_{i,2}, \ldots , qk_{i,n_i} \}$

The result of such an or-query is the union of all the results obtained from each group of keywords in the or-query. Formally, according to the previously defined conventions, an or-query is a sequence of expressions composed of keywords partly connected with and and partly with ors: $Q^{ok}= (qk_{1,1} \wedge qk_{1,2} \wedge \cdots qk_{1,n_1}) \vee (qk_{2,1} \... ...2,n_2}) \vee \cdots \vee (qk_{m,1} \wedge qk_{m,2} \wedge \cdots qk_{m,n_m})$ . and the result of the query can be written as: $R(Q^{ok})= R(Q^{k}_{1}) \cup R(Q^{k}_{2}) \cup \ldots R(Q^{k}_{m})$ .

Returning to the earlier microcomputer repository, we see that an or-query might be expressed as: $\{ 1Mb\_Ram, Color\_Display\} OR \{SCSI\_disk, 68000\_CPU \}$ . which could return as a result:

$\{ 1Mb\_Ram, Color\_Display, SCSI\_disk, 68000\_CPU \}$ , $\{ 2Mb\_Ram, B\&W\_Display, SCSI\_disk, 68000\_CPU \}$ , $\{ 1Mb\_Ram, Color\_Display, 8000\_CPU , Dinosaur \}$ , $\{ 1Mb\_Ram, Color\_Display, IDE\_disk, 8008\_CPU \}$ .

Weighted Keywords

Weights can be considered as either properties of keywords or relevance to the larger context of the description or query, depending on the object described; we assume the latter case, since it is more general. Although weighted keywords per se are not one of the basic methodologies, they correspond to one form of the thesaurus which is a standard fixture of the keyword methodology. Moreover, weights can also be assigned to each keyword in a query or a description of an object.

The descriptor of an object can be synthesized as follows: $D^{wk}(O)=\{(dk_{i},dwt_{i}) \}$ . that is, the descriptor is a set of couples formed by a keyword and a descriptor weight of such keyword. The query is again a non empty set of keywords, now a query weight is associated with each keyword: $Q^{wk}=\{ (qk_{j},qwt_{i}) \}$ .

The result of a query is a function of the way a weight contributes towards scoring the match of an individual query keyword against a descriptor keyword in an object (a matching function $\phi^{wk}$ ) and of how such results are combined to determine the overall result for all the keywords in a query considered against all the keywords in a particular object (a cumulator function $\Gamma^{wk}$ ); formally: $R^{wk}(Q^{wk},\Gamma^{wk},\phi^{wk})= \{O:O \in {\sl Repository\/}, not ( \exists O^{x} \in {\sl Repository\/} : v^{wk}(O^{x}) {\rm \gt} v^{wk}(O) ) \}$ where the function v^wk expresses distance or similarity between query and repository objects and is defined as:

v^wk(O) =	$\Gamma^{wk}(\{\phi^{wk}(qwt_{i},matchwt(qk_{i},D^{wk}(O)))\})$ ,
	where matchwt(qk_i,D^wk(O)) =	dwt_i if $qk_{i} \in D^{wk}(O) ;$
		NULL otherwise .

Nearly always this equation is presented in its simplified form, where $\Gamma^{wk}$ is $\Sigma$ (summation) and $\phi^{wk}$ is $\Pi$ (multiplication of weights); in such cases, the null value is 0. Note also that this kind of search does not require any semantic inference on the keywords, it is just an optimization problem.

Free Text

A possible simple strategy to assign relevance values is to delete the conjunctions, prepositions, articles, pronouns, and other low importance words, and then tabulate the relative occurrence frequency of the remaining words. For instance, taking the IEEE definition of a process: a sequence of steps performed for a given purpose, initially the descriptor is: $D^{ft}_{inf}({\it Process})=[{\it a,sequence,of,steps,performed,for,a,given,purpose}]$ . then the relevance is determined by eliminating the low importance words a, of, for and a, and then weighting each remaining word in terms of its relative number of occurrences. Here no object appears twice, so all are equally weighted and the final result is:
$D^{ft}({\it Process})=\{(sequence,0.2), (steps,0.2), (performed,0.2), (given,0.2), (result,0.2)\}$ .

More analysis can be applied to the words in order to eliminate further irrelevant terms (such as given) and possible synonyms. However such details are not analyzed here since the focus is on the general taxonomy rather than on details.

A query is a set of words, quite like the case of keywords: $Q^{ft}=\{ qw_{j} \}$ . The result is the set of objects which maximizes a correlation value expressed in terms of relevances of the descriptors and a function, $\Gamma^{ft}$ , composing the relevances of each term in the query relative to the corresponding terms in the descriptor of an object from the Repository: $R^{ft}(Q^{ft},\Gamma^{ft})= \{O:O \in {\sl Repository\/}, not ( \exists O^{x} \in {\sl Repository\/} : v^{ft}(O^{x}) {\rm \gt} v^{ft}(O) ) \}$ where the function v^ft is defined as:

v^ft(O) =	$\Gamma^{ft}( \{matchwt(qw_{j},D^{ft}(O))\})$ ,
	where matchwt(qw_j,D^ft(O)) =	r_{dw_<<211>>i} if $\exists dw_{i} , qw_{j} = dw_{i} \wedge (dw_{i},r_{dw_{i}}) \in D^{ft}(O)$ ,
		0 otherwise .

For instance, again the function $\Gamma^{ft}$ can be $\Sigma$ and other interpretations of the matching function may be possible.

This method can be further refined adding weights to each word in the query. In this case, the query would be: $Q^{wft}=\{ (qw_{i},r_{qw_{i}}) \}$ . Therefore the result of such a query can be modeled in terms of an integrating function $\Gamma^{wft}$ and a combining function $\phi^{wft}$ : $R^{wft}(Q^{wft},\Gamma^{wft},\phi^{wft})= \{O:O \in {\sl Repository\/}, not ... ...sts O^{x} \in {\sl Repository\/} : v^{wft}(O^{x}) {\rm \gt} v^{wft}(O) ) \}$ where the function v^wft is defined as: $v^{wft}(O) = \Gamma^{wft}(\{\phi^{wft}(r_{qw_{i}},r(qw_{i},D^{ft}(O)))\})$ .

However the focus of this paper is more on the general taxonomy rather than on the details of how the different methods can be combined in all the possible ways; so we shall not dwell on such details much further.

Simple Faceted

More formally, the attribute space is depicted as: $AS = \{v^{k}_{1}\} \times \{v^{k}_{2}\} \times \cdots \times \{v^{k}_{nf}\}$ , where $\{v^{k}_{i}\} = domain(Facet_{i})$ , and so the descriptor of an object can be written as a vector: $D^{sf}(O)= (dv^{j}_{1},dv^{j}_{2},\ldots,dv^{j}_{nf}) \in AS$ , such that $dv^{j}_{i} \in \{v^{k}_{i}\}$ , and the query correspondingly is a vector as well: $Q^{sf}= (qv_{1},qv_{2},\ldots,qv_{nf}) \in AS$ , such that $qv_{i} \in \{v^{k}_{i}\}$ .

The result of the query can be formally expressed as the set of the objects that are closest to the desired object, closest in terms of the values of the faces of their descriptors and in terms of two functions. One function, $\phi^{sf}$ , determines the proximity of the value of each single face of all the objects in the repository to the corresponding facet value of the desired query object. The other function, $\Gamma^{sf}$ ,integrates the proximities of the different faces to determine the global proximity of each object in the repository to the desired object. $R^{sf}(Q^{sf},\Gamma^{sf},\phi^{sf})= \{O:O \in {\sl Repository\/}, not ( \e... ...n {\sl Repository\/} : p^{sf}(O^{x},Q^{sf}) {\rm \gt} p^{sf}(O,Q^{sf}) ) \}$ . where: $p^{sf}(O,Q^{sf}) = \Gamma^{sf}(\{\phi^{sf}(qv_{i},dv^{O}_{i})\})$ .

In the forest of values for each facet, a value or possibly two values can be associated to each arc specifying the proximity of its two nodes. Allowing two values permits modeling the unusual (but sometimes useful) case in which the upward value (i.e., from child to parent) is not equal to the downward value (i.e. from parent to child). Such values are used by $\phi^{sf}$ to determine the proximity of the nodes corresponding to its two arguments - the facet value terms from a query and a repository object respectively. If they are neighbors, then the solution is straightforward since $\phi^{sf}$ takes the value of the connecting arc; otherwise, $\phi^{sf}$ searches for the shortest path between the two nodes, taking into account the general-to-specific taxonomy, i.e., computing the distance between the query-term node and the first ancestor common to both the query-term node and the repository-object-term node.

The function $\Gamma^{sf}$ is used to integrate all the values coming from the different faces to determine the global proximity of two objects - the hypothetical query object and any repository object.

Faceted with Semantic Connections

D^fsc(O)	=	$(dv^{j}_{1},dv^{j}_{2}, \ldots , dv^{j}_{nf})$
Q^fsc	=	$(qv_{1},qv_{2}, \ldots , qv_{nf})$
$R^{fsc}(Q^{sf},\Gamma^{fsc},\phi^{fsc})$	=	$\{O:O \in {\sl Repository\/},$
		$not ( \exists O^{x} \in {\sl Repository\/} : p^{fsc}(O^{x},Q^{sf}) {\rm \gt} p^{fsc}(O,Q^{fsc})) \}$
p^fsc(O,Q^fsc)	=	$\Gamma^{fsc}(\phi^{fsc}(qv_{i},dv^{O}_{i}))$

Conclusions

To some extent, this can not be solved solely with a formalism. There needs to be agreement on the part of indexers and searchers about the ``architectural style'' used by a particular community of software producers. Introducing indexers unfamiliar with the style of structuring programs used in that community will virtually guarantee lack of reusability based on the classification process. For example, consider for example the two cases depicted in below: in one case the assumption is that a piece will be monolithic and in the other, the searcher structures the search with the hope of finding subcomponents satisfying separately formulated requirements. In the first case, there may be parts of his total requirements satisfying his requirements partially and their union may appear to satisfy the total requirements superficially, but the individual searcher must have a strategy that is foreseen by the indexer - otherwise, he never states his query in a way that is compatible with what has been inserted in the repository.

In considering the special needs for extensions to the methodologies to support software reuse, a strategy based upon a hybridization of the simplest methodologies was proposed rather than using or developing more complex methodologies.

The authors conclude that some extensions oriented towards the inevitable informal human side of the process may be necessary based. This is the reason a domain analysis of the environment in which such methodologies must operate is tightly coupled to the design of the termspace for keywords and facets.

References

[1] Barstow, D., An Experiment in Knowledge-based Automatic Programming, Artificial Intelligence, Vol. 12, pp. 73-119, North-Holland, Amsterdam, 1979.

[2] Komiya, S., Automatic Programming by Composing Program Components and Its Realization Method, Future Generation Computer Systems, Vol. 5, North-Holland, 1989.

[3] Rockmore, J., Knowledge-based Software Turns Specifications into Efficient Programs, Electronic Design, July, 1985.

[4] Cheng, Betty and Jeng Jun-Jang, Formal Methods Applied to Reuse, Proceedings of the WISR '92 5th Annual Workshop on Software Reuse, Latour, L., Philbrick, S. and Stevens, M., eds., Palo Alto, CA, October 20-29, 1992.

[5] Bech, Brian, Declarative Programming for Component Interconnection, Proceedings of the WISR '92 5th Annual Workshop on Software Reuse, Latour, L., Philbrick, S. and Stevens, M., eds., Palo Alto, CA, October 20-29, 1992.

[7] Dershowitz, Nachum, Program Abstraction and Instantiation, ACM Transactions on Programming Languages and Systems, Vol. 7, No. 3, July 1985.

[8] Frakes, Wm. B., and Gandel, P. B., Representing Reusable Software, Information and Software Technology, Vol. 32, No. 10, pages 653-664, December 1990.

[10] Steigerwald, Robert, Reusable Component Retrieval with Formal Specifications, Proceedings of the WISR '92 5th Annual Workshop on Software Reuse, Latour, L., Philbrick, S. and Stevens, M., eds., Palo Alto, CA, October 20-29, 1992.

[11] Yu, Guohui, Automatic Retrieval of Formally Specified Real-Time Software Components, Proceedings of the WISR '92 5th Annual Workshop on Software Reuse, Latour, L., Philbrick, S. and Stevens, M., eds., Palo Alto, CA, October 20-29, 1992.

[12] Brachman, R.J., What's in a Concept: Structural foundations for Semantic Networks, Associative Networks: The representation and use of knowledge by computers, N.V. Findler ed., Academic Press, N.Y., pp. 3-50, 1979.

[13] Hendrix, G.G., Expanding the Utility of Semantic Networks through Partitioning, Artificial Intelligence, Vol. 7, pp. 21-49, 1976.

About this document ...

This document was generated using the LaTeX2HTML translator Version 97.1 (release) (July 13th, 1997)