An Approach to Integration of Web Information Source Search and Web Information Retrieval
Yuichi Iizuka
NTT Cyber Space Labs.1 +81 468 59 2771 iizuka@dq.isl.ntt.co.jpMitsuaki Tsunakawa
NTT Software corporation2 +81 45 212 7525 tunakawa@po.ntts.co.jpShin-ichiro Seo
NTT Cyber Space Labs.1 +81 468 59 3413 seo@dq.isl.ntt.co.jpTetsuo Ikeda
NTT Cyber Space Labs.11
1-1 Hikarinooka Yokosuka-shi Kanagawa 239-0847 Japan 2 209 Yamashita-cho Naka-ku Yokohama-shi Kanagawa 231-8551 JapanABSTRACT
As a result of the explosive spread of the WWW, many information sources are now being accessed through open networks. The information on the WWW covers various fields and the number of source is always increasing because more enterprises and individuals are becoming information sources. Users, on the other hand, want to obtain just the right information efficiently from the huge supply of information available. When we want to retrieve information from the WWW, it is useful to be able to search for the information sources and to retrieve information from these sources by using integrated operations. Various methods through which one can search for or retrieve information from the WWW, such as mediator methods and URL search engines have been proposed. However, using those methods, it is difficult to search for the information sources and retrieve information from those sources in an integrated way. When we use a mediator type method, it is necessary to specify the sources explicitly. When we use URL search engines (e.g., AltaVista, Lycos, Yahoo), URLs i.e. locations of information sources are returned as a result. So, it is necessary to access each individual source based on the listed URL to obtain the desired information.
This paper proposes an integrated method of searching for information sources and retrieving information from these sources. The proposed method adopts a universal relation as a user interface. The other features of this method are as follows.
Keywords
WWW, universal relation, heterogeneous information sources, information resource management, mediator
As a result of the popularity of the WWW, a lot of open web sources have been created and a user can access all these sources through any web browser. These sources include not only HTML documents that consist of text data and/or table but HTML forms providing fill-form interfaces for the retrieval of various subjects such as cars, PCs, restaurants, and so on. However retrieval interface standards, for example a specified way of handling CGI, do not exist yet. When we want to retrieve information from the WWW, it is useful to be able to search for the information sources and to retrieve information from these sources by using integrated methods. For example, if the user wants to retrieve information about PC models and prices, it would be useful if one integrated function could achieve the following two operations.
There are various approaches for searching for or retrieving information from the WWW such as URL search engines and the mediator methods[4]. URL search engines (e.g., AltaVista, Lycos, Yahoo) locate information sources. However, only URL lists are returned as the search result. This forces users to access individual sources and analyze each source to obtain the desired information. On the other hand, the mediator type methods[9] are effective in retrieval information from information sources that have already been registered with the system. However, it is difficult to achieve integrated method, i.e. the combination of source search and information retrieval, because it is necessary to specify the sources explicitly. We cannot search for information sources.
In this paper, we propose an integrated method that performs source identification and information retrieval by employing a universal relation[10] as the user interface. In the proposed method, the first operation identifies the candidate sources and selects the target sources from these candidates. The second operation achieves item-based information retrieval from the selected sources. Besides adopting a universal relation, the other features of this method are a template mechanism, information resource management, and a new application programming interface(API). The template mechanism achieves extraction of data values from HTML pages as if they were RDB tables. Information resource management is used to resolve the heterogeneity of information sources. By using information resource management and the new API, the proposed method return retrieval candidates and retrieval results from target retrieval candidates.
Section 2 outlines the proposed integrated method. Section 3 explains the template mechanism. Section 4 describes information resource management, and Section 5 covers information retrieval using an information resource dictionary and template. Lastly, concluding remarks are given in section 7.
An outline of this method is shown in Figure 1. We will explain briefly the integrated consecutive operation. When the user query is issued(fig.1 a), first, various types of heterogeneity such as retrieval interface differences among information sources is resolved using an information resource dictionary (IRD), and retrieval candidates are generated and presented to the user(fig.1 b). The user chooses target sources from these candidates(fig.1 c). Second, information is retrieved from the target sources (fig1. d), and the retrieval results are returned(fig1. e). Retrieval results are a list of item values that satisfy the userfs query. Item values are extracted by using templates. The new API described herein removes the necessity for the user to specify the information sources explicitly. The roles and functions of template, IRD, and API are explained briefly as follows.
[Templates] Templates enable users to treat HTML documents as if they were RDB tables. Users can issue SQL-like queries.
[Information resource management] Based on prior database research[8], we think that the information resource dictionary[6] is suitable for managing metadata. Information resource data such as location of sources (URL), item names, item value representations, and so on, are managed in IRD. Using IRD, heterogeneity can be resolved and retrieval candidates can be dynamically generated. IRD consists of the following dictionaries. D-1) schema dictionary D-2) term dictionary D-3) domain dictionary
[API] The new API is based on universal relation and enables users to construct a query by specifying the retrieval items and the retrieval conditions. The API provides interfaces through which the user receives the candidate sources and selects the sources. API also returns the list of values that satisfy the userfs query.
Retrieval can be made more efficient if information can be retrieved from various sources as if they were relational databases. However, there is no common rule on the structure of HTML documents. We adopted a template mechanism through which one could treat HTML documents as if they were RDB tables. An example of a template is depicted in Figure 2. The template describes multiple data extraction patterns. The part subsequent to gHtmlTemplateh corresponds to one extraction pattern. In the extraction pattern, the parts enclosed in $$ correspond to the locations of data value that are extracted. The character string enclosed in $$ becomes the name of the extracted data item. In this template, item names are gurlh, gtitleh, and gcontenth. The character string g..h allows pattern matching with any character string. If a part that matches the pattern is found, starting with the highest-ranking pattern, the parts corresponding to the location of the data items are extracted as the values of the item. There are various approaches for extracting information from WWW sources[1, 5]. Up to now templates have been created manually by system administrators. The semiautomatic generation of templates is possible by detecting repetitions in tag patterns that extend over multiple lines. A template is made by actions such as removing the patterns that are seldom used.
In the proposed method, IRD manages information resources to resolve heterogeneity among target sources. This chapter discusses the heterogeneity among sources, and describes IRD.
There are three main kinds of differences.
(a) Retrieval interface differences Most retrieval interfaces use either keyword retrieval, which retrieves the information related to the keyword set as CGI parameters, or category retrieval, which traces categorized and hyper-linked information. Two search engines to the same content, may use different names, numbers, and other specifiable CGI parameters. For example, specifiable parameters such as gmakerh, gmodelh, gpriceh, of sources holding car information may be just a single parameter, multiple parameters and combinations, or more complex retrieval conditions.
(b) Naming heterogeneity Often item meaning is not easily understood from the name. Information sources are usually developed individually. A name in one system does not have the same meaning in another system. Also the same name can have different meanings. It is necessary to consider item name in the query, item name allowed by each source, and returned item name. For example, in a car information search page that enables users to specify gmakerh, various input parameters, i.e. item such as gmakeh, gcar_makerh exist.
(c) Representation heterogeneity There are differences also in the value formats actually shown. This is possible because various expression methods can exist for one object. We expect there are some differences in the units used, the presence of encoding, the describing rule, etc. Moreover, the hyperlink to the data substance can be considered to be a form of data. For example, the various forms of "10,000 yen", "
10,000", and "$83.30 ", may be used to show the amount of money.The IRD consists of three dictionaries. Fill-form interface differences are resolved by the schema dictionary. Naming heterogeneity is solved by the term dictionary, and representation heterogeneity is resolved by employing the domain dictionary.
D-1) Schema dictionary The specifiable CGI parameters and extractable data items are centrally managed as schema information in the schema dictionary. Concretely, it manages source names, access information, extractable data item names that correspond to item names in the templates and data types, necessary conditions to specify web source retrieval conditions, and specified capability in retrieval item/condition in web source. Access information includes URLs, CGI parameter names, proxy host name, proxy port number, and the template file name. The specifiable CGI parameters and the data items that can be retrieved from the HTML documents are individually managed as data items. This dictionary is used to identify the candidate web sources. The differences in the retrieval-form interface are resolved by managing the information structure of each data item. Using the capability specified in the retrieval item/conditions of the source, this system judges whether the retrieval condition is to be processed on the web source side or on the system side. Moreover, relationships between items are also managed in this dictionary. Responding to the user query may require joining the results from multiple web sources; this is realized by using the relationship information.
D-2) Term dictionary In order to make it easier for the user to retrieve information, we define three kinds of name. We think that there are three levels in the information retrieval system: the information source, the administrator, and the user. Each level uses a different kind of name. The information sources use gdefinition nameh; item. The gdefinition nameh is only used when accessing the information sources. The second level is gcolumn nameh; universal relation attributes are constructed by the administrator using gcolumn nameh. It is linked to gdefinition nameh. The third level is gretrieval nameh; the user uses this when specifying the information to be retrieved. To resolve naming heterogeneity, synonyms of retrieval name are managed in this dictionary.
D-3) Domain dictionary To resolve representation heterogeneity, we introduce a domain to specify a representation of data. The domain in a web source is a glocal domainh; the domain used by a user is a guser domainh; a group in the domain having the same content is a gdomain grouph; and the domain used as a standard in the domain group is a gglobal domainh. Only if a pair of domains belong to the same domain group, one domain value is converted to another domain value by interposition of global domain. The domain dictionary manages domain group, domain, global domain, user domain, and a domain conversion function. Representation heterogeneity is resolved by centrally managing the representation of the data item in queries and the data item in the web sources. The user can create a query without considering the representation of the data.
One benefit of this method is that it offers an original API which simplifies the development of application programs. In the WWW, web sources are frequently added and/or changed. Therefore, it is too difficult to construct and maintain application programs that use a language that specifies the location and the structure of the data, such as SQL. The original API is based on a universal relation[6] so there is no need to specify the information sources. Application development is independent of the system design and we can maintain and operate any such application without concern for the number or structure of the web sources. Typical API is shown in Table 1.
Userfs query is constructed by specifying items, which designate the retrieval items and the retrieval conditions. For example, the user issues the following query sentence:
Select car_name, miles, doors, price Where price < g$10000h.
As mentioned in section4.2, item name in the query is assumed to be retrieval name. In this example, the retrieval names are gcar_nameh, gmilesh, gdoorsh, and gpriceh.
An example of candidate sources for the above query is shown in Figure 3. After the user query is issued, the retrieval names are compared with synonyms of each item name in the term dictionary. The synonyms corresponding to retrieval names are then searched for in the schema dictionary. Column name is compared with synonyms, and sources that include all items are returned as the candidate sources. In figure 3, the column name gVehicleh of candidate 1 matches the synonym of retrieval name gcar_nameh. Candidate sources created from multiple sources are displayed together with their join relationship. Figure 3 shows that two sources are joined by gVehicleh in candidate 1.
The information desired by the user is extracted from the sources chosen by the user. First, it is necessary to judge whether the retrieval condition is processed on the web source side or on the system side. User query processing is decided by using the capability specified in the schema dictionary. If the retrieval condition can be processed on the source side, the extraction of item values from the HTML document is executed by using the template. If the system performs retrieval, not only retrieval item values but the item values in retrieval condition are extracted and then filtering is executed.
If retrieval is to be executed, representation heterogeneity needs to be resolved. This is achieved by using user domain and the local domain of each data item in the form dictionary. When the user domain and local domain are different, a conversion function is used to convert user domain to global domain, and global domain to local domain. The retrieval items/conditions are converted from user domain to local domain, and the retrieval results are converted from local domain to user domain. The data representation heterogeneity is resolved by this process. The user can specify the retrieval condition in the user domain, and obtain the retrieval results in the user domain.
Our goal is the same as TSIMMIS[2] and Information Manifold[7]. We compare these methods to our method. TSIMMIS uses Mediator Specification Language(MSL) to describe the relationships between schemas of information sources and user views(schemas for users in TSIMMIS)[3]. Information Manifold describes such mapping rules by using queries (which include a declaration of relations and comparison descriptions) against world views(same meaning as user views in TSIMMIS). In both methods, a query against user views(or world views) and these mapping rules are interpreted together in the rule induction engine, and the finally query plan is generated. It is basically assumed that only one final query plan is produced through rule interpretations. Several candidate query plans are evaluated by the optimizer, and only the best candidate is output as the final query plan. In our proposed method, however, several retrieval candidates are returned to the application programs or users, and they have to interpret these candidates and decide how to process them. This method has the following advantages. Consider an information retrieval application for information sources that have similar data structures. For example, there are three information sources for the subjects of books, video, and Compact Discs, respectively. All sources have two data items, gpriceh and gtitleh. To construct an information retrieval system using the rule based approach, a super class of three information sources must be added to the rule base, and the user must issues queries against that super class. If a new information source is to be added, the rule base must be changed in the same manner. Our proposed method allows synonyms to be set for items gpriceh and gtitleh. This allows three retrieval candidates to be automatically obtained by addressing one query to the system. The addition of an information source does not require us to change the rules. This shows that our method is more effective than rule based approaches when new similar information sources are frequently added.
Next, we compare our method to URL search engines such as AltaVista, Lycos, Yahoo. The input of URL search engines are keywords, and output is a list of URLs. On the other hand, our method can extract data from HTML documents by giving retrieval items and retrieval conditions. Both methods can obtain information from Web servers, but our system is more general, because it can use URL search engines as local information sources. In that case, we use gURLh as a retrieval item, and gkeyword =fkeyword given by userfh as a retrieval condition in the query against our system. In URL search engines, retrieval flexibility such as semantic analysis of input keywords and full-text search of HTML documents are significant benefits; our system lacks such semantic analysis. As explained above, however, we can use URL search engines as information sources. This means that our method can well use of URL search engines, and we can combine the results of URL search engines and other Web site information to obtain some complex information.
This paper proposed an integrated information retrieval method for the WWW. This method bases the user interface on a universal relation. Given the userfs query, the method returns a preliminary list of candidate sources from which the user selects target sources. The information desired is extracted from the target sources. Besides adopting a universal relation, the other features of this method are as follows.
The proposed method resolves heterogeneity among sources and generates and presents retrieval candidates based on the userfs query. Though there are many candidates, the users can choose the sources desired as retrieval targets. So, this method can treats independently controlled sources covering various subjects such as cars, PCs, restaurants, and so on. This method returns the lists of item values within an uniform user domain as the retrieval results. Thus the user needs not to analyze HTML documents to obtain the desired information and retrieval can be more easily.
Our future research plans are as follows.