An Approach to Integration of Web Source search & Web info retrieval

An Approach to Integration of Web Information Source Search and Web Information Retrieval

Yuichi Iizuka
NTT Cyber Space Labs.¹
+81 468 59 2771
iizuka@dq.isl.ntt.co.jp

Mitsuaki Tsunakawa
NTT Software corporation²
+81 45 212 7525
tunakawa@po.ntts.co.jp

Shin-ichiro Seo
NTT Cyber Space Labs.¹
+81 468 59 3413
seo@dq.isl.ntt.co.jp

Tetsuo Ikeda
NTT Cyber Space Labs.¹

+81 468 59 3407
ikeda@dq.isl.ntt.co.jp

1 1-1 Hikarinooka Yokosuka-shi Kanagawa 239-0847 Japan
² 209 Yamashita-cho Naka-ku Yokohama-shi Kanagawa 231-8551 Japan

ABSTRACT

As a result of the explosive spread of the WWW, many information sources are now being accessed through open networks. The information on the WWW covers various fields and the number of source is always increasing because more enterprises and individuals are becoming information sources. Users, on the other hand, want to obtain just the right information efficiently from the huge supply of information available. When we want to retrieve information from the WWW, it is useful to be able to search for the information sources and to retrieve information from these sources by using integrated operations.
Various methods through which one can search for or retrieve information from the WWW, such as mediator methods and URL search engines have been proposed. However, using those methods, it is difficult to search for the information sources and retrieve information from those sources in an integrated way. When we use a mediator type method, it is necessary to specify the sources explicitly. When we use URL search engines (e.g., AltaVista, Lycos, Yahoo), URLs i.e. locations of information sources are returned as a result. So, it is necessary to access each individual source based on the listed URL to obtain the desired information.

This paper proposes an integrated method of searching for information sources and retrieving information from these sources. The proposed method adopts a universal relation as a user interface. The other features of this method are as follows.

Template mechanism through which users can treat HTML pages as if they were tables of relational database(RDB).
Information resource management that resolves the heterogeneity among information sources
A new application programming interface (API) by mean of which users can issue on inquiry and obtain results values.

Keywords

WWW, universal relation, heterogeneous information sources, information resource management, mediator

INTRODUCTION

As a result of the popularity of the WWW, a lot of open web sources have been created and a user can access all these sources through any web browser. These sources include not only HTML documents that consist of text data and/or table but HTML forms providing fill-form interfaces for the retrieval of various subjects such as cars, PCs, restaurants, and so on. However retrieval interface standards, for example a specified way of handling CGI, do not exist yet.
When we want to retrieve information from the WWW, it is useful to be able to search for the information sources and to retrieve information from these sources by using integrated methods. For example, if the user wants to retrieve information about PC models and prices, it would be useful if one integrated function could achieve the following two operations.

Searching for and selecting the information sources related to PCs, and
Retrieving information about models and prices from the selected sources as if they were relational databases.

There are various approaches for searching for or retrieving information from the WWW such as URL search engines and the mediator methods[4]. URL search engines (e.g., AltaVista, Lycos, Yahoo) locate information sources. However, only URL lists are returned as the search result. This forces users to access individual sources and analyze each source to obtain the desired information. On the other hand, the mediator type methods[9] are effective in retrieval information from information sources that have already been registered with the system. However, it is difficult to achieve integrated method, i.e. the combination of source search and information retrieval, because it is necessary to specify the sources explicitly. We cannot search for information sources.

In this paper, we propose an integrated method that performs source identification and information retrieval by employing a universal relation[10] as the user interface. In the proposed method, the first operation identifies the candidate sources and selects the target sources from these candidates. The second operation achieves item-based information retrieval from the selected sources.
Besides adopting a universal relation, the other features of this method are a template mechanism, information resource management, and a new application programming interface(API). The template mechanism achieves extraction of data values from HTML pages as if they were RDB tables. Information resource management is used to resolve the heterogeneity of information sources. By using information resource management and the new API, the proposed method return retrieval candidates and retrieval results from target retrieval candidates.

Section 2 outlines the proposed integrated method. Section 3 explains the template mechanism. Section 4 describes information resource management, and Section 5 covers information retrieval using an information resource dictionary and template. Lastly, concluding remarks are given in section 7.

OUTLINE OF THE PROPOSED INTEGRATED METHOD

An outline of this method is shown in Figure 1. We will explain briefly the integrated consecutive operation. When the user query is issued(fig.1 a), first, various types of heterogeneity such as retrieval interface differences among information sources is resolved using an information resource dictionary (IRD), and retrieval candidates are generated and presented to the user(fig.1 b). The user chooses target sources from these candidates(fig.1 c). Second, information is retrieved from the target sources (fig1. d), and the retrieval results are returned(fig1. e). Retrieval results are a list of item values that satisfy the user’s query. Item values are extracted by using templates. The new API described herein removes the necessity for the user to specify the information sources explicitly. The roles and functions of template, IRD, and API are explained briefly as follows.

[Templates]
Templates enable users to treat HTML documents as if they were RDB tables. Users can issue SQL-like queries.

[Information resource management]
Based on prior database research[8], we think that the information resource dictionary[6] is suitable for managing metadata. Information resource data such as location of sources (URL), item names, item value representations, and so on, are managed in IRD. Using IRD, heterogeneity can be resolved and retrieval candidates can be dynamically generated. IRD consists of the following dictionaries.
D-1) schema dictionary
D-2) term dictionary
D-3) domain dictionary

[API]
The new API is based on universal relation and enables users to construct a query by specifying the retrieval items and the retrieval conditions. The API provides interfaces through which the user receives the candidate sources and selects the sources. API also returns the list of values that satisfy the user’s query.

EXTRACTION OF ITEM VALUE FROM HTML DOCUMENT BY USING TEMPLATE

Retrieval can be made more efficient if information can be retrieved from various sources as if they were relational databases. However, there is no common rule on the structure of HTML documents. We adopted a template mechanism through which one could treat HTML documents as if they were RDB tables. An example of a template is depicted in Figure 2.
The template describes multiple data extraction patterns. The part subsequent to “HtmlTemplate” corresponds to one extraction pattern. In the extraction pattern, the parts enclosed in $$ correspond to the locations of data value that are extracted. The character string enclosed in $$ becomes the name of the extracted data item. In this template, item names are “url”, “title”, and “content”. The character string “..” allows pattern matching with any character string. If a part that matches the pattern is found, starting with the highest-ranking pattern, the parts corresponding to the location of the data items are extracted as the values of the item.
There are various approaches for extracting information from WWW sources[1, 5]. Up to now templates have been created manually by system administrators. The semiautomatic generation of templates is possible by detecting repetitions in tag patterns that extend over multiple lines. A template is made by actions such as removing the patterns that are seldom used.

HETEROGENEITY AND INFORMATION RESOURCE MANAGEMENT

In the proposed method, IRD manages information resources to resolve heterogeneity among target sources. This chapter discusses the heterogeneity among sources, and describes IRD.

Heterogeneity among WWW Sources

There are three main kinds of differences.

(a) Retrieval interface differences
Most retrieval interfaces use either keyword retrieval, which retrieves the information related to the keyword set as CGI parameters, or category retrieval, which traces categorized and hyper-linked information. Two search engines to the same content, may use different names, numbers, and other specifiable CGI parameters. For example, specifiable parameters such as “maker”, “model”, “price”, of sources holding car information may be just a single parameter, multiple parameters and combinations, or more complex retrieval conditions.

(b) Naming heterogeneity
Often item meaning is not easily understood from the name. Information sources are usually developed individually. A name in one system does not have the same meaning in another system. Also the same name can have different meanings. It is necessary to consider item name in the query, item name allowed by each source, and returned item name. For example, in a car information search page that enables users to specify “maker”, various input parameters, i.e. item such as “make”, “car_maker” exist.

(c) Representation heterogeneity
There are differences also in the value formats actually shown. This is possible because various expression methods can exist for one object. We expect there are some differences in the units used, the presence of encoding, the describing rule, etc. Moreover, the hyperlink to the data substance can be considered to be a form of data. For example, the various forms of "10,000 yen", "￥10,000", and "$83.30 ", may be used to show the amount of money.

Information Resource Management

The IRD consists of three dictionaries. Fill-form interface differences are resolved by the schema dictionary. Naming heterogeneity is solved by the term dictionary, and representation heterogeneity is resolved by employing the domain dictionary.

D-1) Schema dictionary
The specifiable CGI parameters and extractable data items are centrally managed as schema information in the schema dictionary. Concretely, it manages source names, access information, extractable data item names that correspond to item names in the templates and data types, necessary conditions to specify web source retrieval conditions, and specified capability in retrieval item/condition in web source. Access information includes URLs, CGI parameter names, proxy host name, proxy port number, and the template file name. The specifiable CGI parameters and the data items that can be retrieved from the HTML documents are individually managed as data items.
This dictionary is used to identify the candidate web sources. The differences in the retrieval-form interface are resolved by managing the information structure of each data item. Using the capability specified in the retrieval item/conditions of the source, this system judges whether the retrieval condition is to be processed on the web source side or on the system side. Moreover, relationships between items are also managed in this dictionary. Responding to the user query may require joining the results from multiple web sources; this is realized by using the relationship information.

D-2) Term dictionary
In order to make it easier for the user to retrieve information, we define three kinds of name. We think that there are three levels in the information retrieval system: the information source, the administrator, and the user. Each level uses a different kind of name.
The information sources use “definition name”; item. The “definition name” is only used when accessing the information sources.
The second level is “column name”; universal relation attributes are constructed by the administrator using “column name”. It is linked to “definition name”.
The third level is “retrieval name”; the user uses this when specifying the information to be retrieved.
To resolve naming heterogeneity, synonyms of retrieval name are managed in this dictionary.

D-3) Domain dictionary
To resolve representation heterogeneity, we introduce a domain to specify a representation of data. The domain in a web source is a “local domain”; the domain used by a user is a “user domain”; a group in the domain having the same content is a “domain group”; and the domain used as a standard in the domain group is a “global domain”. Only if a pair of domains belong to the same domain group, one domain value is converted to another domain value by interposition of global domain. The domain dictionary manages domain group, domain, global domain, user domain, and a domain conversion function.
Representation heterogeneity is resolved by centrally managing the representation of the data item in queries and the data item in the web sources. The user can create a query without considering the representation of the data.

INTEGRATION OF SOURCE IDENTIFICATION AND INFORMATION RETRIEVAL

API

One benefit of this method is that it offers an original API which simplifies the development of application programs.
In the WWW, web sources are frequently added and/or changed. Therefore, it is too difficult to construct and maintain application programs that use a language that specifies the location and the structure of the data, such as SQL. The original API is based on a universal relation[6] so there is no need to specify the information sources. Application development is independent of the system design and we can maintain and operate any such application without concern for the number or structure of the web sources. Typical API is shown in Table 1.

Source Identification

User’s query is constructed by specifying items, which designate the retrieval items and the retrieval conditions.
For example, the user issues the following query sentence:

Select car_name, miles, doors, price
Where price < “$10000”.

As mentioned in section4.2, item name in the query is assumed to be retrieval name. In this example, the retrieval names are “car_name”, “miles”, “doors”, and “price”.

An example of candidate sources for the above query is shown in Figure 3. After the user query is issued, the retrieval names are compared with synonyms of each item name in the term dictionary. The synonyms corresponding to retrieval names are then searched for in the schema dictionary. Column name is compared with synonyms, and sources that include all items are returned as the candidate sources. In figure 3, the column name “Vehicle” of candidate 1 matches the synonym of retrieval name “car_name”.
Candidate sources created from multiple sources are displayed together with their join relationship. Figure 3 shows that two sources are joined by “Vehicle” in candidate 1.

Information Retrieval

The information desired by the user is extracted from the sources chosen by the user. First, it is necessary to judge whether the retrieval condition is processed on the web source side or on the system side. User query processing is decided by using the capability specified in the schema dictionary. If the retrieval condition can be processed on the source side, the extraction of item values from the HTML document is executed by using the template. If the system performs retrieval, not only retrieval item values but the item values in retrieval condition are extracted and then filtering is executed.

If retrieval is to be executed, representation heterogeneity needs to be resolved. This is achieved by using user domain and the local domain of each data item in the form dictionary. When the user domain and local domain are different, a conversion function is used to convert user domain to global domain, and global domain to local domain. The retrieval items/conditions are converted from user domain to local domain, and the retrieval results are converted from local domain to user domain. The data representation heterogeneity is resolved by this process. The user can specify the retrieval condition in the user domain, and obtain the retrieval results in the user domain.

RELATED WORKS

Our goal is the same as TSIMMIS[2] and Information Manifold[7]. We compare these methods to our method.
TSIMMIS uses Mediator Specification Language(MSL) to describe the relationships between schemas of information sources and user views(schemas for users in TSIMMIS)[3]. Information Manifold describes such mapping rules by using queries (which include a declaration of relations and comparison descriptions) against world views(same meaning as user views in TSIMMIS). In both methods, a query against user views(or world views) and these mapping rules are interpreted together in the rule induction engine, and the finally query plan is generated. It is basically assumed that only one final query plan is produced through rule interpretations. Several candidate query plans are evaluated by the optimizer, and only the best candidate is output as the final query plan. In our proposed method, however, several retrieval candidates are returned to the application programs or users, and they have to interpret these candidates and decide how to process them. This method has the following advantages.
Consider an information retrieval application for information sources that have similar data structures. For example, there are three information sources for the subjects of books, video, and Compact Discs, respectively. All sources have two data items, “price” and “title”. To construct an information retrieval system using the rule based approach, a super class of three information sources must be added to the rule base, and the user must issues queries against that super class. If a new information source is to be added, the rule base must be changed in the same manner. Our proposed method allows synonyms to be set for items “price” and “title”. This allows three retrieval candidates to be automatically obtained by addressing one query to the system. The addition of an information source does not require us to change the rules.
This shows that our method is more effective than rule based approaches when new similar information sources are frequently added.

Next, we compare our method to URL search engines such as AltaVista, Lycos, Yahoo. The input of URL search engines are keywords, and output is a list of URLs. On the other hand, our method can extract data from HTML documents by giving retrieval items and retrieval conditions. Both methods can obtain information from Web servers, but our system is more general, because it can use URL search engines as local information sources. In that case, we use “URL” as a retrieval item, and “keyword =’keyword given by user’” as a retrieval condition in the query against our system.
In URL search engines, retrieval flexibility such as semantic analysis of input keywords and full-text search of HTML documents are significant benefits; our system lacks such semantic analysis. As explained above, however, we can use URL search engines as information sources. This means that our method can well use of URL search engines, and we can combine the results of URL search engines and other Web site information to obtain some complex information.

CONCULUSION

This paper proposed an integrated information retrieval method for the WWW. This method bases the user interface on a universal relation. Given the user’s query, the method returns a preliminary list of candidate sources from which the user selects target sources. The information desired is extracted from the target sources.
Besides adopting a universal relation, the other features of this method are as follows.

Template mechanism allows HTML pages to be treated as if they were relational database forms.
Information resource management resolves heterogeneity of information sources
New application programming interface (API) allows users to construct a query by specifying items, which designate the retrieval items and the retrieval conditions.

The proposed method resolves heterogeneity among sources and generates and presents retrieval candidates based on the user’s query. Though there are many candidates, the users can choose the sources desired as retrieval targets. So, this method can treats independently controlled sources covering various subjects such as cars, PCs, restaurants, and so on. This method returns the lists of item values within an uniform user domain as the retrieval results. Thus the user needs not to analyze HTML documents to obtain the desired information and retrieval can be more easily.

Our future research plans are as follows.

Create an automatic template generation method.
Expand the range of the proposed method to cover information sources such as text databases, XML data, and multimedia databases.

REFERENCES

B.Adelberg, “NoDoSe ? a Tool for Semi-automatically Extracting Structured Data from Text Documents”, Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, Seattle, Washington, June 1998.

S.Chawathe, H.Garcia-Molina, J.Hammer, K.Ireland, Y.Papakonstantinou, J.D.Ullman, J.Widom, “The TSIMMIS project: Integration of heterogeneous information sources”, IPSJ Conference, 1994.

L.Chen, R.Yerneni, V.Vassalos, H.Garcia-Molina, Y.Papakostantinou, J.Ullman, M.Valiveti, “Capability based mediation in TSIMMIS”, SIGMOD-Record. vol. 27, no. 2, June 1998.

L.Gravano, Y.Papakonstantinou, “Mediating and Metasearching on the Internet”, IEEE Data Engineering Bulletin Special Issue on Databases and the World Wide Web, Vol. 21, No. 2, June 1998.

J.Hammer, H.Garcia-Molina, J.Cho, R.Aranha, A.Crespo, “Extracting Semistructured Information from the Web”, Proceedings of the Workshop on Management of Semistructured Data, Tucson, Arizona, May 1997.

“Information Resource Dictionary System (IRDS) Framework”, ISO/IEC 10027, 1990.

A.Y.Levy, A.Rajaraman, J.J.Ordille, “Querying heterogeneous information sources using source descriptions”, International Conference on Very Large Data Bases(VLDB‘96), 1996.

J.Sekine, H.Machihara, M.Kawashimo, M.Nakagawa, “A Methodology for the Data Standardization using a Word Dictionary”, 47th International Federation for Information and Documentation Conference and Congress, Saitama, Japan, Oct. , 1994.

J.Ullman, “Information Integration Using Logical Views”, Proceeding of the International Conference on Database Theory, Delphi, Greece, 1997.

J.Ullman, “Principles of DATABASE SYSTEMS Second Edition”, 1985.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and or fee.
SAC 2000 March 19-21 Como, Italy
(c) 2000 ACM 1-58113-239-5/00/003>...>$5.00

Copyright 2000 ACM