ecocast
Home Data Reserach Applications Technology Education Publications People
 
TECHNOLOGY > Fetch
< TOPS  |  IMAGEbot |  Tetrad IV |  Fetch  |  JDAF  |  NLI >
Fetch Technologies

The Fetch Agent Platform architecture.Fetch Web Agents

To facilitate rapid data access and to enable on the fly generation of Ecocasts in response to an event (e.g., a fire, flood, or frost), we are leveraging data management software developed by Fetch Technologies. Fetch Technologies provides innovative solutions for integrating and accessing heterogeneous data sources. Solutions are built on top of the Fetch Agent Platform, a system for gathering information from the Internet and intranets.

There is a tremendous amount of information available on the World Wide Web, but applications cannot easily use data that is embedded in Web sites. To solve this problem, Fetch Technologies has developed agent-based technology for navigating through complex Web sites and extracting data from semistructured sources. A Fetch agent is a software program that enables online sources to be queried as if they were databases. The Fetch Agent Platform enables users to:

• Build agents by example.
• Ensure that agents accurately extract data across an entire collection of pages.
• Verify an agent’s integrity in order to avoid failures when a site changes.
• Semi-automatically repair agents in response to changes in Web site layout or format.

Our technology is particularly useful for processing “semistructured” sources, with no explicit structure or schema, but with an implicit underlying structure. Many HTML sources, such as online catalogs, have a very regular structure that can be used to extract data from them. However, even text sources, such as email messages, often have some underlying structure that can be exploited to extract fields such as the date, sender, addressee, subject, and body of the message.

The core technology embedded in the Fetch Agent Platform was originally developed by the company founders at the University of Southern California. Fetch has refined and extended the basic technology, building a set of tools and techniques to make it possible to rapidly provide efficient and accurate extraction from Web sources. Fetch’s agent technology is based on a powerful idea: the user provides examples of the data to extract, and the computer builds the agent automatically. Using this approach, we not only generate highly accurate extraction rules, but we also create rules that monitor the quality of the data being extracted. This machine learning approach can be used to regenerate agents automatically when the format of a Web site changes.

The platform consists of two major components. AgentBuilder provides a completely visual environment that allows a user with minimal training to construct sophisticated agents easily and quickly. AgentRunner is an execution system that automatically performs the tasks specified by the agent, and produces structured data that is ready for use in any application. AgentRunner also includes a graphical user interface, called AgentAdministrator, which supports the maintenance of the agents.

One of the critical problems in building an agent is to generate a set of extraction rules that precisely define how to locate targeted information on the page. For any given item targeted for extraction, one needs an extraction rule to locate the beginning and end of that item. Each extraction rule consists of a sequence of “landmarks“ that specify exactly how to locate the targeted item of information. The difficult part of this problem is to create a set of extraction rules that work for all pages of the same type. For example, if you wanted to extract the cuisine type and the rating for any given restaurant on a restaurant review site, you would want the extraction rules to work for any page on that site containing these items of information.

It is difficult for people to write robust extraction rules. Fetch has solved this problem by developing a machine-learning algorithm that learns extraction rules by example. Using AgentBuilder, users can mark up sample pages obtained from a site. The system then analyzes the examples to generate a set of extraction rules that accurately extract the desired information. Our approach uses a greedy-covering inductive learning algorithm, which incrementally builds the extraction rules from the examples. Our algorithm is able to generate extraction rules efficiently from a small number of examples. The algorithm also has the advantage of being able to extract data from pages containing lists, nested structures, and other complicated formatting layouts that are difficult for other approaches.

For more information, please visit http://www.fetch.com/.

Copyright © 2000-2003 Fetch Technologies, Inc. All rights reserved.

 

Contacts
More information on Fetch Technologies and the Web Agent Platform is available here.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

     
 
footer graphic

Questions & Comments
updated 03/31/04

NASA Official: Rama Nemani
Curator: Forrest Melton