|
Fetch
Web Agents
To facilitate rapid data access and to enable on the fly
generation of Ecocasts in response to an event (e.g., a
fire, flood, or frost), we are leveraging data management
software developed by Fetch Technologies. Fetch Technologies
provides innovative solutions for integrating and accessing
heterogeneous data sources. Solutions are built on top of
the Fetch Agent Platform, a system for gathering information
from the Internet and intranets.
There is a tremendous amount of information
available on the World Wide Web, but applications cannot
easily use data that is embedded in Web sites. To solve
this problem, Fetch Technologies has developed agent-based
technology for navigating through complex Web sites and
extracting data from semistructured sources. A Fetch agent
is a software program that enables online sources to be
queried as if they were databases. The Fetch Agent Platform
enables users to:
• Build agents by example.
• Ensure that agents accurately extract data across an entire
collection of pages.
• Verify an agent’s integrity in order to avoid failures
when a site changes.
• Semi-automatically repair agents in response to changes
in Web site layout or format.
Our technology is particularly useful
for processing “semistructured” sources, with no explicit
structure or schema, but with an implicit underlying structure.
Many HTML sources, such as online catalogs, have a very
regular structure that can be used to extract data from
them. However, even text sources, such as email messages,
often have some underlying structure that can be exploited
to extract fields such as the date, sender, addressee, subject,
and body of the message.
The core technology embedded in the Fetch
Agent Platform was originally developed by the company founders
at the University of Southern California. Fetch has refined
and extended the basic technology, building a set of tools
and techniques to make it possible to rapidly provide efficient
and accurate extraction from Web sources. Fetch’s agent
technology is based on a powerful idea: the user provides
examples of the data to extract, and the computer builds
the agent automatically. Using this approach, we not only
generate highly accurate extraction rules, but we also create
rules that monitor the quality of the data being extracted.
This machine learning approach can be used to regenerate
agents automatically when the format of a Web site changes.
The platform consists of two major components.
AgentBuilder provides a completely visual environment that
allows a user with minimal training to construct sophisticated
agents easily and quickly. AgentRunner is an execution system
that automatically performs the tasks specified by the agent,
and produces structured data that is ready for use in any
application. AgentRunner also includes a graphical user
interface, called AgentAdministrator, which supports the
maintenance of the agents.
One of the critical problems in building
an agent is to generate a set of extraction rules that precisely
define how to locate targeted information on the page. For
any given item targeted for extraction, one needs an extraction
rule to locate the beginning and end of that item. Each
extraction rule consists of a sequence of “landmarks“ that
specify exactly how to locate the targeted item of information.
The difficult part of this problem is to create a set of
extraction rules that work for all pages of the same type.
For example, if you wanted to extract the cuisine type and
the rating for any given restaurant on a restaurant review
site, you would want the extraction rules to work for any
page on that site containing these items of information.
It is difficult for people to write robust
extraction rules. Fetch has solved this problem by developing
a machine-learning algorithm that learns extraction rules
by example. Using AgentBuilder, users can mark up sample
pages obtained from a site. The system then analyzes the
examples to generate a set of extraction rules that accurately
extract the desired information. Our approach uses a greedy-covering
inductive learning algorithm, which incrementally builds
the extraction rules from the examples. Our algorithm is
able to generate extraction rules efficiently from a small
number of examples. The algorithm also has the advantage
of being able to extract data from pages containing lists,
nested structures, and other complicated formatting layouts
that are difficult for other approaches.
For more information, please visit http://www.fetch.com/.
Copyright © 2000-2003 Fetch Technologies, Inc. All rights
reserved.
|
|
|
Contacts
More information on Fetch Technologies
and the Web Agent Platform is available here.
|
|