A DISTRIBUTED OPERATIONS AUTOMATION TESTBED TO EVALUATE SYSTEM SUPPORT FOR AUTONOMY AND OPERATOR INTERACTION PROTOCOLS
University of Colorado, Dept. of Computer Science
Boulder, CO USA 80309-0520
Fax: 001 303 492-5456
University of Colorado, Space Grant College
Boulder, CO USA 80309-0520
Fax: 001 303 492-5456
ABSTRACT. Space systems must be considerably more autonomous to enable cost-effective commercial and scientific missions. Most automated space systems still require significant operator attention to monitor and complete tasks. Ideally, space systems would be automated to operate autonomously. However, if a mission is complicated and the operational environment is difficult to model, then full autonomy is inherently risky. The goal of the semi-autonomous system presented here is to provide a framework for evolving automation over the life of a mission. To support this evolutionary scheme, a multi-agent approach is taken where event detection is linked to goal-oriented reaction using ground and space-segment agents. A key feature of the system is that it enables migration of agent automation between segments, with protocols for operator-agent and inter-agent interaction so that the operator can develop "cooperative" automation. Through initial analysis, properties of this system have been identified. Problems which arise due to the nature of multi-agent systems are described along with working solutions. In order to validate the system, it will be tested on a July 1997 manifested Space Shuttle payload being built at the University of Colorado.
Many space systems include automated functions, but these are usually limited to specific functions which have well-defined physical properties and mathematical models. Typical automation is function-specific (e.g. attitude control), static, and hard-to-model functions such as subsystem health and status monitoring are not substantially automated. Modern space systems automation could be described as "telerobotic", but the level of automated sensing and device control is low, and significant operator attention to monitor tasks and handle exceptions is still required. Ideally, space systems would operate fully autonomously, but full autonomy is difficult to achieve due to unpredictable variations of the operational environment. Based on these facts, the most significant automation advances can be made by providing methods to automate traditionally manual tasks that are difficult to model and define in advance. The approach taken in this research is to define a framework for flexible, evolutionary automation. This will allow an operator to "off-load" traditionally manual tasks during a mission as operational experience is gained, and to define cooperative automation for operations assistance. The multi-agent design considered incorporates situated agents (agents local to device sensors and actuators) and surrogate agents (remote to device, but local to operator interface). This type of automation has been applied to systems comparable to space operations systems . Situated agents simply link event detection to inferences about their environment, and goal-oriented reactions in order to achieve a specific mission or high-level task . Agent theory does not necessarily dictate how the linking is performed, and a number of automated intermediate steps may be taken, including: state observation, event detection, event classification, reaction selection, and reaction execution. Reactions can handle classifiable events, leaving those that cannot be reliably detected or classified, to be handled by the operator. This type of operator interaction with the cooperative system is characterized as "management by exception." While this approach has promise for increasing autonomy, inherent problems must first be solved for reliability.
Figure 1. DATA Space System Elements
The fundamental problem is observability of agent state and execution to enable effective interaction between the operator and between distributed agents. In many traditional systems, variations from expected behavior are handled exclusively by human operators. With this multi-agent system, the variations must be handled jointly by the system and operators. The "DATA" (Distributed Automation Technology Advancement) Space Shuttle Hitchhiker payload (Figure 1), incorporates sophisticated methods to detect events , and a real-time, rule-based inferencing engine for event classification, reaction selection and execution . The DATA system has properties hypothesized to be common to real-time, multi-agent systems in general, and thus would be applicable to a broad range of semi-autonomous space systems. The DATA system is described first along with properties and problems derived from initial testbed supported analysis.
Figure 2. DATA Distributed Multi-Agent Architecture
DATA SEMI-AUTONOMOUS SYSTEM DESCRIPTION
The DATA software architecture (Figure 2), is best described as a multi-agent system with agents distributed between the ground and space segments as well as within the ground segment. A key feature of system downlink is that the amount of data transported from the space-segment situated agent to the ground-segment surrogate agents is much smaller than the amount of data processed by the situated agent. Furthermore, the amount of data presented to the user is minimized to exception data rather than full state. This provides a way to deal with bandwidth limitations and a way to reduce downlink data rates to reduce operator monitoring requirements. Another key feature of the architecture is that it supports agent automation both within the ground and space segment, which enables migration of operational rules, scripts, constraints, and detection methods between segments.
DATA DETAILED SYSTEM DESCRIPTION
The DATA multi-agent system has been built using an "off-the-shelf" forward-chaining, rule-based inferencing and task control system called SCL (Spacecraft Command Language) . SCL provides multitasked and scheduled script execution, rule-based inferencing, and constraint checking. SCL runs as a Unix application and as an embedded systems RTEMS (Real-Time Executive for Multi-processor Systems) task. SCL has been integrated with the NASA JPL (Jet Propulsion Laboratory) developed SELMON (SELective MONitoring) application . SELMON provides event detection and SCL provides inferencing required to implement the DATA ground and space-segment agents. Surrogate agents have two functions: shadowing the state of the remote situated agent and providing additional automation. The surrogate agent also enables migration of automation which can be initially ground tested with operator concurrence until the automation has been statistically validated. Following testing, the "trusted" automation may then be uplinked to the remote situated agent. The system provides confidence statistics based on the operator feedback during the concurrence phase of migration for detection, classification, and reaction selection methods. The operator interface is distributed between GSFC (Goddard Space Flight Center) and CU (University of Colorado) and allows multiple operators to interact with the system through distributed GUIs (Graphical User Interfaces). An interaction protocol allows operators to work with one of three DATA solar science instruments, or with the DATA engineering subsystem. This operator interface is a VE (Virtual Environment) since it supports interaction between operators and with system agents.
Figure 3. DATA Situated Agent Embedded-Segment Detail
The DATA embedded system situated agent (Figure 3) and the remote ground-segment surrogate agent communicate through an RS232 interface in the DATA testbed and will communicate through the GSFC (Goddard Space Flight Center) "ACCESS" telemetry MRD/LRD (Medium-Rate Downlink / Low-Rate Downlink) and uplink system during the Shuttle flight. Commands will flow from the UPOCC (University Payload Operations Control Center) to CGSE (Customer Ground Support Equipment) at GSFC or directly from CGSE for ACCESS uplink. The DATA embedded system situated agent will downlink both full state and compressed status to enable validation of the compressed status scheme. The ground-segment surrogate agent includes additional automation and data management of full status history as well as a planner and scheduler developed at NASA JPL called "Plan-IT II". Plan-IT II provides more sophisticated goal-oriented capability to the agent system since it can schedule tasks and activate or deactivate event-triggered tasks according to a mission goal optimization and constraint model. Also, within the ground segment, a testbed simulated situated agent is used when the real embedded situated agent is not available, and for automation testing that is "off-line". From initial testing, a basic set of observability and interaction properties (and related problems) have been identified  and are further refined in this paper.
CLASSES OF SEMI-AUTONOMOUS SYSTEMS OPERATIONS PROPERTIES
Given the semi-autonomous operational scheme, the ability for operators to observe agent status and interact with the agent system effectively is critical. However, perhaps even more critical are the inter-agent observability and interaction properties. In this system, status is considered to include not only state information from sensor sampling, but also agent execution status, classification of detected events, and reaction selection. In general, an agent will make successive state observations to detect behavioral changes in the object being sensed. When behavioral changes are detected, called "events" in this paper, the agent must then use an inferencing scheme to "identify" or "classify" the event. The sophistication of inferencing could be as simple as a table-lookup, a more complex probabilistic inference , rule-based inference incorporating context, or some combination of methods. Methods of inferencing are not considered here, but the properties of this process are analyzed. The final result of inferencing is to select a reaction sequence for the agent to execute which may simply be generating an operator message, or as complex as "safing" the system, adjusting control parameters, etc. Based upon the nature of agent automation involving linking of perception to reaction, the two major classes of properties analyzed directly relate to this functionality of the agent.
Observability properties, first identified in , are fundamental to operator knowledge of agent status as well as inter-agent status knowledge.
Distributed State Observation: Given an object, X, that can be sensed by several different agents, including AGi as Si(X); if the object X resides in a particular segment (ground or space for example) for which sensor-based state information is available to an agent in segment i or k, differences in observation Si(X) from Sk(X) will be due to latency and synchronization factors resulting from the cost of observing X by AGi and AGk from their respective operational segments.
Event Detection: Given a series of state observations Si,1(X); Si,2(X); ... ;Si,n(X), which constitute a time series of observations S, an agent may use a detection method to determine if changes in state indicate behavioral changes in the object X being sensed . Detection may be expressed Ei,1 = Di,1(S), where the event Ei,1 is detected by applying method Di to time series S. Baye's rule provides a method to quantify detection performance in terms of false alarm probability P(A|E~), probability of a real event P(E), probability of correct detection P(A|E), and probability of a real event given an alarm P(E|A), where E is an occurrence of the event to be detected and A is an alarm such that:
P(E|A) = [P(A|E)*P(E)] / [P(A|E)*P(E) + P(A|E~)*P(E~)]
This mathematical relationship guarantees that given "imperfect detection", a tradeoff will always exist between raising false alarms and missing detection of real events.
Perception: When an event E is detected, it is meaningless until it can be "classified" with some confidence so that an appropriate reaction may be selected by the agent in order to automate tasks and deal with exceptions. While detection simply involves "noting" significant behavioral changes, perception requires assignment of meaning to an event given a context (state, state history, event history, and detection confidence). Perception depends upon context and is composed of the current state Si,1(X), historical state observations S, historical events E, and confidence in the current event detection P(E|A). Perception is simply expressed PEi (X) in terms of the sensed object alone. Even though perception is applied directly to an object, it is a composite function expressed PEi(Si,1(X), S, E, P(E|A)), where S = (Si,1(X), Si,2(X), ...) and E = (Di,1(S), Di,2(S), ...).
Reaction Observability (Intermediate Execution Status): An agent may react with an autonomously generated command gi(X) given it has high enough confidence in the related perception. In order for agent AGi to verify that gi(X) executed successfully, it must be able to recognize a resultant time series signature S, caused by gi(X), from observations of the object X. If AGi has a reaction sequence linked to a particular perception PEi(X) -> gi,1(X); gi,2(X); ... ;gi,n(X), then it may be necessary to observe intermediate completion of this transaction with X. Otherwise, the agent will not be able to verify that the intended reaction executed successfully. This verification is a perception problem.
Interaction properties, first identified in , build upon observability properties since reliable perception enables reliable reaction. Interaction includes operator-agent and inter-agent interaction.
Reaction Order: If perception PEi(X), by agent AGi is linked to a reaction, expressed PEi(X)-> gi(X), then there is a race condition between PEj(X) -> gj(X) and a possible Ak reaction PEk(X) -> gk(X).
Reaction Reliability: Since detection is imperfect, classifying the effect of reactions is important. If a reaction is triggered by a false positive detection, then the decision to react must be based on confidence and impact of an incorrect reaction or impact of no reaction to be "reliable."
Reaction Incompleteness: If AGi is a complex agent such that it has an automated reaction linked to a particular perception PEi(X) -> gi,1(X); gi,2(X); ... ;gi,n(X), then if one of the intermediate reactions does not execute successfully, the complex reaction is incomplete. This relates directly to perception and the ability to verify intermediate execution of simple reactions that compose a complex reaction.
Reaction Dispatch Latency and Preemptability: Dispatch latency for a reaction must be predictable, which for a multi-tasking system requires preemptability, task priority inversion handling, elimination of hidden scheduling, and a programmer's interface for specifying real-time tasks.
Reaction Time: The response time of a triggered reaction must be able to meet real-time deadlines. This requires predictability in execution and scheduling. Reactions will often have completion time constraints, and a "late" reaction may be detrimental. Even if the property of "time to complete a reaction" is not predictable, the "time to execute a reaction before aborting" must be predictable.
Agent Localization: An agent AGi that can observe X with the least cost will always have the best observability of X, but may not be the best agent to control X based on its specific PEi(X) -> gi(X) perception to reaction linking knowledge base and ability to process raw data from X. In this case, migration of agents or detection and reaction methods is desirable. Such migration is also desirable for verifying automation and evolving the system to higher levels of autonomy.
PROBLEMS RELATED TO OBSERVABILITY PROPERTIES
Most of the problems associated with observability are due to bandwidth limitations and latency which are common in space systems. In addition, there is the problem of observing execution status. Observing and verifying successful execution is not simple since it is in fact the perception problem.
DISTRIBUTED PERCEPTION PROBLEM
As already noted, semi-autonomous operation is complicated by unpredictability of the environment and the system itself. Deviations from expectation (anomalies or faults) are manifested by event detection. It is interesting to note that not all anomalies are negative events, but may in fact represent opportunities. Either way, classification of events requires state observation of local and remote sensors which is a well known distributed systems problem . However, observability often requires more than state observation, and often requires successive state observation and detection of behavioral changes in time series. In general, detection performance depends upon sensor reliability, observation latency, frequency, and behavior references. The best detection figure of merit is P(E|A), since this quantifies reliability of detection methods in terms of how likely the actual presence of an event is given that the detector indicates it is present. Probability figures can be refined "on-line" so that confidence can be adjusted based on reviews of detection correctness. Detection performance is quantified using Baye's rule. The latency problem is dealt with by placing time-critical reactions in the segment where the cost of observation is the lowest.
In general, a reaction may be observed to be logically complete as well as verifying cause and effect state change (where sensor values change as would be expected when a device is actuated). Reactions may however be complex sequences of atomic operations. Observability of the execution status of reactions must therefore include a series of verifications, and may be further generalized to include status for commands sent between segments such as reaction sent, reaction received, reaction accepted, reaction executed, and reaction verified. The DATA system includes the capability to verify execution with rule-based "postchecks" which detect successful execution signatures of commands.
PROBLEMS RELATED TO INTERACTION PROPERTIES
Most problems associated with interaction are due to how quickly reactions can be dispatched by the agent host system, the amount of processing resources the reaction is given, and what action is taken if a reaction can't complete by a timing deadline, or if there is any execution failure.
DISPATCH LATENCY AND PREEMPTABILITY
The DATA operating systems being used include kernel features to minimize the latency of dispatching a task which executes a reaction. For DATA the commercially available Solaris 2.x monolithic kernel operating system is being used within the ground segment, and the RTEMS operating system in the space segment, which both include real-time task control and scheduling features. Ideally, the DATA system could be improved to use a common operating systems kernel in both the ground and space segments such as the RT-Mach microkernel  or the Open Software Foundation MK6.1 microkernel . Both systems provide "soft" real-time performance such that the time to dispatch can be statistically estimated. For DATA ground-segment automation, Solaris 2.4 provides 2 millisecond dispatching in most situations . The dispatch latency is directly related to the preemptability of the system itself when it is running "kernel mode" code. Well known scheduling problems such as priority inversion will also affect dispatch latency. Both RTEMS and Solaris 2.x deal with these problems, however, there are many differences in policy and implementation, and therefore segment inconsistencies exist.
The ability of the operating system kernel to meet real-time deadlines for reactions to detected events requires predictability in execution and scheduling. Reactions will often have time constraints for completion and often a "late" reaction may be useless or detrimental. It would be useful to predict whether there is time to react given the current system load. This is exceedingly difficult to do. Especially if the reaction code is not deterministic (i.e. it has state dependent code branching). Solaris supports reaction deadline time control with a real-time scheduling class and a non-real-time class, and RTEMS only supports real-time tasks, but neither have specific task abort protocols.
REACTION ORDER AND RELIABILITY
Reaction order is controlled using reaction priorities. Determination of priorities to achieve a specific order and interaction is not trivial since reactions may be added and priorities changed during operations, and needs to be researched further. The reliability of a reaction is determined and coded according to failure modes and effects analysis. For example, the confidence in detection can be considered in conjunction with impact of reaction to a false alarm, impact of non-reaction to a true alarm and context incorporated into decision boundary logic for the reaction. This capability enables various levels of conservatism with automation, so that for example, reactions with high impact can be disabled. Finally, constraints on reactions may be defined and activated or deactivated in order to protect resources and reduce system-level conflicts. This prevents a reaction from executing when conditions have changed since command issue which may make execution of the command undesirable at the time of receipt.
The remaining problem is what to do when a complex reaction sequence encounters a problem part way through its execution. This can be handled by treating a complex reaction as an atomic transaction which can first be executed logically within the local segment (to verify it does not violate constraints or terminate due to state changes). Following logical verification, the command can then be committed and actually executed, or otherwise aborted and not committed with all logical changes "undone." In the DATA system all reactions from remote segments are atomic or are atomic executions of local complex reactions. Local complex reactions may include intermediate verification of commands, but currently when an error is encountered, there is no transactional abort.
In order to localize agents according to performance such that reactions can be moved to segments with the best observability of what they control, reactions are encoded so that they can be migrated between agents. This enables reactions which may be initially triggered in a remote segment to be migrated to a segment where they can be executed locally. Detection parameters and perception inferencing are likewise migratable to allow for complete localization and load balancing within the system. In the DATA system, rules, constraints, and scripts may be added to any agent in a given segment of operations by explicit command. Agents themselves cannot be migrated, but with controlled update to existing agents, they can effectively be metamorphosed as desired. Currently, rules, constraints, and scripts cannot be deleted, but rules and constraints can be deactivated. The ability to migrate agents as applications , or as dynamically-loadable modules , is being considered for future work.
FUTURE RESEARCH PLANS
Future research plans include an improved implementation of the software, called AIM (Automation Interaction Manager), compared to the current DATA operations software implementation for which technology has been frozen due to the immanent launch date. Furthermore, due to the limited availability of experimentation available with a real space systems such as DATA, the author is working on an automation testbed which has characteristics very similar to space systems.
The proposed architecture provides a generic distributed situated agent system for reliable semi-autonomous management of distributed devices and objects at the task level. The flexible design should allow for use of this system in a wide variety of space system operations environments.
 Bharat, K., and Cardelli, Luca, "Migratory Applications", Digital Equipment Corporation, SRC Research Report 138, February 15, 1996.
 Buckley, B. and Wheatcraft, L., "Spacecraft Command Language - A Smart Control System," Interface and Control Systems, Melbourne Florida, March 1991.
 Doyle, R., "Determining the Loci of Anomalies Using Minimal Causal Models," International Joint Conference on Artificial Intelligence, Montreal, Canada, August, 1995.
 Maes, P., "Situated Agents Can Have Goals," Robotics and Autonomous Systems, Vol. 6, 1990.
 Pearl, J., Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann Publishers, Inc., San Mateo, California, 1988.
 Siewert, S. and Nutt G., "A Space Systems Testbed for Situated Agent Observability and Interaction", The ASCE 2nd Conference, Exposition and Demonstration on Robotics for Challenging Environments, Albuquerque, N.M., 1996.
 Singhal, M. and Shivaratri, N., Advanced Concepts in Operating Systems, McGraw-Hill, Inc., New York, 1994.
 Tokuda, H. and Mercer, C., "ARTS: A Distributed Real-Time Kernel", ACM Operating Systems Review, Vol. 23, No. 3, July 1989.
 Vahalia, U., Unix Internals: The New Frontiers, Prentice-Hall, Inc., Upper Saddle River, N.J., 1996.
 Tokuda, H., Nakajima, T., and Rao, P., "Real-Time Mach: Towards a Predictable Real-Time System", Carnegie Mellon University, Pittsburgh, Pennsylvania, 1995.
 Wells, D., "A Trusted, Scalable, Real-Time Operating System Environment", Open Software Foundation Research Institute, Cambridge, Massachussettes, 1994.
 Wittig, Thies, ed., ARCHON: An Architecture for Multi-Agent Systems, Ellis Horwood Limited, Chichester, West Sussex, England, 1992.