A tale of two explanations: Enhancing human trust by explaining robot behavior – Science

Embodied haptic model details

The embodied haptic model leverages low-level haptic signals obtained from the robots manipulator to make action predictions based on the human poses and forces collected with the tactile glove. This embodied haptic sensing allows the robot to reason about (i) its own haptic feedback by imagining itself as a human demonstrator and (ii) what a human would have done under similar poses and forces. The critical challenge here is to learn a mapping between equivalent robot and human states, which is difficult due to the different embodiments. From the perspective of generalization, manually designed embodiment mappings are not desirable. To learn from human demonstrations on arbitrary robot embodiments, we propose an embodied haptic model general enough to learn between an arbitrary robot embodiment and a human demonstrator.

The embodied haptic model consists of three major components: (i) an autoencoder to encode the human demonstration in a low-dimensional subspace (we refer to the reduced embedding as the human embedding); (ii) an embodiment mapping that maps robot states onto a corresponding human embedding, providing the robot with the ability to imagine itself as a human demonstrator; and (iii) an action predictor that takes a human embedding and the current action executing as the input and predicts the next action to execute, trained using the action labels from human demonstrations. Figure 2B shows the embodied haptic network architecture. Using this network architecture, the robot infers what action a human was likely to execute on the basis of this inferred human state. This embodied action prediction model picks the next action according toat+1p(ft,at)(1)where at + 1 is the next action, ft is the robots current haptic sensing, and at is the current action.

The autoencoder network takes an 80-dimensional vector from the human demonstration (26 for the force sensors and 54 for the poses of each link in the human hand) and uses the post-condition vector, i.e., the average of last N frames (we choose N = 2 to minimize the variance), of each action in the demonstration as input (see the autoencoder portion of Fig. 2B). This input is then reduced to an eight-dimensional human embedding. Given a human demonstration, the autoencoder enables the dimensionality reduction to an eight-dimensional representation.

The embodiment mapping maps from the robots four-dimensional post-condition vector, i.e., the average of the last N frames (different from human post-condition due to a faster sample rate on the robot gripper compared with the tactile glove; we chose N = 10), to an imagined human embedding (see the embodiment mapping portion of Fig. 2B). This mapping allows the robot to imagine its current haptic state as an equivalent low-dimensional human embedding. The robots four-dimensional post-condition vector consists of the gripper position (one dimension) and the forces applied by the gripper (three dimensions). The embodiment mapping network uses a 256-dimensional latent representation, and this latent representation is then mapped to the eight-dimensional human embedding.

To train the embodiment mapping network, the robot first executes a series of supervised actions where, if the action produces the correct final state of the action, the robot post-condition vector is saved as input for network training. Next, human demonstrations of equivalent actions are fed through the autoencoder to produce a set of human embeddings. These human embeddings are considered as the ground-truth target outputs for the embodiment mapping network, regardless of the current reconstruction accuracy of the autoencoder network. Then, the robot execution data are fed into the embodiment mapping network, producing an imagined human embodiment. The embodiment mapping network optimizes to reduce the loss between its output from the robot post-condition input and the target output.

For the action predictor, the 8-dimensional human embedding and the 10-dimensional current action are mapped to a 128-dimensional latent representation, and the latent representation is then mapped to a final 10-dimensional action probability vector (i.e., the next action) (see action prediction portion of Fig. 2B). This network is trained using human demonstration data, where a demonstration is fed through the autoencoder to produce a human embedding, and that human embedding and the one-hot vector of the current action execution are fed as the input to the prediction network; the ground truth is the next action executed in the human demonstration.

The network in Fig. 2B is trained in an end-to-end fashion with three different loss functions in a two-step process: (i) a forward pass through the autoencoder to update the human embedding zh. After computing the error Lreconstruct between the reconstruction sh and the ground-truth human data sh, we back-propagate the gradient and optimize the autoencoderLreconstruct(sh,sh)=12(shsh)2(2)

(ii) A forward pass through the embodiment mapping and the action prediction network. The embodiment mapping is trained by minimizing the difference Lmapping between the embodied robot embedding zr and target human embedding zh; the target human embedding zh is acquired through a forward pass through the autoencoder using a human demonstration post-condition of the same action label, sh. We compute the cross-entropy loss Lprediction of the predicted action label a and the ground-truth action label a to optimize this forward passLplanning(a,a)=Lmapping+LpredictionLmapping=12(zrzh)2Lprediction=H(p(a),q(a))(3)where H is the cross entropy, p is the model prediction distribution, q is the ground-truth distribution, and is the balancing parameter between the two losses (see text S2.2 for detailed parameters and network architecture).

A similar embodied haptic model was presented in (23) but with two separate loss functions, which is more difficult to train compared with the single loss function presented here. A clear limitation of the haptic model is the lack of long-term action planning. To address this problem, we discuss the symbolic task planner below and then discuss how we integrated the haptic model with the symbolic planner to jointly find the optimal action.

To encode the long-term temporal structure of the task, we endow a symbolic action planner that encodes semantic knowledge of the task execution sequence. The symbolic planner uses stochastic context-free grammars to represent tasks, where the terminal nodes (words) are actions and sentences are action sequences. Given an action grammar, the planner finds the optimal action to execute next on the basis of the action history, analogous to predicting the next word given a partial sentence.

The action grammar is induced using labeled human demonstrations, and we assume that the robot has an equivalent action for each human action. Each demonstration forms a sentence, xi, and the collection of sentences from a corpus, xi X. The segmented demonstrations are used to induce a stochastic context-free grammar using the method presented in (21). This method pursues T-AOG fragments to maximize the likelihood of the grammar producing the given corpus. The objective function is the posterior probability of the grammar given the training data Xp(GX)p(G)p(XG)=1ZeGxiXp(xiG)(4)where G is the grammar, xi = (a1, a2,, am) X represents a valid sequence of actions with length m from the demonstrator, is a constant, G is the size of the grammar, and Z is the normalizing factor. Figure 3 shows examples of induced grammars of actions.

During the symbolic planning process, this grammar is used to compute which action is the most likely to open the bottle based on the action sequence executed thus far and the space of possible future actions. A pure symbolic planner picks the optimal action based on the grammar priorat+1*=arg maxat+1p(at+1a0:t,G)(5)where at + 1 is the next action and a0:t is the action sequence executed thus far. This grammar prior can be obtained by a division of two grammar prefix probabilities: p(at+1a0:t,G)=p(a0:t+1G)p(a0:tG), where the grammar prefix probability p(a0:t G) measures the probability that a0:t occurs as a prefix of an action sequence generated by the action grammar G. On the basis of a classic parsing algorithmthe Earley parser (31)and dynamic programming, the grammar prefix probability can be obtained efficiently by the Earley-Stolcke parsing algorithm (32). An example of pure symbolic planning is shown in fig. S4.

However, due to the fixed structure and probabilities encoded in the grammar, always choosing the action sequence with the highest grammar prior is problematic because it provides no flexibility. An alternative pure symbolic planner picks the next action to execute by sampling from the grammar priorat+1p(a0:t,G)(6)In this way, the symbolic planner samples different grammatically correct action sequences and increases the adaptability of the symbolic planner. In the experiments, we choose to sample action sequences from the grammar prior.

In contrast to the haptic model, this symbolic planner lacks the adaptability to real-time sensor data. However, this planner encodes long-term temporal constraints that are missing from the haptic model, because only grammatically correct sentences have nonzero probabilities. The GEP adopted in this paper naturally combines the benefits of both the haptic model and the symbolic planner (see the next section).

The robot imitates the human demonstrator by combining the symbolic planner and the haptic model. The integrated model finds the next optimal action considering both the action grammar G and the haptic input ftat+1*=arg maxat+1p(at+1a0;t,ft,G)(7)Conceptually, this can be thought of as a posterior probability that considers both the grammar prior and the haptic signal likelihood. The next optimal action is computed by an improved GEP (22); GEP is an extension of the classic Earley parser (31). In the present work, we further extend the original GEP to make it applicable to multisensory inputs and provide explanation in real time for robot systems, instead of for offline video processing (see details in text S4.1.3).

The computational process of GEP is to find the optimal label sentence according to both a grammar and a classifier output of probabilities of labels for each time step. In our case, the labels are actions, and the classifier output is given by the haptic model. Optimality here means maximizing the joint probability of the action sequence according to the grammar prior and haptic model output while being grammatically correct.

The core idea of the algorithm is to directly and efficiently search for the optimal label sentence in the language defined by the grammar. The grammar constrains the search space to ensure that the sentence is always grammatically correct. Specifically, a heuristic search is performed on the prefix tree expanded according to the grammar, where the path from the root to a node represents a partial sentence (prefix of an action sequence).

GEP is a grammar parser, capable of combining the symbolic planner with low-level sensory input (haptic signals in this paper). The search process in the GEP starts from the root node of the prefix tree, which is an empty terminal symbol indicating that no terminals are parsed. The search terminates when it reaches a leaf node. In the prefix tree, all leaf nodes are parsing terminals e that represent the end of parse, and all non-leaf nodes represent terminal symbols (i.e., actions). The probability of expanding a non-leaf node is the prefix probability, i.e., how likely is the current path being the prefix of the action sequence. The probability of reaching a leaf node (parsing terminal e) is the parsing probability, i.e., how likely is the current path to the last non-leaf node being the executed actions and the next action. In other words, the parsing probability measures the probability that the last non-leaf node in the path will be the next action to execute. It is important to note that this prefix probability is computed on the basis of both the grammar prior and the haptic prediction; in contrast, in the pure symbolic planner, the prefix probability is computed on the basis of only the grammar prior. An example of the computed prefix and parsing probabilities and output of GEP is given by Fig. 8, and the search process is illustrated in fig. S5. For an algorithmic description of this process, see algorithm S1.

(A) A classifier is applied to a six-frame signal and outputs a probability matrix as the input. (B) Table of the cached probabilities of the algorithm. For all expanded action sequences, it records the parsing probabilities at each time step and prefix probabilities. (C) Grammar prefix tree with the classifier likelihood. The GEP expands a grammar prefix tree and searches in this tree. It finds the best action sequence when it hits the parsing terminal e. It finally outputs the best label grasp, pinch, pull with a probability of 0.033. The probabilities of children nodes do not sum to 1 because grammatically incorrect nodes are eliminated from the search and the probabilities are not renormalized (22).

The original GEP is designed for offline video processing. Here, we made modifications to enable online planning for a robotic task. The major difference between parsing and planning is the uncertainty about past actions: There is uncertainty about observed actions during parsing. However, during planning, there is no uncertainty about executed actionsthe robot directly chooses which actions to execute, thereby removing any ambiguity regarding which action was executed at a previous time step. Hence, we need to prune the impossible parsing results after executing each action; each time after executing an action, we change the probability vector of that action to a one-hot vector. This modification effectively prunes the action sequences that contain the impossible actions executed thus far by the robot.

Human participants were recruited from the University of California, Los Angeles (UCLA) Department of Psychology subject pool and were compensated with course credit for their participation. A total of 163 students were recruited, each randomly assigned to one of the five experimental groups. Thirteen participants were removed from the analysis for failing to understand the haptic display panel by not passing a recognition task. Hence, the analysis included 150 participants (mean age of 20.7). The symbolic and haptic explanation panels were generated as described in the Explanation generation section. The text explanation was generated by the authors based on the robots action plan to provide an alternate text summary of robot behavior. Although such text descriptions were not directly yielded by the model, they could be generated by modern natural language generation methods.

The human experiment included two phases: familiarization and prediction. In the familiarization phase, participants viewed two videos showing a robot interacting with a medicine bottle, with one successful attempt of opening the bottle and a failure attempt without opening the bottle. In addition to the RGB videos showing the robots executions, different groups viewed the different forms of explanation panels. At the end of familiarization, participants were asked to assess how well they trusted/believed that the robot had the ability to open the medicine bottle (see text S2.5 and fig. S7 for the illustration of the trust rating question).

Next, the prediction phase presented all groups with only RGB videos of a successful robot execution; no group had access to any explanatory panels. Specifically, participants viewed videos segmented by the robots actions; for segment i, videos start from the beginning of the robot execution up to the ith action. For each segment, participants were asked to predict what action the robot would execute next (see text S2.5 and fig. S8 for an illustration of the action prediction question).

Regardless of group assignment, all RGB videos were the same across all groups; i.e., we showed the same RGB video for all groups with varying explanation panels. This experimental design isolates potential effects of execution variations in different robot execution models presented in the Robot learning section; we only sought to evaluate how well explanation panels foster qualitative trust, enhance prediction accuracy, and keep robot execution performance constant across groups to remove potential confounding.

For both qualitative trust and prediction accuracy, the null hypothesis is that the explanation panels foster equivalent levels of trust and yield the same prediction accuracy across different groups, and therefore, no difference in trust or prediction accuracy would be observed. The test is a two-tailed independent samples t test to compare performance from two groups of participants, because we used between-subjects design in the study, with a commonly used significance level = 0.05, assuming t-distribution, and the rejection region is P < 0.05.

Go here to see the original:
A tale of two explanations: Enhancing human trust by explaining robot behavior - Science

Related Posts