Human Cognition Models to Inspire AVs in Interaction Scenes

Last Updated: 01/22/2025 | All information is accurate and up-to-date

Zhang, Z., Elahi, M., Domeyer, J., and Tian, R., “Driver Temporal Segmentation of Pedestrian Crossing Intentions during Negotiations,” in IEEE Transactions on Intelligent Vehicles, (Under Review)

Issues of Pedestrian Behavior Prediction Models

Limited generalizability towards inherent behavioral uncertainty and contingency
- Lower accuracy in predicting sudden behavior changes
- Reduced performance for longer prediction horizons (2-6 seconds)
AV-pedestrian negotiation has higher requirements compared with safety functionalities

Possible Solutions

Generative models for pedestrian trajectories
- Generating multiple trajectories or trajectory heatmap
Rethink how human drivers negotiate with pedestrians

Driver Scene Understanding Model

We propose the event-segmentation-based scene understanding model based on the Theory of Mind to explain driver cognition during pedestrian interactions.

Main Assumption: driver and pedestrian negotiate crossing intentions

Intention is a commitment to certain actions within a time boundary
Pedestrians have present-oriented (low-level) and future-oriented (high-level) intentions
Pedestrian Situated Intent (PSI) is the pedestrian’s intention to cross the conflicting area before the ego-vehicle in dynamically changing situations involving the car, pedestrian, and contextual environment.

A diagram explaining the four steps in the event-segmentation-based understanding model. — Event-segmentation-based scene understanding model

Step 1: A driver automatically segments perceptual inputs at a coarse level (pedestrian intention).
Step 2: Within each segment, drivers can predict fine-level events (i.e., pedestrian actions) more accurately by comparing working memory with long-term memory.
Step 3: Coarse-level segmentation boundaries are identified when the prediction of fine-level events is no longer accurate, meaning estimated pedestrian intention changes.
Step 4: Working memory is updated to rebuild the course level segment (pedestrian intention) boundaries and loop back to step

Video Experiment Process

Ask a group of representative human drivers to estimate the pedestrian situated intent changes when watching prerecorded pedestrian encountering videos from the driver’s view.
From the first frame to the last frame during the pedestrian encounter
- Each human driver needs to estimate the pedestrian’s intent to cross in front of the car
- Provide descriptions about the reasoning process when the intent estimation changes
- Provide driving decisions

The view out of the windshield of a vehicle on a city street showing a pedestrian is standing between the two lanes and obeying traffic. The car ahead is slowing down. — Time: 0.099s, first frame

Estimation: Not sure

The pedestrian is standing between the two lanes and obeying traffic. The car ahead is slowing down. It is a busy road with fast moving.

Estimation: Not cross

The pedestrian looks like a child. He is still standing between the two lanes and obeying traffic. He has been looking behind at the other side, his body facing diagonally, and his feet pointed in the same direction. The car ahead has already stopped. There are still cars going in his way.

The view out of the windshield of a vehicle on a city street showing a pedestrian is standing between the two lanes and making back-and-forth movements. He may be looking for an opportunity to cross. — Time: 7.042s, 3rd pause

Estimation: Cross

The pedestrian is still standing between the two lanes and making back-and-forth movements. He may be looking for an opportunity to cross. Someone in the car ahead might be calling him as well. Cars have slowed down, so he may jump to the side.

The view out of the windshield of a vehicle on a city street showing a pedestrian is standing between the two lanes and is looking for an opportunity to cross. — Time: 8.685s, final estimation

Estimation: Not cross

The pedestrian has been looking to cross to the other side. Now that the opposite lanes were empty, he started to run to the other side and would not be in front of this car even though it was closer to this side.

Experiment and Data Analysis Process

A flexible and scalable annotation tool diagram for micro-level behaviors and reasonings — Flexible and Scalable Annotation Tool for Micro-Level Behaviors and Reasonings >>> NLP-based Human Reasoning Cue Extraction Algorithm

Elahi, M.F., Luo, X. and Tian, R., 2020, July. A framework for modeling knowledge graphs via processing natural descriptions of vehicle-pedestrian interactions. In International Conference on Human-Computer Interaction (pp. 40-50). Cham: Springer International Publishing.
Elahi, M.F., Sreeram, J.G., Luo, X. and Tian, R., 2021, September. A Novel Adaptation of Information Extraction Algorithm to Process Natural Text Descriptions of Pedestrian Encounters. In 2021 IEEE International Intelligent Transportation Systems Conference (ITSC) (pp. 1906-1912). IEEE.
Sreeram, J.G., Luo, X. and Tian, R., 2021. Contextual and Behavior Factors Extraction from Pedestrian Encounter Scenes Using Deep Language Models. In Big Data Analytics and Knowledge Discovery: 23rd International Conference, DaWaK 2021, Virtual Event, September 27–30, 2021, Proceedings 23 (pp. 131-136). Springer International Publishing.
Elahi, M., Tian, R., and Luo, X., 2022. Flexible and Scalable Annotation Tool to Develop Scene Understanding Datasets. Workshop on Human-in-the-Loop Data Analytics (HILDA 2022), ACM SIGMOD/PODS Conference, June 12-17, Philadelphia, PA.
Elahi, M., Jing, T., Ding, Z., and Tian, R., MinDReaD: Mining Decision-Making Reasoning Data at Micro Level, International Journal of Human-Computer Interaction, (Under Revision).

Demo of Experiment Results

Benchmark Dataset

Pedestrian Situated Intent (PSI) Benchmark Dataset (http://situated-intent.net/pedestrian_dataset/)
210 videos are randomly sampled from the naturalistic driving dataset
75 subjects
- Age ranges from 19 to 77
- Personality and driving styles are recorded for all the subjects
- Each subject completed 1.5 hours of training and 15 hours of video annotation experiment