麻豆免费版下载

Skip to main content

Recognizing Gesture: A Vital Feature in Multimodal Communication

Hannah VanderHoevenis a Ph.D research student at Colorado State University (CSU) who holds a MS in Computer Science from CSU. As part of iSAT, Hannah works with Dr. Krishnaswamy on automatic gesture recognition, multimodal processing, and system development to help extract context from group work and process it in an effective and meaningful manner.

Nikhil Krishnaswamy is an Assistant Professor of Computer Science at Colorado State University. His expertise lies in natural language processing, including multimodal methods, human-AI collaboration, and embodied cognition. He works to research and develop AI systems that complement and support human expertise and effort.

When working together to achieve a goal, especially in collaborative problem solving (CPS) tasks, students are likely to use many different forms of communication. These include not only language but also actions, gestures, and other nonverbal cues. Gestures, in particular, are rich in meaning and often complement spoken language. Certain concepts like pointing to something in physical space is usually expressed through gesture(s) alone. Because of this, recognizing gestures accurately and in real time is crucial for multimodal language understanding. This is especially important when designing well-rounded AI systems that can interpret the state of the collaboration for a given task.

When combined with speech, nonverbal behaviors such as gestures can provide crucial context information by helping an AI interpret how a person interacts with others and their surrounding physical space. Deictic gestures, such as pointing, allow participants to indicate locations in their workspace, helping speakers specify a referent or target of interest in their environment. These targets of interest can align directly with demonstrative pronouns in speech (e.g., when you say: "this one" or "those") and can be used to add clarity to otherwise pretty vague statements. By incorporating this information, an AI agent can gain a more complete understanding of a scene than it would from speech alone.

Hand detection tools like MediaPipe, an open-source library developed by Google, aid in gesture recognition. MediaPipe provides joint locations (landmarks) of detected hands on a frame-by-frame basis. These landmarks can be used to train gesture recognition models for a variety of applications. By leveraging this data, we can more efficiently train lightweight classifiers to detect a wide range of gestures.

Most vision-based gesture recognition projects focus on detecting hand poses in a single video frame, which is useful for human-computer interaction, such as recognizing static commands to control a user interface. However, there is less focus on recognizing the distinct 鈥減hases'' of more complex gestures that are more likely to occur throughout interactions between humans. As participants work through small group tasks, there is likely to be more emphasis on complex multiframe gestures, such as pointing, that can accompany multiple words spoken.听

Unifying gesture and language can result in enriched natural language, providing a text-only large language model (LLM) with a much more complete depiction of a scene. Imagine a scene where someone points to a green block and says, 鈥淚 think that one weighs twenty grams.鈥 From the speech transcription alone, an AI agent powered by an LLM wouldn鈥檛 know what 鈥渢hat one鈥 refers to. However, by integrating gesture recognition and object detection, these two channels can be unified into a clear, unambiguous description: 鈥淚 think [the green block] weighs twenty grams.鈥 In this way, gesture recognition and other forms of multimodal interpretation serve as a means of giving 鈥渂lind鈥 systems 鈥渆yes.鈥

Below are some additional resources for you to check out!

References

Kendon, A. (1997). Gesture.听Annual review of anthropology,听26(1), 109-128.

Lascarides, A., & Stone, M. (2009). A formal semantic analysis of gesture.听Journal of Semantics,听26(4), 393-449.

Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., ... & Grundmann, M. (2019, June). Mediapipe: A framework for perceiving and processing reality. In听Third workshop on computer vision for AR/VR at IEEE computer vision and pattern recognition (CVPR) (Vol. 2019).

McNeill, D. (2008). Gesture and thought. In听Gesture and thought. University of Chicago press.

VanderHoeven, H., Blanchard, N., & Krishnaswamy, N. (2023, July). Robust motion recognition using gesture phase annotation. In听International conference on human-computer interaction (pp. 592-608). Cham: Springer Nature Switzerland.