Utilizing Multimodal Large Language Models for Video Analysis of Posture in Studying Collaborative Learning: A Case Study

Ridwan Whitehead; Andy Nguyen; Sanna Järvelä

doi:10.18608/jla.2025.8595

Authors

Ridwan Whitehead University of Oulu https://orcid.org/0009-0002-2888-7304
Andy Nguyen University of Oulu https://orcid.org/0000-0002-0759-9656
Sanna Järvelä University of Oulu https://orcid.org/0000-0001-6223-3668

DOI:

https://doi.org/10.18608/jla.2025.8595

Keywords:

multimodal learning analytics, generative artificial intelligence, multimodal large language models (MLLMs), pyramid collaborative learning flow pattern, research paper

Abstract

Incorporating non-verbal data streams is essential to understanding the dynamics of interaction within collaborative learning environments in which a variety of verbal and non-verbal modes of communication intersect. However, the complexity of non-verbal data — especially gathered in the wild from collaborative learning contexts — demands efficient and effective analysis. Methodological advancements are necessary to handle this complexity, enabling researchers to derive meaningful insights from these data streams. The advancement of Generative Artificial Intelligence (GenAI) has significantly broadened its accessibility, making it available to a diverse array of users and demonstrating its utility in aiding data analytics. However, the application of GenAI in multimodal learning analytics, particularly within the context of feature extraction for studying collaborative learning interactions, remains unexplored. This study aims to explore how multimodal large language models (MLLMs) can be utilized as part of the multimodal learning analytics (MMLA) process, focusing on the extraction of postural behaviour. The study focuses on an illustrative case study involving 52 pre-service teachers engaged in a physics-based collaborative learning task, demonstrating how MLLMs can be used for feature extraction. The integration of GenAI techniques in learning research promises a new horizon in understanding and enhancing collaborative learning interactions.

References

Amatriain, X. (2024). Prompt design and engineering: Introduction and advanced methods. arXiv. https://doi.org/10.48550/arXiv.2401.14423

Bewersdorff, A., Hartmann, C., Hornberger, M., Seßler, K., Bannert, M., Kasneci, E., Kasneci, G., Zhai, X., & Nerdel, C. (2025). Taking the next step with generative artificial intelligence: The transformative role of multimodal large language models in science education. Learning and Individual Differences, 118, Article 102601. https://doi.org/10.1016/j.lindif.2024.102601

Blikstein, P., & Worsley, M. (2016). Multimodal learning analytics and education data mining: Using computational technologies to measure complex learning tasks. Journal of Learning Analytics, 3(2), 220–238. https://doi.org/10.18608/jla.2016.32.11

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winters, C., & Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901. https://doi.org/10.48550/arXiv.2005.14165

Burgoon, J. K., Guerrero, L. K., & Manusov, V. (2011). Nonverbal signals. In M. L. Knapp & J. A. Daly (Eds.), The SAGE handbook of interpersonal communication (4th ed., pp. 239–280). SAGE Publications.

Chango, W., Cerezo, R., Sanchez-Santillan, M., Azevedo, R., & Romero, C. (2021). Improving prediction of students’ performance in intelligent tutoring systems using attribute selection and ensembles of different multimodal data sources. Journal of Computing in Higher Education, 33(3), 614–634. https://doi.org/10.1007/s12528-021-09298-8

Chen, B., Zhang, Z., Langrené, N., & Zhu, S. (2023). Unleashing the potential of prompt engineering in large language models: A comprehensive review. arXiv. https://doi.org/10.48550/arXiv.2310.14735

Çini, A., Järvelä, S., Dindar, M., & Malmberg, J. (2023). How multiple levels of metacognitive awareness operate in collaborative problem solving. Metacognition and Learning, 18(3), 891–922. https://doi.org/10.1007/s11409-023-09358-7

Cloude, E. B., Azevedo, R., Winne, P. H., Biswas, G., & Jang, E. E. (2022). System design for using multimodal trace data in modeling self-regulated learning. Frontiers in Education, 7, Article 928632. https://doi.org/10.3389/feduc.2022.928632

Cui, C., Ma, Y., Cao, X., Ye, W., Zhou, Y., Liang, K., Chen, J., Lu, J., Yang, Z., Liao, K.-D., Gao, T., Li, E., Tang, K., Cao, Z., Zhou, T., Liu, A., Yan, X., Mei, S., Cao, J., & Zheng, C. (2024). A survey on multimodal large language models for autonomous driving. In K. Derpanis, H. Kuehne, S. Maji, V. Morariu, R. Souvenir, T. Hassner, & L. Verdoliva (Eds.), Proceedings of the 2024 IEEE Winter Conference on Applications of Computer Vision Workshops: WACVW 2024 (pp. 958–979). IEEE. https://doi.org/10.1109/WACVW60836.2024.00106

Cukurova, M., Giannakos, M., & Martinez-Maldonado, R. (2020b). The promise and challenges of multimodal learning analytics. British Journal of Educational Technology, 51(5), 1441–1449. https://doi.org/10.1111/bjet.13015

Cukurova, M., Zhou, Q., Spikol, D., & Landolfi, L. (2020a). Modelling collaborative problem-solving competence with transparent learning analytics: Is video data enough? In C. Rensing, H. Drachsler, V. Kovanović, N. Pinkwart, M. Scheffel, & K. Verbert (Eds.), LAK ’20: Proceedings of the 10th International Conference on Learning Analytics & Knowledge (pp. 270–275). ACM Press. https://doi.org/10.1145/3375462.3375484

Di Mitri, D., Schneider, J., Specht, M., & Drachsler, H. (2018). From signals to knowledge: A conceptual model for multimodal learning analytics. Journal of Computer Assisted Learning, 34(4), 338–349. https://doi.org/10.1111/jcal.12288

Ekman, P., & Friesen, W. V. (1969). The repertoire of nonverbal behavior: Categories, origins, usage, and coding. Semiotica, 1(1), 49–98. https://doi.org/10.1515/semi.1969.1.1.49

Giannakos, M., & Cukurova, M. (2023). The role of learning theory in multimodal learning analytics. British Journal of Educational Technology, 54(5), 1246–1267. https://doi.org/10.1111/bjet.13320

Go-Lab. (n.d.). Golabz: The portal for inquiry learning spaces. https://www.golabz.eu/

Gu, J., Han, Z., Chen, S., Beirami, A., He, B., Zhang, G., Liao, R., Qin, Y., Tresp, V., & Torr, P. (2023). A systematic survey of prompt engineering on vision-language foundation models. arXiv. https://doi.org/10.48550/arXiv.2307.12980

Guo, Z., Yu, K., Pearlman, R., Navab, N., & Barmaki, R. (2019). Collaboration analysis using deep learning. arXiv. https://doi.org/10.48550/arXiv.1904.08066

Hutt, S., DePiro, A., Wang, J., Rhodes, S., Baker, R. S., Hieb, G., Sethuraman, S., Ocumpaugh, J., & Mills, C. (2024). Feedback on feedback: Comparing classic natural language processing and generative AI to evaluate peer feedback. In B. Flanagan, B. Wasson, & D. Gašević (Eds.), LAK ’24: Proceedings of the 14th International Conference on Learning Analytics & Knowledge (pp. 55–65). ACM Press. https://doi.org/10.1145/3636555.3636850

Järvelä, S., Nguyen, A., & Hadwin, A. (2023). Human and artificial intelligence collaboration for socially shared regulation in learning. British Journal of Educational Technology, 54(5), 1057–1076. https://doi.org/10.1111/bjet.13325

Jeitziner, L. T., Paneth, L., Rack, O., & Zahn, C. (2024). Beyond words: Investigating non-verbal indicators of collaborative engagement in a virtual synchronous CSCL environment. Frontiers in Psychology, 15, Article 1347073. https://doi.org/10.3389/fpsyg.2024.1347073

Jocher, G., Qui, J., & Chaurasia, A. (2023). Ultralytics YOLO (Version 8.0.0) [Computer software]. Ultralytics. https://github.com/ultralytics/ultralytics

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y. J., Madotto, A., & Fung, P. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), 1–38. https://doi.org/10.1145/3571730

Kajamaa, A., & Kumpulainen, K. (2020). Students’ multimodal knowledge practices in a makerspace learning environment. International Journal of Computer-Supported Collaborative Learning, 15(4), 411–444. https://doi.org/10.1007/s11412-020-09337-z

Khan, M. S. U., Naeem, M. F., Tombari, F., Van Gool, L., Stricker, D., & Afzal, M. Z. (2024). Human pose descriptions and subject-focused attention for improved zero-shot transfer in human-centric classification tasks. arXiv. https://doi.org/10.48550/arXiv.2403.06904

Knapp, M. L., Hall, J. A., & Horgan, T. G. (2014). Nonverbal communication in human interaction (8th ed.). Wadsworth, Cengage Learning.

Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174. https://doi.org/10.2307/2529310

Liu, Z., Fang, F., Feng, X., Du, X., Zhang, C., Wang, Z., Bai, Y., Zhao, Q., Fan, L., Gan, C., Lin, H., Li, J., Ni, Y., Wu, H., Narsupalli, Y., Zheng, Z., Li, C., Hu, X., Xu R., ... & Ni, S. (2024). II-Bench: An image implication understanding benchmark for multimodal large language models. arXiv. https://doi.org/10.48550/arXiv.2406.05862

Lombard, M., Snyder-Duch, J., & Bracken, C. C. (2002). Content analysis in mass communication: Assessment and reporting of intercoder reliability. Human Communication Research, 28(4), 587–604. https://doi.org/10.1111/j.1468-2958.2002.tb00826.x

Mangaroska, K., Martinez‐Maldonado, R., Vesin, B., & Gašević, D. (2021). Challenges and opportunities of multimodal data in human learning: The computer science students’ perspective. Journal of Computer Assisted Learning, 37(4), 1030–1047. https://doi.org/10.1111/jcal.12542

Mehrabian, A. (1972). Some subtleties of communication. Language, Speech, and Hearing Services in Schools, 3(4), 62–67. https://doi.org/10.1044/0161-1461.0304.62

Mu, S., Cui, M., & Huang, X. (2020). Multimodal data fusion in learning analytics: A systematic review. Sensors, 20(23), Article 6856. https://doi.org/10.3390/s20236856

Navarro, J., & Karlins, M. (2008). What every body is saying: An ex-FBI agent’s guide to speed-reading people. Collins.

Nguyen, A., Järvelä, S., Wang, Y., & Rosé, C. P. (2022). Exploring socially shared regulation with an AI deep learning approach using multimodal data. In Chinn, C., Tan, E., Chan, C., & Kali, Y. (Eds.), Proceedings of the 16th International Conference of the Learning Sciences — ICLS 2022 (pp. 527–534). International Society of the Learning Sciences. https://repository.isls.org//handle/1/8836

Noël, R., Miranda, D., Cechinel, C., Riquelme, F., Primo, T. T., & Munoz, R. (2022). Visualizing collaboration in teamwork: A multimodal learning analytics platform for non-verbal communication. Applied Sciences, 12(15), Article 7499. https://doi.org/10.3390/app12157499

Ochoa, X. (2022). Multimodal learning analytics: Rationale, process, examples, and direction. In C. Lang, G. Siemens, A. F. Wise, D. Gašević, & A. Merceron (Eds.), Handbook of learning analytics (2nd ed., pp. 54–65). SoLAR.

Onwuegbuzie, A. J., Dickinson, W. B., Leech, N. L., & Zoran, A. G. (2009). A qualitative framework for collecting and analyzing data in focus group research. International Journal of Qualitative Methods, 8(3), 1–21 https://doi.org/10.1177/160940690900800301

OpenAI. (2023). GPT-4 technical report. arXiv. https://doi.org/10.48550/arXiv.2303.08774

Ouhaichi, H., Bahtijar, V., & Spikol, D. (2024). Exploring design considerations for multimodal learning analytics systems: An interview study. Frontiers in Education, 9, Article 1356537. https://doi.org/10.3389/feduc.2024.1356537

Pangakis, N., Wolken, S., & Fasching, N. (2023). Automated annotation with generative AI requires validation. arXiv. https://doi.org/10.48550/arXiv.2306.00176

Radu, I., Tu, E., & Schneider, B. (2020). Relationships between body postures and collaborative learning states in an augmented reality study. In I. I. Bittencourt, M. Cukurova, K. Muldner, R. Luckin, & E. Millán (Eds.), Artificial intelligence in education: 21st international conference, AIED 2020, Ifrane, Morocco, July 6–10, 2020, proceedings, part II (pp. 257–262). Springer Cham. https://doi.org/10.1007/978-3-030-52240-7_47

Reiss, M. V. (2023). Testing the reliability of ChatGPT for text annotation and classification: A cautionary remark. arXiv. https://doi.org/10.48550/arXiv.2304.11085

Schlagwein, D., & Willcocks, L. (2023). ‘ChatGPT et al.’: The ethics of using (generative) artificial intelligence in research and science. Journal of Information Technology, 38(3), 232–238. https://doi.org/10.1177/02683962231200411

Schneider, B. (2024). Three challenges in implementing multimodal learning analytics in real-world learning environments. Learning: Research and Practice, 10(1), 103–112. https://doi.org/10.1080/23735082.2023.2270611

Schroeder, H., Le Quéré, M. A., Randazzo, C., Mimno, D., & Schoenebeck, S. (2024). Large language models in qualitative research: Can we do the data justice? arXiv. https://doi.org/10.48550/arXiv.2410.07362

Sinha, S., Rogat, T. K., Adams-Wiggins, K. R., & Hmelo-Silver, C. E. (2015). Collaborative group engagement in a computer-supported inquiry learning environment. International Journal of Computer-Supported Collaborative Learning, 10(3), 273–307. https://doi.org/10.1007/s11412-015-9218-y

Stake, R. E. (1995). The art of case study research. SAGE Publications.

Suraworachet, W., Seon, J., & Cukurova, M. (2024). Predicting challenge moments from students’ discourse: A comparison of GPT-4 to two traditional natural language processing approaches. In B. Flanagan, B. Wasson, & D. Gašević (Eds.), LAK ’24: Proceedings of the 14th International Conference on Learning Analytics & Knowledge (pp. 473–485). ACM Press. https://doi.org/10.1145/3636555.3636905

Taylor, R. (2016). The multimodal texture of engagement: Prosodic language, gaze and posture in engaged, creative classroom interaction. Thinking Skills and Creativity, 20, 83–96. https://doi.org/10.1016/j.tsc.2016.04.001

Watanabe, E., Ozeki, T., & Kohama, T. (2019). Modeling of non-verbal behaviors of students in cooperative learning by using OpenPose. In H. Nakanishi, H. Egi, I.-A. Chounta, H. Takada, S. Ichimura, & U. Hoppe (Eds.), Collaboration technologies and social computing: 25th international conference, CRIWG+CollabTech 2019, Kyoto, Japan, September 4–6, 2019, proceedings (pp. 191–201). Springer Cham. https://doi.org/10.1007/978-3-030-28011-6_13

Whitehead, R., Nguyen, A., & Järvelä, S. (2024). Exploring the role of gaze behaviour in socially shared regulation of collaborative learning in a group task. Journal of Computer Assisted Learning, 40(5), 2226–2247. https://doi.org/10.1111/jcal.13022

Wu, J., Gan, W., Chen, Z., Wan, S., & Yu, P. S. (2023). Multimodal large language models: A survey. In A. Cuzzocrea & R. Agrawal (Eds.), 2023 IEEE International Conference on Big Data (BigData) (pp. 2247–2256). IEEE. https://doi.org/10.1109/BigData59044.2023.10386743

Xiao, Z., Yuan, X., Liao, Q. V., Abdelghani, R., & Oudeyer, P.-Y. (2023). Supporting qualitative analysis with large language models: Combining codebook with GPT-3 for deductive coding. In F. Chen, M. Billinghurst, M. Zhou, & S. Berkovsky (Eds.), IUI ’23 Companion: Companion Proceedings of the 28th International Conference on Intelligent User Interfaces (pp. 75–78). ACM Press. https://doi.org/10.1145/3581754.3584136

Yan, L., Martinez-Maldonado, R., & Gašević, D. (2024). Generative artificial intelligence in learning analytics: Contextualising opportunities and challenges through the learning analytics cycle. In B. Flanagan, B. Wasson, & D. Gašević (Eds.), LAK ’24: Proceedings of the14th International Conference on Learning Analytics & Knowledge (pp. 101–111). ACM Press. https://doi.org/10.1145/3636555.3636856

Yin, R. K. (2018). Case study research and applications: Design and methods (6th ed.). SAGE Publications.

Zang, Y., Li, W., Han, J., Zhou, K., & Loy, C. C. (2025). Contextual object detection with multimodal large language models. International Journal of Computer Vision, 133(4), 825–843. https://doi.org/10.1007/s11263-024-02214-4

Zeng, X., Wang, X., Zhang, T., Yu, C., Zhao, S., & Chen, Y. (2024). GestureGPT: Toward zero-shot free-form hand gesture understanding with large language model agents. Proceedings of the ACM on Human–Computer Interaction, 8(ISS), Article 545. https://doi.org/10.1145/3698145

Zhang, F., Possaghi, I., Sharma, K., & Papavlasopoulou, S. (2024). High-performing groups during children’s collaborative coding activities: What can multimodal data tell us? In S. Pera, T. Bekker, T. Huibers, J. Good, C. Sylla, & S. Papavlasopoulou (Eds.), IDC ’24: Proceedings of the 23rd Annual ACM Interaction Design and Children Conference (pp. 533–559). ACM Press. https://doi.org/10.1145/3628516.3655805

Zheng, Q., Lu, X., Jin, Q., Jain, J., Meadan-Kaplansky, H., Shi, H., Xion, J., & Huang, Y. (2024). Towards responsible use of large multi-modal AI to analyze human social behaviors. In R. Farzan, C. López, D. C. Llach, D. Quercia, M. Mustafa, S. Niu, & M. Wong-Villacrés (Eds.), CSCW Companion ’24: Companion Publication of the 2024 Conference on Computer-Supported Cooperative Work and Social Computing (pp. 663–665). ACM Press. https://doi.org/10.1145/3678884.3687137

Zhou, Q., Suraworachet, W., & Cukurova, M. (2024). Detecting non-verbal speech and gaze behaviours with multimodal data and computer vision to interpret effective collaborative learning interactions. Education and Information Technologies, 29(1), 1071–1098. https://doi.org/10.1007/s10639-023-12315-1

Zhou, Q., Suraworachet, W., Celiktutan, O., & Cukurova, M. (2022). What does shared understanding in students’ face-to-face collaborative learning gaze behaviours “look like”? In M. M. Rodrigo, N. Matsuda, A. I. Cristea, & V. Dimitrova (Eds.), Artificial Intelligence in Education: 23rd International Conference, AIED 2022, Durham, UK, July 27–31, 2022, proceedings, part I (pp. 588–593). Springer Cham. https://doi.org/10.1007/978-3-031-11644-5_53

Utilizing Multimodal Large Language Models for Video Analysis of Posture in Studying Collaborative Learning

A Case Study

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)