Utilizing Multimodal Large Language Models for Video Analysis of Posture in Studying Collaborative Learning
A Case Study
DOI:
https://doi.org/10.18608/jla.2025.8595Keywords:
multimodal learning analytics, generative artificial intelligence, multimodal large language models (MLLMs), pyramid collaborative learning flow pattern, research paperAbstract
Incorporating non-verbal data streams is essential to understanding the dynamics of interaction within collaborative learning environments in which a variety of verbal and non-verbal modes of communication intersect. However, the complexity of non-verbal data — especially gathered in the wild from collaborative learning contexts — demands efficient and effective analysis. Methodological advancements are necessary to handle this complexity, enabling researchers to derive meaningful insights from these data streams. The advancement of Generative Artificial Intelligence (GenAI) has significantly broadened its accessibility, making it available to a diverse array of users and demonstrating its utility in aiding data analytics. However, the application of GenAI in multimodal learning analytics, particularly within the context of feature extraction for studying collaborative learning interactions, remains unexplored. This study aims to explore how multimodal large language models (MLLMs) can be utilized as part of the multimodal learning analytics (MMLA) process, focusing on the extraction of postural behaviour. The study focuses on an illustrative case study involving 52 pre-service teachers engaged in a physics-based collaborative learning task, demonstrating how MLLMs can be used for feature extraction. The integration of GenAI techniques in learning research promises a new horizon in understanding and enhancing collaborative learning interactions.
References
Amatriain, X. (2024). Prompt design and engineering: Introduction and advanced methods. arXiv. https://doi.org/10.48550/arXiv.2401.14423
Bewersdorff, A., Hartmann, C., Hornberger, M., Seßler, K., Bannert, M., Kasneci, E., Kasneci, G., Zhai, X., & Nerdel, C. (2025). Taking the next step with generative artificial intelligence: The transformative role of multimodal large language models in science education. Learning and Individual Differences, 118, Article 102601. https://doi.org/10.1016/j.lindif.2024.102601
Blikstein, P., & Worsley, M. (2016). Multimodal learning analytics and education data mining: Using computational technologies to measure complex learning tasks. Journal of Learning Analytics, 3(2), 220–238. https://doi.org/10.18608/jla.2016.32.11
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winters, C., & Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901. https://doi.org/10.48550/arXiv.2005.14165
Burgoon, J. K., Guerrero, L. K., & Manusov, V. (2011). Nonverbal signals. In M. L. Knapp & J. A. Daly (Eds.), The SAGE handbook of interpersonal communication (4th ed., pp. 239–280). SAGE Publications.
Chango, W., Cerezo, R., Sanchez-Santillan, M., Azevedo, R., & Romero, C. (2021). Improving prediction of students’ performance in intelligent tutoring systems using attribute selection and ensembles of different multimodal data sources. Journal of Computing in Higher Education, 33(3), 614–634. https://doi.org/10.1007/s12528-021-09298-8
Chen, B., Zhang, Z., Langrené, N., & Zhu, S. (2023). Unleashing the potential of prompt engineering in large language models: A comprehensive review. arXiv. https://doi.org/10.48550/arXiv.2310.14735
Çini, A., Järvelä, S., Dindar, M., & Malmberg, J. (2023). How multiple levels of metacognitive awareness operate in collaborative problem solving. Metacognition and Learning, 18(3), 891–922. https://doi.org/10.1007/s11409-023-09358-7
Cloude, E. B., Azevedo, R., Winne, P. H., Biswas, G., & Jang, E. E. (2022). System design for using multimodal trace data in modeling self-regulated learning. Frontiers in Education, 7, Article 928632. https://doi.org/10.3389/feduc.2022.928632
Cui, C., Ma, Y., Cao, X., Ye, W., Zhou, Y., Liang, K., Chen, J., Lu, J., Yang, Z., Liao, K.-D., Gao, T., Li, E., Tang, K., Cao, Z., Zhou, T., Liu, A., Yan, X., Mei, S., Cao, J., & Zheng, C. (2024). A survey on multimodal large language models for autonomous driving. In K. Derpanis, H. Kuehne, S. Maji, V. Morariu, R. Souvenir, T. Hassner, & L. Verdoliva (Eds.), Proceedings of the 2024 IEEE Winter Conference on Applications of Computer Vision Workshops: WACVW 2024 (pp. 958–979). IEEE. https://doi.org/10.1109/WACVW60836.2024.00106
Cukurova, M., Giannakos, M., & Martinez-Maldonado, R. (2020b). The promise and challenges of multimodal learning analytics. British Journal of Educational Technology, 51(5), 1441–1449. https://doi.org/10.1111/bjet.13015
Cukurova, M., Zhou, Q., Spikol, D., & Landolfi, L. (2020a). Modelling collaborative problem-solving competence with transparent learning analytics: Is video data enough? In C. Rensing, H. Drachsler, V. Kovanović, N. Pinkwart, M. Scheffel, & K. Verbert (Eds.), LAK ’20: Proceedings of the 10th International Conference on Learning Analytics & Knowledge (pp. 270–275). ACM Press. https://doi.org/10.1145/3375462.3375484
Di Mitri, D., Schneider, J., Specht, M., & Drachsler, H. (2018). From signals to knowledge: A conceptual model for multimodal learning analytics. Journal of Computer Assisted Learning, 34(4), 338–349. https://doi.org/10.1111/jcal.12288
Ekman, P., & Friesen, W. V. (1969). The repertoire of nonverbal behavior: Categories, origins, usage, and coding. Semiotica, 1(1), 49–98. https://doi.org/10.1515/semi.1969.1.1.49
Giannakos, M., & Cukurova, M. (2023). The role of learning theory in multimodal learning analytics. British Journal of Educational Technology, 54(5), 1246–1267. https://doi.org/10.1111/bjet.13320
Go-Lab. (n.d.). Golabz: The portal for inquiry learning spaces. https://www.golabz.eu/
Gu, J., Han, Z., Chen, S., Beirami, A., He, B., Zhang, G., Liao, R., Qin, Y., Tresp, V., & Torr, P. (2023). A systematic survey of prompt engineering on vision-language foundation models. arXiv. https://doi.org/10.48550/arXiv.2307.12980
Guo, Z., Yu, K., Pearlman, R., Navab, N., & Barmaki, R. (2019). Collaboration analysis using deep learning. arXiv. https://doi.org/10.48550/arXiv.1904.08066
Hutt, S., DePiro, A., Wang, J., Rhodes, S., Baker, R. S., Hieb, G., Sethuraman, S., Ocumpaugh, J., & Mills, C. (2024). Feedback on feedback: Comparing classic natural language processing and generative AI to evaluate peer feedback. In B. Flanagan, B. Wasson, & D. Gašević (Eds.), LAK ’24: Proceedings of the 14th International Conference on Learning Analytics & Knowledge (pp. 55–65). ACM Press. https://doi.org/10.1145/3636555.3636850
Järvelä, S., Nguyen, A., & Hadwin, A. (2023). Human and artificial intelligence collaboration for socially shared regulation in learning. British Journal of Educational Technology, 54(5), 1057–1076. https://doi.org/10.1111/bjet.13325
Jeitziner, L. T., Paneth, L., Rack, O., & Zahn, C. (2024). Beyond words: Investigating non-verbal indicators of collaborative engagement in a virtual synchronous CSCL environment. Frontiers in Psychology, 15, Article 1347073. https://doi.org/10.3389/fpsyg.2024.1347073
Jocher, G., Qui, J., & Chaurasia, A. (2023). Ultralytics YOLO (Version 8.0.0) [Computer software]. Ultralytics. https://github.com/ultralytics/ultralytics
Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y. J., Madotto, A., & Fung, P. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), 1–38. https://doi.org/10.1145/3571730
Kajamaa, A., & Kumpulainen, K. (2020). Students’ multimodal knowledge practices in a makerspace learning environment. International Journal of Computer-Supported Collaborative Learning, 15(4), 411–444. https://doi.org/10.1007/s11412-020-09337-z
Khan, M. S. U., Naeem, M. F., Tombari, F., Van Gool, L., Stricker, D., & Afzal, M. Z. (2024). Human pose descriptions and subject-focused attention for improved zero-shot transfer in human-centric classification tasks. arXiv. https://doi.org/10.48550/arXiv.2403.06904
Knapp, M. L., Hall, J. A., & Horgan, T. G. (2014). Nonverbal communication in human interaction (8th ed.). Wadsworth, Cengage Learning.
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174. https://doi.org/10.2307/2529310
Liu, Z., Fang, F., Feng, X., Du, X., Zhang, C., Wang, Z., Bai, Y., Zhao, Q., Fan, L., Gan, C., Lin, H., Li, J., Ni, Y., Wu, H., Narsupalli, Y., Zheng, Z., Li, C., Hu, X., Xu R., ... & Ni, S. (2024). II-Bench: An image implication understanding benchmark for multimodal large language models. arXiv. https://doi.org/10.48550/arXiv.2406.05862
Lombard, M., Snyder-Duch, J., & Bracken, C. C. (2002). Content analysis in mass communication: Assessment and reporting of intercoder reliability. Human Communication Research, 28(4), 587–604. https://doi.org/10.1111/j.1468-2958.2002.tb00826.x
Mangaroska, K., Martinez‐Maldonado, R., Vesin, B., & Gašević, D. (2021). Challenges and opportunities of multimodal data in human learning: The computer science students’ perspective. Journal of Computer Assisted Learning, 37(4), 1030–1047. https://doi.org/10.1111/jcal.12542
Mehrabian, A. (1972). Some subtleties of communication. Language, Speech, and Hearing Services in Schools, 3(4), 62–67. https://doi.org/10.1044/0161-1461.0304.62
Mu, S., Cui, M., & Huang, X. (2020). Multimodal data fusion in learning analytics: A systematic review. Sensors, 20(23), Article 6856. https://doi.org/10.3390/s20236856
Navarro, J., & Karlins, M. (2008). What every body is saying: An ex-FBI agent’s guide to speed-reading people. Collins.
Nguyen, A., Järvelä, S., Wang, Y., & Rosé, C. P. (2022). Exploring socially shared regulation with an AI deep learning approach using multimodal data. In Chinn, C., Tan, E., Chan, C., & Kali, Y. (Eds.), Proceedings of the 16th International Conference of the Learning Sciences — ICLS 2022 (pp. 527–534). International Society of the Learning Sciences. https://repository.isls.org//handle/1/8836
Noël, R., Miranda, D., Cechinel, C., Riquelme, F., Primo, T. T., & Munoz, R. (2022). Visualizing collaboration in teamwork: A multimodal learning analytics platform for non-verbal communication. Applied Sciences, 12(15), Article 7499. https://doi.org/10.3390/app12157499
Ochoa, X. (2022). Multimodal learning analytics: Rationale, process, examples, and direction. In C. Lang, G. Siemens, A. F. Wise, D. Gašević, & A. Merceron (Eds.), Handbook of learning analytics (2nd ed., pp. 54–65). SoLAR.
Onwuegbuzie, A. J., Dickinson, W. B., Leech, N. L., & Zoran, A. G. (2009). A qualitative framework for collecting and analyzing data in focus group research. International Journal of Qualitative Methods, 8(3), 1–21 https://doi.org/10.1177/160940690900800301
OpenAI. (2023). GPT-4 technical report. arXiv. https://doi.org/10.48550/arXiv.2303.08774
Ouhaichi, H., Bahtijar, V., & Spikol, D. (2024). Exploring design considerations for multimodal learning analytics systems: An interview study. Frontiers in Education, 9, Article 1356537. https://doi.org/10.3389/feduc.2024.1356537
Pangakis, N., Wolken, S., & Fasching, N. (2023). Automated annotation with generative AI requires validation. arXiv. https://doi.org/10.48550/arXiv.2306.00176
Radu, I., Tu, E., & Schneider, B. (2020). Relationships between body postures and collaborative learning states in an augmented reality study. In I. I. Bittencourt, M. Cukurova, K. Muldner, R. Luckin, & E. Millán (Eds.), Artificial intelligence in education: 21st international conference, AIED 2020, Ifrane, Morocco, July 6–10, 2020, proceedings, part II (pp. 257–262). Springer Cham. https://doi.org/10.1007/978-3-030-52240-7_47
Reiss, M. V. (2023). Testing the reliability of ChatGPT for text annotation and classification: A cautionary remark. arXiv. https://doi.org/10.48550/arXiv.2304.11085
Schlagwein, D., & Willcocks, L. (2023). ‘ChatGPT et al.’: The ethics of using (generative) artificial intelligence in research and science. Journal of Information Technology, 38(3), 232–238. https://doi.org/10.1177/02683962231200411
Schneider, B. (2024). Three challenges in implementing multimodal learning analytics in real-world learning environments. Learning: Research and Practice, 10(1), 103–112. https://doi.org/10.1080/23735082.2023.2270611
Schroeder, H., Le Quéré, M. A., Randazzo, C., Mimno, D., & Schoenebeck, S. (2024). Large language models in qualitative research: Can we do the data justice? arXiv. https://doi.org/10.48550/arXiv.2410.07362
Sinha, S., Rogat, T. K., Adams-Wiggins, K. R., & Hmelo-Silver, C. E. (2015). Collaborative group engagement in a computer-supported inquiry learning environment. International Journal of Computer-Supported Collaborative Learning, 10(3), 273–307. https://doi.org/10.1007/s11412-015-9218-y
Stake, R. E. (1995). The art of case study research. SAGE Publications.
Suraworachet, W., Seon, J., & Cukurova, M. (2024). Predicting challenge moments from students’ discourse: A comparison of GPT-4 to two traditional natural language processing approaches. In B. Flanagan, B. Wasson, & D. Gašević (Eds.), LAK ’24: Proceedings of the 14th International Conference on Learning Analytics & Knowledge (pp. 473–485). ACM Press. https://doi.org/10.1145/3636555.3636905
Taylor, R. (2016). The multimodal texture of engagement: Prosodic language, gaze and posture in engaged, creative classroom interaction. Thinking Skills and Creativity, 20, 83–96. https://doi.org/10.1016/j.tsc.2016.04.001
Watanabe, E., Ozeki, T., & Kohama, T. (2019). Modeling of non-verbal behaviors of students in cooperative learning by using OpenPose. In H. Nakanishi, H. Egi, I.-A. Chounta, H. Takada, S. Ichimura, & U. Hoppe (Eds.), Collaboration technologies and social computing: 25th international conference, CRIWG+CollabTech 2019, Kyoto, Japan, September 4–6, 2019, proceedings (pp. 191–201). Springer Cham. https://doi.org/10.1007/978-3-030-28011-6_13
Whitehead, R., Nguyen, A., & Järvelä, S. (2024). Exploring the role of gaze behaviour in socially shared regulation of collaborative learning in a group task. Journal of Computer Assisted Learning, 40(5), 2226–2247. https://doi.org/10.1111/jcal.13022
Wu, J., Gan, W., Chen, Z., Wan, S., & Yu, P. S. (2023). Multimodal large language models: A survey. In A. Cuzzocrea & R. Agrawal (Eds.), 2023 IEEE International Conference on Big Data (BigData) (pp. 2247–2256). IEEE. https://doi.org/10.1109/BigData59044.2023.10386743
Xiao, Z., Yuan, X., Liao, Q. V., Abdelghani, R., & Oudeyer, P.-Y. (2023). Supporting qualitative analysis with large language models: Combining codebook with GPT-3 for deductive coding. In F. Chen, M. Billinghurst, M. Zhou, & S. Berkovsky (Eds.), IUI ’23 Companion: Companion Proceedings of the 28th International Conference on Intelligent User Interfaces (pp. 75–78). ACM Press. https://doi.org/10.1145/3581754.3584136
Yan, L., Martinez-Maldonado, R., & Gašević, D. (2024). Generative artificial intelligence in learning analytics: Contextualising opportunities and challenges through the learning analytics cycle. In B. Flanagan, B. Wasson, & D. Gašević (Eds.), LAK ’24: Proceedings of the14th International Conference on Learning Analytics & Knowledge (pp. 101–111). ACM Press. https://doi.org/10.1145/3636555.3636856
Yin, R. K. (2018). Case study research and applications: Design and methods (6th ed.). SAGE Publications.
Zang, Y., Li, W., Han, J., Zhou, K., & Loy, C. C. (2025). Contextual object detection with multimodal large language models. International Journal of Computer Vision, 133(4), 825–843. https://doi.org/10.1007/s11263-024-02214-4
Zeng, X., Wang, X., Zhang, T., Yu, C., Zhao, S., & Chen, Y. (2024). GestureGPT: Toward zero-shot free-form hand gesture understanding with large language model agents. Proceedings of the ACM on Human–Computer Interaction, 8(ISS), Article 545. https://doi.org/10.1145/3698145
Zhang, F., Possaghi, I., Sharma, K., & Papavlasopoulou, S. (2024). High-performing groups during children’s collaborative coding activities: What can multimodal data tell us? In S. Pera, T. Bekker, T. Huibers, J. Good, C. Sylla, & S. Papavlasopoulou (Eds.), IDC ’24: Proceedings of the 23rd Annual ACM Interaction Design and Children Conference (pp. 533–559). ACM Press. https://doi.org/10.1145/3628516.3655805
Zheng, Q., Lu, X., Jin, Q., Jain, J., Meadan-Kaplansky, H., Shi, H., Xion, J., & Huang, Y. (2024). Towards responsible use of large multi-modal AI to analyze human social behaviors. In R. Farzan, C. López, D. C. Llach, D. Quercia, M. Mustafa, S. Niu, & M. Wong-Villacrés (Eds.), CSCW Companion ’24: Companion Publication of the 2024 Conference on Computer-Supported Cooperative Work and Social Computing (pp. 663–665). ACM Press. https://doi.org/10.1145/3678884.3687137
Zhou, Q., Suraworachet, W., & Cukurova, M. (2024). Detecting non-verbal speech and gaze behaviours with multimodal data and computer vision to interpret effective collaborative learning interactions. Education and Information Technologies, 29(1), 1071–1098. https://doi.org/10.1007/s10639-023-12315-1
Zhou, Q., Suraworachet, W., Celiktutan, O., & Cukurova, M. (2022). What does shared understanding in students’ face-to-face collaborative learning gaze behaviours “look like”? In M. M. Rodrigo, N. Matsuda, A. I. Cristea, & V. Dimitrova (Eds.), Artificial Intelligence in Education: 23rd International Conference, AIED 2022, Durham, UK, July 27–31, 2022, proceedings, part I (pp. 588–593). Springer Cham. https://doi.org/10.1007/978-3-031-11644-5_53
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Journal of Learning Analytics

This work is licensed under a Creative Commons Attribution 4.0 International License.