Automated Classification of Elementary Instructional Activities: Analyzing the Consistency of Human Annotations

Jonathan K. Foster; Peter Youngs; Rachel van Aswegen; Samarth Singh; Ginger S.  Watson; Scott T. Acton

doi:10.18608/jla.2024.8323

Authors

Jonathan K. Foster University of Albany https://orcid.org/0000-0002-7842-6277
Peter Youngs University of Virginia https://orcid.org/0000-0002-1711-1749
Rachel van Aswegen Richmond Public Schools
Samarth Singh University of Virginia
Ginger S. Watson Old Dominion University https://orcid.org/0000-0001-7197-1654
Scott T. Acton University of Virginia https://orcid.org/0000-0003-3288-1255

DOI:

https://doi.org/10.18608/jla.2024.8323

Keywords:

video annotation, temporal analysis, elementary instruction, validation, research paper

Abstract

Despite a tremendous increase in the use of video for conducting research in classrooms as well as preparing and evaluating teachers, there remain notable challenges to using classroom videos at scale, including time and financial costs. Recent advances in artificial intelligence could make the process of analyzing, scoring, and cataloguing videos more efficient. These advances include natural language processing, automated speech recognition, and deep neural networks. To train artificial intelligence to accurately classify activities in classroom videos, humans must first annotate a set of videos in a consistent way. This paper describes our investigation of the degree of inter-annotator reliability regarding identification of and duration of activities among annotators with and without experience analyzing classroom videos. Validity of human annotations is crucial for research involving temporal analysis within classroom video research. The study reported here represents an important step towards applying methods developed in other fields to validate temporal analytics within learning analytics research for classifying time- and event-based activities in classroom videos.

References

Bakeman, R., & Quera, V. (2011). Sequential analysis and observational methods for the behavioral sciences. Cambridge University Press. https://doi.org/10.1017/CBO9781139017343

Bakeman, R., & Quera, V. (2023). Behavioral observation. In H. Cooper, M. N. Coutanche, L. M. McMullen, A. T. Panter, D. Rindskopf, & K. J. Sher (Eds.), APA handbook of research methods in psychology: Foundations, planning, measures, and psychometrics, 2nd ed. (pp. 251–274). American Psychological Association. https://doi.org/10.1037/0000318-013

Berry, R. Q., III., Rimm-Kaufman, S. E., Ottmar, E. M., Walkowiak, T. A., Merritt, E., & Pinter, H. H. (2013). The mathematics scan (M-Scan): A measure of standards-based mathematics teaching practices [Unpublished]. University of Virginia.

Budd, S., Day, T., Simpson, J., Lloyd, K., Matthew, J., Skelton, E., Razavi, R., & Kainz, B. (2021). Can non-specialists provide high quality gold standard labels in challenging modalities? In S. Albarqouni, M. J. Cardoso, Q. Dou, K. Kamnitsas, B. Khanal, I. Rekik, N. Rieke, D. Sheet, S. Tsaftaris, D. Xu, & Z. Xu (Eds.), Domain adaptation and representation transfer, and affordable healthcare and AI for resource diverse global health (pp. 251–262). Springer International Publishing. https://doi.org/10.1007/978-3-030-87722-4_23

Chen, B., Knight, S., & Wise, A. F. (2018). Critical issues in designing and implementing temporal analytics. Journal of Learning Analytics, 5(1), 1–9. https://doi.org/10.18608/jla.2018.53.1

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/10.1177/001316446002000104

Cowan, J., & Goldhaber, D. (2016). National Board Certification and teacher effectiveness: Evidence from Washington State. Journal of Research on Educational Effectiveness, 9(3), 233–258. https://doi.org/10.1080/19345747.2015.1099768

Curby, T. W., Johnson, P., Mashburn, A. J., & Carlis, L. (2016). Live versus video observations: Comparing the reliability and validity of two methods of assessing classroom quality. Journal of Psychoeducational Assessment, 34(8), 765–781. https://doi.org/10.1177/073428291562711

DeepLearningAI. (2021, March 21). A chat with Andrew on MLOps: From model-centric to data-centric AI [Video]. YouTube. https://www.youtube.com/watch?v=06-AZXmwHjo

Deming, W. E., & Stephan, F. F. (1940). On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. The Annals of Mathematical Statistics, 11(4), 427–444. https://doi.org/10.1214/aoms/1177731829

D’Mello, S. K. (2016). On the influence of an iterative affect annotation approach on inter-observer and self-observer reliability. IEEE Transactions on Affective Computing, 7(2), 136–149. https://doi.org/10.1109/TAFFC.2015.2457413

Eagan, B., Brohinsky, J., Wang, J., & Shaffer, D. W. (2020). Testing the reliability of inter-rater reliability. Proceedings of the 10th International Conference on Learning Analytics and Knowledge (LAK ’20), 23–27 March 2020, Frankfurt, Germany (pp. 454–461). ACM Press. https://doi.org/10.1145/3375462.3375508

Max Planck Institute for Psycholinguistics, The Language Archive. (2021). ELAN (Version 6.2) [Computer software]. https://archive.mpi.nl/tla/elan

Epp, C. D., Phirangee, K., & Hewitt, J. (2017). Talk with me: Student pronoun use as an indicator of discourse health. Journal of Learning Analytics, 4(3), 47–75. http://dx.doi.org/10.18608/jla.2017.43.4

Feinstein, A. R., & Cicchetti, D. V. (1990). High agreement but low kappa: I. the problems of two paradoxes. Journal of Clinical Epidemiology, 43(6), 543–549. https://doi.org/10.1016/0895-4356(90)90158-L

Foster, J. K., Korban, M., Youngs, P., Watson, G. S., & Acton, S. T. (2024). Classification of instructional activities in classroom videos using neural networks. In X. Zhai & J. Krajcik (Eds.), Uses of artificial intelligence in STEM education. Oxford University Press.

Grossman, P., Loeb, S., Cohen, J., & Wyckoff, J. (2013). Measure for measure: The relationship between measures of instructional practice in middle school English language arts and teachers’ value-added scores. American Journal of Education, 119(3), 445–470. https://doi.org/10.1086/669901

Giuliani, M., Mirnig, N., Stollnberger, G., Stadler, S., Buchner, R., & Tscheligi, M. (2015). Systematic analysis of video data from different human–robot interaction studies: A categorization of social signals during error situations. Frontiers in Psychology, 6, 931. https://doi.org/10.3389/fpsyg.2015.00931

Hamre, B. K., Pianta, R. C., Burchinal, M., Field, S., LoCasale-Crouch, J., Downer, J. T., Howes, C., LaParo, K., & Scott-Little, C. (2012). A course on effective teacher–child interactions: Effects on teacher beliefs, knowledge, and observed practice. American Educational Research Journal, 49(1), 88–123. https://doi.org/10.3102/0002831211434596

Holle, H., & Rein, R. (2013). The modified Cohen’s kappa: Calculating interrater agreement for segmentation and annotation. In H. Lausberg (Ed.), Understanding body movement: A guide to empirical research on nonverbal behaviour (pp. 261–277). Peter Lang.

Holle, H., & Rein, R. (2015). EasyDIAg: A tool for easy determination of interrater agreement. Behavioral Research Methods. 47(3), 837–847. https://doi.org/10.3758/s13428-014-0506-7

Jacoby, A. R., Pattichis, M. S., Celedón-Pattichis, S., & LópezLeiva, C. (2018). Context-sensitive human activity classification in collaborative learning environments. Proceedings of the 2018 IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI 2018), 8–10 April 2018, Las Vegas, NV, USA (pp. 141–144). IEEE. https://doi.org/10.1109/SSIAI.2018.8470331

Jiang, J. (2021, May 3). What is MLOps and why we should care. Medium. https://towardsdatascience.com/what-is-mlops-and-why-we-should-care-9b2d79a29e75

Jones, M. D., Johnson, N., Seppi, K., & Thatcher, L. (2018). Understanding how non-experts collect and annotate activity data. Proceedings of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers (UbiComp ’18), 8–12 October 2018, Singapore (pp. 1424–1433). https://doi.org/10.1145/3267305.3267507

Kane, T. J., McCaffrey, D. F., Miller, T., & Staiger, D. O. (2013). Have we identified effective teachers? Validating measures of effective teaching using random assignment. Bill & Melinda Gates Foundation. https://files.eric.ed.gov/fulltext/ED540959.pdf

Kelly, S., Olney, A. M., Donnelly, P., Nystrand, M., & D’Mello, S. K. (2018). Automatically measuring question authenticity in real-world classrooms. Educational Researcher, 47(7), 451–464. https://doi.org/10.3102/0013189X18785613

Kitto, K., Manly, C. A., Ferguson, R., & Poquet, O. (2023). Towards more replicable content analysis for learning analytics. Proceedings of the 13th International Conference on Learning Analytics and Knowledge (LAK ’23), 13–17 March 2023, Arlington, TX, USA (pp. 303–314). ACM Press. https://dl.acm.org/doi/10.1145/3576050.3576096

Knight, S., Wise, A. F., & Chen, B. (2017). Time for change: Why learning analytics needs temporal analysis. Journal of Learning Analytics, 4(3), 7–17. https://doi.org/10.18608/jla.2017.43.2

Kong, A. P.-H., Law, S.-P., Kwan, C. C.-Y., Lai, C., & Lam, V. (2015). A coding system with independent annotations of gesture forms and functions during verbal communication: Development of a Database of Speech and Gesture (DoSaGE). Journal of Nonverbal Behavior, 39(1), 93–111. https://doi.org/10.1007/s10919-014-0200-6

Kwitt, R., Hegenbart, S., Rasiwasia, N., Vécsei, A., & Uhl, A. (2014). Do we need annotation experts? A case study in celiac disease classification. In P. Golland, N. Hata, C. Barillot, J. Hornegger, & R. Howe (Eds.), Medical Image Computing and Computer-Assisted Intervention – MICCAI 2014 (pp. 454–461). Springer International Publishing. https://doi.org/10.1007/978-3-319-10470-6_57

Lane, S., & Stone, C. A. (2006). Performance assessment. In R. L. Brennan (Ed.), Educational measurement, 4th ed. (pp. 387–431). Roman & Littlefield Publishers.

McHugh, M. L. (2012). Interrater reliability: The kappa statistic. Biochemia Medica, 22(3), 276–282. https://doi.org/10.11613/BM.2012.031

Molenaar, I., & Wise, A. F. (2022). Temporal aspects of learning analytics: Grounding analyses in concepts of time. In C. Lang, G. Siemens, A. F. Wise, D. Gašević, & A. Merceron (Eds.), The handbook of learning analytics, 2nd ed. (pp. 66–76). SoLAR. https://doi.org/10.18608/hla22.007

Prusak, K., Dye, B., Graham, C. R., & Graser, S. (2010). Reliability of pre-service physical education teachers’ coding of teaching videos using studiocode analysis software. Journal of Technology and Teacher Education, 18(1), 131–159. http://hdl.lib.byu.edu/1877/2846

Pustu-Iren, K., Mühling, M., Korfhage, N., Bars, J., Bernhöft, S., Hörth, A., Freisleben, B., & Ewerth, R. (2019). Investigating correlations of inter-coder agreement and machine annotation performance for historical video data. In A. Doucet, A. Isaac, K. Golub, T. Aalberg, & A. Jatowt (Eds.), Digital Libraries for Open Knowledge (pp. 107–114). Springer International Publishing. https://doi.org/10.1007/978-3-030-30760-8_9

Riel, J., Lawless, K. A., & Brown, S. W. (2018). Timing matters: Approaches for measuring and visualizing behaviours of timing and spacing of work in self-paced online teacher professional development courses. Journal of Learning Analytics, 5(1), 25–40. http://dx.doi.org/10.18608/jla.2018.51.3

SCALE. (2015). Elementary education: Assessment handbook. edTPA. https://www.colorado.edu/education/sites/default/files/attached-files/edtpaelehandbook.pdf

Shaffer, D. W. (2017). Quantitative ethnography. Cathcart Press.

Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. (2008). Cheap and fast – but is it good? Evaluating non-expert annotations for natural language tasks. Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (EMNLP ’08), 25–27 October 2008, Honolulu, Hawaii (pp. 254–263). Association for Computational Linguistics. https://aclanthology.org/D08-1027

Tong, F., Tang, S., Irby, B. J., Lara-Alecio, R., & Guerrero, C. (2020). The determination of appropriate coefficient indices for inter-rater reliability: Using classroom observation instruments as fidelity measures in large-scale randomized research. International Journal of Educational Research, 99, 101514. https://doi.org/10.1016/j.ijer.2019.101514

Tucker, L., Scherr, R. E., Zickler, T., & Mazur, E. (2016). Exclusively visual analysis of classroom group interactions. Physical Review Physics Education Research, 12(2), 020142. https://doi.org/10.1103/PhysRevPhysEducRes.12.020142

Whitehill, J., Serpell, Z., Lin, Y.-C., Foster, A., & Movellan, J. R. (2014). The faces of engagement: Automatic recognition of student engagement from facial expressions. IEEE Transactions on Affective Computing, 5(1), 86–98. https://doi.org/10.1109/TAFFC.2014.2316163

Wittenburg, P., Brugman, H., Russel, A., Klassmann, A., & Sloetjes, H. (2006). ELAN: A professional framework for multimodality research. Proceedings of 5th International Conference on Language Resources and Evaluation (LREC ’06), 24–26 May 2006, Genoa, Italy (pp. 1556–1559). http://www.lrec-conf.org/proceedings/lrec2006/pdf/153_pdf.pdf

Youngs, P., Molloy Elreda, L., Anagnostopoulos, D., Cohen, J., Drake, C., & Konstantopoulos, S. (2022). The development of ambitious instruction: How beginning elementary teachers’ preparation experiences are associated with their mathematics and English language arts instructional practices. Teaching and Teacher Education, 110, 103576. https://doi.org/10.1016/j.tate.2021.103576

Automated Classification of Elementary Instructional Activities

Analyzing the Consistency of Human Annotations

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License