AI-powered content analysis: Using ChatGPT to measure media and communication content

Methods tutorial #28835, module (political) communication research methods, Winter term 2023/2024

Instructor

Prof. Dr. Marko Bachl

Zeit

Wednesday, 16:15

Location
Garystr.55/105 Seminarraum

Last updated on 2024-01-17 at 15:15

Overview

Large language models (LLM; starting with Google’s BERT) and particularly their implementations as generative or conversational AI tools (e.g., OpenAI’s ChatGPT) are increasingly used to measure or classify media and communication content. The idea is simple yet intriguing: Instead of training and employing humans for annotation tasks, researchers describe the concept of interest to a model such as ChatGPT, present the coding unit, and ask for a classification. The first tests of the utility of ChatGPT and similar tools for content analysis were positive to enthusiastic (Gilardi et al., 2023; Rathje et al., 2023). However, others pointed out the need for more thorough validation and reliability tests (Pangakis et al., 2023; Reiss, 2023). Easy-to-use tools and user-friendly tutorials have proliferated the methods to the average social scientist (Kjell et al., 2023; Törnberg, 2023b). Yet (closed-source, commercial) large language models are not entirely understood even by their developers, and their uncritical use has been criticized on ethical grounds (Bender et al., 2021; Spirling, 2023).

In this seminar, we will engage practically with this cutting-edge methodological research. We start with a quick refresher on the basics of quantitative content analysis (both human and computational), focusing on quality criteria and evaluation (validity, reliability, reproducibility, robustness, replicability). We will then attempt an overview of the rapidly developing literature on LLMs’ utility for content analysis. The central part of the seminar will be dedicated to small evaluation studies by student teams. Questions can range from understanding a tool’s parameters (e.g., What’s the effect of a model’s “temperature” on reliability and validity?) to practical optimization (e.g., Which prompts work best for a given task?) to critical questions (e.g., Does the classification show gender, racial, 
, biases?).

Requirements

  • Some prior exposure to (standardized, quantitative) content analysis will be helpful. However, qualitative methods also have their place in evaluating content analysis methods. If you have little experience with the former but can contribute with the latter, make sure to team up with students whose skill set complements yours.
  • Prior knowledge in R or Python, applied data analysis, and interacting with application programming interfaces (API) will be helpful but are not required. Again, make sure that the teams overall have a balanced skill set.
  • You will use your computer to conduct your evaluation study. Credit for commercial APIs (e.g., OpenAI) will be provided within sensible limits.
  • This is not a programming class. Neither are programming skills required nor will you acquire such skills in a systematic way. I primarily work with R and sometimes copy, paste, and adapt some Python code. So, my examples will be mainly in R. However, you are free to use whichever software you like.

Session plan

(1) 18. 10.: Hello

Class content: Introduction, demo, and organization

Organization: Find a partner for the state-of-the-art presentation. The goal is to find a partner who complements your skill set. Select or find an additional text. Register your presentation in the Blackboard Wiki.

Homework: Listen to this podcast episode with Petter Törnberg: LLMs in Social Science

(2) 25. 10.: Refresher: Traditional content analysis (human and computational)

Class content: Quick refresher on the basics of quantitative content analysis (both human and computational), focusing on quality criteria and evaluation (validity, reliability, reproducibility, robustness, replicability).

Texts (if needed): Krippendorff (2019) (but not the parts on computational content analysis), Van Atteveldt et al. (2022), Kroon et al. (2023).

State of the art: Overview

Class content: Short presentations on current work about LLM-based zero-shot classification

  • Short presentations (15 Minutes)
  • One paper presented by two participants

Texts: Some recommendations include Burnham (2023), Gilardi et al. (2023), Hoes et al. (2023), Kjell et al. (2022), Kuzman et al. (2023), Laurer et al. (2023), McCoy et al. (2023), Ornstein et al. (2023), Pangakis et al. (2023), Qin et al. (2023), Rathje et al. (2023), Reiss (2023), Törnberg (2023a), Yang & Menczer (2023), Zhong et al. (2023). You are free to use other texts (check citations in and to these texts to find more). Text assignment will be managed via Blackboard.

(3) 01. 11.: State of the art I

  • Joscha N.: Gilardi et al. (2023)

  • Anne K., Suse K.: Törnberg (2023a)

(4) 08. 11.: State of the art II

  • Helena O., Nina G., Tobias v.d.B.: Yang & Menczer (2023)

  • Jiaqi W., Charlie v.V.: Rathje et al. (2023)

(5) 15. 11.: State of the art III

  • Julian P., Nika A.: Hoes et al. (2023)

  • Olaf P., Jan S. A.: Zhong et al. (2023)

  • Katarina K: Kuzman et al. (2023)

(6) 22. 11.: Work on first ideas

Class content: Support in class and office hours

Organization: Now, at the latest, form teams for the evaluation study. The goal is to create teams with diverse skill sets. In my experience, three to five persons are a good team size, but your preferences might differ.

(7) 29. 11.: Present first ideas

Class content: Presentations and feedback

(8) 06. 12.: Work on design of evaluation study

Class content: Support in class and office hours

(9) 13. 12.: Present design of evaluation study

Class content: Presentations and feedback

(10) 20. 12.: Organize evaluation study

Class content: Progress report and support in class and office hours


Winter break


(11) 10. 01.: Conduct evaluation study

Class content: Live coding: How to talk to OpenAI models using the API

(12) 17. 01.: Conduct evaluation study

Class content: Live coding: How to set up the evaluation study

(13) 24. 01.: Conduct evaluation study

Class content: Class evaluation 1; Help desk: Collect data for evaluation study

(14) 31. 01.: Conduct evaluation study

Class content: Class evaluation 2; Live coding: Quantitative evaluation

(15) 07. 02.: Conduct evaluation study

Class content: Help desk: Qualitative and quantitative evaluation

(16) 14. 02.: Final presentations

Class content: Presentations (15 Minutes per group) and feedback

Aims

The primary aims of a methods tutorial are twofold: firstly, to equip participants with the essential knowledge and skills required to effectively utilize Large Language Models (LLMs) for content analysis, enabling them to extract valuable insights and meaning from textual data. Secondly, the tutorial seeks to provide a comprehensive understanding of the methodologies involved in conducting an evaluation study of a new method. Through this, participants can gain proficiency in assessing the performance and effectiveness of novel approaches, fostering innovation and informed decision-making within the realm of natural language processing and data analysis (wordy phraseology according to ChatGPT).

Tasks

  • 5 ECTS ≈ 125-150 hours workload
  • Active participation, not graded
  • Participation in class: read texts, ask questions, discuss, give feedback to other students
  • Short presentation of a published evaluation study report (in pairs)
    • Not a detailed description, but a summary for the class. The audience should learn a) what kind of questions and studies might be interesting and b) which texts might be worth reading once they have decided on a study idea.
  • Plan and conduct an evaluation study (in groups)
  • Present the results of your own evaluation study (in groups)

Contact information

Prof. Dr. Marko Bachl

Division Digital Research Methods

Email: marko.bachl@fu-berlin.de

Phone: +49-30-838-61565

Webex: Personal Meeting Room

Office: Garystr. 55, Room 274

Office hours: Tuesday, 11:00-13:00, please make an appointment via email.

References

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? 🩜. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623. https://doi.org/gh677h
Burnham, M. (2023). Stance detection with supervised, zero-shot, and few-shot applications. arXiv. https://doi.org/10.48550/arXiv.2305.01723
Gilardi, F., Alizadeh, M., & Kubli, M. (2023). ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30), e2305016120. https://doi.org/10.1073/pnas.2305016120
Hoes, E., Altay, S., & Bermeo, J. (2023). Leveraging ChatGPT for efficient fact-checking. PsyArXiv. https://doi.org/10.31234/osf.io/qnjkf
Kjell, O., Giorgi, S., & Schwartz, H. A. (2023). The text-package: An R-package for analyzing and visualizing human language using natural language processing and transformers. Psychological Methods. https://doi.org/gsmcq8
Kjell, O., Sikström, S., Kjell, K., & Schwartz, H. A. (2022). Natural language analyzed with AI-based transformers predict traditional subjective well-being measures approaching the theoretical upper limits in accuracy. Scientific Reports, 12(1), 3918. https://doi.org/gppxhs
Krippendorff, K. (2019). Content analysis: An introduction to its methodology (4th ed.). SAGE Publications, Inc. https://doi.org/10.4135/9781071878781
Kroon, A., Welbers, K., Trilling, D., & Atteveldt, W. van. (2023). Advancing automated content analysis for a new era of media effects research: The key role of transfer learning. Communication Methods and Measures, 1–21. https://doi.org/gsv44t
Kuzman, T., Mozetič, I., & Ljubeơić, N. (2023). ChatGPT: Beginning of an end of manual linguistic data annotation? Use case of automatic genre identification. arXiv. https://doi.org/10.48550/arXiv.2303.03953
Laurer, M., Atteveldt, W. van, Casas, A., & Welbers, K. (2023). Less annotating, more classifying: Addressing the data scarcity issue of supervised machine learning with deep transfer learning and BERT-NLI. Political Analysis, 1–17. https://doi.org/10.1017/pan.2023.20
McCoy, R. T., Yao, S., Friedman, D., Hardy, M., & Griffiths, T. L. (2023). Embers of autoregression: Understanding large language models through the problem they are trained to solve. arXiv. https://doi.org/10.48550/arXiv.2309.13638
Ornstein, J. T., Blasingame, E. N., & Truscott, J. S. (2023). How to train your stochastic parrot: Large language models for political texts. https://joeornstein.github.io/publications/ornstein-blasingame-truscott.pdf
Pangakis, N., Wolken, S., & Fasching, N. (2023). Automated annotation with generative AI requires validation. arXiv. https://doi.org/10.48550/arXiv.2306.00176
Qin, C., Zhang, A., Zhang, Z., Chen, J., Yasunaga, M., & Yang, D. (2023). Is ChatGPT a general-purpose natural language processing task solver? arXiv. https://doi.org/10.48550/arXiv.2302.06476
Rathje, S., Mirea, D.-M., Sucholutsky, I., Marjieh, R., Robertson, C., & Bavel, J. J. V. (2023). GPT is an effective tool for multilingual psychological text analysis. PsyArXiv. https://doi.org/10.31234/osf.io/sekf5
Reiss, M. V. (2023). Testing the reliability of ChatGPT for text annotation and classification: A cautionary remark. arXiv. https://doi.org/10.48550/arXiv.2304.11085
Spirling, A. (2023). Why open-source generative AI models are an ethical way forward for science. Nature, 616(7957), 413–413. https://doi.org/gsqx6v
Törnberg, P. (2023a). ChatGPT-4 outperforms experts and crowd workers in annotating political Twitter messages with zero-shot learning. arXiv. https://doi.org/10.48550/arXiv.2304.06588
Törnberg, P. (2023b). How to use LLMs for text analysis. arXiv. https://doi.org/10.48550/arXiv.2307.13106
Van Atteveldt, W., Trilling, D., & CalderĂłn, C. A. (2022). Computational analysis of communication. Wiley Blackwell. https://cssbook.net/
Yang, K.-C., & Menczer, F. (2023). Large language models can rate news outlet credibility. arXiv. https://doi.org/10.48550/arXiv.2304.00228
Zhong, Q., Ding, L., Liu, J., Du, B., & Tao, D. (2023). Can ChatGPT understand too? A comparative study on ChatGPT and fine-tuned BERT. arXiv. https://doi.org/10.48550/arXiv.2302.10198