AI-powered content analysis: Using generative AI to measure media and communication content

Methods tutorial #28834(a), module (political) communication research methods, winter term 2024/2025

Instructor

Prof. Dr. Marko Bachl

Date

Monday, 14:15

Location

Garystr.55/302a Seminarraum

Last updated on 2025-02-03 at 14:03

Important links

Seminar in Blackboard🔒

Seminar in Course Catalog

Room Garystraße 55-302a

Digital Research Methods Division

Overview

Large language models (LLM; starting with Google’s BERT) and particularly their implementations as generative or conversational AI tools (e.g., OpenAI’s ChatGPT) are increasingly used to measure or classify media and communication content. The idea is simple yet intriguing: Instead of training and employing humans for annotation tasks, researchers describe the concept of interest to a model such as ChatGPT, present the coding unit, and ask for a classification. The first tests of the utility of ChatGPT and similar tools for content analysis were positive to enthusiastic (Gilardi et al., 2023; Heseltine & Clemm von Hohenberg, 2024; Rathje et al., 2024). However, others pointed out the need for more thorough validation and reliability tests (Pangakis et al., 2023; Reiss, 2023). Easy-to-use tools and user-friendly tutorials have proliferated the methods to the average social scientist (Kjell et al., 2023; Törnberg, 2023, 2024b). Yet (closed-source, commercial) large language models are not entirely understood even by their developers, and their uncritical use has been criticized on ethical grounds (Bender et al., 2021; Spirling, 2023).

In this seminar, we will engage practically with this cutting-edge methodological research. We start with a quick refresher on the basics of quantitative content analysis (both human and computational), focusing on quality criteria and evaluation (validity, reliability, reproducibility, robustness, replicability). We will then attempt an overview of the rapidly developing literature on LLMs’ utility for content analysis. The central part of the seminar will be dedicated to small evaluation studies by student teams. Questions can range from understanding a tool’s parameters (e.g., What’s the effect of a model’s “temperature” on reliability and validity?) to practical optimization (e.g., Which prompts work best for a given task?) to critical questions (e.g., Does the classification show gender, racial, …, biases?).

Requirements

Some prior exposure to (standardized, quantitative) content analysis will be helpful. However, qualitative methods also have their place in evaluating content analysis methods. If you have little experience with the former but can contribute with the latter, make sure to team up with students whose skill set complements yours.
Prior knowledge in R or Python, applied data analysis, and interacting with application programming interfaces (API) will be helpful but are not required. Again, make sure that the teams overall have a balanced skill set.
You will use your computer to conduct your evaluation study. Credit for commercial APIs (e.g., OpenAI) will be provided within sensible limits.
This is not a programming class. Neither are programming skills required nor will you acquire such skills in a systematic way. We will learn the basics of interacting with an API using R. Code examples will be provided and discussed.
Here are some ressources to get started with R:
- Check out R Primers by Andrew Heiss: Browser-based, no installation required.
- Get code snippets for typical tasks: Posit Recipes
- Read (and work through) R for Data Science by Hadley Wickham; make sure to get 2nd edition.

Aims

After the seminar, you should be able to:

critically evaluate and improve the performance of a classifier in a (computational) content analysis.
use zero-shot content content analysis with generative AI tools in your own research project.

Tasks

5 ECTS ≈ 125-150 hours workload
Active participation, not graded
Participation in class: read texts, ask questions, discuss, give feedback to other students
Short presentation of a published evaluation study report (in teams)
- Not a detailed description, but a summary for the class. The audience should learn a) what kind of questions and studies might be interesting and b) which texts might be worth reading once they have decided on a study idea.
Plan and conduct an evaluation study (in teams)
Present the results of your own evaluation study (in teams)

Session plan

Please note that the session plan is subject to change.

(1) 14. 10.: Hello

Class content: Introduction, demo, and organization

Organization: Find a team for the state-of-the-art presentation. The goal is to find a team with a complementary skill set. Select or find an additional text.

Homework:

Listen to this podcast episode with Petter Törnberg: LLMs in Social Science
Register your presentation in the Blackboard Wiki.
Preparing your computer: If you want to actively participate in the computational part of the seminar using the prepared R code, please prepare your laptop with either an up-to-date installation of R (at least version 4.2) and RStudio or create an account at Posit Cloud.

(2) 21. 10.: Refresher: Traditional content analysis (human and computational)

Class content:

Quick refresher on the basics of quantitative content analysis (both human and computational), focusing on quality criteria and evaluation (validity, reliability, reproducibility, robustness, replicability).

Texts (if needed):

Manual content analysis: Krippendorff (2019), Neuendorf (2017) (but not the parts on computational content analysis)
Computational content analysis: Bachl & Scharkow (2024), Van Atteveldt et al. (2022), Kroon et al. (2024)

State of the art: Overview

Class content: Short presentations on current work about LLM-based zero-shot classification

Short presentations (10-15 Minutes)
One paper presented by two to three participants

Texts: Some recommendations include Alizadeh et al. (2023), Brown et al. (2020), Burnham (2024), Chae & Davidson (2024), Egami et al. (2023), Gilardi et al. (2023), Gupta et al. (2024), He et al. (2023), Heseltine & Clemm von Hohenberg (2024), Hoes et al. (2023), Huang et al. (2023), Kathirgamalingam et al. (2024), Kojima et al. (2023), Kuzman et al. (2023), Lai et al. (2023), Matter et al. (2024), Møller et al. (2024), Ollion et al. (2024), Ornstein et al. (2023), Pangakis et al. (2023), Plaza-del-Arco et al. (2023), Qin et al. (2023), Rathje et al. (2024), Reiss (2023), Schulhoff et al. (2024), Tam et al. (2024), Thalken et al. (2023), Törnberg (2024a), Weber & Reichardt (2023), Yang & Menczer (2023), Zhu et al. (2023), Ziems et al. (2024). You are free to use other texts (check citations in and to these texts to find more). Text assignment will be managed via Blackboard.

(3) 28. 10.: State of the art I

Keti, Luisa & Svenja: Reiss (2023)
Nick, Daniela, & Otto: Heseltine & Clemm von Hohenberg (2024)

(4) 04. 11.: State of the art II

Magarete, Maggie, & Marie-Luise: Matter et al. (2024)
Dang, Ziyang, & Yiquan: Huang et al. (2023)
Nina, Alina & Julian: Zhu et al. (2023)

(5) 11. 11.: State of the art III

Anna, Pascal, Tanja, & Phoebe: Hoes et al. (2023)
MB: Törnberg (2024b)
MB: First intro to confusion matrix, metrics

Empirical evaluation study

(6) 18. 11.: Introduction to the evaluation study

The central part of the seminar will be dedicated to small evaluation studies by student teams. Questions can range from understanding a tool’s parameters (e.g., What’s the effect of a model’s “temperature” on reliability and validity?) to practical optimization (e.g., Which prompts work best for a given task?) to critical questions (e.g., Does the classification show gender, racial, …, biases?).

Class content:

Decision about type of evaluation study

Competition: Multiple groups work on the same task using a training set; at the end, classifiers are tested against a new test set); The competition would be held on the this task: Explainable Detection of Online Sexism (EDOS)
Free topics: Each groups works on their own idea and data sets evaluation study of classifier for their own task
- 1. Find a data set and a task (e.g., Hugging Face, SemEval, GermEval)
- 1. Use or create an own data set.

Tools and computers

Introduction of tools to play with the API
Quick computer check

Organization: Form teams for the evaluation study. The goal is to create teams with diverse skill sets. In my experience, three to five persons are a good team size, but your preferences might differ.

Homework:

Register your team for the evaluation study in the Blackboard Wiki.
One group member: Send me an e-mail and receive an API key for OpenAI.
Start thinking about a task and a data set for your evaluation study.

(7) 25. 11.: Design evaluation study

Class content: Support in class and office hours

(8) 02. 12.: Study idea presentations

Class content: Presentations and feedback

10 Minute pitch
- Topic, domain
- Relevance
- Specific research question
- Design (first ideas)
- Data set(s) (if already chosen)

Homework:

Upload pitch presentations to Blackboard discussion board.
Prepare a first prototype of a prompt. See Törnberg (2024b) for some suggestions.
Test it using the tools in Blackboard.
Prepare two txt files for next session.
- Prompt
- Example content (1-3 units)
- Make them available on all computers that will be used next session.

(9) 09. 12.: Design evaluation study

Class content: Understanding (and using) R Code for interaction with the OpenAI API

Bring your laptop with R/RStudio or Posit Cloud account.
Share OpenAI key within research group.

Homework:

Continue working on your evaluation study.
- Definition(s) of measured constructs
- Prompt(s)
- Data set, annotation
- Design (what is compared to what on which data set(s)?)

(10) 16. 12.: Design evaluation study

Class content: Understanding (and using) R Code for interaction with the OpenAI API

Bring your laptop with R/RStudio or Posit Cloud account.
Share OpenAI key within research group.

Homework:

Continue working on your evaluation study.
- Definition(s) of measured constructs
- Prompt(s)
- Data set, annotation
- Design (what is compared to what on which data set(s)?)

Winter break

(11) 06. 01.: Design evaluation study

Class content: Understanding (and using) R Code for interaction with the OpenAI API

Bring your laptop with R/RStudio or Posit Cloud account.
Share OpenAI key within research group.

Homework:

Prepare final prompt(s) and data sets
- Prompt(s) in separate txt files
- Structured response template(s) in R file
- Data set either .csv file (min. req. columns: text and gold) or texts as txt and gold as .csv, linked via filename and ID

(12) 13. 01.: Design evaluation study

Class content:

Update from project groups - what is missing to finish:
- Prompt(s)
- Structured response template(s)
- Data set, annotation
Practical help with conducting data collection via OpenAI API

Homework:

Finish prompt(s) and data sets
- Prompt(s) in separate txt files
- Structured response template(s) in R file
- Data set either .csv file (min. req. columns: text and gold) or texts as txt and gold as .csv, linked via filename and ID
Upload everything to Blackboard discussion board.

(13) 20. 01.: Conduct evaluation study

Class content: Help desk: Conducting data collection via OpenAI API

(14) 27. 01.: Conduct evaluation study

Class content: Understanding (and using) R Code for quantitative evaluation of AI classifier

(15) 03. 02.: Conduct evaluation study

Class content: Help desk: Qualitative and quantitative evaluation

(16) 10. 02.: Final presentations

Class content: Presentations and feedback (20 Minutes per team)

Research question
Short overview on measure: Definition, prompt
Research design (comparisons)
Results: Quantitative, qualitative
Discussion:
- What have we learned?
- What’s next?

Teamwork

Teamwork is a crucial part of the methods tutorial. Working in a team is also an important soft skill that you will need beyond this seminar and your academic education. Working in a team is more fun, creative, and productive. However, group work can also lead to conflicts. If you have problems in your group, please address them early within the group and/or to me.

Here are some recommendations:

Division of labor: Distribute tasks and responsibilities early and evenly. But also know that you will learn most about the tasks in which you actively participated.
Communication: Clarify early on how you want to communicate with each other. Agree on fixed dates for meetings and stick to them. If possible, plan a regular in-person work meeting. Use digital tools such as messengers, e-mail, or video conferences to coordinate.
Infrastructure & tools:
- Webex can be used for video calls and team chats.
- The university library offers group work spaces that you can book for group meetings on campus.
- Box.FU is a cloud storage solution. You can collaborate on documents and share files.
- Here is a list of software available to all FU students free of charge.
- Of course, you can and should use other tools that make collaboration easier. Please make sure that all group members have access to the tools.

Use of AI tools and plagiarism

You are likely aware of AI tools like ChatGPT that can assist you with various academic tasks. If you haven’t explored these tools yet, I encourage you to do so, as they are expected to become integral in both academic and professional settings. Being familiar with these tools and understanding their strengths and weaknesses is crucial. However, some ways of using them are more beneficial for your learning and academic success than others.

Before tackling an assignment with the help of an AI tool, consider what you might miss out on. Assignments are designed to help you practice certain skills (repeatedly), allowing you to improve and deepen your abilities over time. This improvement only occurs if you engage with the tasks independently. Relying on AI tools too early in the process will hinder your skill development. Conversely, not using AI tools at all means missing out on learning about a useful tool.

I recommend approaching each task initially without AI support to practice the necessary skills. Afterwards, compare your work with suggestions from an AI tool. These can be used to enhance your work. However, you will often find that the AI’s suggestions are incorrect or less suitable than your own. By comparing different tools and methods (e.g., prompting strategies), you can discover how to maximize the benefits of AI tools.

[All linked sources in this paragraph are in German, sorry] When submitting academic work, particularly essays or theses, please refer to IfPuK’s guidelines on using AI-based tools and plagiarism in the Guide to Academic Writing. You can also take a look at my provisional guideline for using AI in thesis work. It is crucial to document and transparently disclose the use of AI tools and information sources. You alone are responsible for your submitted work, including verifying its correctness and adherence to academic integrity standards. Plagiarism created by an AI tool remains plagiarism, even if you document the AI usage or are unaware of the plagiarized source.

In this class, using AI tools is allowed for the following purposes:

Assisting in understanding concepts or studies
Helping gather ideas or create outlines
Supporting specific steps in the research process (e.g., suggestions for questions or categories, selecting appropriate statistical tests)
Identifying and correcting grammar, spelling, and punctuation errors
Working with programming languages (e.g., R or Python)

The following uses of AI tools are not permitted in this class:

Using primarily AI-generated text (verbatim or edited) in presentations or written assignments without proper citation
Completing entire tasks, assignments, or papers using AI tools

This seminar is (for most students) not graded. You have the opportunity to engage practically in empirical work and receive feedback and suggestions for improvement. There is no benefit in using dishonest means here—so please don’t.

Diversity, equity, and inclusion

My goal is for all students to feel welcome and able to actively participate in this class. I strive to ensure that no one is discriminated against or excluded through course planning and my language. Likewise, I expect all participants to behave respectfully and appreciatively, acknowledging the opinions and experiences of other students. At the same time, it is clear that neither I nor the students will always fully meet this expectation. Therefore, I ask you to inform me or your peers if you feel uncomfortable or observe discriminatory behavior. If you prefer not to do this yourself, you can also appoint a trusted person to do so.

Mental health

Attending university is demanding and, as a time of transition, brings many challenges, both within and outside of your academic work. If you feel overwhelmed, please make use of support services such as the Mental Wellbeing support.point or the Psychological Counseling Service. Feel free to contact me directly or through a trusted person if your situation conflicts with the course requirements.

Contact information

Prof. Dr. Marko Bachl

Division Digital Research Methods

Email: marko.bachl@fu-berlin.de

Phone: +49-30-838-61565

Webex: Personal Meeting Room

Office: Garystr. 55, Room 274

Student office hours: Tuesday, 11:00-13:00, please book an appointment.

References

Alizadeh, M., Kubli, M., Samei, Z., Dehghani, S., Bermeo, J. D., Korobeynikova, M., & Gilardi, F. (2023). Open-source large language models outperform crowd workers and approach ChatGPT in text-annotation tasks. arXiv. https://doi.org/10.48550/arXiv.2307.02179

Bachl, M., & Scharkow, M. (2024). Computational text analysis. OSF. https://doi.org/10.31219/osf.io/3yhu8

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? 🦜. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623. https://doi.org/gh677h

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language models are few-shot learners. arXiv. https://doi.org/10.48550/arXiv.2005.14165

Burnham, M. (2024). Stance detection: A practical guide to classifying political beliefs in text. Political Science Research and Methods, 1–18. https://doi.org/10.1017/psrm.2024.35

Chae, Y. (YJ)., & Davidson, T. (2024). Large language models for text classification: From zero-shot learning to fine-tuning. https://doi.org/gth4nm

Egami, N., Jacobs-Harukawa, M., Stewart, B. M., & Wei, H. (2023). Using large language model annotations for valid downstream statistical inference in social science: Design-based semi-supervised learning. arXiv. https://doi.org/10.48550/arXiv.2306.04746

Gilardi, F., Alizadeh, M., & Kubli, M. (2023). ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30), e2305016120. https://doi.org/gsqx5m

Gupta, S., Shrivastava, V., Deshpande, A., Kalyan, A., Clark, P., Sabharwal, A., & Khot, T. (2024). Bias runs deep: Implicit reasoning biases in persona-assigned LLMs. arXiv. https://doi.org/10.48550/arXiv.2311.04892

He, X., Lin, Z., Gong, Y., Jin, A.-L., Zhang, H., Lin, C., Jiao, J., Yiu, S. M., Duan, N., & Chen, W. (2023). AnnoLLM: Making large language models to be better crowdsourced annotators. arXiv. https://doi.org/10.48550/arXiv.2303.16854

Heseltine, M., & Clemm von Hohenberg, B. (2024). Large language models as a substitute for human experts in annotating political text. Research & Politics, 11(1), 20531680241236239. https://doi.org/gtkhqr

Hoes, E., Altay, S., & Bermeo, J. (2023). Leveraging ChatGPT for efficient fact-checking. PsyArXiv. https://doi.org/10.31234/osf.io/qnjkf

Huang, F., Kwak, H., & An, J. (2023). Is ChatGPT better than human annotators? Potential and limitations of ChatGPT in explaining implicit hate speech. Companion Proceedings of the ACM Web Conference 2023, 294–297. https://doi.org/gth4nt

Kathirgamalingam, A., Lind, F., Bernhard, J., & Boomgaarden, H. G. (2024). Agree to disagree? Human and LLM coder bias for constructs of marginalization. OSF. https://doi.org/10.31235/osf.io/agpyr

Kjell, O., Giorgi, S., & Schwartz, H. A. (2023). The text-package: An R-package for analyzing and visualizing human language using natural language processing and transformers. Psychological Methods, 28(6), 1478–1498. https://doi.org/gsmcq8

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2023). Large language models are zero-shot reasoners. arXiv. https://doi.org/10.48550/arXiv.2205.11916

Krippendorff, K. (2019). Content analysis: An introduction to its methodology (4th ed.). SAGE Publications, Inc. https://doi.org/mmsp

Kroon, A., Welbers, K., Trilling, D., & Atteveldt, W. van. (2024). Advancing automated content analysis for a new era of media effects research: The key role of transfer learning. Communication Methods and Measures, 18(2), 142–162. https://doi.org/gsv44t

Kuzman, T., Mozetič, I., & Ljubešić, N. (2023). ChatGPT: Beginning of an end of manual linguistic data annotation? Use case of automatic genre identification. arXiv. https://doi.org/10.48550/arXiv.2303.03953

Lai, V., Ngo, N., Pouran Ben Veyseh, A., Man, H., Dernoncourt, F., Bui, T., & Nguyen, T. (2023). ChatGPT beyond English: Towards a comprehensive evaluation of large language models in multilingual learning. In H. Bouamor, J. Pino, & K. Bali (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 13171–13189). Association for Computational Linguistics. https://doi.org/gth4mp

Matter, D., Schirmer, M., Grinberg, N., & Pfeffer, J. (2024). Close to human-level agreement: Tracing journeys of violent speech in incel posts with GPT-4-enhanced annotations. arXiv. https://doi.org/10.48550/arXiv.2401.02001

Møller, A. G., Dalsgaard, J. A., Pera, A., & Aiello, L. M. (2024). The parrot dilemma: Human-labeled vs. LLM-augmented data in classification tasks. arXiv. https://doi.org/10.48550/arXiv.2304.13861

Neuendorf, K. A. (2017). The content analysis guidebook. SAGE Publications, Inc. https://doi.org/dz7p

Ollion, E., Shen, R., Macanovic, A., & Chatelain, A. (2024). ChatGPT for text annotation? Mind the hype! https://doi.org/gth4nw

Ornstein, J. T., Blasingame, E. N., & Truscott, J. S. (2023). How to train your stochastic parrot: Large language models for political texts. https://joeornstein.github.io/publications/ornstein-blasingame-truscott.pdf

Pangakis, N., Wolken, S., & Fasching, N. (2023). Automated annotation with generative AI requires validation. arXiv. https://doi.org/10.48550/arXiv.2306.00176

Plaza-del-Arco, F. M., Nozza, D., & Hovy, D. (2023). Leveraging label variation in large language models for zero-shot text classification. arXiv. https://doi.org/10.48550/arXiv.2307.12973

Qin, C., Zhang, A., Zhang, Z., Chen, J., Yasunaga, M., & Yang, D. (2023). Is ChatGPT a general-purpose natural language processing task solver? arXiv. https://doi.org/10.48550/arXiv.2302.06476

Rathje, S., Mirea, D.-M., Sucholutsky, I., Marjieh, R., Robertson, C. E., & Van Bavel, J. J. (2024). GPT is an effective tool for multilingual psychological text analysis. Proceedings of the National Academy of Sciences, 121(34), e2308950121. https://doi.org/gt7hrw

Reiss, M. V. (2023). Testing the reliability of ChatGPT for text annotation and classification: A cautionary remark. arXiv. https://doi.org/10.48550/arXiv.2304.11085

Schulhoff, S., Ilie, M., Balepur, N., Kahadze, K., Liu, A., Si, C., Li, Y., Gupta, A., Han, H., Schulhoff, S., Dulepet, P. S., Vidyadhara, S., Ki, D., Agrawal, S., Pham, C., Kroiz, G., Li, F., Tao, H., Srivastava, A., … Resnik, P. (2024). The prompt report: A systematic survey of prompting techniques. arXiv. https://doi.org/10.48550/arXiv.2406.06608

Spirling, A. (2023). Why open-source generative AI models are an ethical way forward for science. Nature, 616(7957), 413–413. https://doi.org/gsqx6v

Tam, Z. R., Wu, C.-K., Tsai, Y.-L., Lin, C.-Y., Lee, H., & Chen, Y.-N. (2024). Let me speak freely? A study on the impact of format restrictions on performance of large language models. arXiv. https://doi.org/10.48550/arXiv.2408.02442

Thalken, R., Stiglitz, E. H., Mimno, D., & Wilkens, M. (2023). Modeling legal reasoning: LM annotation at the edge of human agreement. arXiv. https://doi.org/10.48550/arXiv.2310.18440

Törnberg, P. (2023). How to use LLMs for text analysis. arXiv. https://doi.org/mqx9

Törnberg, P. (2024a). Large language models outperform expert coders and supervised classifiers at annotating political social media messages. Social Science Computer Review, 08944393241286471. https://doi.org/g8nnfx

Törnberg, P. (2024b). Best practices for text annotation with large language models. arXiv. https://doi.org/gtn9qf

Van Atteveldt, W., Trilling, D., & Arcila Calderón, C. (2022). Computational analysis of communication. Wiley Blackwell. https://v2.cssbook.net/

Weber, M., & Reichardt, M. (2023). Evaluation is all you need. Prompting generative large language models for annotation tasks in the social sciences. A primer using open models. arXiv. https://doi.org/10.48550/arXiv.2401.00284

Yang, K.-C., & Menczer, F. (2023). Large language models can rate news outlet credibility. arXiv. https://doi.org/10.48550/arXiv.2304.00228

Zhu, Y., Zhang, P., Haq, E.-U., Hui, P., & Tyson, G. (2023). Can ChatGPT reproduce human-generated labels? A study of social computing tasks. arXiv. https://doi.org/10.48550/arXiv.2304.10145

Ziems, C., Held, W., Shaikh, O., Chen, J., Zhang, Z., & Yang, D. (2024). Can large language models transform computational social science? Computational Linguistics, 50(1), 237–291. https://doi.org/10.1162/coli_a_00502