Lecturer: Ivana Kajić, Philipp Wicke
Fields: Cognitive Science, Linguistics, Artificial Intelligence

Content

This course examines the relationship between language, perception, and intelligence, using recent developments in generative AI as a central case study. Moving from cognitive linguistics to multimodal machine learning systems, the course investigates how systems transition from text-based representations to models that increasingly integrate perception and action. Across four lectures, we move from theoretical foundations to technical architectures and finally to societal and industrial implications.

Lecture 1 introduces the conceptual foundations of the course. We explore the hypothesis that human thinking is deeply structured by language, examining linguistic universals, linguistic relativity, and the role of metaphor and conceptual framing. Language is presented not merely as a communicative tool but as a generative system that structures world models. This session establishes the idea that if human cognition is scaffolded by language, then language-trained AI systems offer a particularly revealing lens through which to rethink intelligence.
Lecture 2 shifts the focus to embodiment and perceptual grounding. We examine theories of embodied cognition and consider how bodily experience shapes conceptual systems. The lecture discusses how abstract thought is rooted in sensorimotor experience and presents language as an interface between pre-linguistic cognition and articulated reasoning. By contrasting embodied human cognition with predominantly text-trained AI systems, this session sharpens the central question of the course: can intelligence emerge from language alone, or does meaningful understanding require grounding in perception and action?
Lecture 3 explores the technical foundations of modern generative AI, moving from large language models (LLMs) to multimodal architectures. After reviewing the core principles of transformer-based language models, the lecture expands to vision–language models, multimodal training paradigms, and large-scale deployment techniques such as retrieval-augmented generation and in-context learning. The session highlights how these systems are developed in practice, the role of human data and alignment, and current challenges including interpretability and safety. By examining how AI systems increasingly integrate text and perception, we assess both their capabilities and structural limitations.
Lecture 4 turns to real-world applications and broader impact. Rather than focusing exclusively on speculative AGI narratives, this session highlights how AI is already shaping scientific research, industrial processes, and economic infrastructures. We examine examples from scientific discovery, energy optimization, manufacturing, and operations research, alongside ongoing debates around trust, labor, and human–AI interaction. Designed as an interactive and discussion-based session, this lecture also critically evaluates the gap between technological hype and practical implementation, offering a forward-looking yet grounded perspective on the future of multimodal and agentic systems.

Literature

Kajić, Ivana, et al. “Evaluating numerical reasoning in text-to-image models.” Advances in Neural Information Processing Systems 37 (2024): 42211-42224.
Kajić, Ivana, and Aida Nematzadeh. “Evaluating Visual Number Discrimination in Deep Neural Networks.” Proceedings of the Annual Meeting of the Cognitive Science Society. Vol. 45. No. 45. 2023.
Albuquerque, I., Ktena, I., Wiles, O., Kajić, I., Rannen-Triki, A., Vasconcelos, C., & Nematzadeh, A. (2025). Benchmarking Diversity in Image Generation via Attribute-Conditional Human Evaluation. arXiv preprint arXiv:2511.10547.
Evans, Vyvyan, and Melanie Green. Cognitive linguistics: An introduction. Routledge, 2018.
Wei, Jason, et al. “Emergent abilities of large language models.” arXiv preprint arXiv:2206.07682 (2022).
Boroditsky, Lera. “Does language shape thought?: Mandarin and English speakers’ conceptions of time.” Cognitive psychology 43.1 (2001): 1-22.
Wicke, Philipp, Wachowiak, Lennart. “Exploring Spatial Schema Intuitions in Large Language and Vision Models” ACL 2024 Findings.
Wicke, Philipp, and Marianna Bolognesi. “Emoji-based semantic representations for abstract and concrete concepts.” Cognitive processing 21.4 (2020): 615-635.

Lecturer

Ivana Kajić is a Senior Research Scientist at Google DeepMind in Montréal, Canada. Her research interests include applying methods and techniques from cognitive science in analysis and characterization of behavior of machine learning models. Specifically, this includes designing evaluation protocols, benchmarks and metrics to comprehensively understand capabilities and limitations of large vision-language models that in recent years have demonstrated strong performance in a variety of tasks. She completed her PhD thesis titled “Computational Mechanisms of Language Understanding and Use in the Brain and Behaviour” in 2020 at the University of Waterloo in Canada.

Affiliation: Google DeepMind
Homepage: www.ivanakajic.me

Philipp Wicke studied Cognitive Science at the University of Osnabrück in the B.Sc. programme. During these studies he interned at Dauwels Lab at the NTU Singapore in the field of neuroinformatics, he also interned at the Creative Language Systems Lab at UCD Dublin at which he later wrote his dissertation on “Computational Storytelling as an Embodied Robot Performance with Gesture and Spatial Metaphor”. He was an assistant professor at the LMU Munich at the Center for Language and Information Processing (CIS) and a Researcher in Residency at the Center for Advanced Studies (CAS). Philipp is researching on Natural Language Processing and teaches Artificial Intelligence at the BTU Cottbus. Philipp Wicke is the Lead AI Engineer at AURYAL, a Europe-based neuro-tech startup funded by the German Federal Agency for Disruptive Innovation (SPRIND).

Affiliation: BTU Cottbus, AURYAL GmbH
Homepage: www.phil-wicke.com