AI-assisted research across multilingual corpora

14 April, 9:00 – 12:00

Toruń, ISA, A.3.16.

Instructor: Dr Jakub Sypiański (Ca’ Foscari University of Venice, ERC SSE1K)


This workshop shows how large language models (LLMs) can genuinely accelerate and enrich research – with no programming experience required. We focus on concrete applications: analysing and translating historical sources, automatically extracting structured data from large corpora, and intelligently searching literature.

Instructor

Jakub Sypiański – historian and arabist, PhD in History (Sorbonne, 2024). Postdoctoral research fellow in the ERC SSE1K project at Ca’ Foscari University of Venice. His work examines intellectual exchanges between Byzantium and the Islamic world (7th–11th c.), the environmental history of the medieval Mediterranean, and the application of AI to humanities research.

sypian.ski

What you will learn

– how to write effective prompts so that AI gives you useful answers
– how to analyse and translate historical sources (Latin, modern languages) with language models
– how to extract structured data from hundreds of documents – automatically, with quality control
– what RAG (retrieval-augmented generation) is and how to use it with your own notes or literature

Harvesting the Corpus

The central case study of the workshop is the Harvesting the Corpus project – an attempt to test Andrew Watson’s thesis on the Arab Agricultural Revolution using LLMs. Watson argued that 18 crop species (rice, cotton, sugarcane, aubergine, spinach, citrus fruits, and others) spread from East Africa and India through the Islamic world into Europe between the 9th and 15th centuries; the project tests this by mass-extracting crop mentions from medieval texts from eight corpora: classical Arabic literature (taken from OpenITI), the agronomic Filāḥa treatises (added manually), Arabic Papyri Database, Duke Databank of Documentary Papyri (Greek and Coptic papyri), the Digital Syriac Corpus, and Latin, Greek and Arabic agricultural authors added manually. In total, the project has processed over 20,000 medieval documents – approximately 15 million words. It demonstrates how AI makes it possible to revisit historical questions that were previously out of reach due to the sheer scale of the sources.

Programme

Introduction and LLMs in practice – how models work, how they differ, how to write good prompts; exercise: analysing a historical source or a linguistics article.

AI in the humanities and linguistics – OCR and historical manuscripts, named-entity recognition, corpus processing, laboratory data analysis.

RAG and knowledge management – retrieval-augmented generation lets the model answer questions drawing exclusively from a closed corpus of sources you select yourself; tools: NotebookLM, Obsidian with local AI, LightRAG.

Exercise and discussion – extracting structured data from your own sources; the role of AI in academic teaching.

Workshop outline

A more detailed outline, bibliography, and links to materials will be available at sypian.ski/ai4umk

An event co-organized by the Institute of Advanced Studies at Nicolaus Copernicus University in Toruń