🚀

WikiDetective: Following the Footprints of Knowledge

📌 Key facts

Use Python and Java to programmatically access and analyze German Wikipedia articles. Help us compare different extraction methods (API, Dumps, Web Scraping) and build a sustainable workflow to extract title, text, categories, and version history. Your results will contribute to research on how encyclopedic content changes over time.

When: Start anytime. Applications are open!
How to apply: Send us an e-mail (at the end of this page) with your CV and a grade report.

📌 Key facts
💡 Background
🦾Who We Are
🎯 Goals
🎓 Profile
📄 Deliverables
📝 How to Apply
📬 Contact

💡 Background

This IDP explores how to access and analyze German Wikipedia articles with a focus on structured and time-based content extraction. The project aims to identify the most efficient methods — such as API usage or web scraping — to retrieve key components of an article (e.g., title, introduction text, publication date) as well as its version history over time.

A specific focus lies on comparing different extraction strategies in terms of performance, completeness, and sustainability. A key task is to strategically sample from various types of Wikipedia content, such as “Exzellente Artikel,” and to analyze how extraction quality varies across sampling methods and article types. The project will contribute to a broader understanding of how encyclopedic content evolves and how it can be programmatically accessed for research purposes.

🦾Who We Are

yathos is a software development and consulting company with a focus on tailor made software for research and businesses. We aim to provide reliant and low maintenance software products. This ensures the future success of our customers. We provide the full service from consulting, project management, implementation, and operation of software.

The Chair for Strategy and Organization is focused on research with impact. This means we do not want to repeat old ideas and base our research solely on the research people did 10 years ago. Instead, we currently research topics that will shape the future. Topics such as Agile Organisations and Digital Disruption, Blockchain Technology, Creativity and Innovation, Digital Transformation and Business Model Innovation, Diversity, Education: Education Technology and Performance Management, HRTech, Leadership, and Teams.. We are always early in noticing trends, technologies, strategies, and organisations that shape the future, which has its ups and downs.

🎯 Goals

Implement an efficient and reproducible workflow for extracting structured content from German Wikipedia articles (e.g., title, full article text, publication date, version history, article history)
Implement and test multiple extraction methods, such as:

MediaWiki API, Wikipedia Dumps, Web scraping, etc..

Compare and evaluate these methods with respect to:

Data completeness and accuracy
Ease of access and automation
Long-term maintainability and scalability

Develop effective sampling strategies to select representative sets of articles (e.g., "Exzellente Artikel" vs. average articles)
Provide structured, research-ready output datasets (e.g., JSON or CSV format) or direct export to sql database
Ensure reproducibility and extensibility of the workflow for integration into future research projects (e.g., media bias analysis)

🎓 Profile

✅ Required Skills

Java and Python programming experience
Working with RESTful APIs (e.g., MediaWiki API)
Handling JSON/XML data
Basic data processing experience

💡 Optional / Bonus Skills

Java/JavaEE
SQL or lightweight database usage (e.g., SQLite)
Experience with Wikipedia Dumps or MediaWiki revision history
Experience with sampling strategies for data analysis
Familiarity with Wikipedia markup

📄 Deliverables

Fully functional and well-documented source code for data extraction
Structured output files (e.g., JSON, CSV) that include title, intro, full text, publication date, version info, and category
Code comments and a short summary document describing the methodology and comparison of approaches
Optional: visualization or benchmark of method performance

📝 How to Apply

If you are interested, please contact Joe Yu (joe.yu@tum.de) by submitting the following documents in one PDF:

Grade report
Short overview of your software development experience
List of programming languages, tools, and any relevant experience in working with APIs, scraping, or structured data extraction

Apply now!

📬 Contact

Joe Yu (joe.yu@tum.de)