Project Study or Master Thesis with <put the name of your company here>.
π Key facts

- When: Start anytime. Applications are open!
- How to apply: Send us an e-mail (at the end of this page) with your CV and a grade report.
- π Key facts
- π‘ Background
- π¦ΎWho We Are
- π― Goals
- π Profile
- π Deliverables
- π How to Apply
- π¬ Contact
π‘ Background
This IDP explores how to access and analyze German Wikipedia articles with a focus on structured and time-based content extraction. The project aims to identify the most efficient methods β such as API usage or web scraping β to retrieve key components of an article (e.g., title, introduction text, publication date) as well as its version history over time.
A specific focus lies on comparing different extraction strategies in terms of performance, completeness, and sustainability. A key task is to strategically sample from various types of Wikipedia content, such as βExzellente Artikel,β and to analyze how extraction quality varies across sampling methods and article types. The project will contribute to a broader understanding of how encyclopedic content evolves and how it can be programmatically accessed for research purposes.
π¦ΎWho We Are
yathos is a software development and consulting company with a focus on tailor made software for research and businesses. We aim to provide reliant and low maintenance software products. This ensures the future success of our customers. We provide the full service from consulting, project management, implementation, and operation of software.
The Chair for Strategy and Organization is focused on research with impact. This means we do not want to repeat old ideas and base our research solely on the research people did 10 years ago. Instead, we currently research topics that will shape the future. Topics such as Agile Organisations and Digital Disruption, Blockchain Technology, Creativity and Innovation, Digital Transformation and Business Model Innovation, Diversity, Education: Education Technology and Performance Management, HRTech, Leadership, and Teams.. We are always early in noticing trends, technologies, strategies, and organisations that shape the future, which has its ups and downs.
π― Goals
- Implement an efficient and reproducible workflow for extracting structured content from German Wikipedia articles (e.g., title, full article text, publication date, version history, article history)
- Implement and test multiple extraction methods, such as:
- MediaWiki API, Wikipedia Dumps, Web scraping, etc..
- Compare and evaluate these methods with respect to:
- Data completeness and accuracy
- Ease of access and automation
- Long-term maintainability and scalability
- Develop effective sampling strategies to select representative sets of articles (e.g., "Exzellente Artikel" vs. average articles)
- Provide structured, research-ready output datasets (e.g., JSON or CSV format) or direct export to sql database
- Ensure reproducibility and extensibility of the workflow for integration into future research projects (e.g., media bias analysis)
π Profile
β Required Skills
- Java and Python programming experience
- Working with RESTful APIs (e.g., MediaWiki API)
- Handling JSON/XML data
- Basic data processing experience
π‘ Optional / Bonus Skills
- Java/JavaEE
- SQL or lightweight database usage (e.g., SQLite)
- Experience with Wikipedia Dumps or MediaWiki revision history
- Experience with sampling strategies for data analysis
- Familiarity with Wikipedia markup
π Deliverables
- Fully functional and well-documented source code for data extraction
- Structured output files (e.g., JSON, CSV) that include title, intro, full text, publication date, version info, and category
- Code comments and a short summary document describing the methodology and comparison of approaches
- Optional: visualization or benchmark of method performance
π How to Apply
If you are interested, please contact Joe Yu (joe.yu@tum.de) by submitting the following documents in one PDF:
- Grade report
- Short overview of your software development experience
- List of programming languages, tools, and any relevant experience in working with APIs, scraping, or structured data extraction
π¬ Contact
Joe Yu (joe.yu@tum.de)