The project

Evidence of authentic language use is fundamental for language learning. One way to develop authentic language learning materials is through the use of examples from corpora, i.e., large collections of texts produced in natural contexts, saved in electronic form. However, these corpora might include sensitive content or offensive language, in addition to exhibit structural problems. Although such use is unquestionably authentic, it is recommended that these corpora must be carefully monitored before applied to education to flag inappropriateness, thus leaving the choice of use of certain examples to the needs and context of use of teachers and didactic material developers. In other words, from our perspective, pedagogical corpora should be labelled for potentially problematic content rather than cleaned from it. In order to streamline the verification of the sentences for the creation of problem-labelled pedagogical corpora, we have decided to ask the crowd for help. It was in this context that the Crowdsourcing Corpus Filtering for Pedagogical Purposes project was created.

The Crowdsourcing Corpus Filtering for Pedagogical Purposes project aims at creating pedagogical corpora of Dutch, Estonian, Slovene and Portuguese through the application of crowdsourcing techniques. These pedagogical corpora can be used for the development of auxiliary language learning resources, such as Sketch Engine for Language Learning − SKELL (Baisa & Suchomel, 2014), dictionaries and teaching materials; and, within Natural Language Processing, for the creation of datasets aimed at training machine learning algorithms for the compilation of larger pedagogical corpora.

In phase 1, we carried out an experiment on the use of crowdsourcing for corpus filtering in which we asked the crowd to identify offensive sentences for pedagogical purposes. The experiment was implemented on the Pybossa platform.

In phase 2, we are developing the Crowdsourcing for Language Learning game – CrowLL. CrowLL is a multilevel, multilanguage, platform responsive game in which players identify problematic sentences, classify them, and indicate problematic excerpts. The types of problems to be labelled are: vulgar, offensive, sensitive content, grammar/spelling problems, incomprehensible/lack of context. Data preparation for the game involved manual annotation of automatically extracted sentences of all the languages. This task has been funded by the CLARIN Resource Families Project Funding and the annotated corpora are available at the PORTULAN CLARIN repository.

Publications

  1. Blog post (02 November 2023): Manually Annotated Corpora for Teaching and Learning Purposes of Brazilian Portuguese, Dutch, Estonian, and Slovene – the CrowLL project
  2. Zingano Kuhn, Tanara; Kosem, Iztok; Arhar Holdt, Špela; Tiberius, Carole; Koppel, Kristina; Zviel-Girshin, Rina. (2023). Usando crowdsourcing e gamification em lexicografia: o jogo CrowLL [Using crowdsourcing and gamification in lexicography: the CrowLL game] [oral communication]. AmericaLex-S Inaugural Conference. 20 – 25 October 2023, University of São Paulo, São Paulo, Brazil.
  3. Zingano Kuhn. T., Tiberius, C., Arhar Holdt, Š., Koppel, K., Kosem, I., Zviel-Girshin, R., Luís, A. R. (2023). Developing Manually Annotated Corpora for Teaching and Learning Purposes of Brazilian Portuguese, Dutch, Estonian, and Slovene (the CrowLL Project). In: Lindén, K., Niemi, J., and Kontino, T. (eds.) CLARIN Annual Conference Proceedings, 2023. ISSN 2773-2177 (online). 16-18 October 2023, Leuven, Belgium, p. 173 – 177.
  4. Manually annotated corpora for teaching and learning purposes of Brazilian Portuguese, Dutch, Estonian, and Slovene. https://hdl.handle.net/21.11129/0000-0010-05DA-3. Released 31 August 2023.
  5. Zingano Kuhn. T.,Koppel, K., Arhar Holdt, Š., Tiberius, C., Zviel-Girshin, R., Kosem, I. (2023). Annotating corpora for language learning and lexicography with the Crowdsourcing for Language Learning (CrowLL) game. In: Medveď, M., Měchura, M., Tiberius, C., Kosem, I., Kallas, J., Jakubíček, M. & Krek, S. (eds.) (2023). Electronic lexicography in the 21st century (eLex 2023): Invisible Lexicography. Book of abstracts. Brno, 27–29 June 2023. Brno: Lexical Computing CZ s.r.o., p. 13-14.
  6. Kuhn, Tanara Zingano; Zviel-Girshin, Rina. (2023). Webinar Crowdsourcing corpus filtering for pedagogical purpose project: A fruitful partnership between computer science and linguistics promoted by the EuroCALL CorpusCALL SIG, 26 January 2023.
  7. Zingano Kuhn. T., Arhar Holdt, Š., Kosem, I., Tiberius, C., Koppel, K., Zviel-Girshin, R. (2022). Data preparation in crowdsourcing for pedagogical purposes: the case of the CrowLL game. Slovenščina 2.0, 10(2): 62–100.
  8. Kuhn, Tanara Zingano; Tiberius, Carole; Arhar Holdt, Špela; Kosem, Iztok; Koppel, Kristina; Zviel-Girshin, Rina; Luís, Ana R. (2022) The CrowLL project - Manually-annotated corpora for teaching and learning purposes of Brazilian Portuguese, Dutch, Estonian, and Slovene. Poster presented at the CLARIN Bazaar 2022. CLARIN Annual Conference 2022, 10-12 October 2022, Prague, Czech Republic.
  9. Kuhn, Tanara Zingano; Arhar Holdt, Špela; Zviel-Girshin, Rina; Luís, Ana R.; Tiberius, Carole; Koppel, Kristina; Todorović, Branislava S.; Kosem, Iztok (2022). Introducing Crowll – the Crowdsourcing for Language Learning game. In Book of Abstracts of the XX Euralex International Congress, 12-16 July 2022, Mannheim, Germany.
  10. Kuhn, Tanara Zingano (2021). Closing plenary ‘O projeto Crowdsourcing Corpus Filtering for Pedagogical Purposes’ [The Crowdsourcing Corpus Filtering for Pedagogical Purposes project] at the Brazilian School for Computational Linguistics (EBRALC) 2021. 23 November 2021.
  11. Zviel-Girshin, Rina; Kuhn, Tanara Zingano; Luís, Ana R.; Koppel, Kristina; Šandrih, Branislava; Arhar Holdt, Špela; Tiberius, Carole; Kosem, Iztok. (2021). Developing pedagogically appropriate language corpora through crowdsourcing and gamification. In: Zoghlami, Naouel; Brudermann, Cédric; Sarré, Cedric; Grosbois, Muriel; Bradley, Linda; Thouësny, Sylvie. CALL and professionalisation: short papers from EUROCALL 2021. Research-publishing.net
  12. Kuhn, Tanara Zingano; Zviel-Girshin, Rina; Arhar Holdt, Špela; Šandrih, Branislava; Tiberius, Carole; Luís, Ana R; Jokić, Danka; Koppel, Kristina; Kosem, Iztok. (2021) Gamifying the path to corpus-based pedagogical dictionaries. Electronic lexicography in the 21st century (eLex 2021): post-editing lexicography Book of Abstracts, eLex 2021.
  13. Zingano Kuhn, Tanara; Todorović, Branislava Šandrih; Holdt Špela Arhar; Zviel-Girshin, Rina; Koppel, Kristina; Luís, Ana R.; Kosem, Iztok (2021). Crowdsourcing pedagogical corpora for lexicographical purposes. Euralex 2020 Congress, Alexandropolous, Greece (online), 07-09 September 2021. Book of Abstracts EURALEX XIX Congress, p.193.
  14. Zingano Kuhn, Tanara; Todorović, Branislava Šandrih; Holdt Špela Arhar; Zviel-Girshin, Rina; Koppel, Kristina; Luís, Ana R.; Kosem, Iztok (2021). Crowdsourcing pedagogical corpora for lexicographical purposes. In Proceedings of the EURALEX XIX Congress, volume 2.
  15. Kuhn, Tanara Zingano; Šandrih, Branislava; Zviel-Girshin, Rina; Arhar-Holdt, Špela; Schoonheim, Tanneke; Dekker, Peter (2019). Using crowdsourcing for corpus filtering. Part 2: preliminary results. WG1 workshop of the European Network for Combining Language Learning with Crowdsourcing Techniques. University of Coimbra, Coimbra, Portugal, 6 December 2019.
  16. Kuhn, Tanara Zingano; Šandrih, Branislava; Zviel-Girshin, Rina; Arhar-Holdt, Špela; Schoonheim, Tanneke; Dekker, Peter (2019). Using crowdsourcing for corpus filtering (part 1). WG1 workshop of the European Network for Combining Language Learning with Crowdsourcing Techniques, University of Coimbra, Coimbra, Portugal, 6 December 2019.
  17. Dekker, Peter; Kuhn, Tanara Zingano; Šandrih, Branislava; Zviel-Girshin, Rina, Arhar-Holdt, Špela, Schoonheim, Tanneke. (2019). Corpus Filtering via Crowdsourcing for Developing a Learner’s Dictionary. In: Kosem, Iztok, Kuhn, Tanara Zingano (eds) Book of abstracts of the Electronic lexicography in the 21st century (eLex 2019): Smart Lexicography, Sintra, Portugal, 01-03 October 2019. Brno, Czech Republic: Lexical Computing CZ s.r.o., p. 84-85.
  18. Kuhn, Tanara Zingano; Dekker, Peter; Šandrih, Branislava; Zviel-Girshin, Rina, Arhar-Holdt, Špela, Schoonheim, Tanneke. (2019). Corpus cleaning for language learning resource development. EUROCALL Conference 2019, Louvain-la-Neuve, Belgium, 28-31 August 2019. Book of Abstracts, p. 159.
  19. Kuhn, Tanara Zingano; Dekker, Peter; Branislava, Šandrih; Zviel-Girshin, Rina. (2019). Crowdsourcing corpus cleaning for language learning: an approach proposal. Poster presentation at the Working Groups 1&3 workshop. 3rd enetCollect Annual Meeting, Lisbon, Portugal, 13-14 March 2019.
  20. Kuhn, Tanara Zingano; Dekker, Peter. (2019). Report from Crowdfest: Crowdsourcing corpus cleaning for language learning. 3rd enetCollect Annual Meeting, Lisbon, Portugal, 13-14 March 2019.

Team members

Tanara Zingano Kuhn

Project leader

Centre for the Studies of General and Applied Linguistics at University of Coimbra(CELGA-ILTEC)
Brazil/Portugal

Ana Luís

Centre for the Studies of General and Applied Linguistics at University of Coimbra (CELGA-ILTEC)/Faculty of Arts and Humanities at University of Coimbra
Portugal

Carole Tiberius

Dutch Language Institute
Netherlands

Iztok Kosem

Centre for Language Resources and Technologies at the University of Ljubljana (CJVT UL)
Slovenia

Kristina Koppel

Institute of the Estonian Language
Estonia

Rina Zviel Girshin

Ruppin Academic Center
Israel

Špela Arhar-Holdt

Centre for Language Resources and Technologies at the University of Ljubljana (CJVT UL)
Slovenia

Andressa Rodrigues Gomide

Former member

Centre for the Studies of General and Applied Linguistics at University of Coimbra(CELGA-ILTEC)
Brazil/Portugal

Branislava Šandrih Todorović

Former member

University of Belgrade, Faculty of Philology
Serbia

Danka Jokić

Former member

Serbia

Peter Dekker

Former member

Dutch Language Institute & AI Lab, Vrije Universiteit Brussel
Netherlands/Belgium

RS

Ranka Stanković

Former member

Serbia

Tanneke Schoonheim

Former member

Dutch Language Institute
Netherlands