View on GitHub

Maëlick Claes - Portfolio

A portfolio of various data/ML/NLP software I produced.


This page highlights some of the work I produced while conducting research at the University of Oulu.


textpype is a (WIP) Python library for producing a pipeline/workflow for evaluating combinations of various ML classification algorithms, NLP preprocessing steps and data sampling techniques on different datasets. It is partly based on a paper presented at MSR 2021.


FinnishSentiment is a package for conducting sentiment analysis of Finnish text using logistic regressions. The data used for training the model is based on almost 2000 tweets about COVID-19 that were manually classified and the paper behind it is currently under review. There is also second repository containing additional scripts for replicating all the results of the paper.


20-MAD is a dataset that was shared and presented during MSR 2020. The data itself is hosted at OSF. The main code for data extraction and processing is hosted in a GitHub repository as an R package while a second repository is used for gluing together different packages, documenting the data and contains a few additional scripts. A Docker image is also available for replication purposes.

Natural Language or Not (NLoN)

NLoN is an R package to automaticelly detect whether a line of text is natural language or not. This work is the result of a paper presented at MSR 2018.

RSentiStrength and RSenti4SD

RSentiStrength and RSenti4SD are two simple R packages for running two sentiment analysis tools: SentiStrength and Senti4SD


TextFeatures is an R package for generating features to feed to machine learning for text classification.


EmoticonFindeR is a package for detecting emoticons and emojis from text data.