Portolio
This page highlights some of the work I produced while conducting research at the University of Oulu.
textpype
textpype is a (WIP) Python library for producing a pipeline/workflow for evaluating combinations of various ML classification algorithms, NLP preprocessing steps and data sampling techniques on different datasets. It is partly based on a paper presented at MSR 2021.
FinnishSentiment
FinnishSentiment is a package for conducting sentiment analysis of Finnish text using logistic regressions. The data used for training the model is based on almost 2000 tweets about COVID-19 that were manually classified and the paper behind it is currently under review. There is also second repository containing additional scripts for replicating all the results of the paper.
20-MAD
20-MAD is a dataset that was shared and presented during MSR 2020. The data itself is hosted at OSF. The main code for data extraction and processing is hosted in a GitHub repository as an R package while a second repository is used for gluing together different packages, documenting the data and contains a few additional scripts. A Docker image is also available for replication purposes.
Natural Language or Not (NLoN)
NLoN is an R package to automaticelly detect whether a line of text is natural language or not. This work is the result of a paper presented at MSR 2018.
RSentiStrength and RSenti4SD
RSentiStrength and RSenti4SD are two simple R packages for running two sentiment analysis tools: SentiStrength and Senti4SD
TextFeatures
TextFeatures is an R package for generating features to feed to machine learning for text classification.
EmoticonFindeR
EmoticonFindeR is a package for detecting emoticons and emojis from text data.