On Scraping Mastodon

8 min readJan 27, 2020

Mastodon was scraped, again. It was not the first time it had happened, and it probably wont be the last. This time it was for research, not just archiving which we had encountered in the past. The actual scraping happened in 2018, but the research was recently published, and this is why we’re talking about it now.

Background:

The research article, “Mastodon Content Warnings: Inappropriate Contents in a Microblogging Platform”, was written by authors from the Computer Science Department, University of Milan. The same group of people have previously published another research article related to Mastodon, “The Footprints of a “Mastodon”: How a Decentralized Architecture Influences Online Social Relationships”. In their previous paper they also had a lot of misunderstandings of the technology as well as the culture of Mastodon.

While it is tempting to do a complete analysis of the research, in this post I will point out a few issues with it, both from a technical perspective and an ethical one. In doing so I will reference and quote a few sections. However, it will not be a full analysis of all of the paper.

They wrote that they hashed the usernames, but included the URI of the posts in their database, which has the username in it. — Screenshot from Mastodon

The research papers both contained datasets: the first one had focused on meta data; and this last one’s dataset was match-able with the previous one, even though it was “anonymized”. However, it was brought to my…

On Scraping Mastodon

Background:

Written by maloki