On Scraping Mastodon

8 min readJan 27, 2020

Mastodon was scraped, again. It was not the first time it had happened, and it probably wont be the last. This time it was for research, not just archiving which we had encountered in the past. The actual scraping happened in 2018, but the research was recently published, and this is why we’re talking about it now.


The research article, “Mastodon Content Warnings: Inappropriate Contents in a Microblogging Platform”, was written by authors from the Computer Science Department, University of Milan. The same group of people have previously published another research article related to Mastodon, “The Footprints of a “Mastodon”: How a Decentralized Architecture Influences Online Social Relationships”. In their previous paper they also had a lot of misunderstandings of the technology as well as the culture of Mastodon.

While it is tempting to do a complete analysis of the research, in this post I will point out a few issues with it, both from a technical perspective and an ethical one. In doing so I will reference and quote a few sections. However, it will not be a full analysis of all of the paper.

They wrote that they hashed the usernames, but included the URI of the posts in their database, which has the username in it.
Screenshot from Mastodon

The research papers both contained datasets: the first one had focused on meta data; and this last one’s dataset was match-able with the previous one, even though it was “anonymized”. However, it was brought to my attention that their anonymization was pointless, because the username was still in the URI.

The 2nd dataset, for the latest research paper, has been removed from online access with the comment:

“Deaccessioned Reason: Legal issue or Data Usage Agreement Many entries in the datasets do not fulfill the law about personal data release since they allow identification of personal information.”

Does this mean that they did not take any of these things into account when they wrote the paper to begin with? If we look at their ethical and legal considerations we can see that they half-considered it, and I would argue missed the mark. The way most people were talking about it, it did not actually seem like they even had made any ethical nor legal considerations in their research. Reading them, I realized that they probably would’ve been better off if they had written the legal consideration first, and then have that inform the…




I write passionately about things like ADHD, Mental Health, Mastodon, and Games. You can support me by donating to my GoFundMe https://gofund.me/39b69b10