VOSON Lab Code Blog: Mastodon Conversation Networks

Bryan Gertzel

Introduction

Mastodon is a microblogging platform that offers an alternative to traditional social media giants by embracing an open and distributed network model. Unlike other social media platforms that are centralised in terms of the being the property and governed by a single business entity, Mastodon servers (also called instances) are typically owned and operated by the users. The server instances collectively form the Mastodon network which is part of the federated universe or fediverse, and uses an open-source network protocol called ActivityPub to inter-communicate between services and also user software clients.

Pitched as a free and open alternative to Twitter, Mastodon allows users to sign-up to an instance of their choosing, joining autonomous local communities with their own moderation and policies. Server instances are typically established around a community or area of interest, for example historians.social an instance created for academic historians, and federated.press a server centered around journalism. Users can interact with the community on their server, or more widely with users on other servers. Mastodon has its own flavour of common features, users can follow others, make and reply to posts (toots), re-blog and boost posts (similar to retweeting and combined liking) and also use hashtags. Users can either find global content through the federated timeline, which is a stream of messages eminating from server instances that their community server is federated with, or local content through their own community servers local timeline.

Much like Twitter, Mastodon has a REST data API that has data endpoints and OAuth authentication to support the operation of 3rd-party applications. Unlike Twitter however, authentication and data collection occur at the individual server instance level and can have varying support for public or private access to data depending on the server providers configuration and policies. Because of the open and transparent nature of the servers and networks, data collection can be performed from multiple and alternative access points. The fediverse presents a significantly lower barrier to collecting public data than other restrictive and commercial social media platforms.

This blog post will introduce a process for simple data collection and network creation for Mastodon conversation threads in R.

The rtoot package

rtoot The rtoot R package developed by (Schoch & Chan, 2023) allows R code to interact with Mastodon API enpoints and perform various functions. The package allows the creation of either a public or private bearer token used to authenticate requests to the instance server API. A public bearer token is suitable for reading public data, whereas a private token would be used to perform user scope operations such as posting toots. These tokens can be used by both apps or scripts and no credentials (or even user account) are required to create a public scope token making it easy for researchers to collect network data. In fact the OAuth process is not even required to access public Mastodon data but will be presented as reference as it is part of a standard practice for accessing social media API endpoints.

The following code snippet creates an OAuth bearer token with public access scope for the mastodon.social instance. At time of writing tokens generally do not expire, so this step only needs to be performed once if you save the token.

library(rtoot)

# create a public access bearer token, name and save it for future use
auth_setup(
  instance = "mastodon.social",
  type = "public",
  name = "mast_soc_public",
  path = "~/.rtoot_auth")

# Token of type "public" for instance mastodon.social is valid
# <mastodon bearer token> for instance: mastodon.social of type: public

# read saved bearer token
rtoot_token <- readRDS("~/.rtoot_auth/mast_soc_public.rds")

# bearer token value
rtoot_token$bearer
# [1] "fLb..............................spY"

token <- rtoot_token$bearer

Collecting data

This section will look at collecting data for a Mastodon conversation thread that will enable the construction of different types of networks. The following public post was made on July 27, 2023 by the Ars Technica account promoting a story on their blog site concerning the rebranding of Twitter as ‘X’.

Figure 1: Ars Technica news story toot on mastodon.social.

To collect the relevant posts related to the thread, we need to provide the desired server instance and toot identifier as parameters to the collection functions. These components can be extracted from a post URL as follows.

library(httr)

# thread url
toot_url <- "https://mastodon.social/@arstechnica/110781427821159121"

# extract toot id and server instance from url
toot_server = parse_url(toot_url)$hostname    # mastodon.social
toot_id <- basename(parse_url(toot_url)$path) # 110781427821159121

The rtoot function get_status can then be used to collect toots, and the get_context function for threads (a thread is referred to as a context by the API). The original post is not returned by get_context so this must be collected seperately using get_status.

# get original post and its thread
toot <- get_status(toot_id, toot_server, token)
thread <- get_context(toot_id, toot_server, token)

# original post
toot

# A tibble: 1 × 29
  id            uri   created_at          content visibility sensitive
  <chr>         <chr> <dttm>              <chr>   <chr>      <lgl>    
1 110781427821… http… 2023-07-26 16:53:59 "<p>Tw… public     FALSE    
# ℹ 23 more variables: spoiler_text <chr>, reblogs_count <int>,
#   favourites_count <int>, replies_count <int>, url <chr>,
#   in_reply_to_id <lgl>, in_reply_to_account_id <lgl>,
#   language <chr>, text <lgl>, application <I<list>>,
#   poll <I<list>>, card <I<list>>, account <list>, reblog <I<list>>,
#   media_attachments <list>, mentions <list>, tags <I<list>>,
#   emojis <I<list>>, favourited <lgl>, reblogged <lgl>, …

# thread object
names(thread)

[1] "ancestors"   "descendants"

# number of toots above the original post
nrow(thread$ancestors)

[1] 0

# number of toots below the original post
nrow(thread$descendants)

[1] 59

# rate-limit
attr(thread, "headers")

# A tibble: 1 × 3
  rate_limit rate_remaining rate_reset         
  <chr>      <chr>          <dttm>             
1 300        205            2023-07-26 21:25:00

Here we can see the structure of the original post and thread objects. The original post is a dataframe with 29 columns of post metadata, some containing nested list data, and only 1 observation. The thread object has two dataframes, one for posts above (ancestors) that will always have 0 observations if an original post was specified, and below the post (descendants) that contains all of the reply posts. The API rate-limit is also included as an attribute of data objects returned by rtoot functions and can be used to monitor and manage the collection of larger data sets.

All of the post and user profile metadata (as nested data) is returned with the above function calls, so no further collection steps are needed to construct rich networks with many options for node and edge attributes.

Data wrangling

The collected data can now be combined into dataframes and cleaned up to make it easier to work with. This can be done in a number of ways that will largely depend on the types of analysis that are to be performed. In this blog post the dataframes will be organised to assist in generating common social network types with metadata.

The following steps combine all of the posts into a single dataframe, and then unnest and extract the user metadata. Users are the authors of the posts that constitute the collected thread.

Text analysis is often useful in social network analysis, and it should be noted that the Mastodon API only returns text content in HTML format and will usually require some cleaning. In this case the HTML contains all of the style tags for displaying the data in a web page, as well as the text, and this is often undesirable for token based text analysis as tags will also be processed. A simple technique is provided for removing unecessary tags and replacing others to improve the text quality using the rvest package html_text2 function. It removes HTML tags but also has the advantage of retaining punctiation, emoji’s and some user intended whitespace formatting.

The user profile description (present in the note field of user metadata) and post text (content field of post metatdata) will be be cleaned up using this method.

library(dplyr)
library(rvest)
library(tidyr)

# combine collected toot data
toots <- toot |> bind_rows(thread$descendants, thread$ancestors)

# extract user metadata from nested account column
users <- toots |> 
  select(account) |> 
  unnest_wider(account) |> 
  distinct(id, .keep_all = TRUE)

# remove html tags from the user note field
# rowise is used here because the html functions are not vectorized
users <- users |> 
  mutate(note = ifelse(nchar(note) < 1, "<span></span>", note)) |>
  rowwise() |>
  mutate(note = html_text2(read_html(note))) |>
  ungroup()

# number of users
nrow(users)

[1] 56

# user metadata
colnames(users)

 [1] "id"              "username"        "acct"           
 [4] "display_name"    "locked"          "bot"            
 [7] "discoverable"    "group"           "created_at"     
[10] "note"            "url"             "avatar"         
[13] "avatar_static"   "header"          "header_static"  
[16] "followers_count" "following_count" "statuses_count" 
[19] "last_status_at"  "fields"          "emojis"

Here we can see some of the metadata for the original poster @arstechnica in our user data.

Show table code

library(reactable)

tbl_data <- users |>
  slice_head(n = 1) |>
  mutate(avatar = paste0("<img src='", avatar_static,"' height = '60px'>")) |>
  relocate(avatar) |>
  select(avatar, id, username, display_name, created_at, note,
         starts_with(c("follow", "status")))

reactable(tbl_data,
  columns = list(avatar = colDef(name = "avatar", html = TRUE)),
  bordered = TRUE, striped = TRUE, resizable = TRUE,
  wrap = TRUE, searchable = FALSE, compact = TRUE,
  pagination = FALSE, height = 200,
  style = list(fontFamily = "Arial", fontSize = "0.875rem"),)

Figure 2: Ars Technica user metadata.

# convert html content to formatted text
toots <- toots |>
  rowwise() |>
  mutate(text = html_text2(read_html(content))) |>
  ungroup()

# available post metadata
colnames(toots)

 [1] "id"                     "uri"                   
 [3] "created_at"             "content"               
 [5] "visibility"             "sensitive"             
 [7] "spoiler_text"           "reblogs_count"         
 [9] "favourites_count"       "replies_count"         
[11] "url"                    "in_reply_to_id"        
[13] "in_reply_to_account_id" "language"              
[15] "text"                   "application"           
[17] "poll"                   "card"                  
[19] "account"                "reblog"                
[21] "media_attachments"      "mentions"              
[23] "tags"                   "emojis"                
[25] "favourited"             "reblogged"             
[27] "muted"                  "bookmarked"            
[29] "pinned"

We now have two dataframes, one for thread posts and their metadata called toots and one for users and their metadata called users that can be cross-referenced by user ID.

Create networks

This section is about generating networks from the collected conversation thread data. To do this we need to build edge lists using the relational data fields present in the collection. These are the post and user ID fields: id (post id), account_id (author or user id), and the relational fields: in_reply_to_id (parent post id) and in_reply_to_account_id (parent post author user id).

Three network types will be created, a thread activity network, a user actor network and a community affiliation style network.

1. Activity network

The activity network represents the post-reply tree structure of the conversation thread. Nodes are the Mastodon posts that make up the thread and the edges signify directed replies, in this case the relationship between posts in a conversation are of only one type: reply. It should be noted that in a Mastodon conversation network a post can only be a reply to one post, meaning it can only ever have zero or one outbound reply edge, creating network branch structures.

In the following snippet the edge list is created in the first three lines of code and is sufficient to create a network graph, the remaining code adds some node metadata as attributes and styles a network graph plot.

# activity network edge list
activity_edges <- toots |>
  select(from = id, to = in_reply_to_id) |>
  filter(!is.na(to))

# set edge types
activity_edges$edge_type <- "reply"

# activity network nodes and metadata
activity_nodes <- toots |>
  hoist(account, account_id = "id") |>
  hoist(tags, tag = list("name"), tag_url = list("url")) |>
  hoist(mentions, mention_id = list("id"), mention_name = list("username"))

library(ggraph)
library(tidygraph)
library(stringr)

# create a labels from post ids
activity_nodes$label <- paste0(str_sub(activity_nodes$id, 1, 2), 
                               "..",
                               str_sub(activity_nodes$id, -3, -1))

# set node types
activity_nodes <- activity_nodes |>
  mutate(node_type = if_else(id == toot_id, "op", "post"))

# create the network graph
g <- tbl_graph(nodes = activity_nodes, edges = activity_edges, directed = TRUE)

Show plot code

ggraph(g, layout = "kk") +
  geom_edge_arc(aes(edge_colour = factor(edge_type)),
                strength = 0,
                arrow = grid::arrow(length = unit(0.1, "inches")),
                start_cap = circle(1.5, 'mm'),
                end_cap = circle(1.5, 'mm')) +
  scale_edge_colour_manual(name = "edges", values = c(reply = "darkgray")) +
  geom_node_point(aes(colour = factor(node_type))) +
  scale_color_manual(name = "nodes",
                     values = c(post = "magenta", op = "blue")) +
  geom_node_text(nudge_y = -0.4, aes(label = label), size = 2.4, alpha = 0.25)

Figure 3: Conversation thread activity network. Post ID’s as labels.

The network graph represents the structure of the posts in the thread or the posting activity. It reveals many single replies surrounding the original post or op node, and 11 slightly longer reply chains with some minimal branching indicating where reply posts have multiple replies.

2. Actor network

The actor network represents the relationship between users in the conversation thread. Nodes are the users who authored the posts that make up the thread, and the edges again signify a reply but this time an abstracted reply relationship to a user instead of a post. In contrast to the activity network, users can have many inbound and outbound ties between one another, the more ties between two users potentially representing the more they have conversed.

In the following snippet the edge list is again created in the first code block and is sufficient to create a network graph. User metadata is assigned through the node list which is used to add further meaning to the nodes in the graph. The remaining code styles the actor network graph plot.

# actor network edge list
actor_edges <- toots |>
  hoist(account, account_id = "id", account_name = "username") |>
  select(from = account_id, to = in_reply_to_account_id) |>
  filter(!is.na(to))

actor_nodes <- users

# set edge and node types
actor_nodes$node_type <- "user"
actor_edges$edge_type <- "reply"

# create a labels from usernames
actor_nodes$label <- paste0("@..", str_sub(actor_nodes$username, -4, -1))

# create the network graph
g2 <- tbl_graph(nodes = actor_nodes, edges = actor_edges, directed = TRUE)

Show plot code

library(igraph)

actor_nodes <- actor_nodes |>
  mutate(node_type = if_else(acct == "arstechnica", "op", node_type)) |>
  mutate(label = if_else(acct == "arstechnica", "@arstechnica", label))

g2 <- graph_from_data_frame(actor_edges, vertices = actor_nodes, directed = TRUE)

E(g2)$weight <- 1
g2 <- simplify(g2, edge.attr.comb = list(weight = "sum"))

E(g2)$edge_type <- "reply"

g2 <- as_tbl_graph(g2)

ggraph(g2, layout = "dh") +
  geom_edge_link(aes(width = weight, colour = factor(edge_type)), alpha = 0.8,
                 arrow = grid::arrow(length = unit(0.1, "inches")),
                 start_cap = circle(1.5, 'mm'),
                 end_cap = circle(1.5, 'mm')) +
  scale_edge_width_continuous(range = c(0.8, 2), breaks = 1:5) +
  scale_edge_colour_manual(name = "edges", values = c(reply = "gray")) +
  geom_node_point(aes(colour = factor(node_type), size = degree(g2, V(g2)))) +
  scale_size(name = "degree", range = c(1, 5), breaks = c(0:3,40)) +
  scale_color_manual(name = "nodes", values = c(user = "purple", op = "blue")) +
  geom_node_text(nudge_y = -1, aes(label = label), size = 3, alpha = 0.35)

Figure 4: Conversation thread actor network. User display names as labels.

The actor network graph shows that there is little back and forth interaction between users in this conversation thread with only one user replying to another user more than once and no visible reciprocated edges.

Show plot code

g2a <- graph_from_data_frame(actor_edges, vertices = actor_nodes, directed = TRUE)

op_id <- actor_nodes |> filter(acct == "arstechnica") |> pull(id)
g2a <- delete_vertices(g2a, op_id)
g2a <- induced_subgraph(g2a, V(g2a)[which(degree(g2a, V(g2a)) >= 1)])

E(g2a)$edge_type <- "reply"

g2b <- as_tbl_graph(g2a)

ggraph(g2b, layout = "kk") +
  geom_edge_arc(strength = 0.2, aes(colour = factor(edge_type)), alpha = 1,
                arrow = grid::arrow(length = unit(0.1, "inches")),
                start_cap = circle(1.5, 'mm'),
                end_cap = circle(1.5, 'mm')) +
  scale_edge_colour_manual(name = "edges", values = c(reply = "gray")) +
  geom_node_point(aes(colour = factor(node_type))) +
  scale_color_manual(name = "nodes", values = c(user = "purple", op = "blue")) +
  geom_node_text(nudge_y = -0.3, aes(label = label), size = 3, alpha = 0.45)

Figure 5: Conversation thread actor network. User display names as labels. OP user and isolates removed.

Removing the original post author reveals that two users replied twice (to different users), two users received 2 replies and one received 6 replies, however there is clearly no reciprocation (multiple edges) or back and forth discussion between users in this thread.

3. Affiliation network

A variation on the actor network, an affiliation network can be created to represent Mastodon server community ties within the conversation thread. Nodes in this network are the server instances or communities that users belong to. Once again the edges signify a reply but one that could be conceptualised more as a response from a particular community to another community via their affiliated members. This kind of network is possible to generate from Mastodon data because of the community affiliation created by the user when they choose a fediverse server instance to register their account to.

# parse and create a column for user server instances
domain_nodes <- users |>
  select(id, url) |>
  rowwise() |>
  mutate(acct_instance = parse_url(url)$hostname) |>
  select(-url)

# create the server edge list from the actor edge list
domain_edges <- actor_edges |>
  left_join(domain_nodes |>
              rename(from_instance = acct_instance), by = c("from" = "id")) |>
  left_join(domain_nodes |>
              rename(to_instance = acct_instance), by = c("to" = "id")) |>
  select(from = from_instance, to = to_instance)

# create a column for the number of users from each instance
domain_nodes <- domain_nodes |>
  count(acct_instance) |> ungroup() |> rename(id = acct_instance)

# output the number of server instances or nodes
domain_nodes |> nrow()

[1] 30

# create the network graph
g3 <- tbl_graph(nodes = domain_nodes, edges = domain_edges, directed = TRUE)

The servers with more than one user that posted in the conversation thread can be seen in the following output. There were only 5 instances represented with 2 or more users out of 30.

domain_nodes |> arrange(desc(n)) |> filter(n > 1) |> select(id, n)

# A tibble: 5 × 2
  id                   n
  <chr>            <int>
1 mastodon.social     19
2 mastodon.online      4
3 infosec.exchange     3
4 mstdn.ca             3
5 mstdn.social         2

Show plot code

g3 <- graph_from_data_frame(domain_edges, vertices = domain_nodes, directed = TRUE)

g3 <- as_tbl_graph(g3)

n_sizes <- sort(unique(V(g3)$n))
ggraph(g3, layout = "fr") +
  geom_edge_arc(strength = 0.2, alpha = 0.45, width = 0.6,
                arrow = grid::arrow(length = unit(0.1, "inches"), type = "closed"),
                start_cap = circle(2, 'mm'),
                end_cap = circle(3, 'mm'), color = "maroon") +
  geom_node_point(size = 1, alpha = 0) +
  geom_node_text(aes(label = name, size = n), alpha = 0.5, color = "black") +
  scale_size_continuous(name = "n users", range = c(3, 8), breaks = n_sizes)

Figure 6: Conversation thread actor affiliation network. Server instance names as nodes

This graph presents an interesting view of server relations within the conversation thread, however it also shows again that there was somewhat limited interaction between users, this time as grouped by affiliated server. Two instances of reciprocated edges can be seen in this network but further analysis would be needed to determine any group relationships.

This article is intended as a simple introduction to using the rtoot package, collecting Mastodon public conversation threads and generating networks using R. Mastodon makes it much easier to collect social network data as compared with other social media platforms. With increased interest and usage due to disruption on other microblogging services, it could become a very useful and accessable resource for researchers interested in Social Network Analysis and studying interaction, communities, and information diffusion online.

Schoch, D. & Chan, C. (2023). Software presentation: Rtoot: Collecting and analyzing mastodon data. Mobile Media & Communication, 20501579231176678. https://doi.org/10.1177/20501579231176678

Mastodon Conversation Networks

Introduction

The rtoot package

Collecting data

Data wrangling

Create networks

1. Activity network

2. Actor network

3. Affiliation network

References

Reuse

Citation