Creates a semantic network from tweets returned from the twitter search query. Semantic networks describe the semantic relationships between concepts. In this network the concepts are significant words and hashtags extracted from the tweet text. Network edges are weighted and represent occurrence of words and hashtags in the same tweets.

# S3 method for semantic.twitter
Create(
  datasource,
  type,
  removeRetweets = TRUE,
  removeTermsOrHashtags = NULL,
  stopwords = TRUE,
  stopwordsLang = "en",
  stopwordsSrc = "smart",
  removeNumbers = TRUE,
  removeUrls = TRUE,
  termFreq = 5,
  hashtagFreq = 50,
  assoc = "limited",
  verbose = TRUE,
  ...
)

Arguments

datasource

Collected social media data with "datasource" and "twitter" class names.

type

Character string. Type of network to be created, set to "semantic".

removeRetweets

Logical. Removes detected retweets from the tweet data. Default is TRUE.

removeTermsOrHashtags

Character vector. Words or hashtags to remove from the semantic network. For example, this parameter could be used to remove the search term or hashtag that was used to collect the data by removing any nodes with matching name. Default is NULL to remove none.

stopwords

Logical. Removes stopwords from the tweet data. Default is TRUE.

stopwordsLang

Character string. Language of stopwords to use. Refer to the stopwords package for further information on supported languages. Default is "en".

stopwordsSrc

Character string. Source of stopwords list. Refer to the stopwords package for further information on supported sources. Default is "smart".

removeNumbers

Logical. Removes whole numerical tokens from the tweet text. For example, a year value such as 2020 will be removed but not mixed values such as G20. Default is TRUE.

removeUrls

Logical. Removes twitter shortened URL tokens from the tweet text. Default is TRUE.

termFreq

Numeric integer. Specifies the percentage of most frequent words to include. For example, termFreq = 20 means that the 20 percent most frequently occurring words will be included in the semantic network as nodes. A larger percentage will increase the number of nodes and therefore the size of graph. The default value is 5, meaning the top 5 percent most frequent words are used.

hashtagFreq

Numeric integer. Specifies the percentage of most frequent hashtags to include. For example, hashtagFreq = 20 means that the 20 percent most frequently occurring hashtags will be included in the semantic network as nodes. The default value is 50.

assoc

Character string. Association of nodes. A value of "limited" includes only edges between most frequently occurring hashtags and terms. A value of "full" includes ties between most frequently occurring hashtags and terms, hashtags and hashtags, and terms and terms. Default is "limited".

verbose

Logical. Output additional information about the network creation. Default is TRUE.

...

Additional parameters passed to function. Not used in this method.

Value

Network as a named list of two dataframes containing $nodes and $edges.

Note

The words and hashtags passed to the function in the removeTermsOrHashtags parameter are removed before word frequencies are calculated and are therefore excluded from top percentage of most frequent terms completely rather than simply filtered out of the final network.

The top percentage of frequently occurring hashtags hashtagFreq and words termFreq are calculated to a minimum frequency and all terms that have an equal or greater frequency than the minimum are included in the network as nodes. For example, of unique hashtags of varying frequencies in a dataset the top 50% of total frequency or most common hashtags may calculate to being the first 20 hashtags. The frequency of the 20th hashtag is then used as the minimum and all hashtags of equal or greater frequency are included as part of the top 50% most frequently occurring hashtags. So the number of top hashtags may end up being greater than 20 if there is more than one hashtag that has frequency matching the minimum. The exception to this is if the minimum frequency is 1 and the hashtagFreq is set to less than 100, in this case only the first 20 hashtags will be included.

Hashtags and words in the top percentages are included in the network as isolates if there are no instances of them occurring in tweet text with other top percentage frequency terms.

Examples

if (FALSE) {
# twitter semantic network creation additionally requires the stopwords
# package for working with text data
# install.packages("stopwords")

# create a twitter semantic network graph removing the hashtag "#auspol"
# and using the top 2% frequently occurring words and 10% most frequently
# occurring hashtags as nodes
net_semantic <- collect_tw |>
  Create("semantic", removeTermsOrHashtags = c("#auspol"),
         termFreq = 2, hashtagFreq = 10, verbose = TRUE)

# network
# net_semantic$nodes
# net_semantic$edges
}