Creates a semantic network from tweets returned from the twitter search query. Semantic networks describe the semantic relationships between concepts. In this network the concepts are significant words and hashtags extracted from the tweet text. Network edges are weighted and represent occurrence of words and hashtags in the same tweets.
# S3 method for semantic.twitter
Create(
datasource,
type,
removeRetweets = TRUE,
removeTermsOrHashtags = NULL,
stopwords = TRUE,
stopwordsLang = "en",
stopwordsSrc = "smart",
removeNumbers = TRUE,
removeUrls = TRUE,
termFreq = 5,
hashtagFreq = 50,
assoc = "limited",
verbose = TRUE,
...
)
Collected social media data with "datasource"
and "twitter"
class names.
Character string. Type of network to be created, set to "semantic"
.
Logical. Removes detected retweets from the tweet data. Default is TRUE
.
Character vector. Words or hashtags to remove from the semantic network. For example,
this parameter could be used to remove the search term or hashtag that was used to collect the data by removing any
nodes with matching name. Default is NULL
to remove none.
Logical. Removes stopwords from the tweet data. Default is TRUE
.
Character string. Language of stopwords to use. Refer to the stopwords package for further
information on supported languages. Default is "en"
.
Character string. Source of stopwords list. Refer to the stopwords package for further
information on supported sources. Default is "smart"
.
Logical. Removes whole numerical tokens from the tweet text. For example, a year value such as
2020
will be removed but not mixed values such as G20
. Default is TRUE
.
Logical. Removes twitter shortened URL tokens from the tweet text. Default is TRUE
.
Numeric integer. Specifies the percentage of most frequent words to include. For example,
termFreq = 20
means that the 20 percent most frequently occurring words
will be included in the
semantic network as nodes. A larger percentage will increase the number of nodes and therefore the size of graph.
The default value is 5
, meaning the top 5 percent most frequent words are used.
Numeric integer. Specifies the percentage of most frequent hashtags
to include. For
example, hashtagFreq = 20
means that the 20 percent most frequently occurring hashtags will be included in
the semantic network as nodes. The default value is 50
.
Character string. Association of nodes. A value of "limited"
includes only edges between most
frequently occurring hashtags and terms. A value of "full"
includes ties between most frequently occurring
hashtags and terms, hashtags and hashtags, and terms and terms. Default is "limited"
.
Logical. Output additional information about the network creation. Default is TRUE
.
Additional parameters passed to function. Not used in this method.
Network as a named list of two dataframes containing $nodes
and $edges
.
The words and hashtags passed to the function in the removeTermsOrHashtags
parameter are removed before
word frequencies are calculated and are therefore excluded from top percentage of most frequent terms completely
rather than simply filtered out of the final network.
The top percentage of frequently occurring hashtags hashtagFreq
and words termFreq
are calculated to
a minimum frequency and all terms that have an equal or greater frequency than the minimum are included in the
network as nodes. For example, of unique hashtags of varying frequencies in a dataset the top 50% of total
frequency or most common hashtags may calculate to being the first 20 hashtags. The frequency of the 20th hashtag
is then used as the minimum and all hashtags of equal or greater frequency are included as part of the top 50% most
frequently occurring hashtags. So the number of top hashtags may end up being greater than 20 if there is more than
one hashtag that has frequency matching the minimum. The exception to this is if the minimum frequency is 1 and the
hashtagFreq
is set to less than 100, in this case only the first 20 hashtags will be included.
Hashtags and words in the top percentages are included in the network as isolates if there are no instances of them occurring in tweet text with other top percentage frequency terms.
if (FALSE) {
# twitter semantic network creation additionally requires the stopwords
# package for working with text data
# install.packages("stopwords")
# create a twitter semantic network graph removing the hashtag "#auspol"
# and using the top 2% frequently occurring words and 10% most frequently
# occurring hashtags as nodes
net_semantic <- collect_tw |>
Create("semantic", removeTermsOrHashtags = c("#auspol"),
termFreq = 2, hashtagFreq = 10, verbose = TRUE)
# network
# net_semantic$nodes
# net_semantic$edges
}