Skip to contents

Collects hyperlinks from web pages and structures the data into a dataframe with the class names "datasource" and "web".

Usage

# S3 method for class 'web'
Collect(credential, pages = NULL, ..., writeToFile = FALSE, verbose = TRUE)

collect_web_hyperlinks(pages = NULL, writeToFile = FALSE, verbose = TRUE, ...)

Arguments

credential

A credential object generated from Authenticate with class name "web".

pages

Dataframe. Dataframe of web pages to crawl. The dataframe must have the columns page (character), type (character) and max_depth (integer). Each row is a seed web page to crawl, with the page value being the page URL. The type value is type of crawl as either "int", "ext" or "all", directing the crawler to follow only internal links, follow only external links (different domain to the seed page) or follow all links. The max_depth value determines how many levels of hyperlinks to follow from the seed site.

...

Additional parameters passed to function. Not used in this method.

writeToFile

Logical. Write collected data to file. Default is FALSE.

verbose

Logical. Output additional information. Default is TRUE.

Value

A tibble object with class names "datasource" and "web".

Examples

if (FALSE) { # \dontrun{
pages <- tibble::tibble(page = c("http://vosonlab.net",
                                 "https://rsss.cass.anu.edu.au"),
                        type = c("int", "all"),
                        max_depth = c(2, 2))

webData <- webAuth |>
  Collect(pages, writeToFile = TRUE)
} # }