# Extracting and Cleaning Bibliometric Data with R (1)

Exploring Scopus

Table des matières

After the roadmap for learning to programme in R where I listed useful tutorials and packages, I propose here to get to the heart of the matter and to learn how to extract and clean bibliometric data. This tutorial keeps in mind the issues that historians of economics can encounter, but I hope it could be also useful for other social scientists.

It is not easy to navigate between the different existing bibliometric database (Web of Science, Scopus, Dimensions, Jstor, Econlit, RePEc…), and to know which information they gather and how much data we can extract. I will thus bring here a series of tutorials on how to find and extract such data (if possible, by doing it via R scripts), but also on how to clean them. Indeed, depending on the database one uses, the data are in different formats, and one often has to do some cleaning (notably for the bibliographic references cited by the corpus one has extracted). In this tutorial, we will focus on Scopus data for which you will need an institutional access. The next post focuses on Dimensions data and the one after will deal with Constellate application of JSTOR. Both Dimensions and Constellate are accessible without institutional access.

At the end of this tutorial, you will know how to get the data needed for building bibliographic networks and I will give an example of a co-citation network using the biblionetwork package (Goutsmedt, Claveau, and Truc 2021). What you only need in this tutorial is a basic understanding of some functions of Tidyverse packages (Wickham 2021; Wickham et al. 2019)—mainly dplyr and stringr—as well as a (less) basic understanding of “regex” (Regular Expressions).1 At some points I dig into more complicated and tortuous methods when using Elsevier/Scopus APIs with rscopus or cleaning references with the machine learning software anystyle using the command line. This will be explained in separated sections that beginners can skip.

Let’s first load all the packages we need and set the path to the directory where I put the data extracted from Scopus:

# packages
package_list <- c(
"here", # use for paths creation
"tidyverse",
"bib2df", # for cleaning .bib data
"janitor", # useful functions for cleaning imported data
"rscopus", # using Scopus API
"biblionetwork", # creating edges
"tidygraph", # for creating networks
"ggraph" # plotting networks
)
for (p in package_list) {
if (p %in% installed.packages() == FALSE) {
install.packages(p, dependencies = TRUE)
}
library(p, character.only = TRUE)
}

github_list <- c(
"agoutsmedt/networkflow", # manipulating network
"ParkerICI/vite" # needed for the spatialisation of the network
)
for (p in github_list) {
if (gsub(".*/", "", p) %in% installed.packages() == FALSE) {
devtools::install_github(p)
}
library(gsub(".*/", "", p), character.only = TRUE)
}

# paths
data_path <- here(path.expand("~"),
"data",
"tuto_biblio_dsge")
scopus_path <- here(data_path,
"scopus")


## Extracting data from Scopus

### Using Scopus website

I focus here on a not so much historical subject because a very recent one: the Dynamic Stochastic General Equilibrium (DSGE) models. This type of models has emerged in the late 1990s and has become standard in academic publications as well as in policymaking institutions, notably in central banks.2

If you don’t have one, you will need to create an account on Scopus using your institutional email address. Once you are on the “Documents” query page, you can search for “DSGE” or “Dynamic Stochastic General Equilibrium” in documents titles, abstracts and keywords (see Figure 1). On January 2022, I got 2633 results. You can select “All” the documents and then click on the down arrow on the righ of “CSV export” (see Figure 2). You then have to choose all the information you want to download in a .csv. What we need is to cross the “Include References” box as we need it for the bibliographic co-citation later (see Figure 3).3

The .csv you have exported gathers metadata on authors, title, journal, abstract, keywords, affiliations and references on documents (mainly articles, but also conference papers, book chapters, and reviews) mentioning DSGE in their title, abstract or keywords. There are two limits at this method for extracting data:

• the quantity of data you can export is limited to 2000, which is not much;
• part of the metadata are relatively raw, like affiliations and references, what involves some cleaning.

I will show you how to clean these raw data in the next section. But the following sub-section explains you how to use Scopus APIs directly in R, to query more easily for different data and extract more items.

### Alternative method: Using Scopus APIs and rscopus

You first need to create an “API Key” on Scopus website (see more info here) associated to your account. Using the APIs allows you to extract larger set of data (see the available APIs and corresponding quotas here).

Let’s use Rscopus and set the API key:

api_key <- "your_api_key"
set_api_key(api_key)


Most of the time (and it was the case for me) your institutional access is linked to an IP address and you are not able to use the APIs if you are not connected to your institution internet network. If you want to work remotely, you need to ask for a “token-based” authentification ad ask for an “Institutional Token” or “Insttoken” (see explanations here). You just have to write to Elsevier to explain which kind of research you are doing and ask them for an “insttoken”. That is also a good occasion, if necessary, to ask them for higher quotas.

Let’s set the institutional token:

insttoken <- "your_institutional_token"


We first run the query using “Scopus Search API” via rscopus. That is the API corresponding to the search we have done above on Scopus website. We can now extract as much as 20000 items directly per week. We get raw data with a lot of information in it (the data but also information on the query, etc.). Using rscopus gen_entries_to_df function, we convert these raw data in data.frames.

dsge_query <- rscopus::scopus_search("TITLE-ABS-KEY(DSGE) OR TITLE-ABS-KEY(\"Dynamic Stochastic General Equilibrium\")",
view = "COMPLETE",
dsge_data_raw <- gen_entries_to_df(dsge_query$entries)  We can finally separate the different data.frames. We get three tables with different types of information: • A table with all the articles and their metadata; • A table with the list of all authors of the articles; • A table with the list of affiliations. dsge_papers <- dsge_data_raw$df
dsge_affiliations <- dsge_data_raw$affiliation dsge_authors <- dsge_data_raw$author


Now that we have the articles, we have to extract the references using scopus “Abstract Retrieval API”. We use articles internal identifier to find references. But we cannot query references with multiple identifiers, so we need to make a loop to extract references one by one. We create a list where we put the references for each articles (we have as many data.frames as articles in our list) and we bind the data.frames together, associating each of them to the identifier of the corresponding citing article.4

citing_articles <- dsge_papers$dc:identifier # extracting the IDs of our articles on dsge citation_list <- list() for(i in 1:length(citing_articles)){ citations_query <- abstract_retrieval(citing_articles[i], identifier = "scopus_id", view = "REF", headers = insttoken) citations <- gen_entries_to_df(citations_query$content$abstracts-retrieval-response$references$reference) message(i) if(length(citations$df) > 0){
message(paste0(citing_articles[i], " is not empty."))
citations <- citations$df %>% as_tibble(.name_repair = "unique") %>% select_if(~!all(is.na(.))) citation_list[[citing_articles[i]]] <- citations } } dsge_references <- bind_rows(citation_list, .id = "citing_art")  Once you have these four data.frames (articles metadata, authors, affiliations and references), you can proceed to the bibliometric analysis. The only difficulty now is to navigate between the numerous columns of each data.frame and to understand what the different information are. To link the articles metadata data.frame with the reference one, you can use the scopus identifier that we have put in the reference table (in the column citing_art). If you want to join the articles metadata data.frame with the authors and affiliations ones, you have to use the entry_number column that exists in the three data.frames. This number is created by Scopus after your query, meaning that articles, affiliations and authors are not linked by a permanent identifier. Consequently, the identifiers will be regenerated and thus different if you change your query. In addition to saving you the laborious task of cleaning references and affiliations, when you use the APIs, the references already have an identifier, given by Scopus. It means that if two articles cite the same reference, you will know it because this reference will have a common identifier in the citations of both the first and second article. Below, when manipulating the data extracted from Scopus website, we will need to find which citations are corresponding to the same reference ourselves.5 If you are not to afraid by Scopus categories and language, or by querying a website through an R script, that is perhaps the quickest method to get data (and that is a method that allows you to get more data, with more complicated queries). However, in the next section, I show you how to clean the data extracted from the Scopus website above. ## Cleaning Scopus data So let’s come back to the data we have downloaded on Scopus website: #' # Cleaning scopus data from website search #' scopus_search_1 <- read_csv(here(scopus_path, "scopus_search_1998-2013.csv")) scopus_search_2 <- read_csv(here(scopus_path, "scopus_search_2014-2021.csv")) scopus_search <- rbind(scopus_search_1, scopus_search_2) %>% mutate(citing_id = paste0("A", 1:n())) %>% # We create a unique ID for each doc of our corpus clean_names() # janitor function to clean column names scopus_search  ## # A tibble: 2,608 x 23 ## authors title year source_title volume issue art_no page_start page_end ## <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 Ha J., So I. Infl~ 2013 Global Econ~ 42 4 <NA> 396 424 ## 2 Botero J., ~ Exog~ 2013 Revista de ~ 16 1 <NA> 1 24 ## 3 Kirsanova T~ Comm~ 2013 Internation~ 9 4 <NA> 99 151 ## 4 Ashimov A.A~ Para~ 2013 Economic De~ <NA> <NA> <NA> 95 188 ## 5 Khan A., Th~ Cred~ 2013 Journal of ~ 121 6 <NA> 1055 1107 ## 6 Jerábek T.,~ Pred~ 2013 Acta Univer~ 61 7 <NA> 2229 2238 ## 7 Hu X., Xu B~ Infl~ 2013 Journal of ~ 5 12 <NA> 636 641 ## 8 Cha H. Taki~ 2013 Internation~ 7 2 <NA> 280 296 ## 9 da Silva M.~ The ~ 2013 North Ameri~ 26 <NA> <NA> 266 281 ## 10 Sandri D., ~ Fina~ 2013 Journal of ~ 45 SUPP~ <NA> 59 86 ## # ... with 2,598 more rows, and 14 more variables: page_count <dbl>, ## # cited_by <dbl>, doi <chr>, affiliations <chr>, ## # authors_with_affiliations <chr>, abstract <chr>, author_keywords <chr>, ## # index_keywords <chr>, references <chr>, correspondence_address <chr>, ## # language_of_original_document <chr>, document_type <chr>, source <chr>, ## # citing_id <chr>  There are several things to clean: • We have several authors per document. For some analysis, for instance co-authorship networks, it is better to have an “author table” which associates each author to a list of papers; • We have several affiliations per article as well as several authors_with_affiliations. It allows us to connect authors with their affiliations, but here again we need to separate it in as many lines as authors (if I am not mistaken, it seems there is only one affiliation per author in this set of data); • references; • Possibly to separate author_keywords and index_keywords if you want to use it. In what follows, we clean scopus_search in order to produce 3 additional data.frames: • one data.frame which associates each article to a list of authors and their corresponding affiliation (see below); • one data.frame which associate each article to the list of references it cites (one article has as many lines as the number of cited references): this is a “direct citation” table (see here); • a list of all the references cited, which implies to find which references are the same in the direct citation table (see this sub-section) ### Extracting affiliations and authors We have two columns for affiliations: • one column affiliations with affiliations alone; • one column, authors_with_affiliations with both authors and affiliations. affiliations_raw <- scopus_search %>% select(citing_id, authors, affiliations, authors_with_affiliations) knitr::kable(head(affiliations_raw, n = 2))  citing_idauthorsaffiliationsauthors_with_affiliations A1Ha J., So I.Department of Economics, Cornell University, Ithaca, NY, United States; Department of Economics, University of Washington, Seattle, WA, United StatesHa, J., Department of Economics, Cornell University, Ithaca, NY, United States; So, I., Department of Economics, University of Washington, Seattle, WA, United States A2Botero J., Franco H., Hurtado Á., Mesa M.Departamento de Economía, Universidad EAFIT, Medellín, Colombia; EAFIT University, ColombiaBotero, J., Departamento de Economía, Universidad EAFIT, Medellín, Colombia; Franco, H., Departamento de Economía, Universidad EAFIT, Medellín, Colombia; Hurtado, Á., Departamento de Economía, Universidad EAFIT, Medellín, Colombia; Mesa, M., EAFIT University, Colombia For a more secure cleaning, we opt for the second column, as it allows us to associate the author with his/her own affiliation as described in the column authors_with_affiliations. The strategy here is to separate each author from the authors column, then to separate the different authors and affiliations in the authors_with_affiliations column. Finally, we keep only the lines where the author from authors_with_affiliations is the same as in authors scopus_affiliations <- affiliations_raw %>% separate_rows(authors, sep = ", ") %>% separate_rows(contains("with"), sep = "; ") %>% mutate(authors_from_affiliation = str_extract(authors_with_affiliations, "^(.+?)\\.(?=,)"), authors_from_affiliation = str_remove(authors_from_affiliation, ","), affiliations = str_remove(authors_with_affiliations, "^(.+?)\\., "), country = str_remove(affiliations, ".*, ")) %>% # Country is after the last comma filter(authors == authors_from_affiliation) %>% select(citing_id, authors, affiliations, country) knitr::kable(head(scopus_affiliations))  citing_idauthorsaffiliationscountry A1Ha J.Department of Economics, Cornell University, Ithaca, NY, United StatesUnited States A1So I.Department of Economics, University of Washington, Seattle, WA, United StatesUnited States A2Botero J.Departamento de Economía, Universidad EAFIT, Medellín, ColombiaColombia A2Franco H.Departamento de Economía, Universidad EAFIT, Medellín, ColombiaColombia A2Hurtado Á.Departamento de Economía, Universidad EAFIT, Medellín, ColombiaColombia A2Mesa M.EAFIT University, ColombiaColombia ### Clean references Let’s first extract the references column while keeping the identifier of the citing article. We put each reference cited by an article on a separated line, using the fact that the references are separated by a semi-colon. We create an identifier for each reference. #' ## Extracting and cleaning references references_extract <- scopus_search %>% filter(! is.na(references)) %>% select(citing_id, references) %>% separate_rows(references, sep = "; ") %>% mutate(id_ref = 1:n()) %>% as_tibble knitr::kable(head(references_extract))  citing_idreferencesid_ref A1Bernanke, B., Gertler, M., Gilchrist, S., Financial accelerator in a quantitative business cycle framework, NBER Working Paper No. W6455 (1998), NBER1 A1Calvo, G., Staggered prices in a utility maximizing framework (1983) Journal of Monetary Economics, 12, pp. 383-3982 A1Christensen, I., Dib, A., Monetary policy in an estimated DSGE model with a financial accelerator (2005) Computing in Economics and Finance, p. 3143 A1Dixit, A., Stiglitz, J., Monopolistic competition and optimum product diversity (1977) American Economic Review, 67 (3), pp. 297-3084 A1Gerali, A., Neri, S., Sessa, L., Signoretti, F., Credit and banking in a DSGE model (2010), 42 (6), pp. 107-141. , Working paper, Banka D’Italia, Rome5 A1Goodfriend, M., McCallum, B., Banking and interest rates in monetary policy analysis: A quantitative exploration (2007), NBER Working Paper Series No. 13207, NBER6 We now have one reference per line and complications begin. We need to find ways to extract the different pieces of information for each reference: authors, year, pages, volume & issue information, journal, title, etc… Here are some regex that will match some of these information. extract_authors <- ".*[:upper:][:alpha:]+( Jr(.)?)?, ([A-Z]\\.[ -]?)?([A-Z]\\.[ -]?)?([A-Z]\\.)?[A-Z]\\." extract_year_brackets <- "(?<=\$$)\\d{4}(?=\$$)" extract_pages <- "(?<= (p)?p\\. )([A-Z])?\\d+(-([A-Z])?\\d+)?" extract_volume_and_number <- "(?<=( |^)?)\\d+ \$$\\d+(-\\d+)?\$$"  We can now extract authors and year. We create a new column, remaining_ref which keeps the information from the references column but we remove the authors from it. For easier cleaning, we separate references depending on the position of the year of publication in the reference. We use the variable is_article to determinate where the year is and thus if the title is before the year or not. cleaning_references <- references_extract %>% mutate(authors = str_extract(references, paste0(extract_authors, "(?=, )")), remaining_ref = str_remove(references, paste0(extract_authors, ", ")), # cleaning from authors is_article = ! str_detect(remaining_ref, "^\$$[:digit:]{4}"), year = str_extract(references, extract_year_brackets) %>% as.integer)  I cannot detail all the regex below but the goal is to extract as many relevant metadata as possible, first for references with the year of publication after the title (is_article == TRUE), which are most of the time journal articles. This code has been written in January 2022 and I will try to improve it later, notably because this use of the is_article variable makes the code unecessarily longer. cleaning_articles <- cleaning_references %>% filter(is_article == TRUE) %>% mutate(title = str_extract(remaining_ref, ".*(?=\\(\\d{4})"), # pre date extraction journal_to_clean = str_extract(remaining_ref, "(?<=\\d{4}\$$).*"), # post date extraction journal_to_clean = str_remove(journal_to_clean, "^,") %>% str_trim("both"), # cleaning a bit the journal info column pages = str_extract(journal_to_clean, extract_pages), # extracting pages volume_and_number = str_extract(journal_to_clean, extract_volume_and_number), # extracting standard volument and number: X (X) journal_to_clean = str_remove(journal_to_clean, " (p)?p\\. ([A-Z])?\\d+(-([A-Z])?\\d+)?"), # clean from extracted pages journal_to_clean = str_remove(journal_to_clean, "( |^)?\\d+ \$$\\d+(-\\d+)?\$$"), # clean from extracted volume and number volume_and_number = ifelse(is.na(volume_and_number), str_extract(journal_to_clean, "(?<= )([A-Z])?\\d+(-\\d+)?"), volume_and_number), # extract remaining numbers journal_to_clean = str_remove(journal_to_clean, " ([A-Z])?\\d+(-\\d+)?"), # clean from remaining numbers journal = str_remove_all(journal_to_clean, "^[:punct:]+( )?[:punct:]+( )?|(?<=,( )?)[:punct:]+( )?([:punct:])?|[:punct:]( )?[:punct:]+( )?$"), # extract journal info by removing inappropriate punctuations
first_page = str_extract(pages, "\\d+"),
volume = str_extract(volume_and_number, "\\d+"),
issue = str_extract(volume_and_number, "(?<=\$$)\\d+(?=\$$)"),
publisher = ifelse(is.na(first_page) & is.na(volume) & is.na(issue) & ! str_detect(journal, "(W|w)orking (P|p)?aper"), journal, NA),
book_title = ifelse(str_detect(journal, " (E|e)d(s)?\\.| (E|e)dite(d|urs)? "), journal, NA), # Incollection article: Title of the book here
book_title = str_extract(book_title, "[A-z ]+(?=,)"), # keeping only the title of the book
publisher = ifelse(!is.na(book_title), NA, publisher), # if we have an incollection article, that's not a book, so no publisher
journal = ifelse(!is.na(book_title) | ! is.na(publisher), NA, journal), # removing journal as what we have is a book
publisher = ifelse(is.na(publisher) & str_detect(journal, "(W|w)orking (P|p)?aper"), journal, publisher), # adding working paper publisher information in publisher column
journal = ifelse(str_detect(journal, "(W|w)orking (P|p)?aper"), "Working Paper", journal))

cleaned_articles <- cleaning_articles %>%
select(citing_id, id_ref, authors, year, title, journal, volume, issue, pages, first_page, book_title, publisher, references)


We do the same now with the remaining references that are less numerous but that are also less easy to clean, due to the fact that the title is not clearly separated from other information (journal or publisher).

cleaning_non_articles <- cleaning_references %>%
filter(is_article == FALSE) %>%
mutate(remaining_ref = str_remove(remaining_ref, "\$$\\d{4}\$$(,)? "),
title = str_extract(remaining_ref, ".*(?=, ,)"),
pages = str_extract(remaining_ref, "(?<= (p)?p\\. )([A-Z])?\\d+(-([A-Z])?\\d+)?"), # extracting pages
volume_and_number = str_extract(remaining_ref, "(?<=( |^)?)\\d+ \$$\\d+(-\\d+)?\$$"), # extracting standard volument and number: X (X)
remaining_ref = str_remove(remaining_ref, " (p)?p\\. ([A-Z])?\\d+(-([A-Z])?\\d+)?"), # clean from extracted pages
remaining_ref = str_remove_all(remaining_ref, ".*, ,"), # clean dates and already extracted titles
remaining_ref = str_remove(remaining_ref, "( |^)?\\d+ \$$\\d+(-\\d+)?\$$"), # clean from extracted volume and number
volume_and_number = ifelse(is.na(volume_and_number), str_extract(remaining_ref, "(?<= )([A-Z])?\\d+(-\\d+)?"), volume_and_number), # extract remaining numbers
remaining_ref = str_remove(remaining_ref, " ([A-Z])?\\d+(-\\d+)?"), # clean from remaining numbers
journal = ifelse(str_detect(remaining_ref, "(W|w)orking (P|p)aper"), "Working Paper", NA),
journal = ifelse(str_detect(remaining_ref, "(M|m)anuscript"), "Manuscript", journal),
journal = ifelse(str_detect(remaining_ref, "(M|m)imeo"), "Mimeo", journal),
publisher = ifelse(is.na(journal), remaining_ref, NA) %>% str_trim("both"),
first_page = str_extract(pages, "\\d+"),
volume = str_extract(volume_and_number, "\\d+"),
issue = str_extract(volume_and_number, "(?<=\$$)\\d+(?=\$$)"),
book_title = NA) # to be symetric with "cleaned_articles"

cleaned_non_articles <- cleaning_non_articles %>%
select(citing_id, id_ref, authors, year, title, journal, volume, issue, pages, first_page, book_title, publisher, references)


We merge the two data.frames and:

• we normalize and clean authors’ name to facilitate matching of references later;
• we extract useful information like DOI and PII.
# merging the two files.
cleaned_ref <- rbind(cleaned_articles, cleaned_non_articles)

#' Now we have all the references, we can do a bit of cleaning on the authors name,
#' and extract useful information, like DOI, for matching later.

cleaned_ref <- cleaned_ref %>%
mutate(authors = str_remove(authors, " Jr\\."), # standardising authors name to favour matching later
authors = str_remove(authors, "^\$$\\d{4}\$$(\\.)?( )?"),
authors = str_remove(authors, "^, "),
authors = ifelse(is.na(authors), str_extract(references, ".*[:upper:]\\.(?= \\d{4})"), authors), # specific case
journal = str_remove(journal, "[:punct:]$"), # remove unnecessary punctuations at the end doi = str_extract(references, "(?<=DOI(:)? ).*|(?<=\\/doi\\.org\\/).*"), pii = str_extract(doi, "(?<=PII ).*"), doi = str_remove(doi, ",.*"), # cleaning doi pii = str_remove(pii, ",.*"), # cleaning pii ) knitr::kable(head(cleaned_ref, n = 3))  citing_idid_refauthorsyeartitlejournalvolumeissuepagesfirst_pagebook_titlepublisherreferencesdoipii A11Bernanke, B., Gertler, M., Gilchrist, S.1998Financial accelerator in a quantitative business cycle framework, NBER Working Paper No. W6455NANANANANANANBERBernanke, B., Gertler, M., Gilchrist, S., Financial accelerator in a quantitative business cycle framework, NBER Working Paper No. W6455 (1998), NBERNANA A12Calvo, G.1983Staggered prices in a utility maximizing frameworkJournal of Monetary Economics12NA383-398383NANACalvo, G., Staggered prices in a utility maximizing framework (1983) Journal of Monetary Economics, 12, pp. 383-398NANA A13Christensen, I., Dib, A.2005Monetary policy in an estimated DSGE model with a financial acceleratorComputing in Economics and FinanceNANA314314NANAChristensen, I., Dib, A., Monetary policy in an estimated DSGE model with a financial accelerator (2005) Computing in Economics and Finance, p. 314NANA Practically speaking, if you are targetting a serious quantitative work, more cleaning would be needed to remove small errors. As often in this kind of automatised cleaning, 95% of cleaning can be done with few lines of code, and the rest involves many more work. I do not go further here as this is just a tutorial. ### Matching references What we need to do now is to find which references are the same, to give them a unique ID. The trade-off is to match as many true positive as possible (references that are the same) while avoiding to match false positive, that is references that have some information in common, but that actually are not the same references. For instance, matching only by the authors’ names and the year of publication is too broad, as these authors can have published several articles during the same year. Here are several ways to identify a common reference that bear very few risks of matching together different references: • same first author surname or authors, year, volume and page (this is the most secure ones): let’s call them fayvp & ayvp; • same journal, volume, issue and first page: jvip; • same author, year and title: ayt; • same title, year and first page: typ; • same Doi or PII.6 We extract first author surname to favour matching as there are more possibilities of small differences for several authors that would prevent us to match similar references.7 cleaned_ref <- cleaned_ref %>% mutate(first_author = str_extract(authors, "^[[:alpha:]+[']?[ -]?]+, ([A-Z]\\.[ -]?)?([A-Z]\\.[ -]?)?([A-Z]\\.)?[A-Z]\\.(?=(,|$))"),
first_author_surname = str_extract(first_author, ".*(?=,)"),
across(.cols = c("authors", "first_author", "journal", "title"), ~toupper(.)))


For each type of matching, we are giving a new id to the matched references, by giving the id_ref of the first references matched. At the end, we compare all the new id created with all the matching methods, and we take the smaller id.

matching_ref <- function(data, id_ref, ..., col_name){
match <- data %>%
group_by(...) %>%
mutate(new_id = min({{id_ref}})) %>%
drop_na(...) %>%
ungroup() %>%
select({{id_ref}}, new_id) %>%
rename_with(~ paste0(col_name, "_new_id"), .cols = new_id)

data <- data %>%
left_join(match)
}

identifying_ref <- cleaned_ref %>%
matching_ref(id_ref, first_author_surname, year, title, col_name = "fayt") %>%
matching_ref(id_ref, journal, volume, issue, first_page, col_name = "jvip") %>%
matching_ref(id_ref, authors, year, volume, first_page, col_name = "ayvp") %>%
matching_ref(id_ref, first_author_surname, year, volume, first_page, col_name = "fayvp") %>%
matching_ref(id_ref, title, year, first_page, col_name = "typ") %>%
matching_ref(id_ref, pii, col_name = "pii") %>%
matching_ref(id_ref, doi, col_name = "doi")


Now we have our direct citation table connecting the citing articles to the references. We have as many lines as the number of citations by citing articles.

direct_citation <- identifying_ref %>%
mutate(new_id_ref = select(., ends_with("new_id")) %>%  reduce(pmin, na.rm = TRUE),
new_id_ref = ifelse(is.na(new_id_ref), id_ref, new_id_ref))  %>%
relocate(new_id_ref, .after = citing_id) %>%
select(-id_ref & ! ends_with("new_id"))

knitr::kable(head(filter(direct_citation, new_id_ref == 2), n = 4))

citing_idnew_id_refauthorsyeartitlejournalvolumeissuepagesfirst_pagebook_titlepublisherreferencesdoipiifirst_authorfirst_author_surname
A12CALVO, G.1983STAGGERED PRICES IN A UTILITY MAXIMIZING FRAMEWORKJOURNAL OF MONETARY ECONOMICS12NA383-398383NANACalvo, G., Staggered prices in a utility maximizing framework (1983) Journal of Monetary Economics, 12, pp. 383-398NANACALVO, G.Calvo
A32CALVO, G.1983STAGGERED PRICES IN A UTILITY-MAXIMIZING FRAMEWORKJOURNAL OF MONETARY ECONOMICS123383-398383NANACalvo, G., Staggered Prices in a Utility-Maximizing Framework (1983) Journal of Monetary Economics, 12 (3), pp. 383-398NANACALVO, G.Calvo
A182CALVO, G.1983STAGGERED PRICES IN A UTILITY MAXIMIZING FRAMEWORKJOURNAL OF MONETARY ECONOMICS12NA383-398383NANACalvo, G., Staggered prices in a utility maximizing framework (1983) Journal of Monetary Economics, 12, pp. 383-398NANACALVO, G.Calvo
A202CALVO, G.A.1983STAGGERED PRICES IN A UTILITY-MAXIMIZING FRAMEWORKJOURNAL OF MONETARY ECONOMICS123383-398383NANACalvo, G.A., Staggered prices in a utility-maximizing framework (1983) Journal of Monetary Economics, 12 (3), pp. 383-398NANACALVO, G.A.Calvo

We can extract the list of all the references cited. We have as many lines as references cited by citing articles (i.e. a reference cited multiple times is present only once in the table). As for matched references, we have different information (due to the fact that the references were cited differently depending on the citing articles), we take a line where information seems to be the most complete.

important_info <- c("authors",
"year",
"title",
"journal",
"volume",
"issue",
"pages",
"book_title",
"publisher")
references <- direct_citation %>%
mutate(nb_na = rowSums(!is.na(select(., all_of(important_info))))) %>%
group_by(new_id_ref) %>%
slice_max(order_by = nb_na, n = 1, with_ties = FALSE) %>%
select(-citing_id) %>%
unique

knitr::kable(head(references, n = 4))

new_id_refauthorsyeartitlejournalvolumeissuepagesfirst_pagebook_titlepublisherreferencesdoipiifirst_authorfirst_author_surnamenb_na
1BERNANKE, B., GERTLER, M., GILCHRIST, S.1998FINANCIAL ACCELERATOR IN A QUANTITATIVE BUSINESS CYCLE FRAMEWORK, NBER WORKING PAPER NO. W6455NANANANANANANBERBernanke, B., Gertler, M., Gilchrist, S., Financial accelerator in a quantitative business cycle framework, NBER Working Paper No. W6455 (1998), NBERNANABERNANKE, B.Bernanke4
2CALVO, G.1983STAGGERED PRICES IN A UTILITY-MAXIMIZING FRAMEWORKJOURNAL OF MONETARY ECONOMICS123383-398383NANACalvo, G., Staggered Prices in a Utility-Maximizing Framework (1983) Journal of Monetary Economics, 12 (3), pp. 383-398NANACALVO, G.Calvo7
3CHRISTENSEN, I., DIB, A.2005MONETARY POLICY IN AN ESTIMATED DSGE MODEL WITH A FINANCIAL ACCELERATORCOMPUTING IN ECONOMICS AND FINANCENANA314314NANAChristensen, I., Dib, A., Monetary policy in an estimated DSGE model with a financial accelerator (2005) Computing in Economics and Finance, p. 314NANACHRISTENSEN, I.Christensen5
4DIXIT, A., STIGLITZ, J.1977MONOPOLISTIC COMPETITION AND OPTIMUM PRODUCT DIVERSITYAMERICAN ECONOMIC REVIEW673297-308297NANADixit, A., Stiglitz, J., Monopolistic competition and optimum product diversity (1977) American Economic Review, 67 (3), pp. 297-308NANADIXIT, A.Dixit7

From the initial 105313 citations, we get 52577 different references, with 10631 cited at least twice.

We also remove the references column of the initial data.frame as references cited are now gathered in direct_citation.

corpus <- scopus_search %>%
select(-references)


I imagine that after so many steps, you can’t wait to look at these data on DSGE models. But you will have to wait, as the following subsection presents you another method to clean (without cleaning yourself) the references. Or you can go directly to the last section that explores the cleaned data.

### Alternative method: anystyle

anystyle is a great references parser relyng on machine learning heuristics. It has an online version where you can put the text for which you want to identify references. However, we will use the command line in order to identify more references than what the online website allows (10 000 while we have 105313 raw references).

Anystyle can be installed as a RubyGem. You thus need to install Ruby (here for Windows) and then install anystyle using the command line: gem install anystyle (see more information here). As you need to use the command line interface, you also need to install anystyle-cli: gem install anystyle-cli.

Some frightened reader: “What, what? wait! Command line you said?”

Oh, yes. And that is a good occasion to refer you to these great tutorials of the great Programming Historian Website: one for the bash command line and one for the Windows Powershell Command Line. And it will be the occasion to use a bit Rstudio Terminal to enter the commands.

Once you have installed anystyle, you need to save all the references (with one reference per line) in a .txt.

ref_text <- paste0(references_extract$references, collapse = "\\\n") name_file <- "ref_in_text.txt" write_file(ref_text, here(scopus_path, name_file))  To create the anystyle command, you need to name the repository where you will send the .bib created by anystyle from your .txt destination_anystyle <- "anystyle_cleaned" directory_command <- paste0("cd ", scopus_path) anystyle_command <- paste0("anystyle -f bib parse ", name_file, " ", destination_anystyle)  To use anystyle, you have to use the command line of the terminal. You first have to set the path where the .txt is (which is here the scopus_path): cd the_path_where_is_the_.txt. Then you copy and paste the anystyle command in the terminal, which here is: anystyle -f bib parse ref_in_text.txt anystyle_cleaned. Hopefully it will work and you wil just have to wait for the creation of the .bib (it took something like 10 minutes on my laptop).8 …waiting… Once we have our .bib, we transform it in a data frame thanks to the bib2df package. options(encoding = "UTF-8") bib_ref <- bib2df(here(scopus_path, destination_anystyle, "ref_in_text.bib")) bib_ref <- bib_ref %>% janitor::clean_names() %>% select_if(~!all(is.na(.))) %>% # removing all empty columns mutate(id_ref = 1:n()) %>% select(-c(translator, citation_number, arxiv, director, source)) knitr::kable(head(bib_ref))  categorybibtexkeyaddressauthorbooktitleeditioneditorinstitutionjournalnotenumberpagespublisherschoolseriestitletypevolumedateissueurlisbndoiid_ref ARTICLEbernanke1998aNABernanke, B. , Gertler, M. , Gilchrist, S.NANANANANBER Working PaperNAW6455NANANANAFinancial accelerator in a quantitative business cycle frameworkNA1998NANANANA1 ARTICLEcalvo-aNACalvo, G.NANANANAJournal of Monetary EconomicsNANA383–398NANANAStaggered prices in a utility maximizing framework (1983NA12NANANANANA2 ARTICLEchristensen-aNAChristensen, I., Dib, A.NANANANAComputing in Economics and FinanceNANA314NANANAMonetary policy in an estimated DSGE model with a financial accelerator (2005NANANANANANANA3 ARTICLEdixit-aNADixit, A. , Stiglitz, J.NANANANAAmerican Economic ReviewNA3297–308NANANAMonopolistic competition and optimum product diversity (1977NA67NANANANANA4 BOOKgerali2010aBanka D’Italia, RomeGerali, A. , Neri, S. , Sessa, L. , Signoretti, F.NANANANANANANA107–141, Working paperNANACredit and banking in a DSGE modelNA4220106NANANA5 INCOLLECTIONgoodfriend-aNAGoodfriend, M., McCallum, B.NBER Working Paper Series No. 13207, NBER\NANANANANANANANANANABanking and interest rates in monetary policy analysis: A quantitative exploration (2007NANANANANANANA6 For now, there is one major limitation to this method (which is most likely linked to my lack of mastery of anystyle and ruby): the result is a list of unique references. In other words, anystyle merge together references that are similar. It means that I have to find a way to build a link between the original references data.frame and the data.frame build on the .bib.9 Ideally, you can clean a bit the result. Anystyle is pretty good (and clearly better than I) for identifying the types of references, and thus to extract book title for chapter in books and editors. It is quite efficient to extract authors, and also titles even if saw many mistakes (but relatively to clean, as most of the time it is the year that has been put with the title). However, I also saw many mistakes for journals (incomplete name) that are not so easy to correct. If you want to clean your references data as much as possible, perhaps the best thing to do is to mix the coding cleaning approach used above with the anystyle method, and to complete missing information with one or another method. ## Exploring the DSGE literature Now we have our bibliographic data, the first thing we can look at is the most cited references in our corpus. direct_citation %>% add_count(new_id_ref) %>% select(new_id_ref, n) %>% unique() %>% slice_max(n, n = 10) %>% left_join(select(references, new_id_ref, references)) %>% select(references, n) %>% knitr::kable()  referencesn Smets, F., Wouters, R., Shocks and frictions in US business cycles. a Bayesian DSGE approach (2007) American Economic Review, 97 (3), pp. 586-606780 Calvo, G., Staggered Prices in a Utility-Maximizing Framework (1983) Journal of Monetary Economics, 12 (3), pp. 383-398702 Christiano, L.J., Eichenbaum, M., Evans, C.L., Nominal rigidities and the dynamic effects of a shock to monetary policy (2005) Journal of Political Economy, 113 (1), pp. 1-45. , http://ideas.repec.org/a/ucp/jpolec/v113y2005i1p1-45.html640 Smets, F., Wouters, R., An Estimated Dynamic Stochastic General Equilibrium Model of the Euro Area (2003) Journal of the European Economic Association, 1 (5), pp. 1123-1175610 Bernanke, B., Gertler, M., Gilchrist, S., The financial accelerator in a quantitative business cycle framework (1999) NBER Working Papers Series, 1 (3), pp. 1341-1393. , Elsevier Science B.V, (chap. 21), Handbook of Macroeconomics331 An, S., Schorfheide, F., Bayesian Analysis of DSGE Models (2007) Econometric Reviews, 26 (4), pp. 113-172311 Taylor, J., Discretion versus policy rules in practice (1993) Carnegie-Rochester Conference Series on Public Policy, 39 (0), pp. 195-214269 Iacoviello, M., House prices, borrowing constraints, and monetary policy in the business cycle (2005) American Economic Review, 95 (3), pp. 739-764243 Kydland, F.E., Prescott, E.C., Time to build and aggregate fluctuations (1982) Econometrica, 50 (6), pp. 1345-1370234 Schmitt-Grohe, S., Uribe, M., Closing small open economy models (2003) Journal of International Economics, 61 (1), pp. 163-185228 As we have the affiliations, we can try to see which are the top references for economists based in different countries: direct_citation %>% left_join(select(scopus_affiliations, citing_id, country)) %>% unique() %>% group_by(country) %>% mutate(nb_article = n()) %>% filter(nb_article > 5000) %>% # we keep only countries with 5000 articles add_count(new_id_ref) %>% select(new_id_ref, n) %>% unique() %>% slice_max(n, n = 8) %>% left_join(select(references, new_id_ref, references)) %>% select(references, n) %>% mutate(label = str_extract(references, ".*\$$\\d{4}\$$") %>% str_wrap(30), label = tidytext::reorder_within(label, n, country)) %>% ggplot(aes(n, label, fill = country)) + geom_col(show.legend = FALSE) + facet_wrap(~country, ncol = 3, scales = "free") + tidytext::scale_y_reordered() + labs(x = "Number of citations", y = NULL) + theme_classic(base_size = 10)  By using affiliations we can observe a regional preference pattern: in European countries, economists tend to cite more Smets and Wouters (2003) (the source of the European Central Bank DSGE model) than Christiano, Eichenbaum, and Evans (2005), while the opposite is true in the United States (Smets and Wouters 2007 is on US data). We can also notice that Kydland and Prescott (1982) is less popular in continental Europe. ### Bibliographic co-citation analysis To conclude this (long) tutorial, we can build a co-citation network: the references we have matched are the nodes of the network, and they are linked together depending on the number of times they are cited together (or in other words, the number of times they are together in a bibliography). We use the biblio_cocitation function of the biblionetwork package. The edge between two nodes is weighted depending of the total number of times each reference has been cited in the whole corpus (see here for more details). citations <- direct_citation %>% add_count(new_id_ref) %>% select(new_id_ref, n) %>% unique references_filtered <- references %>% left_join(citations) %>% filter(n >= 5) edges <- biblionetwork::biblio_cocitation(filter(direct_citation, new_id_ref %in% references_filtered$new_id_ref),
"citing_id",
"new_id_ref",
weight_threshold = 3)
edges

##         from    to    weight Source Target
##     1:     2     4 0.2354379      2      4
##     2:     2     5 0.1244966      2      5
##     3:     2     6 0.0518193      2      6
##     4:     2     7 0.1623352      2      7
##     5:     2    14 0.1349191      2     14
##    ---
## 42657: 64761 64841 0.4045199  64761  64841
## 42658: 64841 64936 0.3481553  64841  64936
## 42659: 66416 68931 0.7453560  66416  68931
## 42660: 66416 68935 0.5039526  66416  68935
## 42661: 68931 68935 0.6761234  68931  68935


We can then take our corpus and these edges to create a network/graph thanks to tidygraph (Pedersen 2020) and networkflow. I don’t enter in the details here as that is not the purpose of this tutorial and that the different steps are explained on networkflow website.

graph <- tbl_main_component(nodes = references_filtered,
edges = edges,
directed = FALSE)
graph

## # A tbl_graph: 2836 nodes and 42661 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 2,836 x 18 (active)
##      Id authors  year title journal volume issue pages first_page book_title
##   <int> <chr>   <int> <chr> <chr>   <chr>  <chr> <chr> <chr>      <chr>
## 1     2 CALVO,~  1983 "STA~ JOURNA~ 12     3     383-~ 383        <NA>
## 2     4 DIXIT,~  1977 "MON~ AMERIC~ 67     3     297-~ 297        <NA>
## 3     5 GERALI~  2010 "CRE~ WORKIN~ 42     6     107-~ 107        <NA>
## 4     6 GOODFR~  2007 "BAN~ JOURNA~ 54     5     1480~ 1480       <NA>
## 5     7 IACOVI~  2005 "HOU~ AMERIC~ 95     3     739-~ 739        <NA>
## 6    14 ROTEMB~  1982 "MON~ REVIEW~ 49     4     517-~ 517        <NA>
## # ... with 2,830 more rows, and 8 more variables: publisher <chr>,
## #   references <chr>, doi <chr>, pii <chr>, first_author <chr>,
## #   first_author_surname <chr>, nb_na <dbl>, n <int>
## #
## # Edge Data: 42,661 x 5
##    from    to weight Source Target
##   <int> <int>  <dbl>  <int>  <int>
## 1     1     2 0.235       2      4
## 2     1     3 0.124       2      5
## 3     1     4 0.0518      2      6
## # ... with 42,658 more rows

set.seed(1234)
graph <- leiden_workflow(graph) # identifying clusters of nodes

nb_communities <- graph %>%
activate(nodes) %>%
as_tibble %>%
select(Com_ID) %>%
unique %>%
nrow
palette <- scico::scico(n = nb_communities, palette = "hawaii") %>% # creating a color palette
sample()

graph <- community_colors(graph, palette, community_column = "Com_ID")

graph <- graph %>%
activate(nodes) %>%
mutate(size = n,# will be used for size of nodes
label = paste0(first_author_surname, "-", year))

graph <- community_names(graph,
ordering_column = "size",
naming = "label",
community_column = "Com_ID")

graph <- vite::complete_forceatlas2(graph,
first.iter = 10000)

top_nodes  <- top_nodes(graph,
ordering_column = "size",
top_n = 15,
top_n_per_com = 2,
biggest_community = TRUE,
community_threshold = 0.02)
community_labels <- community_labels(graph,
community_name_column = "Community_name",
community_size_column = "Size_com",
biggest_community = TRUE,
community_threshold = 0.02)


A co-citation network allows us to observe what are the main influences of a field of research. At the center of the network, we find the most cited references. On the borders of the graph, there are specific communities that influence different parts of the literature on DSGE. Here, the size of nodes depends on the number of times they are cited in our corpus.

graph <- graph %>%
activate(edges) %>%
filter(weight > 0.05)

ggraph(graph, "manual", x = x, y = y) +
geom_edge_arc0(aes(color = color_edges, width = weight), alpha = 0.4, strength = 0.2, show.legend = FALSE) +
scale_edge_width_continuous(range = c(0.1,2)) +
scale_edge_colour_identity() +
geom_node_point(aes(x=x, y=y, size = size, fill = color), pch = 21, alpha = 0.7, show.legend = FALSE) +
scale_size_continuous(range = c(0.3,16)) +
scale_fill_identity() +
ggnewscale::new_scale("size") +
ggrepel::geom_text_repel(data = top_nodes, aes(x=x, y=y, label = Label), size = 2, fontface="bold", alpha = 1, point.padding=NA, show.legend = FALSE) +
ggrepel::geom_label_repel(data = community_labels, aes(x=x, y=y, label = Community_name, fill = color), size = 6, fontface="bold", alpha = 0.9, point.padding=NA, show.legend = FALSE) +
scale_size_continuous(range = c(0.5,5)) +
theme_void()


Let’s conclude by observing what are the most cited nodes in each community. We see that community 04 deals with international issues while the 07 is linked to fiscal policy issues.

ragg::agg_png(here("content", "en", "post", "2022-01-31-extracting-biblio-data-1", "top-ref-country-1.png"),
width = 35,
height = 30,
units = "cm",
res = 200)
top_nodes(graph,
ordering_column = "size",
top_n_per_com = 6,
biggest_community = TRUE,
community_threshold = 0.04) %>%
select(Community_name, Label, title, n, color) %>%
mutate(label = paste0(Label, "-", title) %>%
str_wrap(34),
label = tidytext::reorder_within(label, n, Community_name)) %>%
ggplot(aes(n, label, fill = color)) +
geom_col(show.legend = FALSE) +
scale_fill_identity() +
facet_wrap(~Community_name, ncol = 3, scales = "free") +
tidytext::scale_y_reordered() +
labs(x = "Number of citations", y = NULL) +
theme_classic(base_size = 11)
invisible(dev.off())


## References

Christiano, Lawrence J., Martin Eichenbaum, and Charles L. Evans. 2005. “Nominal Rigidities and the Dynamic Effects of a Shock to Monetary Policy.” Journal of Political Economy 113 (1): 1–45.

De Vroey, Michel. 2016. A History of Macroeconomics from Keynes to Lucas and Beyond. Cambridge: Cambridge University Press.

Goutsmedt, Aurélien, François Claveau, and Alexandre Truc. 2021. Biblionetwork: Create Different Types of Bibliometric Networks.

Kydland, Finn E., and Edward C. Prescott. 1982. “Time to Build and Aggregate Fluctuations.” Econometrica: Journal of the Econometric Society 50 (6): 1345–70.

Pedersen, Thomas Lin. 2020. Tidygraph: A Tidy API for Graph Manipulation. https://CRAN.R-project.org/package=tidygraph.

Sergi, Francesco. 2020. “The Standard Narrative about DSGE Models in Central Banks’ Technical Reports.” The European Journal of the History of Economic Thought 27 (2): 163–93.

Smets, Frank, and Raf Wouters. 2003. “An Estimated Dynamic Stochastic General Equilibrium Model of the Euro Area.” Journal of the European Economic Association 1 (5): 1123–75.

Smets, Frank, and Rafael Wouters. 2007. “Shocks and Frictions in US Business Cycles: A Bayesian DSGE Approach.” American Economic Review 97 (3): 586–606.

Vines, David, and Samuel Wills. 2018. “The Rebuilding Macroeconomic Theory Project: An Analytical Assessment.” Oxford Review of Economic Policy 34 (1-2): 1–42. https://doi.org/10.1093/oxrep/grx062.

Wickham, Hadley. 2021. Tidyverse: Easily Install and Load the Tidyverse. https://CRAN.R-project.org/package=tidyverse.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

1. Don’t worry if you don’t understand regex at the beginning, that is a good occasion to learn and to practice. You can find simpler examples here and learn stringr by the same occasion. ↩︎

2. For reflexive and historical discussions of DSGE models, see De Vroey (2016, chap. 20), Vines and Wills (2018) and Sergi (2020). ↩︎

3. Normally, you won’t be able to download all the data in one extraction as there is a 2000 items limit, and thus you will need to do it in two steps. The easiest is to filter by year. ↩︎

4. We remove all citations data.frames that are empty (i.e. when an article cites nothing). ↩︎

5. I will try to check that for another tutorial on bibliometric data, but I observed that I got fewer citations with the API method than with the Scopus website method. It means perhaps that if Scopus has not been able to give an identifier to a reference (perhaps because of not sufficiently clean metadata), the citation of this reference is removed from the data. Consequently, if our cleaning method is good, we could be able to keep more citations and thus to keep references that could be excluded in the APIs data extraction. ↩︎

6. I have perhaps forget some useful combinations. ↩︎

7. Most of the differences are due to the authors’ initials: some reference have only one when others have two initials for some authors. ↩︎

8. In case you want to know more on the different commands of anystyle, see the API documentation. ↩︎

9. As I find anystyle a very interesting tool, I will try to work on that issue in the following months and find a solution. An easy way to do it would perhaps be to save one .txt per citing article and to run anystyle for all the .txt. We will have as many .bib as articles and we will just have to bind the resulting data.frames. ↩︎

##### Aurélien Goutsmedt
###### Chargé de Recherche FNRS

Je travaille sur l’histoire de la macroéconomie et l’expertise économique.