Skip to content

Commit

Permalink
Adds R Notebook about Authors data for Issue #42
Browse files Browse the repository at this point in the history
  • Loading branch information
Dave Gerrard committed Jul 27, 2018
1 parent 3a2a582 commit 95890f0
Show file tree
Hide file tree
Showing 5 changed files with 1,159 additions and 0 deletions.
196 changes: 196 additions & 0 deletions altmetric_data_analysis/Notebook_Authors.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
---
title: "Altmetric Authors"
date: 2018-07-27
output:
html_notebook: default
github_document: default
---


```{r setup, include = FALSE}
library(tidyverse)
authors <- read_csv('../files_out/20180411_1510_authors.csv')
mentions <- read_csv('../files_out/20180411_1510_mentions.csv')
articles <- read_csv('../files_out/20180411_1510_master.csv')
articles_with_mentions <- left_join(articles, mentions)
mentions_with_authors <- left_join(mentions, authors)
articles_with_mentions_and_authors <- left_join(articles_with_mentions, authors)
```

A dataset of all the Authors that have ever Mentioned any of the Articles in the set. **Note:** this is Altmetric's definition of the word Author, which means: someone who Mentioned an Article somewhere (they may also use the term 'poster' of a 'post' too, but they're called Authors in the JSON from the API). This definition is distinct from the more academic definition, i.e.: 'author of the article itself': the [Dimensions](https://app.dimensions.ai/discover/publication) database might be a better source for finding out more about that type of author.

Join Authors onto the Mentions dataset and you can see which authors discussed the set the most:

```{r Most prolific Authors related to the set of Articles}
arrange(
summarise(
group_by(
filter(
mentions_with_authors,
!is.na(author_name)),
author_name),
total_mentions = n(),
),
desc(total_mentions)
)
```

Join authors onto Mentions and the master Articles file and you can query how many times the Authors Mentioned a specific journal:

```{r Most prolific authors about a specific journal}
arrange(
summarise(
group_by(
filter(articles_with_mentions_and_authors,
journal_title == "Journal of Cancer Policy",
!is.na(author_name)),
author_name
),
total_mentions = n()
),
desc(total_mentions)
)
```


## How authors are managed

The authors dataset is generated by the Altmetric Client using logic based upon how Altmetric themselves handle data about authors. Unsurprisingly, this logic varies according to the source of the Mention the author wrote. For instance, Twitter Authors are uniquely identified using the 'author_id_on_source'field, while blogging Authors are identified using the blog's web address in the author_url field.

The Altmetric Client therefore parses each Mention, and then uses the source of the Mention to decide which field uniquely identifies the Author, using rules in the [AuthorManager Python class](https://github.com/CamLib/AltmetricClient/blob/master/altmetric_client/author_manager.py). Authors are then added to the Authors dataset with an id, generated by the Altmetric Client, that is unique to the dataset overall. This unique id is then posted back into the Mentions set, enabling Mentions to be joined back to their Authors. This in turn enables Authors that have posted multiple Mentions to be recorded more easily; essentially all of Altmetric's logic regarding 'which field identifies an Author for a given source' is handled as simply as possible in the AuthorManager. Hence this logic doesn't have to be added to the R used to analyse the dataset.

There is, however, no way of identifying the same author mentioning an Article across various sources (e.g. a journalist mentioning an Article in a news story, and then tweeting about it). This is pretty much par for the course with social media data, though, of course - there are plenty of people on platforms such as Twitter who aren't who they say they are, so you can never really be *sure* if the id from one platform corresponds to the id from another.

## Authors Data Dictionary

The Author's dataset contains the following fields:

### author_description

**Data type: character**

The author_description field is the description the Author has provided about themselves. In the test dataset at least, these only seem to be provided when the source is twitter or blogs.

```{r Filters those Authors that have a description}
select(
filter(authors,
!is.na(author_description)),
author_name,
author_source,
author_description
)
```

These descriptions might actually provide a decent set of free text for Natural Language Processing, however. (My PhD research indicates that 'Twitter Biography' data potentially contains more useful information than tweets).

### author_follower_count

**Data type: int**

Altmetric record the number of followers each Author has, but only from Twitter and Reddit. This clearly has some potential for assessing the reach of a specific article or journal on either of those platforms (see **Notebook_Mentions.Rmd** for the code to do that).

```{r Totals of all followers by source}
summarise(
group_by(
authors,
author_source
),
total_of_all_followers = sum(author_follower_count)
)
```



### author_id

**Data type: character**

An id generated by the Altmetric Client itself, which is unique to each analytical dataset, and is used to join each Author to the mentions they have written. The author_id is unique by author **and source** - the same actual Author cannot be identified across multiple sources. (See *How authors are managed* above).

### author_id_on_source

**Data type: character**

The information Altmetric have retrieved about the Author's id *on the platform from which their Mentions have been found*. For instance, my Twitter author_id_on_source (for my largely unused Twitter account) would be EpiphanyLboro. This is the most-commonly-used unique identifier for Authors in sources, but isn't the only field that can be used as an id. Some sources - e.g. blogs or pages about specific journalists in newspapers - potentially don't have a specific Author id.

### author_image_url

**Data type: character**

This is a very odd field that mostly seems to take data from Altmetric's own Amazon cloud web service. However, it does seem to uniquely identify policy 'Authors' - indeed the code chunk below:

```{r Authors of multiple policy documents}
arrange(
summarise(
group_by(
filter(mentions_with_authors, source == "policy"),
author_image_url
),
total = n()
),
desc(total)
)
```

... outputs a list of policy 'Authors' that are in fact [thumbnails of the front covers of policy documents](https://s3.amazonaws.com/cache.altmetric.com/policy/thumbnails/thumbnail-a9fd0dcbe86e77e6106c316a34a90235266d3c52b58b94abddb55f9f81c5ffd4.jpg) stored on Altmetric's server, no doubt for display on their own website. From this we can surmise that the policy's 'Author' is in fact the **policy document** itself.

### author_name

**Data type: character**

The given name of the author, if known. Quite often Authors fail to name themselves, and it seems as if finding the names of the Authors of policy Mentions in particular might be problematic, as shown below.

```{r Missing Author names by Mention source}
ggplot(data = filter(authors, is.na(author_name))) +
geom_bar(mapping = aes(x = author_source)) +
coord_flip()
```

### author_source

**Data type: character**

The name of the source from which this Author's Mentions were found. This information can also be found in the Mentions set so it's a bit redundant here, though it means you can look for which sources all the Authors come from without having to join to Mentions and group by author:

```{r Sources Authors come from}
ggplot(data = authors) +
geom_bar(mapping = aes(x = author_source)) +
coord_flip()
```

### author_url

**Data type: character**

Any URL that Altmetric have seen fit to attach to an author. These are used by Altmetric to identify blog, news, and Wikipedia Authors, e.g.:

```{r List of Wikipedia Author URLs}
select(
filter(authors, author_source == "wikipedia"),
author_name,
author_url
)
```



496 changes: 496 additions & 0 deletions altmetric_data_analysis/Notebook_Authors.nb.html

Large diffs are not rendered by default.

110 changes: 110 additions & 0 deletions altmetric_data_analysis/Notebook_Mentions.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
---
title: "Altmetric Mentions"
date: 2018-07-26
output:
html_notebook: default
github_document: default
---

```{r setup, include = FALSE}
library(tidyverse)
mentions <- read_csv('../files_out/20180411_1510_mentions.csv')
articles <- read_csv('../files_out/20180411_1510_master.csv')
authors <- read_csv('../files_out/20180411_1510_authors.csv')
articles_with_mentions <- left_join(articles, mentions)
articles_with_mentions_and_authors <- left_join(articles_with_mentions, authors)
```

A Mention (sometimes also called a *post* in Altmetric parlance) is a piece of content in which a specific article is mentioned. Collecting these mentions is pretty much the core piece of value that Altmetric add.

As ever, the most fun with Mentions can be had by joining them to the master Articles dataset. However, the Altmetric client also extracts Authors information from each Mention and adds it to a third dataset (see the Authors notebook for more about this set).

```{r Shows the journals that are most mentioned in news articles}
arrange(
summarise(
group_by(
filter(articles_with_mentions,
source == "news"),
journal_title
),
total = n()
),
desc(total)
)
```

## Mentions Data Dictionary

The following fields are included in the Mentions dataset

### author_id

**Data type: character**

Used to join Mentions to the Authors dataset. Author information is extracted for each Mention and added to a separate set, which enables analysis of the engagement particular authors have with the Articles in the master dataset. For example, which papers have been mentioned by the Twitter users with the highest numbers of followers?

```{r Articles ordered by the total number of followers of those that tweeted}
arrange(
summarise(
group_by(filter(articles_with_mentions_and_authors, source == "twitter"),
article_title),
total_followers = sum(author_follower_count)
),
desc(total_followers)
)
```

The above depends upon counting two tweets about one article by a twitter user with n followers as being tweeted at 2n followers. Given that not all followers of a twitter user see all the tweets that user posts, this seems a reasonable assumption.

### date_posted

**Data type: POSIX Calendar Time**

The date upon which the Mention was posted. This is incredibly useful for trending the timeline of the 'buzz' around a specific Article (in the case below, the most mentioned one in the test set - 10.1920/bn.ifs.2017.bn0211).

```{r Trend of the buzz around a specific article}
article_mentions <- filter(mentions, doi == "10.1920/bn.ifs.2017.bn0211")
ggplot(data = article_mentions, mapping = aes(x = date_posted)) +
geom_freqpoly(binwidth = 86400)
```

### doi

**Data type: character**

The key used to link a Mention back to the Article it mentioned in the master Articles set.

### source

**Data type: character**

The source of the Mention. These can be charted using the query below (Twitter usually swamps all the others at time of writing).

```{r Charts a count of mentions by source}
ggplot(data = mentions) +
geom_bar(mapping = aes(x = source)) +
coord_flip()
```


### url

**Data type: character**

The URL of the Mention (i.e. the place on the internet that Altmetric found it).



345 changes: 345 additions & 0 deletions altmetric_data_analysis/Notebook_Mentions.nb.html

Large diffs are not rendered by default.

12 changes: 12 additions & 0 deletions altmetric_data_analysis/author_summaries.R
Original file line number Diff line number Diff line change
Expand Up @@ -73,4 +73,16 @@ ggplot(data = followers_and_news) +
geom_point(mapping = aes(x = total_followers, y = n))


arrange(
summarise(
group_by(
filter(mentions_with_authors, source == "policy"),
author_image_url
),
total = n()
),
desc(total)
) %>% write.csv("../files_out/policy_author_urls.csv")



0 comments on commit 95890f0

Please sign in to comment.