Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update full_join method #92

Merged
merged 1 commit into from
Mar 2, 2023

Conversation

xiangpin
Copy link
Member

Description

update full_join method

Related Issue

full_join() on a treedata object does not work with the standard dplyr UI of by=c('columnX'='columnY')
related issue is YuLab-SMU/tidytree#32.

In addition, the original full_join will generate errors, if the external data.frame contains labels that are not present in the tree.

or the da contains duplicated node rows, the original phylo tree structure will be damaged.

> library(treeio)
> tr <- rtree(4)
> da <- data.frame(label=c('t1', 't2', 't8'), values=c(10, 20, 80))
> tr %>% full_join(da, by='label') %>% ggtree::ggtree()
> tr %>% full_join(da, by='label')
'treedata' S4 object'.

...@ phylo:

Phylogenetic tree with 4 tips and 4 internal nodes.

Tip labels:
  t2, t3, t1, t4
Node labels:
  NA, NA, NA, t8

Rooted; includes branch lengths.

with the following features available:
  'values'.

# The associated data tibble abstraction: 8 × 4
# The 'node', 'label' and 'isTip' are from the phylo tree.
   node label isTip values
  <int> <chr> <lgl>  <dbl>
1     1 t2    TRUE      20
2     2 t3    TRUE      NA
3     3 t1    TRUE      10
4     4 t4    TRUE      NA
5     5 NA    FALSE     NA
6     6 NA    FALSE     NA
7     7 NA    FALSE     NA
8     8 t8    FALSE     NA
> tr <- rtree(4)
> da <- data.frame(label=c('t1', 't2', 't3', 't3'), values=c(10, 20, 80, 90))
> tr %>% full_join(da, by='label')
'treedata' S4 object'.

...@ phylo:

Phylogenetic tree with 5 tips and 3 internal nodes.

Tip labels:
  t2, t1, t4, t3, t3

Rooted; includes branch lengths.

with the following features available:
  'values'.

# The associated data tibble abstraction: 11 × 4
# The 'node', 'label' and 'isTip' are from the phylo tree.
    node label isTip values
   <int> <chr> <lgl>  <dbl>
 1     1 t2    TRUE      20
 2     2 t1    TRUE      10
 3     3 t4    TRUE      NA
 4     4 t3    TRUE      80
 5     4 t3    TRUE      90
 6     4 t3    TRUE      80
 7     4 t3    TRUE      90
 8     5 t3    FALSE     NA
 9     6 NA    FALSE     NA
10     7 NA    FALSE     NA
# … with 1 more row
# ℹ Use `print(n = ...)` to see more rows

the t8 is from da, but it doesn't exist in phylo tree. I think it is better to be removed when the da was joined, so the full_join might be like the left_join on treedata or phylo class. Because it is difficult to add a new node or tip in a phylo tree without other useful information such as edge.length etc.

So this update

  • by argument support by=c('columnX'='columnY').
  • the node or label that doesn't exist in phylo will be removed after joining, so it is like left_join.
  • the duplicated node rows will be nested automatically.

Example

> tr <- rtree(4)
> da <- data.frame(label=c('t1', 't2', 't8'), values=c(10, 20, 80))
> tr %>% full_join(da, by='label')
'treedata' S4 object'.

...@ phylo:

Phylogenetic tree with 4 tips and 3 internal nodes.

Tip labels:
  t2, t1, t4, t3

Rooted; includes branch lengths.

with the following features available:
  '', 'values'.

# The associated data tibble abstraction: 7 × 4
# The 'node', 'label' and 'isTip' are from the phylo tree.
   node label isTip values
  <int> <chr> <lgl>  <dbl>
1     1 t2    TRUE      20
2     2 t1    TRUE      10
3     3 t4    TRUE      NA
4     4 t3    TRUE      NA
5     5 NA    FALSE     NA
6     6 NA    FALSE     NA
7     7 NA    FALSE     NA
> da <- data.frame(label=c('t1', 't2', 't3', 't3'), values=c(10, 20, 80, 90))
> tr %>% full_join(da, by='label')
'treedata' S4 object'.

...@ phylo:

Phylogenetic tree with 4 tips and 3 internal nodes.

Tip labels:
  t2, t1, t4, t3

Rooted; includes branch lengths.

with the following features available:
  '', 'values'.

# The associated data tibble abstraction: 7 × 4
# The 'node', 'label' and 'isTip' are from the phylo tree.
   node label isTip values
  <int> <chr> <lgl> <list>
1     1 t2    TRUE  <tibble [1 × 1]>
2     2 t1    TRUE  <tibble [1 × 1]>
3     3 t4    TRUE  <tibble [1 × 1]>
4     4 t3    TRUE  <tibble [2 × 1]>
5     5 NA    FALSE <tibble [1 × 1]>
6     6 NA    FALSE <tibble [1 × 1]>
7     7 NA    FALSE <tibble [1 × 1]>

@GuangchuangYu GuangchuangYu merged commit 2195cc8 into YuLab-SMU:master Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants