Reproducible science: Module 3

# Reproducible science: Module 3
## Dealing with data: Tidyverse
### <a href="dossa@xtbg.org.cn">Gbadamassi G.O. Dossa</a>
### Xishuangbanna Tropical Botanical Garden, XTBG-CAS
### 2021/09/25 (updated: 2022-06-27)

---

class: center
# Acknowledgements
The content of this module are based on materials from:

.pull-right[
![olivier gimenez](img/olivier gimenez.png)
]
.pull-right[
[olivier gimenez's materials](https://oliviergimenez.github.io/)
]

---
class: center

# What is tidyverse and advantages?
.left[
"A framework for managing data that aims at making the cleaning and preparing steps [muuuuuuuch] easier" (Julien Barnier).
Main characteristics of a tidy dataset:

- the dataset is [tibble](https://tibble.tidyverse.org/);

- measured varaiable as a column;

- an observation represents a row with each value is in a different cell.

[**tidyverse**](https://www.tidyverse.org/) consits of a compilation of r packages for data analysis.
]
![tidyverse tibbles](img/Variable observation and measurement.png)
<img src="img/tidyverse.png" width="10%" style="display: block; margin: auto 0 auto auto;" />

---
class: center
# Recognizing a tidy dataset
![Tibble 1](img/Tibble 1.png)

---
class: center
# Recognizing a tidy dataset
![Tibble 2](img/Tibble 2.png)

---
class: center
# Recognizing a tidy dataset
![Tibble 3](img/Tibble 3.png)

---
class: center
# Recognizing a tidy dataset
![Tibble 4](img/Tibble 4.png)

---
class: center

# Tidyverse: Multiple r packages well compiled

Makes data manipulation pretty natural

- [ggplot2](https://ggplot2-book.org/) - visualizing stuff;

- [dplyr](https://dplyr.tidyverse.org/), [tidyr](https://tidyr.tidyverse.org/) - data manipulation;

- [purrr](https://purrr.tidyverse.org/) - advanced programming;

- [readr](https://readr.tidyverse.org/) - import data;

- [tibble](https://tibble.tidyverse.org/) - improved data.frame format;

- [forcats](https://forcats.tidyverse.org/) - working with factors;

- [stringr](https://stringr.tidyverse.org/) - working with chain of characters.
]
---
class: center
# Simplified flowchart of data science?
.left[Any data analysis follows this typical flow:
1. Import data;
2. Clean data;
3. Exploratory analysis. A cycle between:
    - Visualization;
    - modeling;
    - Transformation
4. Communicate]
<small>If these steps happen at multiple software then errors are highly inevitable.</small>

<div class="figure" style="text-align: center">
<img src="img/workflow in data analysis.png" alt="Reproducibilit equals effecient use of time" width="60%" />
<p class="caption">Reproducibilit equals effecient use of time</p>
</div>

---
class: center
# Tidyverse saves: same flowchart in tidyverse
<div class="figure" style="text-align: center">
<img src="img/workflow data analysis within tidyverse.png" alt="Reproducibilit equals effecient use of time" width="80%" />
<p class="caption">Reproducibilit equals effecient use of time</p>
</div>

---
class: center

# Practice in tidyverse “Use twitter to predict citation rate”
![Twitter exercice front page](img/Twitter exercice front page.png)
.left[
We will use an existing data supporting the [above publication](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0166570) to learn some functions within **tidyverse**.
]

---

- creates tibbles instead of data.frame;

- no names to rows;

- allows column names with special characters (see next slide);

- more clever on screen display than w/ data.frames (see next slide);

- no partial matching on column names;

- warning if attempt to access unexisting column;

- is incredibly fast.
]

---
class: center

# Import data
.left[

```r
# Set the url from where to download the data
url<-"https://doi.org/10.1371/journal.pone.0166570.s001"
# name the file to be downloaded and save as destfile object
destfile <- "twitter_cit_data.csv"
# Apply download.file function in R to download from url
download.file(url, destfile)
library(tidyverse)
```

```
## Warning: package 'ggplot2' was built under R version 4.1.1
```

```
## Warning: package 'readr' was built under R version 4.1.1
```

```r
# Read the data file with read_csv() and save with name "citations_raw"
*citations_raw<-read_csv(file="twitter_cit_data.csv")
head(citations_raw)
```
]
---
class: center

# Import data

```r
citations_raw
```

```
## # A tibble: 1,599 x 12
##    `Journal identity` `5-year journal im~ `Year published` Volume Issue Authors 
##    <chr>                            <dbl>            <dbl>  <dbl> <chr> <chr>   
##  1 Ecology Letters                   16.7             2014     17 12    Morin e~
##  2 Ecology Letters                   16.7             2014     17 12    Jucker ~
##  3 Ecology Letters                   16.7             2014     17 12    Calcagn~
##  4 Ecology Letters                   16.7             2014     17 11    Segre e~
##  5 Ecology Letters                   16.7             2014     17 11    Kaufman~
##  6 Ecology Letters                   16.7             2014     17 10    Nasto e~
##  7 Ecology Letters                   16.7             2014     17 10    Tschirr~
##  8 Ecology Letters                   16.7             2014     17 9     Barnech~
##  9 Ecology Letters                   16.7             2014     17 9     Pinto-S~
## 10 Ecology Letters                   16.7             2014     17 9     Clough ~
## # ... with 1,589 more rows, and 6 more variables: Collection date <chr>,
## #   Publication date <chr>, Number of tweets <dbl>, Number of users <dbl>,
## #   Twitter reach <dbl>, Number of Web of Science citations <dbl>
```
]

---
class: center

# Tidy/transform: Rename columns

```r
citations_temp <- rename(citations_raw,
       journal = 'Journal identity',
       impactfactor = '5-year journal impact factor',
       pubyear = 'Year published',
       colldate = 'Collection date',
       pubdate = 'Publication date',
       nbtweets = 'Number of tweets',
       woscitations = 'Number of Web of Science citations')
head(citations_temp,5,6)
```

```
## # A tibble: 5 x 12
##   journal   impactfactor pubyear Volume Issue Authors  colldate pubdate nbtweets
##   <chr>            <dbl>   <dbl>  <dbl> <chr> <chr>    <chr>    <chr>      <dbl>
## 1 Ecology ~         16.7    2014     17 12    Morin e~ 2/1/2016 9/16/2~       18
## 2 Ecology ~         16.7    2014     17 12    Jucker ~ 2/1/2016 10/13/~       15
## 3 Ecology ~         16.7    2014     17 12    Calcagn~ 2/1/2016 10/21/~        5
## 4 Ecology ~         16.7    2014     17 11    Segre e~ 2/1/2016 8/28/2~        9
## 5 Ecology ~         16.7    2014     17 11    Kaufman~ 2/1/2016 8/28/2~        3
## # ... with 3 more variables: Number of users <dbl>, Twitter reach <dbl>,
## #   woscitations <dbl>
```
]

---
class: center

# Tidy: Clean up column names

.left[
To clean columns, use function *clean_names()* from the package janitor from it will fill space in column names by "_".

```r
janitor::clean_names(citations_raw)
```

```
## # A tibble: 1,599 x 12
##    journal_identity x5_year_journal_impa~ year_published volume issue authors   
##    <chr>                            <dbl>          <dbl>  <dbl> <chr> <chr>     
##  1 Ecology Letters                   16.7           2014     17 12    Morin et ~
##  2 Ecology Letters                   16.7           2014     17 12    Jucker et~
##  3 Ecology Letters                   16.7           2014     17 12    Calcagno ~
##  4 Ecology Letters                   16.7           2014     17 11    Segre et ~
##  5 Ecology Letters                   16.7           2014     17 11    Kaufman e~
##  6 Ecology Letters                   16.7           2014     17 10    Nasto et ~
##  7 Ecology Letters                   16.7           2014     17 10    Tschirren~
##  8 Ecology Letters                   16.7           2014     17 9     Barnechi ~
##  9 Ecology Letters                   16.7           2014     17 9     Pinto-San~
## 10 Ecology Letters                   16.7           2014     17 9     Clough et~
## # ... with 1,589 more rows, and 6 more variables: collection_date <chr>,
## #   publication_date <chr>, number_of_tweets <dbl>, number_of_users <dbl>,
## #   twitter_reach <dbl>, number_of_web_of_science_citations <dbl>
```
]

---
class: center

# Tidy: Create and modify columns
.left[
The well known function to create and modify columns is  *mutate()*, This function takes first the tibble names, the new_name= what you want to do to old column.

```r
citations <- mutate(citations_temp, journal = as.factor(journal)) 
#Pay attention that I store in "citations"
citations
```

```
## # A tibble: 1,599 x 12
*##    journal  impactfactor pubyear Volume Issue Authors  colldate pubdate nbtweets
*##    <fct>           <dbl>   <dbl>  <dbl> <chr> <chr>    <chr>    <chr>      <dbl>
##  1 Ecology~         16.7    2014     17 12    Morin e~ 2/1/2016 9/16/2~       18
##  2 Ecology~         16.7    2014     17 12    Jucker ~ 2/1/2016 10/13/~       15
##  3 Ecology~         16.7    2014     17 12    Calcagn~ 2/1/2016 10/21/~        5
##  4 Ecology~         16.7    2014     17 11    Segre e~ 2/1/2016 8/28/2~        9
##  5 Ecology~         16.7    2014     17 11    Kaufman~ 2/1/2016 8/28/2~        3
##  6 Ecology~         16.7    2014     17 10    Nasto e~ 2/2/2016 7/28/2~       27
##  7 Ecology~         16.7    2014     17 10    Tschirr~ 2/2/2016 8/6/20~        6
##  8 Ecology~         16.7    2014     17 9     Barnech~ 2/2/2016 6/17/2~       19
##  9 Ecology~         16.7    2014     17 9     Pinto-S~ 2/2/2016 6/12/2~       26
## 10 Ecology~         16.7    2014     17 9     Clough ~ 2/2/2016 7/17/2~       44
## # ... with 1,589 more rows, and 3 more variables: Number of users <dbl>,
## #   Twitter reach <dbl>, woscitations <dbl>
```
]

---
# Tidy: Create and modify columns
.left[
Check now the levels of journal variable

```r
levels(citations$journal) 
```

```
##  [1] "Animal Conservation"              "Conservation Letters"            
##  [3] "Diversity and Distributions"      "Ecological Applications"         
##  [5] "Ecology"                          "Ecology Letters"                 
##  [7] "Evolution"                        "Evolutionary Applications"       
##  [9] "Fish and Fisheries"               "Functional Ecology"              
## [11] "Global Change Biology"            "Global Ecology and Biogeography" 
## [13] "Journal of Animal Ecology"        "Journal of Applied Ecology"      
## [15] "Journal of Biogeography"          "Limnology and Oceanography"      
## [17] "Mammal Review"                    "Methods in Ecology and Evolution"
## [19] "Molecular Ecology Resources"      "New Phytologist"
```
]
---
class: center

# Piping: Make your manipulations easier

Piping was borrowed from other languages, got incorporated into R after a question in [Pipe question](https://stackoverflow.com/questions/8896820/how-to-implement-fs-forward-pipe-operator-in-r) in 2012.
Pipe which is the bar "|" on your keyboard.

---
class: center

# Omelette: Base r approach
You need to do complicated programming: create multiple intermediate objects; embed, needs some understanding of coding and is prone to errors.![omelette in r base](img/Omlete succesive command.png)

---
class: center
# Omelette: Piping approach
.left[Simpler programming using piping. Piping consists of: taking results from previous function as a starting point of a new function;less prone to errors and consume less memory.
]![omelette in tidyverse](img/omelete with pipe commands.png)

---
class: center
# Example of piping
.left[
Take the tibble "citations_raw" **then** rename some columns then the new tibble containing the renamed tibble and *then* convert the column "journal" from current class ("character") to factor.

```r
citations_raw %>%
* rename(journal = 'Journal identity',
       impactfactor = '5-year journal impact factor',
       pubyear = 'Year published',
       colldate = 'Collection date',
       pubdate = 'Publication date',
       nbtweets = 'Number of tweets',
       woscitations = 'Number of Web of Science citations') %>%
* mutate(journal = as.factor(journal))
```
]
.left[
Please notice every time I say **"then"** this is equal to **"%>%"**. 
]
---
class: center

# Naming final object of pipe
.left[

```r
*citations <- citations_raw %>%
  rename(journal = 'Journal identity',
       impactfactor = '5-year journal impact factor',
       pubyear = 'Year published',
       colldate = 'Collection date',
       pubdate = 'Publication date',
       nbtweets = 'Number of tweets',
       woscitations = 'Number of Web of Science citations') %>%
  mutate(journal = as.factor(journal))
head(citations)
```

```
## # A tibble: 6 x 12
##   journal   impactfactor pubyear Volume Issue Authors  colldate pubdate nbtweets
##   <fct>            <dbl>   <dbl>  <dbl> <chr> <chr>    <chr>    <chr>      <dbl>
## 1 Ecology ~         16.7    2014     17 12    Morin e~ 2/1/2016 9/16/2~       18
## 2 Ecology ~         16.7    2014     17 12    Jucker ~ 2/1/2016 10/13/~       15
## 3 Ecology ~         16.7    2014     17 12    Calcagn~ 2/1/2016 10/21/~        5
## 4 Ecology ~         16.7    2014     17 11    Segre e~ 2/1/2016 8/28/2~        9
## 5 Ecology ~         16.7    2014     17 11    Kaufman~ 2/1/2016 8/28/2~        3
## 6 Ecology ~         16.7    2014     17 10    Nasto e~ 2/2/2016 7/28/2~       27
## # ... with 3 more variables: Number of users <dbl>, Twitter reach <dbl>,
## #   woscitations <dbl>
```
]

---
class: center

# Naming final object of pipe 2
.left[

```r
citations_raw %>% 
  rename(journal = 'Journal identity',
       impactfactor = '5-year journal impact factor',
       pubyear = 'Year published',
       colldate = 'Collection date',
       pubdate = 'Publication date',
       nbtweets = 'Number of tweets',
       woscitations = 'Number of Web of Science citations') %>%
* mutate(journal = as.factor(journal))-> citations2
head(citations2) 
```

---
class: center

# Pipe synthax
.left[
- *Verb(Subject,Complement)* replaced by *Subject %>% Verb(Complement)*;

- No need to name unimportant intermediate variables;

- Clear syntax (readability).

]
.left[

If you want you can first write what you want to accomplished in a text with "then" as step wise, then code it by replace "then" by the pipe with its operator "%>%" of course.

]

---
class: center, middle, inverse
# Other functions in Tidyverse
---
class: center
# Select columns
.left[*select()* is the function one uses to select different variables i a tibble. You just need to remember that it  follows a pipe operator (%>%), and it takes the name of columns one desires to select.

```r
citations %>%
  select(journal, impactfactor, nbtweets)
```

```
## # A tibble: 1,599 x 3
##    journal         impactfactor nbtweets
##    <fct>                  <dbl>    <dbl>
##  1 Ecology Letters         16.7       18
##  2 Ecology Letters         16.7       15
##  3 Ecology Letters         16.7        5
##  4 Ecology Letters         16.7        9
##  5 Ecology Letters         16.7        3
##  6 Ecology Letters         16.7       27
##  7 Ecology Letters         16.7        6
##  8 Ecology Letters         16.7       19
##  9 Ecology Letters         16.7       26
## 10 Ecology Letters         16.7       44
## # ... with 1,589 more rows
```
]

---
class: center
# Drop columns or deselect variables

.left[
The opposite of selecting, which is deselecting. One just need to be more logical in the writing. 
Would you like to guess?
]
--

```r
citations %>%
* select(-Volume, -Issue, -Authors)
```

```
## # A tibble: 1,599 x 9
##    journal     impactfactor pubyear colldate pubdate   nbtweets `Number of user~
##    <fct>              <dbl>   <dbl> <chr>    <chr>        <dbl>            <dbl>
##  1 Ecology Le~         16.7    2014 2/1/2016 9/16/2014       18               16
##  2 Ecology Le~         16.7    2014 2/1/2016 10/13/20~       15               12
##  3 Ecology Le~         16.7    2014 2/1/2016 10/21/20~        5                4
##  4 Ecology Le~         16.7    2014 2/1/2016 8/28/2014        9                8
##  5 Ecology Le~         16.7    2014 2/1/2016 8/28/2014        3                3
##  6 Ecology Le~         16.7    2014 2/2/2016 7/28/2014       27               23
##  7 Ecology Le~         16.7    2014 2/2/2016 8/6/2014         6                6
##  8 Ecology Le~         16.7    2014 2/2/2016 6/17/2014       19               18
##  9 Ecology Le~         16.7    2014 2/2/2016 6/12/2014       26               23
## 10 Ecology Le~         16.7    2014 2/2/2016 7/17/2014       44               42
## # ... with 1,589 more rows, and 2 more variables: Twitter reach <dbl>,
## #   woscitations <dbl>
```
]

---
class: center
# Split a column in several columns

.left[separate is the function used to split a column into several of course you need to indicate what symbol is the separator (e.g., space, -, /, etc.).

```r
head(citations$pubdate)
```

```
## [1] "9/16/2014"  "10/13/2014" "10/21/2014" "8/28/2014"  "8/28/2014" 
## [6] "7/28/2014"
```

```r
citations %>%
  select(journal, impactfactor, nbtweets, pubdate)%>%
  separate(pubdate,c('month','day','year'),'/')
```

```
## # A tibble: 1,599 x 6
##    journal         impactfactor nbtweets month day   year 
##    <fct>                  <dbl>    <dbl> <chr> <chr> <chr>
##  1 Ecology Letters         16.7       18 9     16    2014 
##  2 Ecology Letters         16.7       15 10    13    2014 
##  3 Ecology Letters         16.7        5 10    21    2014 
##  4 Ecology Letters         16.7        9 8     28    2014 
##  5 Ecology Letters         16.7        3 8     28    2014 
##  6 Ecology Letters         16.7       27 7     28    2014 
##  7 Ecology Letters         16.7        6 8     6     2014 
##  8 Ecology Letters         16.7       19 6     17    2014 
##  9 Ecology Letters         16.7       26 6     12    2014 
## 10 Ecology Letters         16.7       44 7     17    2014 
## # ... with 1,589 more rows
```
]
---
class: center

# Transform column in date format
.left[Many of us work with ecological data that record date, and we find it hard to keep these on readable format in R. Within, tidyverse there is a package that specially deals with date formatting variables/columns.
The package is called [**lubridate**](https://lubridate.tidyverse.org/).

```r
library(lubridate)
citations %>%
  mutate(pubdate = mdy(pubdate),
         colldate = mdy(colldate))%>%
  select(journal,impactfactor, nbtweets, pubdate, colldate)
```

```
## # A tibble: 1,599 x 5
##    journal         impactfactor nbtweets pubdate    colldate  
##    <fct>                  <dbl>    <dbl> <date>     <date>    
##  1 Ecology Letters         16.7       18 2014-09-16 2016-02-01
##  2 Ecology Letters         16.7       15 2014-10-13 2016-02-01
##  3 Ecology Letters         16.7        5 2014-10-21 2016-02-01
##  4 Ecology Letters         16.7        9 2014-08-28 2016-02-01
##  5 Ecology Letters         16.7        3 2014-08-28 2016-02-01
##  6 Ecology Letters         16.7       27 2014-07-28 2016-02-02
##  7 Ecology Letters         16.7        6 2014-08-06 2016-02-02
##  8 Ecology Letters         16.7       19 2014-06-17 2016-02-02
##  9 Ecology Letters         16.7       26 2014-06-12 2016-02-02
## 10 Ecology Letters         16.7       44 2014-07-17 2016-02-02
## # ... with 1,589 more rows
```
]
---
class: center
# For easy date format manipulation
.left[Check out ?lubridate::lubridate for more functions

```r
library(lubridate)
citations %>%
  mutate(pubdate = mdy(pubdate),
         pubyear2 = year(pubdate))%>%
  select(journal,impactfactor, pubdate, colldate, pubyear2)
```

```
## # A tibble: 1,599 x 5
##    journal         impactfactor pubdate    colldate pubyear2
##    <fct>                  <dbl> <date>     <chr>       <dbl>
##  1 Ecology Letters         16.7 2014-09-16 2/1/2016     2014
##  2 Ecology Letters         16.7 2014-10-13 2/1/2016     2014
##  3 Ecology Letters         16.7 2014-10-21 2/1/2016     2014
##  4 Ecology Letters         16.7 2014-08-28 2/1/2016     2014
##  5 Ecology Letters         16.7 2014-08-28 2/1/2016     2014
##  6 Ecology Letters         16.7 2014-07-28 2/2/2016     2014
##  7 Ecology Letters         16.7 2014-08-06 2/2/2016     2014
##  8 Ecology Letters         16.7 2014-06-17 2/2/2016     2014
##  9 Ecology Letters         16.7 2014-06-12 2/2/2016     2014
## 10 Ecology Letters         16.7 2014-07-17 2/2/2016     2014
## # ... with 1,589 more rows
```
]
---

# Join tables together

---
class: center

# Join two tables
.left[
Joining tables are the correspondents of merge function in base R.
There is a great tutorial to all sort of joining in tidyverse made available by [Garrick Aden-Buie](https://www.garrickadenbuie.com/project/tidyexplain/).
The joining of tables can be categorized into several types. However, we will only study the following:
- Inner join;

- Left join;

- Right join;

- Semi join;

- Union join;

- Anti join.
]

---
class: center

# Inner join

![inner join](https://raw.githubusercontent.com/gadenbuie/tidyexplain/master/images/inner-join.gif)

---
class: center

# Left join

![left-join](https://raw.githubusercontent.com/gadenbuie/tidyexplain/master/images/left-join.gif)

---
class: center

# Right join

![right-join](https://raw.githubusercontent.com/gadenbuie/tidyexplain/master/images/right-join.gif)
---

# Semi join

![semi-join](https://raw.githubusercontent.com/gadenbuie/tidyexplain/master/images/semi-join.gif)

---
class: center

# Union join

![union-join](https://raw.githubusercontent.com/gadenbuie/tidyexplain/master/images/union.gif)

---
class: center

# Anti join
![anti-join](https://raw.githubusercontent.com/gadenbuie/tidyexplain/master/images/anti-join.gif)
---
class: center, middle, inverse
# Character manipulation
---
class: center

# Select rows of papers with > 3 authors
.left[

```r
citations %>%
*#str_detect() detect characters in a given column
    filter(str_detect(Authors,'et al')) 
```

```
## # A tibble: 1,280 x 12
##    journal  impactfactor pubyear Volume Issue Authors  colldate pubdate nbtweets
##    <fct>           <dbl>   <dbl>  <dbl> <chr> <chr>    <chr>    <chr>      <dbl>
##  1 Ecology~         16.7    2014     17 12    Morin e~ 2/1/2016 9/16/2~       18
##  2 Ecology~         16.7    2014     17 12    Jucker ~ 2/1/2016 10/13/~       15
##  3 Ecology~         16.7    2014     17 12    Calcagn~ 2/1/2016 10/21/~        5
##  4 Ecology~         16.7    2014     17 11    Segre e~ 2/1/2016 8/28/2~        9
##  5 Ecology~         16.7    2014     17 11    Kaufman~ 2/1/2016 8/28/2~        3
##  6 Ecology~         16.7    2014     17 10    Nasto e~ 2/2/2016 7/28/2~       27
##  7 Ecology~         16.7    2014     17 10    Tschirr~ 2/2/2016 8/6/20~        6
##  8 Ecology~         16.7    2014     17 9     Barnech~ 2/2/2016 6/17/2~       19
##  9 Ecology~         16.7    2014     17 9     Pinto-S~ 2/2/2016 6/12/2~       26
## 10 Ecology~         16.7    2014     17 9     Clough ~ 2/2/2016 7/17/2~       44
## # ... with 1,270 more rows, and 3 more variables: Number of users <dbl>,
## #   Twitter reach <dbl>, woscitations <dbl>
```
]
---
class: center
# Select columns with rows of papers with > 3 authors
.left[

```r
citations %>%
  filter(str_detect(Authors,'et al')) %>%
  select(Authors)
```

```
## # A tibble: 1,280 x 1
##    Authors            
##    <chr>              
##  1 Morin et al        
##  2 Jucker et al       
##  3 Calcagno et al     
##  4 Segre et al        
##  5 Kaufman et al      
##  6 Nasto et al        
##  7 Tschirren et al    
##  8 Barnechi et al     
##  9 Pinto-Sanchez et al
## 10 Clough et al       
## # ... with 1,270 more rows
```
]
---
class: center
# Select columns with rows of papers with < 3 authors
.left[

```r
citations %>%
  filter(!str_detect(Authors,'et al')) %>% ##！ for saying "not".
  select(Authors)
```

```
## # A tibble: 319 x 1
##    Authors           
##    <chr>             
##  1 Neutle and Thorne 
##  2 Kellner and Asner 
##  3 Griffin and Willi 
##  4 Gremer and Venable
##  5 Cavieres          
##  6 Haegman and Loreau
##  7 Kearney           
##  8 Locey and White   
##  9 Quintero and Weins
## 10 Lesser and Jackson
## # ... with 309 more rows
```
]
---
class: center
# Select authors of columns with rows of papers with < 3 authors
.left[

```r
citations %>%
  filter(!str_detect(Authors,'et al')) %>% ##！ for saying "not".
pull(Authors) %>%
  head(10)
```

```
##  [1] "Neutle and Thorne"  "Kellner and Asner"  "Griffin and Willi" 
##  [4] "Gremer and Venable" "Cavieres"           "Haegman and Loreau"
##  [7] "Kearney"            "Locey and White"    "Quintero and Weins"
## [10] "Lesser and Jackson"
```
]

---
class: center

# Rows of papers with less than 3 authors in journal with IF < 5
.left[

```r
citations %>%
  filter(!str_detect(Authors,'et al'), impactfactor < 5)
```

```
## # A tibble: 77 x 12
##    journal  impactfactor pubyear Volume Issue  Authors colldate pubdate nbtweets
##    <fct>           <dbl>   <dbl>  <dbl> <chr>  <chr>   <chr>    <chr>      <dbl>
##  1 Molecul~         4.9     2014     14 6      Gautier 2/27/20~ 5/14/2~        2
##  2 Molecul~         4.9     2014     14 5      Gambel~ 2/27/20~ 3/7/20~        7
##  3 Molecul~         4.9     2014     14 4      Kekkon~ 2/27/20~ 3/10/2~        4
##  4 Molecul~         4.9     2014     14 3      Bhatta~ 2/27/20~ 12/8/2~        0
##  5 Molecul~         4.9     2014     14 1      Christ~ 2/28/20~ 10/25/~        0
##  6 Molecul~         4.9     2013     13 4      Villar~ 2/28/20~ 5/2/20~        0
##  7 Molecul~         4.9     2013     13 4      Wang    2/28/20~ 4/25/2~        0
##  8 Molecul~         4.9     2012     12 1      Joly    2/28/20~ 9/7/20~        3
##  9 Animal ~         3.21    2014     17 6      Plavsic 2/9/2016 4/17/2~        9
## 10 Animal ~         3.21    2014     17 Suppl~ Knox a~ 2/11/20~ 11/13/~        1
## # ... with 67 more rows, and 3 more variables: Number of users <dbl>,
## #   Twitter reach <dbl>, woscitations <dbl>
```
]
---
class: center

# Convert words to lowercase
.left[

```r
citations %>%
* mutate(authors_lowercase = str_to_lower(Authors)) %>%
  select(authors_lowercase)
```

```
## # A tibble: 1,599 x 1
##    authors_lowercase  
##    <chr>              
##  1 morin et al        
##  2 jucker et al       
##  3 calcagno et al     
##  4 segre et al        
##  5 kaufman et al      
##  6 nasto et al        
##  7 tschirren et al    
##  8 barnechi et al     
##  9 pinto-sanchez et al
## 10 clough et al       
## # ... with 1,589 more rows
```
]
---
class: center

# Remove all spaces in variable names
.left[

```r
citations%>%
  mutate(journal = str_remove_all(journal," ")) %>%
  select(journal) %>%
  unique() %>%
  head(5)
```

```
## # A tibble: 5 x 1
##   journal                     
##   <chr>                       
## 1 EcologyLetters              
## 2 GlobalChangeBiology         
## 3 GlobalEcologyandBiogeography
## 4 MolecularEcologyResources   
## 5 DiversityandDistributions
```
]
---
class: center, middle, inverse

# Basic exploratory data analysis
---
class: center

# Count ()

This helps to count the number of occurrences.
.left[

```r
citations %>% 
  count(journal, sort = TRUE) ## Embedded sorting within count()
```

```
## # A tibble: 20 x 2
##    journal                              n
##    <fct>                            <int>
##  1 New Phytologist                    144
##  2 Ecology                            108
##  3 Evolution                          108
##  4 Global Change Biology              108
##  5 Global Ecology and Biogeography    108
##  6 Journal of Biogeography            108
##  7 Ecology Letters                    106
##  8 Diversity and Distributions        105
##  9 Animal Conservation                102
## 10 Methods in Ecology and Evolution    90
## 11 Evolutionary Applications           74
## 12 Functional Ecology                  54
## 13 Journal of Animal Ecology           54
## 14 Journal of Applied Ecology          54
## 15 Limnology and Oceanography          54
## 16 Molecular Ecology Resources         54
## 17 Conservation Letters                53
## 18 Ecological Applications             48
## 19 Fish and Fisheries                  36
## 20 Mammal Review                       31
```
]
---
class: center
# Count() for multiple variables
.left[

```r
citations %>%
  count(journal, pubyear)
```

```
## # A tibble: 59 x 3
##    journal                     pubyear     n
##    <fct>                         <dbl> <int>
##  1 Animal Conservation            2012    18
##  2 Animal Conservation            2013    18
##  3 Animal Conservation            2014    66
##  4 Conservation Letters           2012    17
##  5 Conservation Letters           2013    18
##  6 Conservation Letters           2014    18
##  7 Diversity and Distributions    2012    36
##  8 Diversity and Distributions    2013    33
##  9 Diversity and Distributions    2014    36
## 10 Ecological Applications        2012    24
## # ... with 49 more rows
```
]
---
class: center
# Count sum of tweets per journal
.left[

```r
citations %>%
  count(journal, wt = nbtweets, sort = TRUE)
```

```
## # A tibble: 20 x 2
##    journal                              n
##    <fct>                            <dbl>
##  1 Ecology Letters                   1538
##  2 Animal Conservation               1268
##  3 Journal of Applied Ecology        1012
##  4 Methods in Ecology and Evolution   699
##  5 Global Change Biology              613
##  6 Conservation Letters               542
##  7 New Phytologist                    509
##  8 Global Ecology and Biogeography    379
##  9 Ecology                            335
## 10 Evolution                          335
## 11 Journal of Animal Ecology          323
## 12 Fish and Fisheries                 261
## 13 Evolutionary Applications          238
## 14 Journal of Biogeography            209
## 15 Diversity and Distributions        200
## 16 Mammal Review                      166
## 17 Functional Ecology                 155
## 18 Molecular Ecology Resources        139
## 19 Ecological Applications            125
## 20 Limnology and Oceanography           0
```
]
---
class: center

# Group variables to compute stats [summarise()]
.left[

```r
citations %>%
  group_by(journal) %>%
  summarise(avg_tweets = mean(nbtweets))
```

```
## # A tibble: 20 x 2
##    journal                          avg_tweets
##    <fct>                                 <dbl>
##  1 Animal Conservation                   12.4 
##  2 Conservation Letters                  10.2 
##  3 Diversity and Distributions            1.90
##  4 Ecological Applications                2.60
##  5 Ecology                                3.10
##  6 Ecology Letters                       14.5 
##  7 Evolution                              3.10
##  8 Evolutionary Applications              3.22
##  9 Fish and Fisheries                     7.25
## 10 Functional Ecology                     2.87
## 11 Global Change Biology                  5.68
## 12 Global Ecology and Biogeography        3.51
## 13 Journal of Animal Ecology              5.98
## 14 Journal of Applied Ecology            18.7 
## 15 Journal of Biogeography                1.94
## 16 Limnology and Oceanography             0   
## 17 Mammal Review                          5.35
## 18 Methods in Ecology and Evolution       7.77
## 19 Molecular Ecology Resources            2.57
## 20 New Phytologist                        3.53
```
]
---
class: center
# Order stuff [arrange()]
.left[

```r
citations %>%
  group_by(journal) %>%
  summarise(avg_tweets = mean(nbtweets)) %>%
* # decreasing order but (without desc for increasing)
  arrange(desc(avg_tweets))-> arrangedat 
head(arrangedat, 10)
```

```
## # A tibble: 10 x 2
##    journal                          avg_tweets
##    <fct>                                 <dbl>
##  1 Journal of Applied Ecology            18.7 
##  2 Ecology Letters                       14.5 
##  3 Animal Conservation                   12.4 
##  4 Conservation Letters                  10.2 
##  5 Methods in Ecology and Evolution       7.77
##  6 Fish and Fisheries                     7.25
##  7 Journal of Animal Ecology              5.98
##  8 Global Change Biology                  5.68
##  9 Mammal Review                          5.35
## 10 New Phytologist                        3.53
```
]

---
class: center
# Work on several columns [dplyr:::across()]

![Across graphics](img/Across graphics.png)
---
class: center

# Compute mean across multiple variables
.left[

```r
citations %>%
  group_by(journal) %>%
* summarize(across(where(is.numeric), mean))
```

```
## # A tibble: 20 x 8
##    journal impactfactor pubyear Volume nbtweets `Number of user~ `Twitter reach`
##    <fct>          <dbl>   <dbl>  <dbl>    <dbl>            <dbl>           <dbl>
##  1 Animal~         3.21   2013.  16.5     12.4              9.71          28345.
##  2 Conser~         6.4    2013.   6.02    10.2              8.85          23234.
##  3 Divers~         5.4    2013   19        1.90             1.77           2350.
##  4 Ecolog~         5.06   2013   23        2.60             2.5            5727.
##  5 Ecology         6.16   2013   94        3.10             2.87           6176.
##  6 Ecolog~        16.7    2013.  16.0     14.5             14.0           44748.
##  7 Evolut~         5.25   2013   67        3.10             2.93           7762.
##  8 Evolut~         4.6    2013.   6.05     3.22             3.07          13185.
##  9 Fish a~         8.1    2013   14        7.25             6.19          12097.
## 10 Functi~         5.28   2013   27        2.87             2.74           3809.
## 11 Global~         8.7    2013   19        5.68             4.94           9652.
## 12 Global~         7.18   2013   22        3.51             3.15           8995.
## 13 Journa~         5.32   2013.  81.9      5.98             5.59          36112.
## 14 Journa~         5.93   2013   50       18.7             15.8           43839.
## 15 Journa~         4.59   2013   40        1.94             1.86          11632.
## 16 Limnol~         4.4    2013   58        0                0                 0 
## 17 Mammal~         4.3    2013.  43.0      5.35             4.39          10485.
## 18 Method~         7.39   2013.   4.2      7.77             7.14          15867.
## 19 Molecu~         4.9    2013   13        2.57             2.24          19844.
## 20 New Ph~         7.8    2013  198.       3.53             3.15           6340.
## # ... with 1 more variable: woscitations <dbl>
```
]

---
class: center

# Tidying tibbles [wide(), long()]
![tidyr wide and long](https://raw.githubusercontent.com/gadenbuie/tidyexplain/master/images/tidyr-spread-gather.gif)
---
class: center

# Data manipulation with tidyverse: in depth study
.left[
Learn the tidyverse: books, workshops and online courses
Selection of books:
- [R for Data Science](https://www.tidyverse.org/learn/) and [Advanced R](http://adv-r.had.co.nz/);

- [Tidy Tuesdays videos](https://www.youtube.com/user/safe4democracy/videos) by D. Robinson;

- Material of the [stat545](https://stat545.com/) course on Data wrangling, exploration, and analysis with R at the University of British Columbia;

- List of best R packages (with their description) on data import, [wrangling and visualization](https://www.computerworld.com/article/2921176/great-r-packages-for-data-import-wrangling-visualization.html).
]
---
class: center, middle

# Thank you for listening!
Any questions now or email me at [**dossa@xtbg.org.cn**](http://people.ucas.edu.cn/~Dossa?language=en)

Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan).

The chakra comes from [remark.js](https://remarkjs.com), [**knitr**](https://yihui.org/knitr/), and [R Markdown](https://rmarkdown.rstudio.com).