Data Visualization
Borrowing from the #TidyTuesday initiative, we’re analyzing data from Pet Cats UK!
In this edition of the Programming Cafe, we decided to try a coding challenge focused on data visualization that was accessible to all levels of experience.
Slides
Exercises
Once you have gotten hold of the data, you can attempt to reproduce some of the example visualizations (beginner, intermediate, advanced) we’ve gathered below. You can use whatever programming language and packages/libraries you like and we encourage you to work in pairs/groups - so you can learn together!
Resources
Inspiration
R Packages
- ggplot2
- leaflet
- move
- moveVis
Python Libraries
- matplotlib
- geopandas
- movingpandas
- (pystac + rioxarray) (to read and visualize sentinel satellite data)
Pet Cats UK
source: https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-01-31/readme.md
The data comes from the Movebank for Animal Tracking Data via Data is Plural.
Between 2013 and 2017, Roland Kays et al. convinced hundreds of volunteers in the U.S., U.K., Australia, and New Zealand to strap GPS sensors on their pet cats. The aforelinked datasets include each cat’s characteristics (such as age, sex, neuter status, hunting habits) and time-stamped GPS pings.
When using this dataset, please cite the original article.
Kays R, Dunn RR, Parsons AW, Mcdonald B, Perkins T, Powers S, Shell L, McDonald JL, Cole H, Kikillus H, Woods L, Tindle H, Roetman P (2020) The small home ranges and large local ecological impacts of pet cats. Animal Conservation. doi:10.1111/acv.12563
Additionally, please cite the Movebank data package:
McDonald JL, Cole H (2020) Data from: The small home ranges and large local ecological impacts of pet cats [United Kingdom]. Movebank Data Repository. doi:10.5441/001/1.pf315732
Additional datasets for the US, Australia, and New Zealand are also available for download, but they are not included.
Get the data here
R
library(readr)
<- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-01-31/cats_uk.csv')
cats_uk <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-01-31/cats_uk_reference.csv') cats_uk_reference
Python
import pandas as pd
= pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-01-31/cats_uk.csv')
cats_uk = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-01-31/cats_uk_reference.csv') cats_uk_reference
Data Dictionary
Full dictionaries are available on Movebank
cats_uk.csv
variable | class | description |
---|---|---|
tag_id | character | A unique identifier for the tag, provided by the data owner. If the data owner does not provide a tag ID, an internal Movebank tag identifier may sometimes be shown. |
event_id | double | An identifier for the set of values associated with each event, i.e. sensor measurement. A unique event ID is assigned to every time-location or other time-measurement record in Movebank. If multiple measurements are included within a single row of a data file, they will share an event ID. If users import the same sensor measurement to Movebank multiple times, a separate event ID will be assigned to each. |
visible | logical | Determines whether an event is visible on the Movebank map. Values are calculated automatically, with TRUE indicating the event has not been flagged as an outlier by algorithm_marked_outlier , import_marked_outlier or manually_marked_outlier , or that the user has overridden the results of these outlier attributes using manually_marked_valid = TRUE. Allowed values are TRUE or FALSE. |
timestamp | double | The date and time corresponding to a sensor measurement or an estimate derived from sensor measurements. |
location_long | double | The geographic longitude of the location as estimated by the sensor. Positive values are east of the Greenwich Meridian, negative values are west of it. |
location_lat | double | The geographic longitude of the location as estimated by the sensor. Positive values are east of the Greenwich Meridian, negative values are west of it. |
ground_speed | double | The estimated ground speed provided by the sensor or calculated between consecutive locations. Units are reportedly m/s, which indicates that there is likely a problem with this data (either the units were reported erroneously or their is an issue with the sensor data). |
height_above_ellipsoid | double | The estimated height above the ellipsoid, typically estimated by the tag. Units: meters |
algorithm_marked_outlier | logical | Identifies events marked as outliers using a user-selected filter algorithm in Movebank. Outliers have the value TRUE. |
manually_marked_outlier | logical | Identifies events flagged manually as outliers, typically using the Event Editor in Movebank, and may also include outliers identified using other methods. Outliers have the value TRUE. |
study_name | character | The name of the study in Movebank. |
cats_uk_reference.csv
variable | class | description |
---|---|---|
tag_id | character | A unique identifier for the tag, provided by the data owner. If the data owner does not provide a tag ID, an internal Movebank tag identifier may sometimes be shown. |
animal_id | character | An individual identifier for the animal, provided by the data owner. If the data owner does not provide an Animal ID, an internal Movebank animal identifier is sometimes shown. |
animal_taxon | character | The scientific name of the species on which the tag was deployed, as defined by the Integrated Taxonomic Information System (ITIS, www.itis.gov). If the species name can not be provided, this should be the lowest level taxonomic rank that can be determined and that is used in the ITIS taxonomy. |
deploy_on_date | double | The timestamp when the tag deployment started. |
deploy_off_date | double | The timestamp when the tag deployment ended. |
hunt | logical | Whether the animal was allowed to hunt. |
prey_p_month | double | Approximate number of prey caught by the animal per month. |
animal_reproductive_condition | character | The reproductive condition of the animal at the beginning of the deployment. |
animal_sex | character | The sex of the animal, as “m” or “f”. |
hrs_indoors | double | The average number of hours which the animal was indoors per day. |
n_cats | double | The number of cats in the house. |
food_dry | logical | Whether the cat was fed dry food. |
food_wet | logical | Whether the cat was fed wet food. |
food_other | logical | Whether the cat was fed other food. |
study_site | character | A location such as the deployment site or colony, or a location-related group such as the herd or pack name. |
age_years | double | The age of the animal at the beginning of the deployment, in years. “0” indicates that the animal was < 1 year old. |
Example Data Visualizations
Beginner
Beginners can try:
Reading in the data
Plotting the number of cats per household
- Plotting the number of prey caught per year
- Plotting the number of hours spent indoors
- Plotting the gender of the cats in the dataset
- Plotting the age of the cats
The above plots are referenced from: https://github.com/MattHondrakis/TidyTuesday/blob/main/2023/01-31-23/UK-Cats.md. The author has code snippets in R showing how they obtained the plots.
You can get tips on building bar plots in both R & Python on the following page: https://www.data-to-viz.com/graph/histogram.html
Intermediate
Intermediate users have two options:
Creating a treemap
of the data.
The above plot is referenced from: https://github.com/poncest/tidytuesday/tree/main/2023/Week_05. The author has a script in R showing how they obtained the plot.
You can get tips on building treemaps in both R & Python on the following page: https://www.data-to-viz.com/graph/treemap.html
Here is a human-readable breakdown of the steps, provided by ChatGPT:
- Load Packages & Setup:
- Load necessary software packages using the pacman library.
- Set up the figure size and resolution for plotting.
- Read in the Data:
- Load data for the TidyTuesday challenge for 2023, week 05.
- Clean and organize the data.
- Examine the Data:
- Use glimpse to see the structure of the cats_uk and cats_uk_reference datasets.
- List unique values in the “tag_id” column of both datasets.
- Combine the data using a left join operation and save memory.
- Tidy the Data:
- Remove outliers from the data.
- Convert timestamp data to date format.
- Group and summarize the data based on specific columns.
- Clean and prepare the data for visualization.
- Create Visualization:
- Set up plot aesthetics including colors, titles, and captions.
- Define fonts for text elements.
- Build the main plot using the ggplot2 package with a treemap representation.
- Customize plot labels, colors, and themes to make it visually appealing.
Map
Try to create a map (e.g. leaflet (R) or geopandas (python)) and plot a gps track on top of it
Advanced
Advanced users can try plotting the data onto a map and adding an animation element.
The above plot is referenced from: https://github.com/fblpalmeira/cats_uk/blob/main/README.md. The author has a script in R showing how they obtained the plot.
You can get tips on building maps in both R & Python on the following page: https://www.data-to-viz.com/graph/map.html
Here is a human-readable breakdown of the steps, provided by ChatGPT:
Read in the Data: Begin by reading two CSV files from the internet. You can do this using software like R. The files you need are named cats_uk.csv and cats_uk_reference.csv.
View the Data: After reading the data, use a tool or function (like the View function in R) to visually inspect the contents of both data files. This step helps you understand what’s in the data.
Load Libraries: Make sure you have certain software libraries installed and loaded. These libraries include dplyr, move, moveVis, maptools, magrittr, and ggplot2. They provide tools for data manipulation and visualization.
Combine and Filter Data: Use the dplyr library to combine the data from the two CSV files based on a common column called “tag_id.” After combining, filter the data to select only specific cat names like “Bear,” “Coco,” etc.
Convert Timestamps: Convert the timestamps in the filtered data from a text format to a date-time format. This makes it easier to work with time-based data.
Create a Move Object: Using the filtered and timestamp-converted data, create a “move object” that represents the movement of cats. This object specifies the columns for longitude, latitude, time, and track ID.
Align Data: Adjust the timestamps in the move object to align them at a specific resolution, such as daily intervals. This step aggregates the data for each day.
Generate Map Frames: Create map frames to visualize cat movements. You can customize these frames by adding things like labels, a north arrow, and a scale bar.
Add Text: Include text on the map frames, such as a Twitter handle, at a specific location.
Preview a Frame: Optionally, you can preview one of the map frames to see how it looks.
Generate an Animation: Finally, use a function or tool (like animate_frames) to turn the map frames into an animated GIF. This GIF will show the movements of the selected cats over time.
References
Will be updated to be more neat, the authors of the data and code and resources retain all their copyright - they are awesome!