Название | Sports Analytics in Practice with R |
---|---|
Автор произведения | Ted Kwartler |
Жанр | Медицина |
Серия | |
Издательство | Медицина |
Год выпуска | 0 |
isbn | 9781119598091 |
x
` occurs within the curly brackets. In this case, a simple operation `x + 3
` overwrites the internal value of `x` and the new value is returned. The function will be an object in the environment and can accept any numeric or integer value. Here, the function is created and then applied to a value of 2. The output is assigned an object itself in `exampleThree
`.
plus3 <- function(x){ x <- x + 3 return(x) } exampleThree <- plus3(2) exampleThree
Of course, functions can be more complex. As an example, the following function is made to be more dynamic by adding a new parameter, called `value
`. Now both are required for the function to operate. The `x` value is now divided by the `value
` input parameter that is passed into the function. Additionally, before the result is returned from the function, the `round
` function is applied further adjusting the preceding division. In the end, for example, the custom function `divideVal
` will accept a number 5, divide it by 2, and then round the result so that it returns the value 2.
divideVal <- function(x, value){ x <- x / value x <- round(x) return(x) } exampleValue <- divideVal(5,2) exampleValue
Applying R Basics to Real Data
Let’s reward your laborious work though foundational R coding with something of actual interest utilizing sports data. Like many scripts in this book, let’s begin by loading packages. For each of these, you need to first run `install.packages
` and assuming that executes without error, the following library calls will specialize R for the task at hand. As an example, script using real sports data, our only goal is to obtain the data, manipulate it, and finally plot it.
To begin call `library(RCurl)
` which is a general network interface client. Functions within this library allow R to make a network connection to download the data. One could have data locally in a file, connect to an API, database, or even web scrape the data. However, in upcoming code, the data are download directly from an online repository. Next, `library(ggplot2)
` loads the grammar of graphics namespace with excellent visualization capabilities. The `library(ggthemes)
` call is a convenience library accompanying `ggplot2
` for quick, predefined aesthetics. Lastly, the `library(tidyr)
` functions are used for tidying data, which is a style of data organization that is efficient if not intuitive. Here, the basic raw will be rearranged before plotting.
library(RCurl) library(ggplot2) library(ggthemes) library(tidyr)
Next, before establishing a connection between R and the data repository, a character object is created called `c1Data
`. The character string is the web URL to the raw comma-separated value, CSV, file. If you open this web address in a typical browser, you will see the raw text-based statistics for regular season Dallas NBA team in the 2019–2020 season. However, the following code does not open a browser and instead downloads this simple file before loading it as an R object.
c1Data <- ‘https://raw.githubusercontent.com/kwartler/Practical_Sports_Analytics/main/C1_Data/2019-2020%20Dallas%20Player%20Stats.csv’
Now to execute a network connection employ the `getURL
` function which lies within the `RCurl` package. This function simply accepts the string URL address previously defined. Be sure to have the address exactly correct to avoid any errors. nbaFile <- getURL(c1Data)
Finally, the base-R function `read.csv
` is used with the downloaded data. The `read.csv
` function is widely used because CSV files are ubiquitous. Further, the function can accept a local file path leading to a hard disk rather than the file downloaded here but the path must be exactly correct. Spaces, capitalization, and misspellings will result in cryptic and frustrating file not found errors. Assuming the web address was correct, and the `getURL
` function executed without error, then the result of this code is a new object called `nbaData
`. It is automatically read in as a `data.frame
` object.
nbaData <- read.csv(text = nbaFile)
Unlike a spreadsheet program where you can scroll to any area of the sheet to look at the contents, R holds the data frame as an object which is an abstraction. As a result, it can be difficult to comprehend the loaded data. Thus, it is a best practice to explore the data to learn about its characteristics. In fact, exploratory data analysis, EDA, in itself is a robust field within analytics. The code below only scratches the surface of what is possible.
To being this basic EDA defines the dimensions of the data using the `dim
` function applied to the `nbaData
` data frame. This will print the total rows and columns for the data frame. Similar to the indexing code, the first number represents the rows and the second the columns.
dim(nbaData)
Since data frames have named columns, you may want to know what the column headers are. The base-R function `names
` accepts a few types of objects and in this case will print the column names of the basketball data.
names(nbaData)
At this point you know the column names and the size of the data loaded in the environment. Another popular way to get familiar with the data is to glimpse at a portion of it. This is preferred to calling the entire object in your console. Data frames can often be tens of thousands of rows or more plus hundreds of columns. If you call a large object directly in console, your system may lag trying to print that much data as an output. Thus, the popular `head
` function accepts a data object along with an integer parameter representing the number of records to print to select. Since this function call is not being assigned an object, the result is printed to console for review. The default behavior selects six though this can be adjusted for more or less observations. When called the `head
` function will print the first `n
` rows of the data frame. This is in contrast to the `tail
` function which will print the last `n
` rows.
head(nbaData, n = 6)
You should notice that the column `TEAM
` shows “Dal” for all results in the `head` function. To ensure this data set only contains players from the Dallas team you can employ the `table
` function specifying the `TEAM
` column either by name or by index position. The `table
` function merely tallies the levels or values of a column. After running the next code chunk, you see that “Dal” appears 19 times in this data set. Had there been another value in this column, additional tallied information would be presented.
table(nbaData$TEAM) table(nbaData[,2])
Lastly, another basic EDA function is `summary
`. The `summary
` function can be applied to any object and will return some information determined by the type of object it receives. In the case of a data frame, the `summary
` function will examine each column individually. It will denote character columns and, when declared as factor, will tally the different factor levels. Perhaps most important is how `summary
` treats numeric columns. For each numeric column, the minimum, first quartile, median, mean, third quartile, and maximum are returned. If missing values are stored as “NA” in a particular column, the function will also tally that. This allows the practitioner to understand each columns range, distribution, averages, and how much of the column contains NA values.
summary(nbaData)
Now that you have a rudimentary understanding of the player level Dallas basketball data set, you can visualize aspects of it. For example, one would expect that the more minutes a player averages per game, the more points the player averages per game. To confirm this assumption, a simple scatter plot may help identify the correlation. Of course, you can calculate correlation, with the `cor
` function, but often visualizing data can be a powerful tool in a sports analyst’s toolkit. The `ggplot2
` library contains a convenience function called `qplot
` for quick plotting. This function accepts the name of a column for the x-axis, followed by another column name to plot on the y-axis. The last parameter is the data itself. The `data
` parameter requires a data frame so that the specific