Title: | The durhamSLR package |
---|---|
Description: | Data for Statistical Learning modules at Durham University. |
Authors: | Sarah Heaps |
Maintainer: | Sarah.Heaps <[email protected]> |
License: | GPL-3 | GPL-2 |
Version: | 0.2.0 |
Built: | 2025-03-06 03:32:16 UTC |
Source: | https://github.com/nseg4/durhamSLR |
Data on applicants to the Masters of Business Administration (MBA) programme of a US business graduate school.
data(admission)
data(admission)
A data frame with 85 rows and 3 variables. The data frame contains the following columns:
The applicant's grade point average (GPA) on a 0.0 - 4.0 scale.
The applicant's graduate management admission test (GMAT) score on a 200 - 800 scale.
A factor with three levels, admit
, border
and
notadmit
, which refer to the category to which the student was assigned by
admissions tutors (admit, borderline or do not admit).
The data were taken from Johnson and Wichern (2008).
Johnson, R.A and Wichern, D.W. (2008) Applied Multivariate Statistical Analysis, Sixth Edition. Pearson.
data(admission) head(admission)
data(admission) head(admission)
A data comprising of observations from 80 US cities for the year 1960 on 11 variables. These include a number of measures of air pollution, specifically concentrations of sulphate and suspended particulate, as well as a number of demographic variables.
data(airpollution)
data(airpollution)
A data frame with 80 rows and 11 variables. The data frame contains the following columns:
Smallest biweekly sulphate reading in micrograms per cubic metre (x 10).
Arithmetic mean of biweekly sulphate reading in micrograms per cubic metre (x 10).
Largest biweekly sulphate reading in micrograms per cubic metre (x 10).
Smallest biweekly suspended particulate reading in micrograms per cubic metre (x 10).
Arithmetic mean of biweekly suspended particulate reading in micrograms per cubic metre (x 10).
Largest biweekly suspended particulate reading in micrograms per cubic metre (x 10).
Population density per square mile (x 0.1).
Percent of population who are white.
Percent of families with income above the poverty level.
Percent of population who are at least 65 (x 10).
Logarithm (base 10) of population (x 10).
The complete data set is described in Gibbons et al. (1987).
D.I. Gibbons and G.C. McDonald and R.F. Gunst (1987), The complementary use of regression diagnostics and robust estimators. Naval Research Logistics, 34, 109–131.
data(airpollution) head(airpollution)
data(airpollution) head(airpollution)
Data extracted from images taken from genuine and forged banknotes, digitized into 400 x 400 arrays of pixels, and then summarised into four continuously valued summary statistics. For each banknote, the data set records whether the banknote was genuine or forged, along with the four numerical summaries of the image.
data(banknotes)
data(banknotes)
A data frame with 1372 rows and 5 variables. The data frame contains the following columns:
Variance of Wavelet Transformed image.
Skewness of Wavelet Transformed image.
Kurtosis of Wavelet Transformed image.
Entropy of image.
A factor with two levels, 0
and 1
, which refer to
whether the banknote was a forgery or real.
The data were taken from the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/banknote+authentication.
data(banknotes) head(banknotes)
data(banknotes) head(banknotes)
The Boston data frame has 506 rows and 14 columns.
data(Boston, package="durhamSLR")
data(Boston, package="durhamSLR")
This data frame contains the following columns:
Natural logarithm of the per capita crime rate by town.
Proportion of residential land zoned for lots over 25,000 sq.ft.
Proportion of non-retail business acres per town.
Charles River dummy variable (=1 if tract bounds river; =0 otherwise).
Nitrogen oxides concentration (parts per 10 million).
Average number of rooms per dwelling.
Proportion of owner-occupied units built prior to 1940.
A numerical vector representing an ordered categorical variable with four levels depending on the weighted mean of the distances to five Boston employment centres (=1 if distance < 2.5, =2 if 2.5 <= distance < 5, =3 if 5 <= distance < 7.5, =4 if distance >= 7.5).
Index of accessibility to radial highways.
Full-value property-tax rate per $10,000.
Pupil-teacher ratio by town.
where
is the proportion of blacks by town.
Lower status of the population (percent).
Median value of owner-occupied homes in $1000s.
Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102.
Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.
data(Boston, package="durhamSLR") head(Boston)
data(Boston, package="durhamSLR") head(Boston)
A data set containing the counts of Lithobius forficatus, more commonly known as the brown or stone centipede, at each of 30 sites in microhabitats of rotting wood. For each site, a number of soil and habitat variables are recorded in addition to their altitude and geographical coordinates.
data(centipedes)
data(centipedes)
A data frame with 30 rows and 10 variables. The data frame contains the following columns:
The abbreviated site name.
The number of centipedes found at the site.
The area sampled at the site in square metres.
A factor with two levels, Synanthropic
and
Deciduous
, which refer to the habitat in which the site was
located; either deciduous woods or “synanthropic” areas associated
with human habitation, e.g. parks and gardens.
The natural loagarithm of the percentage of organic matter in the soil.
The natural loarithm of the altitude of the site in metres.
The air temperature in degrees Celcius.
The soil temperature in degrees Celcius.
The Easting of the site in tenths of a kilometre.
The Northing of the site in tenths of a kilometre.
The complete data set, which involved more species of centipede and more microhabitats, is described in Blackburn et al. (2002).
J. Blackburn and M. Farrow and W. Arthur (2002), Factors influencing the distribution, abundance and diversity of geophilomorph and lithobiomorph centipedes. Journal of Zoology, 256, 221–232.
data(centipedes) head(centipedes)
data(centipedes) head(centipedes)
Data from a study on heart disease by Dr. John M. Chapman in the mid-twentieth century. The data were taken from the Los Angeles Heart Study and comprise measurements from 200 male patients.
data(chapman)
data(chapman)
A data frame with 200 rows and 7 variables. The data frame contains the following columns:
Patient's age; a numeric vector.
Patient'systolic blood pressure; a numeric vector.
Patient's diastolic blood pressure; a numeric vector.
Patient's cholesterol; a numeric vector.
Patient's height; a numeric vector.
Patient's weight; a numeric vector.
A binary numeric vector which takes the value 1
if the
patient experienced a coronory incident in the preceeding 10 years and
0
otherwise.
The data were taken from the StatLib Datasets Archive at Carnegie Mellon University: https://lib.stat.cmu.edu/datasets/christensen-llm.
data(chapman) head(chapman)
data(chapman) head(chapman)
Data collected in a study concerning patients with diabetes. The response
variable of interest was disease progression one year after taking baseline
measurements on various clinical variables. For each of n=442 patients,
the data comprise a quantitative measure of disease progression (dis
)
and measurements on p=10 baseline (explanatory) variables: age (age
),
sex (sex
), body mass index (bmi
), average blood pressure
(map
) and six blood serum measurements (tc
, ldl
,
hdl
, tch
, ltg
, glu
). The explanatory variables
have been transformed to have mean 0, with sum of squares equal to 1.
data(diabetes)
data(diabetes)
A data frame with 442 rows and 11 variables. The data frame contains the following columns:
Age.
Gender.
Body mass index.
Average blood pressure.
Blood serum measurement 1.
Blood serum measurement 2.
Blood serum measurement 3.
Blood serum measurement 4.
Blood serum measurement 5.
Blood serum measurement 6.
Quantitative measure of disease progression.
http://www-stat.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.ps.
Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004), Least Angle Regression (with discussion). Annals of Statistics, 32, 407–499.
data(diabetes) head(diabetes)
data(diabetes) head(diabetes)
This function generates graphical diagnostics for an array of MCMC output. such For every parameter, a row of three plots is generated: a trace plot, an ACF plot and a kernel density plot. If there is output from more than one chain, the default behaviour is to overlay the plots for different chains in different colours.
diagnostics(mcmc, rows = 3, lag.max = 50, pool = FALSE, colours = NULL)
diagnostics(mcmc, rows = 3, lag.max = 50, pool = FALSE, colours = NULL)
mcmc |
A matrix with dimensions: iterations, parameters; or a three
dimensional array with dimensions: iterations, chains, parameters. The final
(i.e. parameter) component of the |
rows |
A number indicating the number of parameters to plot per page on the graphics device. |
lag.max |
A number indicating the maximum lag for the ACF plots. |
pool |
A logical. If |
colours |
A vector indicating the colours to use to represent each chain.
Colours can be specified using any of the three kinds of R colour
specifications, i.e. a colour name (as listed by |
NULL
srs = array(rnorm(8000), c(1000, 2, 4)) # Example for illustration only! dimnames(srs) = list(NULL, NULL, paste("theta[",1:4,"]",sep="")) diagnostics(srs)
srs = array(rnorm(8000), c(1000, 2, 4)) # Example for illustration only! dimnames(srs) = list(NULL, NULL, paste("theta[",1:4,"]",sep="")) diagnostics(srs)
Data for Statistical Learning modules at Durham University.
Sarah Heaps [email protected]
Results for the heptathlon at the 2012 Olympic Games in London for the 29 athletes who completed all events and were not disqualified.
data(heptathlon)
data(heptathlon)
A data frame with 29 rows and 7 variables. The data frame contains the following columns:
100 metres hurdles (seconds).
High jump (metres).
Shot put (metres).
200 metres (seconds).
Long jump (metres).
Javelin throw (metres).
800 metres (seconds).
The data were taken from Wikipedia at https://en.wikipedia.org/wiki/Athletics_at_the_2012_Summer_Olympics_%E2%80%93_Women%27s_heptathlon.
data(heptathlon) head(heptathlon)
data(heptathlon) head(heptathlon)
Final scores for the heptathlon at the 2012 Olympic Games in London for the 29 athletes who completed all events and were not disqualified.
data(heptathlon_points)
data(heptathlon_points)
A data frame with 29 rows and 1 variable in the column Points.
The data were taken from Wikipedia at https://en.wikipedia.org/wiki/Athletics_at_the_2012_Summer_Olympics_%E2%80%93_Women%27s_heptathlon.
data(heptathlon_points) head(heptathlon_points)
data(heptathlon_points) head(heptathlon_points)
The erythrocyte sedimentation rate (ESR) and measurements of two plasma proteins (fibrinogen and globulin).
data(plasma)
data(plasma)
The erythrocyte sedimentation rate (ESR) is the rate at which red blood cells (erythrocytes) settle out of suspension in blood plasma, when measured under standard conditions. If the ESR increases when the level of certain proteins in the blood plasma rise in association with conditions such as rheumatic diseases, chronic infections and malignant diseases, its determination might be useful in screening blood samples taken form people suspected to being suffering from one of the conditions mentioned. The absolute value of the ESR is not of great importance rather it is whether it is less than 20mm/hr since lower values indicate a healthy individual.
The question of interest is whether there is any association between the probability of an ESR reading greater than 20mm/hr and the levels of the two plasma proteins. If there is not then the determination of ESR would not be useful for diagnostic purposes.
A data frame with 32 observations on the following 3 variables:
The fibrinogen level in the blood.
The globulin level in the blood.
A factor with two levels representing the erythrocyte sedimentation rate, either less or greater 20 mm / hour.
D. Collett and A. A. Jemain (1985), Residuals, outliers and influential observations in regression analysis. Sains Malaysiana, 4, 493–511.
data(plasma) layout(matrix(1:2, ncol = 2)) boxplot(fibrinogen ~ ESR, data = plasma, varwidth = TRUE) boxplot(globulin ~ ESR, data = plasma, varwidth = TRUE)
data(plasma) layout(matrix(1:2, ncol = 2)) boxplot(fibrinogen ~ ESR, data = plasma, varwidth = TRUE) boxplot(globulin ~ ESR, data = plasma, varwidth = TRUE)
This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas and the Census Bureau-designated region
data(USArrests, package="durhamSLR")
data(USArrests, package="durhamSLR")
A data frame with 50 observations on 5 variables.:
Murder arrests (per 100,000); a numeric vector.
Assault arrests (per 100,000); a numeric vector.
Percent urban population; a numeric vector.
Rape arrests (per 100,000); a numeric vector.
A factor with four levels indicating the Census Bureau-designated region.
USArrests contains the data as in McNeil's monograph. For the UrbanPop percentages, a review of the table (No. 21) in the Statistical Abstracts 1975 reveals a transcription error for Maryland (and that McNeil used the same “round to even” rule that R's round() uses), as found by Daniel S Coven (Arizona).
World Almanac and Book of facts 1975. (Crime rates).
Statistical Abstracts of the United States 1975, p.20, (Urban rates), possibly available as https://books.google.ch/books?id=zl9qAAAAMAAJ&pg=PA20.
McNeil, D. R. (1977) Interactive Data Analysis. New York: Wiley.
data(USArrests, package="durhamSLR") head(USArrests)
data(USArrests, package="durhamSLR") head(USArrests)