# The project can be based on either nonparametric density estimation or nonparametric regression

The goal of the course project is to find a dataset (ideally, one that is of interest to

you) and analyze it using one (or more) of the nonparametric methods we have

discussed in the course.

Project Topic

The project can be based on either nonparametric density estimation or

nonparametric regression. Below I give some examples of the types of analysis that

would make a good project.

However, it is important that the methods used are appropriate for the data; in

particular, you do not necessarily need to include all of the components I describe.

1. Nonparametric density estimation

Projects based on density estimation typically fall into one of three categories:

• estimation of the density of a random variable together with investigation of

the properties of the estimate and/or a comparison to a “standard” density for

the data

• a comparison of the density functions of a random variable for two groups

• the use of density estimation for classification.

Note that some projects might combine features of more than one of these

categories, if appropriate for the data. In all cases, the distribution of the “response”

variable should be continuous, in the sense that it is appropriate to model it as a

continuous random variable with an unknown density function.

Further details are given below.

Estimation of a density

Your data should consist of independent observations of a random variable Y such

that the density function p(·) of Y is of interest. The analysis might include the

following components.

• An estimate of p(·) with a discussion of the selection of the smoothing

parameter

• A summary of the properties of ˆp(·), possibly including quantities obtained by

numerical integration

• A comparison of ˆp(·) to a standard density based on a test of the hypothesis

p = p0for a given value of p0. The standard density should be one that is

interesting in the context of the data, e.g., a normal distribution for

measurement data or an exponential distribution for

failure data.

Comparison of two groups

The goal for this option is to compare the densities of a response variable for two

groups and to use the value of the response to predict the group to which the

response belongs.

Your data should consist of a response variable Y for a set of “subjects”, each of

which belongs to one of two groups, which can assumed to be known, in the sense

that there is no uncertainty regarding the group to which a subject belongs. The

analysis might include the following components.

• Estimate the density function of Y for each of the two groups (call them group

1 and group 2)

• Summarize informally the differences between the two estimates

• Test of the hypothesis that the true density functions for the two groups are

identical.

• If the densities appear to be different, test the hypothesis that the shapes of

the true density function for the two groups are identical (if appropriate)

• Summarize the comparison of the two densities in the context of the data.

Classification

The goal for this option is to compare the densities of a response variable for two

groups and to use the value of the response to predict the group to which the

response belongs. The data can be either univariate or bivariate.

• Estimate the density function of the response for each of the two groups (call

them group 1 and group 2).

• Compare the estimates and, for univariate data, test the hypothesis that the

true density functions for the two groups are identical.

• Suppose that, on the basis of an observation Y = y, we wish to predict the

group to which the observation belongs. Estimate the conditional probability

that an observation is from group 1 given that Y = y, as a function of y and

plot the results.

• Interpret the results in the context of the data. One way to do this is to give

the estimates of this conditional probability for a few values of y.

2. Nonparametric regression

Projects based on nonparametric regression typically fall into one of two

categories:

• Estimate a nonparametric regression to a response variable and a predictor

variable and use that estimate to summarize the relationship between the

variables. Both variables should be continuous.

• Estimate the parameters of a semiparametric regression model relating a

continuous response variable to a continuous “nonparametric” predictor and

a ”parametric” predictor which does not need to be continuous. We will

discuss semiparametric regression models in Week 8 of the course.

Further details follow.

Estimation of a nonparametric regression function

Your data should consist of observations on a response variable Y and a

predictor variable X; of interest is the regression function m(·) given by m(x) =

E(Y |X = x). The analysis might include the following components.

o Use kernel estimation to estimate m(·); include a discussion of the

selection of the smoothing parameter

o Summarize of the properties of mˆ (·) in the context of the data; this

might include providing estimates of m(x) for meaningful values of x

o Find the degrees-of-freedom corresponding to the estimate mˆ(·) and

estimate σ, the error standard deviation.

o Test the hypothesis that m(·) is constant (i.e., there is “no effect”) or

that m(x) is linear in x (or both), if appropriate for the data.

o Compare the estimate mˆ (·) to the estimate that would be obtained

using a polynomial regression model, if a low-degree polynomial

regression model might be appropriate for the data.

o Summarize any conclusions regarding the relationship between Y and

X that result from the nonparametric regression analysis. Any such

conclusions should be discussed in the context of the data.

Semiparametric regression

The goal for this option is to analyze data using a semiparametric regression

model.

Your data should consist of a continuous response variable Y along with a

two predictors: Z, which is continuous, and X which can be continuous,

discrete, or categorical. The relationship between Y and X will be modeled

parametrically; hence, this relationship should be approximately linear (unless

X is categorical). The relationship between Y and Z will be modeled

nonparametrically.

o Estimate the regression function m(z) = E(Y |Z = z) using a local linear

kernel estimate.

o Find the degrees-of-freedom of your estimate and estimate the error

variance.

o Test the hypothesis that m(·) is a linear function.

o Estimate the parameters of the semiparametric regression model

Y =βX+m(Z)+ε. Provide an estimate of β along with its standard error.

o Interpret the results in the context of the data.

Data

You will need to find the data to use in your analysis. If possible, your data

should address a question that you find interesting.

The only requirement is that the data include a sufficient number of

observations for non- parametric estimation to be reasonably accurate. For

the density estimation option, there should be at least 25 observations in each

group; for the regression option, there should be at least 25 observations. In

both cases, larger sample sizes are preferable.

For the density estimation option, most datasets that have been collected for

the purpose of comparing two groups will be appropriate.

For the regression option, many datasets used to measure the effect of one

predictor variable, while controlling for another (e.g., analysis of covariance),

will be appropriate. Nonparametric methods are most useful when the

relationship between Y and Z is nonlinear; however, it is fine if for your data

the relationship is approximately linear.

Your data can be data that you have collected, perhaps for some other

purpose or as part of another analysis or another course; alternatively, there

are a number of sources where you can look for data that are of interest to

you. These include

• https://vincentarelbundock.github.io/Rdatasets/datasets.html contains a list of

the datasets that are available in a wide range of R packges.

• https://archive.ics.uci.edu/ml/datasets.html, the University of California, Irvine

Ma- chine Learning Repository.

• http://lib.stat.cmu.edu/datasets/, Statlib.

• For those interested in sports, the websites https://www.baseballreference.com/, https: //www.pro-football-reference.com/,

https://www.basketball-reference.com/, and https: //www.hockeyreference.com/ contain extensive data on the players and teams of the respective sports.

Your Report

Summarize your results in a brief report. It should include

• A description of the data you used, along with its source; if you have collected

the data yourself, include a description of how the data were obtained

• The goals of the analysis, stated in the context of the data. For instance, if

you are estimating a nonparametric regression function relating blood

pressure to heart rate (for example), you should state the goal in terms of

what you hope to learn about blood pressure and heart rate.

• A summary of the results and a brief description of your analysis. It’s fine to

present some R output to support a statement in that description; however,

you should not submit all of the unedited output from the R functions used. In

particular, it is important to interpret any numerical results in the context of the

data.

• Include any plots and figures needed to understand your results. For instance,

when estimat- ing a density or regression function, include a plot of the

function estimate.

Be as concise as possible; the goal is to provide a well-written and informative

summary of your analysis.

Length?

It’s difficult to give an recommend length that is appropriate in all cases. It depends

on the topic, the nature of the analysis, the number of plots, etc.The last time I

taught this course the lengths (in pages) of the best projects were

3,5,5,5,6,7,8,8,10,11,14