Student Project Presentations (ST 740 )
Students from ST 740 are required to present a study of their choice
during the final week of the class. Here is
the list of titles, abstracts of the projects:
- Fall semester, 2007
- Team member: Melinda Thielbar
- Presentation time: Dec. 3, 10:15-10:30
- Title: Intrinsic Loss for Predictive Ordinal Logisitic Regression
- Abstract: Categorical outcomes are a popular choice for many marketing analyses.
Whether a customer buys a product, whether he or she prefers one
product over another, or the customer's satisfaction rating on an
ordinal scale, are all categorical outcomes. Because predictions from
categorical models are based on the probabilities, intrinsic loss
(loss functions based on the distance between probability functions)
seems a natural choice.
In most cases, however, these modes are estimated with maximum
likelihood methods instead. This project uses an intrinsic loss
function and the Netflix data to create a model that predicts customer
rating for the movie American Beauty. The data is a 1000-row sample of
customers who have non-missing ratings for American Beauty and a set
of fifteen other movie ratings chosen by cluster analysis from SAS.
- Data Source: NetflixData.txt
- Description: The first column is the
rating for American Beauty. The rest of the columns are the predictor
movies.
- Code used: WinBUGS code
- Team member: Ani Eloyan
- Presentation time: Dec. 3, 10:35-10:50
- Title: Bayesian Approach to Independent Component Analysis
- Abstract: Independent Component Analysis (ICA) is a data manipulation device which
helps experimenters working in such research areas as statistics,
feature extraction, neural networks, etc. Since researchers in these
fields often work with very large datasets, the issue of representation of
these data in a form that is easier to understand and analyze arises.
Similar problems are also encountered in signal processing, where the
researchers have to recover source signals, using observed mixtures
of these sources. There are a number of different approaches to this
problem as Principal Component Analysis, Projection Pursuit, Factor
Analysis and ICA. The later is a more resent device used for data
representation. It has been suggested that Bayesian approach for
solving ICA problems can be more efficient then the conventional
methods. In this project we propose a Bayesian model for solving the Noisy
ICA problem. We show an application of this method in brain imaging by
applying the method to separate independent components from a MEG dataset.
- Data Source: 3258tonetest-export.txt
- Description: TBA
- Code used: WinBUGS code
- Team members: Eric Reyes and Jamie Reyes
- Presentation time: Dec. 3, 10:55-11:10
- Title: Interval Estimation for a Ratio Of Marginal Proportions from Paired Data
- Abstract: Designs involving matched binomial samples have many useful applications. In some settings,
the ratio of marginal proportions may be of interest. In constructing interval estimates
for a ratio, several asymptotic confidence intervals have been proposed. The use of these
intervals in small samples however, may be inappropriate. A Bayesian approach allows for
the construction of intervals under small sample sizes. A drawback of the highest posterior
density region (HPD) is its lack of invariance under reparameterization. Using a lowest
posterior loss region (LPL) in conjunction with an intrinsic loss function, as proposed by
Bernardo (2005), an invariant credible region is constructed for the ratio. The LPL is
compared to the HPD, a Bayesian tail-interval, and three frequentist methods through a small
simulation study. These methods are then applied to several examples for illustration.
- Data Source: ReyesData.txt
- Description: See the above data file.
- Code used: R code
- Team member: Zhi Wen
- Presentation time: Dec. 5, 10:15-10:30
- Title: Estimating Migration Rates Using Tag-Recovery Data
- Abstract: Tag-recovery data are used to estimate migration rates among a set of strata. The model
formulation is a simple matrix extension of the formulation of a tag-recovery experiment discussed by
Brownie et al (Schwarz et al.). In this paper, Schwarz et al. used MLE to estimate the migration rates
and tag-recovering rates of Pacific herring among spawning grounds off the west coast of Canada. But in
their parameter estimations, in some cases, the MLEs occurred on the boundary of the parameter space
(e.g. migration/survival estimates of 0 or 1), in which case large-sample theory does not hold and
consequently no estimated standard errors are reported (Schwarz et al.) So I plan to use Bayesian
theory to estimate the parameters. I also used SURVIV to get the MLE and then compare the difference
of these two different estimates: MLE and Bayesian Estimators. We assume for each year and each strata,
the tag herring following a multinomial distribution. So the sample likelihood !
is the product of six multinomial distribution. In my Bayesian estimates, I used non-informative priors,
such as uniform distribution and Beata (0.5, 0.5).
- Data Source: ZhiWenData.txt
- Description: The data was extracted from Table 4 of the following paper
Carl J. Schwarz, Jake F. Schweigert, A. Neil Arnason (1993). Estimating migration rates using tag-recovery
data, Biometrics, Vol. 49, No. 1. pp. 177-193
- Code used: WinBUGS code
- Team member: Jason Riddle
- Presentation time: Dec. 5, 10:35-10:50
- Title: Performance of Bayesian Capture-Recapture Models for
Northern Bobwhite Quail (Colinus virginianus) Covey Count Data
- Abstract: Previously, I compared Northern Bobwhite covey detection probability estimates from
a logistic regression method to that of a frequentist capture-recapture method with 4 sampling periods.
Both methods produced similar detection estimates and each had unique advantages and disadvantages.
The "best" capture-recapture model (model with largest AICc weight) resulted in a more efficient estimate
of detection probability than the logistic regression method. Here, I create Bayesian analogs to two of
the frequentist capture-recapture models I previously evaluated: Mt and M0. For the Mt model, the likelihood
for the collected data is: freq[1:15] ~ dmulti(q[],freqsum) where freqsum ~ dbin (psum, N) is the number of
observed coveys, q[] is the vector of detection probabilities for the 1:15 observable detection histories,
and for (j in 1:15) q[j] <- p[j]/psum and psum <- min(sum(p[]),1). N is defined as freqsum+M-1 with a prior
distribution of M ~ dcat(prob []) and Mmax=200. Alpha (i) is the probability of being detected in period i and
for (i in 1:4) alpha (i) ~ dbeta (0.5,0.5). The M0 model is just a special case of Mt where
alpha[1]=alpha[2]=alpha[3]=alpha[4]. Advantages and disadvantages of the Bayesian approach will be discussed in
light of the logistic regression and frequentist capture-recapture methods.
- Data Source: Capture History Data
- Description: This analysis is based on models prsented in the following paper:
Riddle, J. D., Moorman, C. E. and Pollock, K. H. (2007). A Comparison of the Time-of-detection and an Empirical Logistic
Regression Method for Estimating Northern Bobwhite Covey Abundance (under revision).
See also the presentation Slides
- Code used: WinBUGS code
- Team member: Chris Franck
- Presentation time: Dec. 5, 10:55-11:10
- Title: Modeling Microarray p-values with a Mixture Distribution
- Abstract: Consider the set of N tests from a microarray experiment, each of which
generates a p-value between 0 and 1. It is seldom known for which of
these tests the null hypothesis is true, and for which tests the
alternative hypothesis is true. Under the null hypothesis, these p-values
arise from a uniform distribution, and under the alternative hypothesis
these p-values can be modeled with a beta distribution. Implementation of
a two component mixture model is natural in this situation. In this paper
I will fit a mixture model for a set of N=3169 p-values. The natural
choice for this mixture is a combination of a uniform component between 0
and 1 and a beta component. Other mixtures may be considered in the case
where a logit transformation is applied to the p-values to prevent
computational difficulties. The parameter of interest is the mixing
weight for the beta density, which can be interpreted as the probability
that a p-value arises from the beta component. Estimation of this value
is of interest because tests that generate p-values from the beta
component are tests for which the alternative hypothesis is true.
- Data Source: pvalues.txt
- Description:
- Code used: WinBUGS code
- Team member: Ying Zhu
- Presentation time: Dec. 5, 4:15-4:30
- Title: Modeling Dependence in Crop Yield Risk: A Bayesian Perspective
- Abstract: The development of risk management has motivated research investigating the multivariate
risk factors in yield insurance of different crops and their interaction. Usually in crop insurance literature,
the yield risk is modeled as Weibull distribution. The objective of this study is to use a
Bayesian modeling approach to estimate the dependence structure in the joint yields risk
modeling in corn and soybean yield insurance. In order to develop a joint distribution
model, a copula approach is a good choice. By using a one-parameter Farlie-Gumbel-Morgenstern (FGM) copula,
and two Weibull marginals for corn and soybean yields, we have a total of five parameters to estimate: two each for the Weibull marginals and one for the dependence parameter in the FGM copula function. We will determine if a Bayesian method for estimating the copula parameters is an improvement over the maximum likelihood
estimate method.
- Data Source: YingZhuData.txt
- Description: Annually State and County Crops Yield Data from
National Agricultural Statistics Services (NASS)
- Code used: WinBUGS code
- Team member: Hernan Tejeda
- Presentation time: Dec. 5, 4:35-4:50
- Title: Bayesian Calculation of Mixture model for Corn Prices
- Abstract: The Black-Scholes valuation formula is used for calculating option prices
on financial instruments, including commodities. An option entitles the
holder the right, but not the obligation, to buy (call) or sell (put) at
some period in the future, a specific amount of the instrument - at a
guaranteed (strike) price. One assumption under the Black-Scholes formula
is that prices of these instruments distribute log-normal, more
specifically, these instrument prices are assumed to follow a geometric
Brownian motion.
The log-normal assumption implies that the variability (implied
volatility) of these instrument prices should remain relatively constant
at the maturity date, for different strike prices. Empirical evidence
shows that this is not the case for commodity prices.
This study analyzes the result of introducing a mixture of log-normal
distributions to represent the distribution of commodity prices. This
mixture of functions is made up of one log-normal, and another lognormal
which incorporates an autoregressive process of order one (AR(1)) for the
mean. The data used will be of daily observed prices at the Chicago Board
of Trade, for options and future prices of corn with delivery in December
2006. Calculation of the parameters and fit of the new distribution will
be compared to the former case.
- Data Source: ProjectDataWb2000a.txt
- Description: TBA
- Code used: WinBUGS code
- Team member: Ying-Erh Chen
- Presentation time: Dec. 5, 4:55-5:10
- Title: Combo weather-index insurance based on Copular Model
- Abstract: Crop yields are affected by the weather condition- precipitation, temperature, soil,
humidity and etc. World Bank provides the weather-index insurance to protect farmers based on one
of these weather factors (precipitation, temperature, soil, humidity and etc). Under the coverage of
weather-index insurance, farmers can obtain indemnity if the weather condition is above or below normal.
However farmers may not have enough protection if they only purchase one kind of these weather
insurances and their products are affected by multiple weather factors. To consider the fact that crop
yields are affected by both temperature and precipitation, a combo weather insurance is proposed here
to provide more protection for farmers. To investigate if this combo weather insurance protects farmers
more with less premium and more indemnity when loss occurs than purchasing either rainfall-index insurance
or temperature-index insurance, a case study will be conducted and premium, premium rate,
expected loss, indemnity and liability for this combo weather insurance plan will be estimated based
on the copular model for the joint distribution of temperature and precipitation.
- Data Source: Data.txt
- Description: See NASS-USDA and
NCDC websites for background information.
- Code used: WinBUGS code
- Team members: Luciano Silva and Wook Hwang
- Presentation time: Dec. 7, 10:15-10:30
- Title: Bayesian simple interval QTL mapping of mouse body weight
- Abstract: One of many interests in genetics studies is to understand how
phenotypes are genetically controlled. Recently with the advances in
molecular biology, biologists have been able to get a lot of molecular
data for genetic studies. The availability of such molecular techniques
and experimental populations (e.g., backcross) allow the construction of
genetic linkage maps, in which the markers in the chromosomes are linearly
ordered according to the genetic recombination between the markers. The
measurement of phenotypes (e.g., body weight) in these experimental
populations together with the markers genotypes of the individuals allows
searching along the genetic map for quantitative trait locus (QTL), which
are genes of small or intermediate effects on the expression of the
phenotype. In this study we will use Bayesian methods to search for QTL
controlling the body weight of mouse in a backcross population of 103 mice
and 14 molecular markers.
- Data Source: Mouse BCW data
- Description: First 14 columns are markers and the 15th column represents the phenotype. This
data was collected from the following source:
Zhao-Bang Zeng (2001) Course Notes: Statistical Methods for Mapping Quantitative Trait Loci.
North Carolina State University.
- Code used: WinBUGS code
- Team members: Matt Ritter and Stephen Stanislav
- Presentation time: Dec. 7, 10:35-10:50
- Title: Bayesian Methods Applied to Aural Detection in Avian Point Counts
- Abstract: Ornithologists are often concerned with keeping track of the
population counts of various bird species in a given area. The three
most important statistical aspects of this study are: estimates of
the total population (N), estimates of the probability of detecting a
given bird, and estimates of the probability that the bird is
available for detection. Research done by Pollock, Alldredge and
Simons (to appear) give a frequentist MLE approach of estimating these
(and related) parameters using a multiple observer approach to
detecting the birds. The goal of this project is to compare Bayesian
estimates of those same three parameters to those found in Pollock, et
al. and possibly find a more efficient and accurate analysis using
non-informative priors for the parameters N and p of a multinomial
model. Analysis will be based on data used in Pollock, Alldredge and
Simons of the population of hooded warblers and of ovenbirds. The
data includes binomial detection data of two observers for each of 100
birds covering four 2-minute intervals.
- Data Source: RitterData.txt
- Description: TBA
- Code used: R code and
WinBUGS code
- Team member: Osman Gulseven
- Presentation time: Dec. 7, 10:55-11:10
- Title: Teaching Economics with Games: Results From a Classroom Experiment.
- Abstract: One semester economics courses is an essential introduction to both microeconomic and
macroeconomic concepts. It is a challenge to make fundamental economic courses interesting and engaging
for students with diverse backgrounds. A pop-up review quiz was given to 70 students to test their
understanding of microeconomic concepts. After the test, a related economics taboo game where performed
in class. Soon after the games students were given an option to retake the quiz. Unlike the previous
performance analysis studies, the student performance is based on their probability of getting the right
answer for a typical question. Therefore the model is estimated using the binary Logit/Probit regressions
and also Bayesian methods. The dataset includes information on both questions such as question type,
difficulty, and discrimination index, and also about students such as previous calculus experience, class
attendance, gender, age, department, and cumulative grade up to the experiment time. Since I am modeling
the probability of getting the right answer the total sample size is 70 students x 40 different questions
x 2 (pre/post experiment) = 5600 which is sufficient enough to perform reliable econometric analysis.
- Data Source: GulsevenData.txt
- Description: See the report
- Code used: WinBUGS code
- Team member: John White
- Presentation time: Dec. 7, 4:15-4:30
- Title: A Bayesian Multiscale Model for Image Denoising
- Abstract: Astronomers use different types of images to analyze many phenomenon that
occur in space. There are many multiscale methods to denoise these images
when there is Gaussian noise. However, in X-ray images and other images
taken in space, the information is obtained in photon counts across a
2-dimensional viewing area. The signal then possesses a Poissonian noise,
for which many of the frequentist methods perform poorly. In this paper,
we use a Bayesian Mulitiscale Model to estimate the underlying intensity
of an image of the Kepler Supernova taken from the Chandra X-ray
observatory. The multiscale approach iteratively combines the data into
groups of four starting with the most fine scale, the original data,
moving to the coarsest scale, the overall photon count, splitting the
likelihood into conditional distributions of the data across scales and an
overall intensity. From this estimates of the underlying intensity are
obtained giving a smoothed image of the Kepler supernova.
- Data Source: kepler.txt
- Description: TBA
- Code used: WinBUGS code
- Team member: David Robinson
- Presentation time: Dec. 7, 4:35-4:50
- Title: Bayesian Model for Opening Hand Distribution and Exploitive Play
in "Texas Hold Em"
- Abstract: In the 1980's the group that dominated at the stock market
changed from the older generation who relied on feeling and experience to
guide their trade to the younger statisticians making the old way of
analyzing the stock market obsolete. Today poker has become a fad that is
analyzed much like the stock market used to be. There are a few equations
and formula's used, however many will say that experience and intuition
are the main tools necessary. This project will take a Bayesian approach
to determine an opponent's opening hand distribution and then use a
variant of Bayes Theorem to exploit the decision making process of our
opponent.
- Data Source: Game1.txt
- Description: Hand history of a single table game played on Pokerstars
- Code used: WinBUGS code
Last updated November 19, 2007.
Back to ST 740 Home Page