scholarly journals Eucalyptus cloeziana seed count data: a comparative analysis of statistical models

2019 ◽  
Vol 43 ◽  
Author(s):  
Thomas Bruno Michelon ◽  
Cesar Augusto Taconeli ◽  
Elisa Serra Negra Vieira ◽  
Maristela Panobianco

ABSTRACT Generalized linear models (GLMs) are an extension of the linear model and include the normal, Poisson, and negative binomial distributions. Although GLMs were introduced in 1972, most seed technology studies, especially those involving count data, such as germination tests of seeds from the genus Eucalyptus, still using the analysis of variance, without analysis of the fit of other models. Thus, this study aimed to evaluate the most appropriate model in the GLM class for seed count data of Eucalyptus cloeziana. Data were obtained from a germination test using seeds from three lots of E. cloeziana. Each lot was separated by sieving into three material fractions based on size: small (<0.84 mm), medium (from 1.18 to 1.00 mm), and large (>1.18 mm). The data analysis was based on the use of GLMs adjusted to normal, Poisson, and negative binomial distributions, and the models were evaluated by the Akaike and Bayesian Schwartz criteria and Cook’s distance and half-normal diagnostic graphs. Compared to other adjustments, the normal distribution adjustment differed in the configuration of means submitted to the Tukey test, and although the data met all normality assumptions, the adjustment with the Poisson distribution was the most suitable for the count data from a germination test of E. cloeziana seeds.

2017 ◽  
Vol 18 (1) ◽  
pp. 24-49 ◽  
Author(s):  
Wagner H. Bonat ◽  
Bent Jørgensen ◽  
Célestin C. Kokonendji ◽  
John Hinde ◽  
Clarice G. B. Demétrio

We propose a new class of discrete generalized linear models based on the class of Poisson–Tweedie factorial dispersion models with variance of the form [Formula: see text], where [Formula: see text] is the mean and [Formula: see text] and [Formula: see text] are the dispersion and Tweedie power parameters, respectively. The models are fitted by using an estimating function approach obtained by combining the quasi-score and Pearson estimating functions for the estimation of the regression and dispersion parameters, respectively. This provides a flexible and efficient regression methodology for a comprehensive family of count models including Hermite, Neyman Type A, Pólya–Aeppli, negative binomial and Poisson-inverse Gaussian. The estimating function approach allows us to extend the Poisson–Tweedie distributions to deal with underdispersed count data by allowing negative values for the dispersion parameter [Formula: see text]. Furthermore, the Poisson–Tweedie family can automatically adapt to highly skewed count data with excessive zeros, without the need to introduce zero-inflated or hurdle components, by the simple estimation of the power parameter. Thus, the proposed models offer a unified framework to deal with under-, equi-, overdispersed, zero-inflated and heavy-tailed count data. The computational implementation of the proposed models is fast, relying only on a simple Newton scoring algorithm. Simulation studies showed that the estimating function approach provides unbiased and consistent estimators for both regression and dispersion parameters. We highlight the ability of the Poisson–Tweedie distributions to deal with count data through a consideration of dispersion, zero-inflated and heavy tail indices, and illustrate its application with four data analyses. We provide an R implementation and the datasets as supplementary materials.


Author(s):  
Chenangnon Frédéric Tovissodé ◽  
Romain Glele Kakai

It is quite easy to stochastically distort an original count variable to obtain a new count variable with relatively more variability than in the original variable. Many popular overdispersion models (variance greater than mean) can indeed be obtained by mixtures, compounding or randomlystopped sums. There is no analogous stochastic mechanism for the construction of underdispersed count variables (variance less than mean), starting from an original count distribution of interest. This work proposes a generic method to stochastically distort an original count variable to obtain a new count variable with relatively less variability than in the original variable. The proposed mechanism, termed condensation, attracts probability masses from the quantiles in the tails of the original distribution and redirect them toward quantiles around the expected value. If the original distribution can be simulated, then the simulation of variates from a condensed distribution is straightforward. Moreover, condensed distributions have a simple mean-parametrization, a characteristic useful in a count regression context. An application to the negative binomial distribution resulted in a distribution allowing under, equi and overdispersion. In addition to graphical insights, fields of applications of special cases of condensed Poisson and condensed negative binomial distributions were pointed out as an indication of the potential of condensation for a flexible analysis of count data


Author(s):  
Nandi O Leslie ◽  
Richard E Harang ◽  
Lawrence P Knachel ◽  
Alexander Kott

We propose several generalized linear models (GLMs) to predict the number of successful cyber intrusions (or “intrusions”) into an organization’s computer network, where the rate at which intrusions occur is a function of the following observable characteristics of the organization: (i) domain name system (DNS) traffic classified by their top-level domains (TLDs); (ii) the number of network security policy violations; and (iii) a set of predictors that we collectively call the “cyber footprint” that is comprised of the number of hosts on the organization’s network, the organization’s similarity to educational institution behavior, and its number of records on scholar.google.com . In addition, we evaluate the number of intrusions to determine whether these events follow a Poisson or negative binomial (NB) probability distribution. We reveal that the NB GLM provides the best fit model for the observed count data, number of intrusions per organization, because the NB model allows the variance of the count data to exceed the mean. We also show that there are restricted and simpler NB regression models that omit selected predictors and improve the goodness-of-fit of the NB GLM for the observed data. With our model simulations, we identify certain TLDs in the DNS traffic as having a significant impact on the number of intrusions. In addition, we use the models and regression results to conclude that the number of network security policy violations is consistently predictive of the number of intrusions.


1995 ◽  
Vol 124 (1) ◽  
pp. 61-70 ◽  
Author(s):  
J. A. Woolliams ◽  
Z. W. Luo ◽  
B. Villanueva ◽  
D. Waddington ◽  
P. J. Broadbent ◽  
...  

SUMMARYData on ovulation rate and numbers of ova and transferable embryos recovered from superovulated cattle and sheep were analysed using generalized linear models, quasi-likelihood, restricted maximum likelihood (REML) and generalized linear mixed models (GLMMS). The data pertained to the operation of nucleus breeding schemes in cattle and the commercial application of embryo transfer in sheep.Results of the analyses showed that generalized linear models involving Poisson and Binomial distributions were inappropriate because of over-dispersion, and that analyses using quasi-likelihood to model negative binomial and β-binomial distributions were more suitable. Factors identified as important in determining the results in cattle were the number of previous superovulations (a higher proportion of transferable embryos were obtained in the initial flush compared to subsequent recoveries in two out of three sets of data), the donor (significant in all analyses with repeated recoveries) and its mate (significant in some analyses). In sheep, the use of pFSH or hMG for superovulation increased embryo yields above those obtained with PMSG + GnRH. Analyses of a further data set for sheep showed the effect of breed was ambiguous.The effects of donors and their mates were treated as random effects in analyses involving REML and GLMMS. Results showed that the repeatability of the number of transferable embryos produced per donor ranged between 0·13 and 0·23 in three sets of data and was significant in all cases. In these analyses the variance among mates was not significantly different from zero.The results of analyses were used to develop a random generator to simulate the numbers of ova and embryos recovered from a cow following superovulation. By sampling from negative binomial distributions where the scale factor used for each cow was a normally distributed deviate, distributions were obtained which had the same mean, variance and repeatability as those observed.


Author(s):  
Cindy Xin Feng

AbstractCounts data with excessive zeros are frequently encountered in practice. For example, the number of health services visits often includes many zeros representing the patients with no utilization during a follow-up time. A common feature of this type of data is that the count measure tends to have excessive zero beyond a common count distribution can accommodate, such as Poisson or negative binomial. Zero-inflated or hurdle models are often used to fit such data. Despite the increasing popularity of ZI and hurdle models, there is still a lack of investigation of the fundamental differences between these two types of models. In this article, we reviewed the zero-inflated and hurdle models and highlighted their differences in terms of their data generating processes. We also conducted simulation studies to evaluate the performances of both types of models. The final choice of regression model should be made after a careful assessment of goodness of fit and should be tailored to a particular data in question.


2020 ◽  
Vol 68 (6) ◽  
pp. 1196-1198
Author(s):  
Christina G Bracamontes ◽  
Thelma Carrillo ◽  
Jane Montealegre ◽  
Leonid Fradkin ◽  
Michele Follen ◽  
...  

Women with an abnormal Pap smear are often referred to colposcopy, a procedure during which endocervical curettage (ECC) may be performed. ECC is a scraping of the endocervical canal lining. Our goal was to compare the performance of a naïve Poisson (NP) regression model with that of a zero-inflated Poisson (ZIP) model when identifying predictors of the number of distress/pain vocalizations made by women undergoing ECC. Data on women seen in the colposcopy clinic at a medical school in El Paso, Texas, were analyzed. The outcome was the number of pain vocalizations made by the patient during ECC. Six dichotomous predictors were evaluated. Initially, NP regression was used to model the data. A high proportion of patients did not make any vocalizations, and hence a ZIP model was also fit and relative rates (RRs) and 95% CIs were calculated. AIC was used to identify the best model (NP or ZIP). Of the 210 women, 154 (73.3%) had a value of 0 for the number of ECC vocalizations. NP identified three statistically significant predictors (language preference of the subject, sexual abuse history and length of the colposcopy), while ZIP identified one: history of sexual abuse (yes vs no; adjusted RR=2.70, 95% CI 1.47 to 4.97). ZIP was preferred over NP. ZIP performed better than NP regression. Clinicians and epidemiologists should consider using the ZIP model (or the zero-inflated negative binomial model) for zero-inflated count data.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Ahmed Nabil Shaaban ◽  
Bárbara Peleteiro ◽  
Maria Rosario O. Martins

Abstract Background This study offers a comprehensive approach to precisely analyze the complexly distributed length of stay among HIV admissions in Portugal. Objective To provide an illustration of statistical techniques for analysing count data using longitudinal predictors of length of stay among HIV hospitalizations in Portugal. Method Registered discharges in the Portuguese National Health Service (NHS) facilities Between January 2009 and December 2017, a total of 26,505 classified under Major Diagnostic Category (MDC) created for patients with HIV infection, with HIV/AIDS as a main or secondary cause of admission, were used to predict length of stay among HIV hospitalizations in Portugal. Several strategies were applied to select the best count fit model that includes the Poisson regression model, zero-inflated Poisson, the negative binomial regression model, and zero-inflated negative binomial regression model. A random hospital effects term has been incorporated into the negative binomial model to examine the dependence between observations within the same hospital. A multivariable analysis has been performed to assess the effect of covariates on length of stay. Results The median length of stay in our study was 11 days (interquartile range: 6–22). Statistical comparisons among the count models revealed that the random-effects negative binomial models provided the best fit with observed data. Admissions among males or admissions associated with TB infection, pneumocystis, cytomegalovirus, candidiasis, toxoplasmosis, or mycobacterium disease exhibit a highly significant increase in length of stay. Perfect trends were observed in which a higher number of diagnoses or procedures lead to significantly higher length of stay. The random-effects term included in our model and refers to unexplained factors specific to each hospital revealed obvious differences in quality among the hospitals included in our study. Conclusions This study provides a comprehensive approach to address unique problems associated with the prediction of length of stay among HIV patients in Portugal.


Sign in / Sign up

Export Citation Format

Share Document