Patterns of probabilistic segment deletion/reduction in English and Japanese

AbstractProbabilistic phonetic reduction is widely attested in a variety of languages, acoustic domains, and interpretations of predictability. Less well-studied is the categorical effect of probabilistic segment deletion, which in principle is subject to similar pressures. This paper presents the results of an exploratory study into patterns of segment deletion in corpora of spontaneous speech in English and Japanese. Analysis at the word level reveals that words with more phonemes and higher-frequency words tend to have more of their segments deleted. Analysis at the phoneme level reveals that high-probability phonemes are more likely to be deleted than low-probability phonemes. For Japanese only, this analysis also shows effects of word length, frequency, and neighborhood density on deletion probability. Taken together, these results suggest that several large-scale patterns of probabilistic segment deletion mirror the processes of phonetic reduction and apply to both languages. Some patterns, though, appear to be language-specific, and it is not clear to what extent languages can and do differ in this regard. These findings are discussed in terms of our understanding of the universality of proposed predictability effects, and in terms of probabilistic reduction more broadly.

Download Full-text

Why reduce? Phonological neighborhood density and phonetic reduction in spontaneous speech

Journal of Memory and Language ◽

10.1016/j.jml.2011.11.006 ◽

2012 ◽

Vol 66 (4) ◽

pp. 789-806 ◽

Cited By ~ 96

Author(s):

Susanne Gahl ◽

Yao Yao ◽

Keith Johnson

Keyword(s):

Spontaneous Speech ◽

Neighborhood Density ◽

Phonological Neighborhood Density ◽

Phonetic Reduction ◽

Phonological Neighborhood

Download Full-text

Determinants of early lexical acquisition: Effects of word- and child-level factors on Dutch children's acquisition of words

Journal of Child Language ◽

10.1017/s0305000921000635 ◽

2021 ◽

pp. 1-21

Author(s):

Josje VERHAGEN ◽

Mees VAN STIPHOUT ◽

Elma BLOM

Keyword(s):

Word Length ◽

Short Form ◽

Mixed Effects ◽

Neighborhood Density ◽

Vocabulary Knowledge ◽

Receptive Vocabulary ◽

Lexical Acquisition ◽

Word Class ◽

Word Level ◽

Dutch Children

Abstract Previous research on the effects of word-level factors on lexical acquisition has shown that frequency and concreteness are most important. Here, we investigate CDI data from 1,030 Dutch children, collected with the short form of the Dutch CDI, to address (i) how word-level factors predict lexical acquisition, once child-level factors are controlled, (ii) whether effects of these word-level factors vary with word class and age, and (iii) whether any interactions with age are due to differences in receptive vocabulary. Mixed-effects regressions yielded effects of frequency and concreteness, but not of word class and phonological factors (e.g., word length, neighborhood density). The effect of frequency was stronger for nouns than predicates. The effects of frequency and concreteness decreased with age, and were not explained by differences in vocabulary knowledge. These findings extend earlier results to Dutch, and indicate that effects of age are not due to increases in vocabulary knowledge.

Download Full-text

Spatial suppression due to statistical regularities is driven by distractor suppression not by target activation

10.31234/osf.io/zbcjv ◽

2018 ◽

Author(s):

Michel Failing ◽

Benchi Wang ◽

Jan Theeuwes

Keyword(s):

High Probability ◽

Target Location ◽

Target Selection ◽

Visual Space ◽

Alternative Interpretation ◽

Distractor Location ◽

Target Presentation ◽

Statistical Regularities ◽

Low Probability ◽

Presentation Experiment

Where and what we attend to is not only determined by what we are currently looking for but also by what we have encountered in the past. Recent studies suggest that biasing the probability by which distractors appear at locations in visual space may lead to attentional suppression of high probability distractor locations which effectively reduces capture by a distractor but also impairs target selection at this location. However, in many of these studies introducing a high probability distractor location was tantamount to increasing the probability of the target appearing in any of the other locations (i.e. the low probability distractor locations). Here, we investigate an alternative interpretation of previous findings according to which attentional selection at high probability distractor locations is not suppressed. Instead, selection at low probability distractor locations is facilitated. In two visual search tasks, we found no evidence for this hypothesis: neither when there was only a bias in target presentation but no bias in distractor presentation (Experiment 1), nor when there was only a bias in distractor presentation but no bias in target presentation (Experiment 2). We conclude that recurrent presentation of a distractor in a specific location leads to attentional suppression of that location through a mechanism that is unaffected by any regularities regarding the target location.

Download Full-text

Validation of a machine learned model to predict the diagnosis of myocardial infarction

European Heart Journal ◽

10.1093/ehjci/ehaa946.1669 ◽

2020 ◽

Vol 41 (Supplement_2) ◽

Author(s):

D Doudesis ◽

J Yang ◽

A Tsanas ◽

C Stables ◽

A Shah ◽

...

Keyword(s):

Myocardial Infarction ◽

Acute Coronary Syndrome ◽

Predictive Value ◽

High Probability ◽

Cardiovascular Death ◽

Gradient Boosting ◽

Funding Source ◽

Coronary Syndrome ◽

Low Probability ◽

One Year

Abstract Introduction The myocardial-ischemic-injury-index (MI3) is a promising machine learned algorithm that predicts the likelihood of myocardial infarction in patients with suspected acute coronary syndrome. Whether this algorithm performs well in unselected patients or predicts recurrent events is unknown. Methods In an observational analysis from a multi-centre randomised trial, we included all patients with suspected acute coronary syndrome and serial high-sensitivity cardiac troponin I measurements without ST-segment elevation myocardial infarction. Using gradient boosting, MI3 incorporates age, sex, and two troponin measurements to compute a value (0–100) reflecting an individual's likelihood of myocardial infarction, and estimates the negative predictive value (NPV) and positive predictive value (PPV). Model performance for an index diagnosis of myocardial infarction, and for subsequent myocardial infarction or cardiovascular death at one year was determined using previously defined low- and high-probability thresholds (1.6 and 49.7, respectively). Results In total 20,761 of 48,282 (43%) patients (64±16 years, 46% women) were eligible of whom 3,278 (15.8%) had myocardial infarction. MI3 was well discriminated with an area under the receiver-operating-characteristic curve of 0.949 (95% confidence interval 0.946–0.952) identifying 12,983 (62.5%) patients as low-probability (sensitivity 99.3% [99.0–99.6%], NPV 99.8% [99.8–99.9%]), and 2,961 (14.3%) as high-probability (specificity 95.0% [94.7–95.3%], PPV 70.4% [69–71.9%]). At one year, subsequent myocardial infarction or cardiovascular death occurred more often in high-probability compared to low-probability patients (17.6% [520/2,961] versus 1.5% [197/12,983], P<0.001). Conclusions In unselected consecutive patients with suspected acute coronary syndrome, the MI3 algorithm accurately estimates the likelihood of myocardial infarction and predicts probability of subsequent adverse cardiovascular events. Performance of MI3 at example thresholds Funding Acknowledgement Type of funding source: Foundation. Main funding source(s): Medical Research Council

Download Full-text

Tiered Sampling

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3441299 ◽

2021 ◽

Vol 15 (5) ◽

pp. 1-52

Author(s):

Lorenzo De Stefani ◽

Erisa Terolli ◽

Eli Upfal

Keyword(s):

Large Scale ◽

Analysis Of Algorithms ◽

Base Layer ◽

Single Edge ◽

Real World Data ◽

High Quality ◽

Large Graphs ◽

Massive Graphs ◽

Variance Estimate ◽

Low Probability

We introduce Tiered Sampling , a novel technique for estimating the count of sparse motifs in massive graphs whose edges are observed in a stream. Our technique requires only a single pass on the data and uses a memory of fixed size M , which can be magnitudes smaller than the number of edges. Our methods address the challenging task of counting sparse motifs—sub-graph patterns—that have a low probability of appearing in a sample of M edges in the graph, which is the maximum amount of data available to the algorithms in each step. To obtain an unbiased and low variance estimate of the count, we partition the available memory into tiers (layers) of reservoir samples. While the base layer is a standard reservoir sample of edges, other layers are reservoir samples of sub-structures of the desired motif. By storing more frequent sub-structures of the motif, we increase the probability of detecting an occurrence of the sparse motif we are counting, thus decreasing the variance and error of the estimate. While we focus on the designing and analysis of algorithms for counting 4-cliques, we present a method which allows generalizing Tiered Sampling to obtain high-quality estimates for the number of occurrence of any sub-graph of interest, while reducing the analysis effort due to specific properties of the pattern of interest. We present a complete analytical analysis and extensive experimental evaluation of our proposed method using both synthetic and real-world data. Our results demonstrate the advantage of our method in obtaining high-quality approximations for the number of 4 and 5-cliques for large graphs using a very limited amount of memory, significantly outperforming the single edge sample approach for counting sparse motifs in large scale graphs.

Download Full-text

The Challenges of Large‐Scale, Web‐Based Language Datasets: Word Length and Predictability Revisited

Cognitive Science ◽

10.1111/cogs.12983 ◽

2021 ◽

Vol 45 (6) ◽

Author(s):

Stephan C. Meylan ◽

Thomas L. Griffiths

Keyword(s):

Word Length ◽

Large Scale ◽

Web Based

Download Full-text

CNN-Based Classifier as an Offline Trigger for the CREDO Experiment

Sensors ◽

10.3390/s21144804 ◽

2021 ◽

Vol 21 (14) ◽

pp. 4804

Author(s):

Marcin Piekarczyk ◽

Olaf Bar ◽

Łukasz Bibrzycki ◽

Michał Niedźwiecki ◽

Krzysztof Rzecki ◽

...

Keyword(s):

Wavelet Transforms ◽

Exploratory Study ◽

Large Scale ◽

Morphological Difference ◽

Cosmic Ray ◽

Input Image ◽

Image Features ◽

The Earth ◽

Cmos Sensor ◽

Competition Process

Gamification is known to enhance users’ participation in education and research projects that follow the citizen science paradigm. The Cosmic Ray Extremely Distributed Observatory (CREDO) experiment is designed for the large-scale study of various radiation forms that continuously reach the Earth from space, collectively known as cosmic rays. The CREDO Detector app relies on a network of involved users and is now working worldwide across phones and other CMOS sensor-equipped devices. To broaden the user base and activate current users, CREDO extensively uses the gamification solutions like the periodical Particle Hunters Competition. However, the adverse effect of gamification is that the number of artefacts, i.e., signals unrelated to cosmic ray detection or openly related to cheating, substantially increases. To tag the artefacts appearing in the CREDO database we propose the method based on machine learning. The approach involves training the Convolutional Neural Network (CNN) to recognise the morphological difference between signals and artefacts. As a result we obtain the CNN-based trigger which is able to mimic the signal vs. artefact assignments of human annotators as closely as possible. To enhance the method, the input image signal is adaptively thresholded and then transformed using Daubechies wavelets. In this exploratory study, we use wavelet transforms to amplify distinctive image features. As a result, we obtain a very good recognition ratio of almost 99% for both signal and artefacts. The proposed solution allows eliminating the manual supervision of the competition process.

Download Full-text

Prediction of Patients with Acute Cholecystitis Requiring Emergent Cholecystectomy: A Simple Score

Gastroenterology Research and Practice ◽

10.1155/2010/901739 ◽

2010 ◽

Vol 2010 ◽

pp. 1-5 ◽

Cited By ~ 33

Author(s):

Wael N. Yacoub ◽

Mikael Petrosyan ◽

Indu Sehgal ◽

Yanling Ma ◽

Parakrama Chandrasoma ◽

...

Keyword(s):

Heart Rate ◽

Logistic Regression ◽

Acute Cholecystitis ◽

High Probability ◽

Acute Inflammation ◽

Gallbladder Wall ◽

Cutoff Score ◽

Gangrenous Cholecystitis ◽

Pathological Review ◽

Low Probability

The objective was to develop a score, to stratify patients with acute cholecystitis into high, intermediate, or low probability of gangrenous cholecystitis. The probability of gangrenous cholecystitis (score) was derived from a logistic regression of a clinical and pathological review of 245 patients undergoing urgent cholecystectomy. Sixty-eight patients had gangrenous inflammation, 132 acute, and 45 no inflammation. The score comprised of: age > 45 years (1 point), heart rate > 90 beats/min (1 point), male (2 points), Leucocytosis > 13,000/mm3(1.5 points), and ultrasound gallbladder wall thickness>4.5 mm (1 point). The prevalence of gangrenous cholecystitis was 13% in the low-probability (0–2 points), 33% in the intermediate-probability (2–4.5 points), and 87% in the high probability category (>4.5 points). A cutoff score of 2 identified 31 (69%) patients with no acute inflammation (PPV 90%). This scoring system can prioritize patients for emergent cholecystectomy based on their expected pathology.

Download Full-text

Frequency of enforcement is more important than the severity of punishment in reducing violation behaviors

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.2108507118 ◽

2021 ◽

Vol 118 (42) ◽

pp. e2108507118

Author(s):

Kinneret Teodorescu ◽

Ori Plonsky ◽

Shahar Ayal ◽

Rachel Barkan

Keyword(s):

High Probability ◽

Real Life ◽

Expected Value ◽

Relative Importance ◽

Decisions From Experience ◽

High Severity ◽

Low Probability ◽

Violation Rate ◽

Severity Of Punishment ◽

Enforcement Policies

External enforcement policies aimed to reduce violations differ on two key components: the probability of inspection and the severity of the punishment. Different lines of research offer different insights regarding the relative importance of each component. In four studies, students and Prolific crowdsourcing participants (Ntotal = 816) repeatedly faced temptations to commit violations under two enforcement policies. Controlling for expected value, we found that a policy combining a high probability of inspection with a low severity of fines (HILS) was more effective than an economically equivalent policy that combined a low probability of inspection with a high severity of fines (LIHS). The advantage of prioritizing inspection frequency over punishment severity (HILS over LIHS) was greater for participants who, in the absence of enforcement, started out with a higher violation rate. Consistent with studies of decisions from experience, frequent enforcement with small fines was more effective than rare severe fines even when we announced the severity of the fine in advance to boost deterrence. In addition, in line with the phenomenon of underweighting of rare events, the effect was stronger when the probability of inspection was rarer (as in most real-life inspection probabilities) and was eliminated under moderate inspection probabilities. We thus recommend that policymakers looking to effectively reduce recurring violations among noncriminal populations should consider increasing inspection rates rather than punishment severity.

Download Full-text

Language Modeling for Morphologically Rich Languages: Character-Aware Modeling for Word-Level Prediction

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00032 ◽

2018 ◽

Vol 6 ◽

pp. 451-465 ◽

Cited By ~ 5

Author(s):

Daniela Gerz ◽

Ivan Vulić ◽

Edoardo Ponti ◽

Jason Naradowsky ◽

Roi Reichart ◽

...

Keyword(s):

Large Scale ◽

Language Modeling ◽

Language Models ◽

Data Sets ◽

High Type ◽

Word Level ◽

Level Information ◽

Character Sequences ◽

Novel Method ◽

Morphologically Rich Languages

Neural architectures are prominent in the construction of language models (LMs). However, word-level prediction is typically agnostic of subword-level information (characters and character sequences) and operates over a closed vocabulary, consisting of a limited word set. Indeed, while subword-aware models boost performance across a variety of NLP tasks, previous work did not evaluate the ability of these models to assist next-word prediction in language modeling tasks. Such subword-level informed models should be particularly effective for morphologically-rich languages (MRLs) that exhibit high type-to-token ratios. In this work, we present a large-scale LM study on 50 typologically diverse languages covering a wide variety of morphological systems, and offer new LM benchmarks to the community, while considering subword-level information. The main technical contribution of our work is a novel method for injecting subword-level information into semantic word vectors, integrated into the neural language modeling training, to facilitate word-level prediction. We conduct experiments in the LM setting where the number of infrequent words is large, and demonstrate strong perplexity gains across our 50 languages, especially for morphologically-rich languages. Our code and data sets are publicly available.

Download Full-text