synthetic data
Recently Published Documents


TOTAL DOCUMENTS

3157
(FIVE YEARS 1562)

H-INDEX

78
(FIVE YEARS 13)

2022 ◽  
Vol 0 (0) ◽  
Author(s):  
Jian Cao ◽  
Seo-young Silvia Kim ◽  
R. Michael Alvarez

Abstract How do we ensure a statewide voter registration database’s accuracy and integrity, especially when the database depends on aggregating decentralized, sub-state data with different list maintenance practices? We develop a Bayesian multivariate multilevel model to account for correlated patterns of change over time in multiple response variables, and label statewide anomalies using deviations from model predictions. We apply our model to California’s 22 million registered voters, using 25 snapshots from the 2020 presidential election. We estimate countywide change rates for multiple response variables such as changes in voter’s partisan affiliation and jointly model these changes. The model outperforms a simple interquartile range (IQR) detection when tested with synthetic data. This is a proof-of-concept that demonstrates the utility of the Bayesian methodology, as despite the heterogeneity in list maintenance practices, a principled, statistical approach is useful. At the county level, the total numbers of anomalies are positively correlated with the average election cost per registered voter between 2017 and 2019. Given the recent efforts to modernize and secure voter list maintenance procedures in the For the People Act of 2021, we argue that checking whether counties or municipalities are behaving similarly at the state level is also an essential step in ensuring electoral integrity.


2022 ◽  
Vol 23 (1) ◽  
Author(s):  
Mehrdad Mansouri ◽  
Sahand Khakabimamaghani ◽  
Leonid Chindelevitch ◽  
Martin Ester

Abstract Background There has been a simultaneous increase in demand and accessibility across genomics, transcriptomics, proteomics and metabolomics data, known as omics data. This has encouraged widespread application of omics data in life sciences, from personalized medicine to the discovery of underlying pathophysiology of diseases. Causal analysis of omics data may provide important insight into the underlying biological mechanisms. Existing causal analysis methods yield promising results when identifying potential general causes of an observed outcome based on omics data. However, they may fail to discover the causes specific to a particular stratum of individuals and missing from others. Methods To fill this gap, we introduce the problem of stratified causal discovery and propose a method, Aristotle, for solving it. Aristotle addresses the two challenges intrinsic to omics data: high dimensionality and hidden stratification. It employs existing biological knowledge and a state-of-the-art patient stratification method to tackle the above challenges and applies a quasi-experimental design method to each stratum to find stratum-specific potential causes. Results Evaluation based on synthetic data shows better performance for Aristotle in discovering true causes under different conditions compared to existing causal discovery methods. Experiments on a real dataset on Anthracycline Cardiotoxicity indicate that Aristotle’s predictions are consistent with the existing literature. Moreover, Aristotle makes additional predictions that suggest further investigations.


2022 ◽  
Author(s):  
Kuan-Jung Chiang ◽  
Chi Man Wong ◽  
Feng Wan ◽  
Tzyy-Ping Jung ◽  
Masaki Nakanishi

Numerical simulations with synthetic data were conducted.


2022 ◽  
Author(s):  
Kuan-Jung Chiang ◽  
Chi Man Wong ◽  
Feng Wan ◽  
Tzyy-Ping Jung ◽  
Masaki Nakanishi

Numerical simulations with synthetic data were conducted.


2022 ◽  
Vol 40 (1) ◽  
pp. 11-22
Author(s):  
Shin'ya Nakano ◽  
Ryuho Kataoka

Abstract. The properties of the auroral electrojets are examined on the basis of a trained machine-learning model. The relationships between solar-wind parameters and the AU and AL indices are modeled with an echo state network (ESN), a kind of recurrent neural network. We can consider this trained ESN model to represent nonlinear effects of the solar-wind inputs on the auroral electrojets. To identify the properties of auroral electrojets, we obtain various synthetic AU and AL data by using various artificial inputs with the trained ESN. The analyses of various synthetic data show that the AU and AL indices are mainly controlled by the solar-wind speed in addition to Bz of the interplanetary magnetic field (IMF) as suggested by the literature. The results also indicate that the solar-wind density effect is emphasized when solar-wind speed is high and when IMF Bz is near zero. This suggests some nonlinear effects of the solar-wind density.


Electronics ◽  
2022 ◽  
Vol 11 (2) ◽  
pp. 213
Author(s):  
Ghada Abdelmoumin ◽  
Jessica Whitaker ◽  
Danda B. Rawat ◽  
Abdul Rahman

An effective anomaly-based intelligent IDS (AN-Intel-IDS) must detect both known and unknown attacks. Hence, there is a need to train AN-Intel-IDS using dynamically generated, real-time data in an adversarial setting. Unfortunately, the public datasets available to train AN-Intel-IDS are ineluctably static, unrealistic, and prone to obsolescence. Further, the need to protect private data and conceal sensitive data features has limited data sharing, thus encouraging the use of synthetic data for training predictive and intrusion detection models. However, synthetic data can be unrealistic and potentially bias. On the other hand, real-time data are realistic and current; however, it is inherently imbalanced due to the uneven distribution of anomalous and non-anomalous examples. In general, non-anomalous or normal examples are more frequent than anomalous or attack examples, thus leading to skewed distribution. While imbalanced data are commonly predominant in intrusion detection applications, it can lead to inaccurate predictions and degraded performance. Furthermore, the lack of real-time data produces potentially biased models that are less effective in predicting unknown attacks. Therefore, training AN-Intel-IDS using imbalanced and adversarial learning is instrumental to their efficacy and high performance. This paper investigates imbalanced learning and adversarial learning for training AN-Intel-IDS using a qualitative study. It surveys and synthesizes generative-based data augmentation techniques for addressing the uneven data distribution and generative-based adversarial techniques for generating synthetic yet realistic data in an adversarial setting using rapid review, structured reporting, and subgroup analysis.


2022 ◽  
Vol 12 (1) ◽  
Author(s):  
Yang Wu ◽  
Ellora Hui Zhen Chua ◽  
Alvin Wei Tian Ng ◽  
Arnoud Boot ◽  
Steven G. Rozen

AbstractMutational signatures are characteristic patterns of mutations generated by exogenous mutagens or by endogenous mutational processes. Mutational signatures are important for research into DNA damage and repair, aging, cancer biology, genetic toxicology, and epidemiology. Unsupervised learning can infer mutational signatures from the somatic mutations in large numbers of tumors, and separating correlated signatures is a notable challenge for this task. To investigate which methods can best meet this challenge, we assessed 18 computational methods for inferring mutational signatures on 20 synthetic data sets that incorporated varying degrees of correlated activity of two common mutational signatures. Performance varied widely, and four methods noticeably outperformed the others: hdp (based on hierarchical Dirichlet processes), SigProExtractor (based on multiple non-negative matrix factorizations over resampled data), TCSM (based on an approach used in document topic analysis), and mutSpec.NMF (also based on non-negative matrix factorization). The results underscored the complexities of mutational signature extraction, including the importance and difficulty of determining the correct number of signatures and the importance of hyperparameters. Our findings indicate directions for improvement of the software and show a need for care when interpreting results from any of these methods, including the need for assessing sensitivity of the results to input parameters.


Algorithms ◽  
2022 ◽  
Vol 15 (1) ◽  
pp. 20
Author(s):  
Yinan Chen ◽  
Chuanpeng Wang ◽  
Dong Li

Complex networks usually consist of dense-connected cliques, which are defined as communities. A community structure is a reflection of the local characteristics existing in the network topology, this makes community detection become an important research field to reveal the internal structural characteristics of networks. In this article, an information-based community detection approach MINC-NRL is proposed, which can be applied to both overlapping and non-overlapping community detection. MINC-NRL introduces network representation learning (NRL) to represent the target network as vectors, then generates a community evolution process based on these vectors to reduce the search space, and finally, finds the best community partition in this process using mutual information between network and communities (MINC). Experiments on real-world and synthetic data sets verifies the effectiveness of the approach in community detection, both on non-overlapping and overlapping tasks.


2022 ◽  
Author(s):  
Omar Alfarisi ◽  
Zeyar Aung ◽  
Mohamed Sassi

For defining the optimal machine learning algorithm, the decision was not easy for which we shall choose. To help future researchers, we describe in this paper the optimal among the best of the algorithms. We built a synthetic data set and performed the supervised machine learning runs for five different algorithms. For heterogeneous rock fabric, we identified Random Forest, among others, to be the appropriate algorithm.


2022 ◽  
Vol 15 (1) ◽  
pp. 149-164
Author(s):  
Alberto Sorrentino ◽  
Alessia Sannino ◽  
Nicola Spinelli ◽  
Michele Piana ◽  
Antonella Boselli ◽  
...  

Abstract. We consider the problem of reconstructing the number size distribution (or particle size distribution) in the atmosphere from lidar measurements of the extinction and backscattering coefficients. We assume that the number size distribution can be modeled as a superposition of log-normal distributions, each one defined by three parameters: mode, width and height. We use a Bayesian model and a Monte Carlo algorithm to estimate these parameters. We test the developed method on synthetic data generated by distributions containing one or two modes and perturbed by Gaussian noise as well as on three datasets obtained from AERONET. We show that the proposed algorithm provides good results when the right number of modes is selected. In general, an overestimate of the number of modes provides better results than an underestimate. In all cases, the PM1, PM2.5 and PM10 concentrations are reconstructed with tolerable deviations.


Sign in / Sign up

Export Citation Format

Share Document