dataset size
Recently Published Documents


TOTAL DOCUMENTS

157
(FIVE YEARS 79)

H-INDEX

13
(FIVE YEARS 2)

2022 ◽  
Author(s):  
Abdul Muqtadir Khan ◽  
Abdullah BinZiad ◽  
Abdullah Al Subaii ◽  
Turki Alqarni ◽  
Mohamed Yassine Jelassi ◽  
...  

Abstract Diagnostic pumping techniques are used routinely in proppant fracturing design. The pumping process can be time consuming; however, it yields technical confidence in treatment and productivity optimization. Recent developments in data analytics and machine learning can aid in shortening operational workflows and enhance project economics. Supervised learning was applied to an existing database to streamline the process and affect the design framework. Five classification algorithms were used for this study. The database was constructed through heterogeneous reservoir plays from the injection/falloff outputs. The algorithms used were support vector machine, decision tree, random forest, multinomial, and XGBoost. The number of classes was sensitized to establish a balance between model accuracy and prediction granularity. Fifteen cases were developed for a comprehensive comparison. A complete machine learning framework was constructed to work through each case set along with hyperparameter tuning to maximize accuracy. After the model was finalized, an extensive field validation workflow was deployed. The target outputs selected for the model were crosslinked fluid efficiency, total proppant mass, and maximum proppant concentration. The unsupervised clustering technique with t-SNE algorithm that was used first lacked accuracy. Supervised classification models showed better predictions. Cross-validation techniques showed an increasing trend of prediction accuracy. Feature selection was done using one-variable-at-a-time (OVAT) and a simple feature correlation study. Because the number of features and the dataset size were small, no features were eliminated from the final model building. Accuracy and F1 score calculations were used from the confusion matrix for evaluation, XGBoost showed excellent results with an accuracy of 74 to 95% for the output parameters. Fluid efficiency was categorized into three classes and yielded an accuracy of 96%. Proppant concentration and proppant mass predictions showed 77% and 86% accuracy, respectively, for the six-class case. The combination of high accuracy and fine granularity confirmed the potential application of machine learning models. The ratio of training to testing (holdout) across all cases ranged from 80:20 to 70:30. Model validations were done through an inverse problem of predicting and matching the fracture geometry and treatment pressures from the machine learning model design and the actual net pressure match. The simulations were conducted using advanced multiphysics simulations. The advantages of this innovative design approach showed four areas of improvement: reduction in polymer consumption by 30%, reduction of the flowback time by 25%, reduction of water usage by 30%, and enhanced operational efficiency by 60 to 65%.


2022 ◽  
Author(s):  
Alexander Pomberger ◽  
Antonio Pedrina McCarthy ◽  
Ahmad Khan ◽  
Simon Sung ◽  
Connor Taylor ◽  
...  

Multivariate chemical reaction optimization involving catalytic systems is a non-trivial task due to the high number of tuneable parameters and discrete choices. Closed-loop optimization featuring active Machine Learning (ML) represents a powerful strategy for automating reaction optimization. However, the translation of chemical reaction conditions into a machine-readable format comes with the challenge of finding highly informative features which accurately capture the factors for reaction success and allow the model to learn efficiently. Herein, we compare the efficacy of different calculated chemical descriptors for a high throughput generated dataset to determine the impact on a supervised ML model when predicting reaction yield. Then, the effect of featurization and size of the initial dataset within a closed-loop reaction optimization was examined. Finally, the balance between descriptor complexity and dataset size was considered. Ultimately, tailored descriptors did not outperform simple generic representations, however, a larger initial dataset accelerated reaction optimization.


2022 ◽  
Vol 12 (1) ◽  
Author(s):  
Vegard Skirbekk ◽  
Éric Bonsang ◽  
Bo Engdahl

AbstractThere is a lack of studies assessing how hearing impairment relates to reproductive outcomes. We examined whether childhood hearing impairment (HI) affects reproductive patterns based on longitudinal Norwegian population level data for birth cohorts 1940–1980. We used Poisson regression to estimate the association between the number of children ever born and HI. The association with childlessness is estimated by a logit model. As a robustness check, we also estimated family fixed effects Poisson and logit models. Hearing was assessed at ages 7, 10 and 13, and reproduction was observed at adult ages until 2014. Air conduction hearing threshold levels were obtained by pure-tone audiometry at eight frequencies from 0.25 to 8 kHz. Fertility data were collected from Norwegian administrative registers. The combined dataset size was N = 50,022. Our analyses reveal that HI in childhood is associated with lower fertility in adulthood, especially for men. The proportion of childless individuals among those with childhood HI was almost twice as large as that of individuals with normal childhood hearing (20.8% vs. 10.7%). The negative association is robust to the inclusion of family fixed effects in the model that allow to control for the unobserved heterogeneity that are shared between siblings, including factors related to the upbringing and parent characteristics. Less family support in later life could add to the health challenges faced by those with HI. More attention should be given to how fertility relates to HI.


2021 ◽  
Vol 216 ◽  
pp. 108048
Author(s):  
Vijay Mohan Nagulapati ◽  
Hyunjun Lee ◽  
DaWoon Jung ◽  
Boris Brigljevic ◽  
Yunseok Choi ◽  
...  

2021 ◽  
Vol 24 (68) ◽  
pp. 72-88
Author(s):  
Mohammad Alshayeb ◽  
Mashaan A. Alshammari

The ongoing development of computer systems requires massive software projects. Running the components of these huge projects for testing purposes might be a costly process; therefore, parameter estimation can be used instead. Software defect prediction models are crucial for software quality assurance. This study investigates the impact of dataset size and feature selection algorithms on software defect prediction models. We use two approaches to build software defect prediction models: a statistical approach and a machine learning approach with support vector machines (SVMs). The fault prediction model was built based on four datasets of different sizes. Additionally, four feature selection algorithms were used. We found that applying the SVM defect prediction model on datasets with a reduced number of measures as features may enhance the accuracy of the fault prediction model. Also, it directs the test effort to maintain the most influential set of metrics. We also found that the running time of the SVM fault prediction model is not consistent with dataset size. Therefore, having fewer metrics does not guarantee a shorter execution time. From the experiments, we found that dataset size has a direct influence on the SVM fault prediction model. However, reduced datasets performed the same or slightly lower than the original datasets.


2021 ◽  
Author(s):  
Thiago Peixoto Leal ◽  
Vinicius C Furlan ◽  
Mateus Henrique Gouveia ◽  
Julia Maria Saraiva Duarte ◽  
Pablo AS Fonseca ◽  
...  

Genetic and omics analyses frequently require independent observations, which is not guaranteed in real datasets. When relatedness can not be accounted for, solutions involve removing related individuals (or observations) and, consequently, a reduction of available data. We developed a network-based relatedness-pruning method that minimizes dataset reduction while removing unwanted relationships in a dataset. It uses node degree centrality metric to identify highly connected nodes (or individuals) and implements heuristics that approximate the minimal reduction of a dataset to allow its application to large datasets. NAToRA outperformed two popular methodologies (implemented in software PLINK and KING) by showing the best combination of effective relatedness-pruning, removing all relatives while keeping the largest possible number of individuals in all datasets tested and also, with similar or lesser reduction in genetic diversity. NAToRA is freely available, both as a standalone tool that can be easily incorporated as part of a pipeline, and as a graphical web tool that allows visualization of the relatedness networks. NAToRA also accepts a variety of relationship metrics as input, which facilitates its use. We also present a genealogies simulator software used for different tests performed in the manuscript.


Diagnostics ◽  
2021 ◽  
Vol 11 (11) ◽  
pp. 1972
Author(s):  
Abul Bashar ◽  
Ghazanfar Latif ◽  
Ghassen Ben Brahim ◽  
Nazeeruddin Mohammad ◽  
Jaafar Alghazo

It became apparent that mankind has to learn to live with and adapt to COVID-19, especially because the developed vaccines thus far do not prevent the infection but rather just reduce the severity of the symptoms. The manual classification and diagnosis of COVID-19 pneumonia requires specialized personnel and is time consuming and very costly. On the other hand, automatic diagnosis would allow for real-time diagnosis without human intervention resulting in reduced costs. Therefore, the objective of this research is to propose a novel optimized Deep Learning (DL) approach for the automatic classification and diagnosis of COVID-19 pneumonia using X-ray images. For this purpose, a publicly available dataset of chest X-rays on Kaggle was used in this study. The dataset was developed over three stages in a quest to have a unified COVID-19 entities dataset available for researchers. The dataset consists of 21,165 anterior-to-posterior and posterior-to-anterior chest X-ray images classified as: Normal (48%), COVID-19 (17%), Lung Opacity (28%) and Viral Pneumonia (6%). Data Augmentation was also applied to increase the dataset size to enhance the reliability of results by preventing overfitting. An optimized DL approach is implemented in which chest X-ray images go through a three-stage process. Image Enhancement is performed in the first stage, followed by Data Augmentation stage and in the final stage the results are fed to the Transfer Learning algorithms (AlexNet, GoogleNet, VGG16, VGG19, and DenseNet) where the images are classified and diagnosed. Extensive experiments were performed under various scenarios, which led to achieving the highest classification accuracy of 95.63% through the application of VGG16 transfer learning algorithm on the augmented enhanced dataset with freeze weights. This accuracy was found to be better as compared to the results reported by other methods in the recent literature. Thus, the proposed approach proved superior in performance as compared with that of other similar approaches in the extant literature, and it made a valuable contribution to the body of knowledge. Although the results achieved so far are promising, further work is planned to correlate the results of the proposed approach with clinical observations to further enhance the efficiency and accuracy of COVID-19 diagnosis.


Author(s):  
Erhan Sezerer ◽  
Samet Tenekeci ◽  
Ali Acar ◽  
Bora Baloğlu ◽  
Selma Tekir

In the field of software engineering, practitioners’ share in the constructed knowledge cannot be underestimated and is mostly in the form of grey literature (GL). GL is a valuable resource though it is subjective and lacks an objective quality assurance methodology. In this paper, a quality assessment scheme is proposed for question and answer (Q&A) sites. In particular, we target stack overflow (SO) and stack exchange (SE) sites. We model the problem of author reputation measurement as a classification task on the author-provided answers. The authors’ mean, median, and total answer scores are used as inputs for class labeling. State-of-the-art language models (BERT and DistilBERT) with a softmax layer on top are utilized as classifiers and compared to SVM and random baselines. Our best model achieves [Formula: see text] accuracy in binary classification in SO design patterns tag and [Formula: see text] accuracy in SE software engineering category. Superior performance in SE software engineering can be explained by its larger dataset size. In addition to quantitative evaluation, we provide qualitative evidence, which supports that the system’s predicted reputation labels match the quality of provided answers.


2021 ◽  
Vol 79 (10) ◽  
pp. e6-e7
Author(s):  
N. Mehandru ◽  
W.L. Hicks ◽  
A.K. Singh ◽  
L. Hsu ◽  
M.R. Markiewicz ◽  
...  

Author(s):  
Alexandre Bailly ◽  
Corentin Blanc ◽  
Élie Francis ◽  
Thierry Guillotin ◽  
Fadi Jamal ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document