Collecting a Large Scale Dataset for Classifying Fake News Tweets Using Weak Supervision

Stefan Helmstetter; Heiko Paulheim

doi:10.3390/fi13050114

Collecting a Large Scale Dataset for Classifying Fake News Tweets Using Weak Supervision

Future Internet ◽

10.3390/fi13050114 ◽

2021 ◽

Vol 13 (5) ◽

pp. 114

Author(s):

Stefan Helmstetter ◽

Heiko Paulheim

Keyword(s):

Large Scale ◽

Binary Classification ◽

Classification Problem ◽

Training Dataset ◽

Fake News ◽

Weak Supervision ◽

Alternative Approach ◽

Large Scale Dataset ◽

Tweet Classification

The problem of automatic detection of fake news in social media, e.g., on Twitter, has recently drawn some attention. Although, from a technical perspective, it can be regarded as a straight-forward, binary classification problem, the major challenge is the collection of large enough training corpora, since manual annotation of tweets as fake or non-fake news is an expensive and tedious endeavor, and recent approaches utilizing distributional semantics require large training corpora. In this paper, we introduce an alternative approach for creating a large-scale dataset for tweet classification with minimal user intervention. The approach relies on weak supervision and automatically collects a large-scale, but very noisy, training dataset comprising hundreds of thousands of tweets. As a weak supervision signal, we label tweets by their source, i.e., trustworthy or untrustworthy source, and train a classifier on this dataset. We then use that classifier for a different classification target, i.e., the classification of fake and non-fake tweets. Although the labels are not accurate according to the new classification target (not all tweets by an untrustworthy source need to be fake news, and vice versa), we show that despite this unclean, inaccurate dataset, the results are comparable to those achieved using a manually labeled set of tweets. Moreover, we show that the combination of the large-scale noisy dataset with a human labeled one yields more advantageous results than either of the two alone.

Download Full-text

Building Damage Detection Using U-Net with Attention Mechanism from Pre- and Post-Disaster Remote Sensing Datasets

Remote Sensing ◽

10.3390/rs13050905 ◽

2021 ◽

Vol 13 (5) ◽

pp. 905

Author(s):

Chuyi Wu ◽

Feng Zhang ◽

Junshi Xia ◽

Yichen Xu ◽

Guoqing Li ◽

...

Keyword(s):

Damage Assessment ◽

Large Scale ◽

Binary Classification ◽

Open Data ◽

Building Damage ◽

Attention Mechanism ◽

Large Scale Dataset ◽

Data Program ◽

The Impact ◽

Post Disaster

The building damage status is vital to plan rescue and reconstruction after a disaster and is also hard to detect and judge its level. Most existing studies focus on binary classification, and the attention of the model is distracted. In this study, we proposed a Siamese neural network that can localize and classify damaged buildings at one time. The main parts of this network are a variety of attention U-Nets using different backbones. The attention mechanism enables the network to pay more attention to the effective features and channels, so as to reduce the impact of useless features. We train them using the xBD dataset, which is a large-scale dataset for the advancement of building damage assessment, and compare their result balanced F (F1) scores. The score demonstrates that the performance of SEresNeXt with an attention mechanism gives the best performance, with the F1 score reaching 0.787. To improve the accuracy, we fused the results and got the best overall F1 score of 0.792. To verify the transferability and robustness of the model, we selected the dataset on the Maxar Open Data Program of two recent disasters to investigate the performance. By visual comparison, the results show that our model is robust and transferable.

Download Full-text

Selected Robust Logistic Regression Specification for Classification of Multi‑dimensional Functional Data in Presence of Outlier

Acta Universitatis Lodziensis Folia oeconomica ◽

10.18778/0208-6018.334.04 ◽

2018 ◽

Vol 2 (334) ◽

Author(s):

Mirosław Krzyśko ◽

Łukasz Smaga

Keyword(s):

Logistic Regression ◽

Regression Model ◽

Functional Data ◽

Logistic Regression Model ◽

Binary Classification ◽

Classification Problem ◽

Classification Rule ◽

Unknown Parameters ◽

Explanatory Variables

In this paper, the binary classification problem of multi‑dimensional functional data is considered. To solve this problem a regression technique based on functional logistic regression model is used. This model is re‑expressed as a particular logistic regression model by using the basis expansions of functional coefficients and explanatory variables. Based on re‑expressed model, a classification rule is proposed. To handle with outlying observations, robust methods of estimation of unknown parameters are also considered. Numerical experiments suggest that the proposed methods may behave satisfactory in practice.

Download Full-text

IFND: a benchmark dataset for fake news detection

Complex & Intelligent Systems ◽

10.1007/s40747-021-00552-1 ◽

2021 ◽

Author(s):

Dilip Kumar Sharma ◽

Sonal Garg

Keyword(s):

Large Scale ◽

Latent Dirichlet Allocation ◽

Prediction Models ◽

Benchmark Dataset ◽

Fake News ◽

Text And Image ◽

People Detection ◽

Digital Platforms ◽

Augmentation Algorithm ◽

Large Scale Dataset

AbstractSpotting fake news is a critical problem nowadays. Social media are responsible for propagating fake news. Fake news propagated over digital platforms generates confusion as well as induce biased perspectives in people. Detection of misinformation over the digital platform is essential to mitigate its adverse impact. Many approaches have been implemented in recent years. Despite the productive work, fake news identification poses many challenges due to the lack of a comprehensive publicly available benchmark dataset. There is no large-scale dataset that consists of Indian news only. So, this paper presents IFND (Indian fake news dataset) dataset. The dataset consists of both text and images. The majority of the content in the dataset is about events from the year 2013 to the year 2021. Dataset content is scrapped using the Parsehub tool. To increase the size of the fake news in the dataset, an intelligent augmentation algorithm is used. An intelligent augmentation algorithm generates meaningful fake news statements. The latent Dirichlet allocation (LDA) technique is employed for topic modelling to assign the categories to news statements. Various machine learning and deep-learning classifiers are implemented on text and image modality to observe the proposed IFND dataset's performance. A multi-modal approach is also proposed, which considers both textual and visual features for fake news detection. The proposed IFND dataset achieved satisfactory results. This study affirms that the accessibility of such a huge dataset can actuate research in this laborious exploration issue and lead to better prediction models.

Download Full-text

A large scale dataset for classification of vehicles in urban traffic scenes

Proceedings of the Tenth Indian Conference on Computer Vision, Graphics and Image Processing - ICVGIP '16 ◽

10.1145/3009977.3010040 ◽

2016 ◽

Cited By ~ 3

Author(s):

Harish S Bharadwaj ◽

Soma Biswas ◽

K R Ramakrishnan

Keyword(s):

Large Scale ◽

Urban Traffic ◽

Large Scale Dataset

Download Full-text

MULTILEVEL KOHONEN NETWORK LEARNING FOR CLUSTERING PROBLEMS

Journal of Information and Communication Technology ◽

10.32890/jict.7.2008.8075 ◽

2008 ◽

Author(s):

Siti Mariyam Shamsuddin ◽

Anazida Zainal ◽

Norfadzila Mohd Yusof

Keyword(s):

Large Scale ◽

Pattern Separation ◽

Distance Measures ◽

Classification Rate ◽

Network Learning ◽

Large Scale Dataset ◽

Self Organising Map ◽

Clustering Problems

Clustering is the procedure of recognising classes of patterns that occur in the environment and assigning each pattern to its relevant. Unlike classical statistical methods, self-organising map (SOM) does not require any prior knowledge about the statistical distribution of the patterns in the environment. In this study, an alternative classification of self-organising neural networks, known as multilevel learning, was proposed to solve the task of pattern separation. The performance of standard SOM and multilevel SOM were evaluated with different distance or dissimilarity measures in retrieving similarity between patterns. The purpose of this analysis was to evaluate the quality of map produced by SOM learning using different distance measures in representing a given dataset. Based on the results obtained from both SOM methods, predictions can be made for the unknown samples. The results showed that multilevel SOM learning gives better classification rate for small and medium scale datasets, but not for large scale dataset.

Download Full-text

Understanding Medical Conversations with Scattered Keyword Attention and Weak Supervision from Responses

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6412 ◽

2020 ◽

Vol 34 (05) ◽

pp. 8838-8845

Author(s):

Xiaoming Shi ◽

Haifeng Hu ◽

Wanxiang Che ◽

Zhongqian Sun ◽

Ting Liu ◽

...

Keyword(s):

Large Scale ◽

Classification Problem ◽

Unlabeled Data ◽

Classification Models ◽

Weak Supervision ◽

Structured Representations ◽

Slot Filling ◽

Filling Problem

In this work, we consider the medical slot filling problem, i.e., the problem of converting medical queries into structured representations which is a challenging task. We analyze the effectiveness of two points: scattered keywords in user utterances and weak supervision with responses. We approach the medical slot filling as a multi-label classification problem with label-embedding attentive model to pay more attention to scattered medical keywords and learn the classification models by weak-supervision from responses. To evaluate the approaches, we annotate a medical slot filling data and collect a large scale unlabeled data. The experiments demonstrate that these two points are promising to improve the task.

Download Full-text

A MACHINE LEARNING PIPELINE ARTICULATING SATELLITE IMAGERY AND OPENSTREETMAP FOR ROAD DETECTION

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xlii-4-w14-255-2019 ◽

2019 ◽

Vol XLII-4/W14 ◽

pp. 255-260

Author(s):

M. A. Zurbaran ◽

P. Wightman ◽

M. A. Brovelli

Keyword(s):

Machine Learning ◽

Satellite Imagery ◽

Binary Classification ◽

Ground Truth ◽

Classification Problem ◽

Training Dataset ◽

Road Detection ◽

Area Of Interest ◽

Desirable Outcome ◽

Public Datasets

Abstract. Satellite imagery from earth observation missions enable processing big data to gather information about the world. Automatizing the creation of maps that reflect ground truth is a desirable outcome that would aid decision makers to take adequate actions in alignment with the United Nations Sustainable Development Goals. In order to harness the power that the availability of the new generation of satellites enable, it is necessary to implement techniques capable of handling annotations for the massive volume and variability of high spatial resolution imagery for further processing. However, the availability of public datasets for training machine learning models for image segmentation plays an important role for scalability.This work focuses on bridging remote sensing and computer vision by providing an open source based pipeline for generating machine learning training datasets for road detection in an area of interest. The proposed pipeline addresses road detection as a binary classification problem using road annotations existing in OpenStreetMap for creating masks. For this case study, Planet images of 3m resolution are used for creating a training dataset for road detection in Kenya.

Download Full-text

Multi layered Stacked Ensemble Method with Feature Reduction Technique for Multi-Label Classification

Journal of Physics Conference Series ◽

10.1088/1742-6596/2161/1/012074 ◽

2022 ◽

Vol 2161 (1) ◽

pp. 012074

Author(s):

Hemavati ◽

V Susheela Devi ◽

R Aparna

Keyword(s):

Ensemble Learning ◽

Principal Component ◽

Initial Step ◽

Classification Problem ◽

Feature Reduction ◽

Training Dataset ◽

Class Label ◽

Label Information ◽

Class Information

Abstract Nowadays, multi-label classification can be considered as one of the important challenges for classification problem. In this case instances are assigned more than one class label. Ensemble learning is a process of supervised learning where several classifiers are trained to get a better solution for a given problem. Feature reduction can be used to improve the classification accuracy by considering the class label information with principal Component Analysis (PCA). In this paper, stacked ensemble learning method with augmented class information PCA (CA PCA) is proposed for classification of multi-label data (SEMML). In the initial step, the dimensionality reduction step is applied, then the number of classifiers have to be chosen to apply on the original training dataset, then the stacking method is applied to it. By observing the results of experiments conducted are showing our proposed method is working better as compared to the existing methods.

Download Full-text

Model-free classification of multivariate time-series based on epsilon-complexity theory.

Transaction Kola Science Cetnre ◽

10.37614/2307-5252.2020.8.11.023 ◽

2020 ◽

Vol 11 (8-2020) ◽

pp. 176-178

Author(s):

B.S. Darkhovsky ◽

◽

Y.A. Dubnov ◽

A.Y. Popkov ◽

◽

...

Keyword(s):

Time Series ◽

Binary Classification ◽

Multivariate Time Series ◽

Feature Space ◽

Classification Problem ◽

Eeg Signals ◽

Real Numbers ◽

Model Free ◽

Free Classification

This work is devoted to a new model-free approach to a problem of binary classification of multivariate time-series. The approach is based on the original theory of epsilon-complexity which allows almost every mapping that satisfies Hoelder condition, be characterized by a pair of real numbers –complexity coefficients. Thus we can form a feature space in which a classification problem can be formulated and solved. We provide an example of classification of real EEG signals.

Download Full-text

Graph Convolutional Networks by Architecture Search for PolSAR Image Classification

Remote Sensing ◽

10.3390/rs13071404 ◽

2021 ◽

Vol 13 (7) ◽

pp. 1404

Author(s):

Hongying Liu ◽

Derong Xu ◽

Tianwen Zhu ◽

Fanhua Shang ◽

Yuanyuan Liu ◽

...

Keyword(s):

Neural Networks ◽

Large Scale ◽

Spatial Relations ◽

Classification Problem ◽

Search Method ◽

Sample Distribution ◽

Convolutional Networks ◽

Training Samples ◽

Graph Neural Networks

Classification of polarimetric synthetic aperture radar (PolSAR) images has achieved good results due to the excellent fitting ability of neural networks with a large number of training samples. However, the performance of most convolutional neural networks (CNNs) degrades dramatically when only a few labeled training samples are available. As one well-known class of semi-supervised learning methods, graph convolutional networks (GCNs) have gained much attention recently to address the classification problem with only a few labeled samples. As the number of layers grows in the network, the parameters dramatically increase. It is challenging to determine an optimal architecture manually. In this paper, we propose a neural architecture search method based GCN (ASGCN) for the classification of PolSAR images. We construct a novel graph whose nodes combines both the physical features and spatial relations between pixels or samples to represent the image. Then we build a new searching space whose components are empirically selected from some graph neural networks for architecture search and develop the differentiable architecture search method to construction our ASGCN. Moreover, to address the training of large-scale images, we present a new weighted mini-batch algorithm to reduce the computing memory consumption and ensure the balance of sample distribution, and also analyze and compare with other similar training strategies. Experiments on several real-world PolSAR datasets show that our method has improved the overall accuracy as much as 3.76% than state-of-the-art methods.

Download Full-text