Recommender Systems in Light of Big Data

The growth in the usage of the web, especially e-commerce website, has led to the development of recommender system (RS) which aims in personalizing the web content for each user and reducing the cognitive load of information on the user. However, as the world enters Big Data era and lives through the contemporary data explosion, the main goal of a RS becomes to provide millions of high quality recommendations in few seconds for the increasing number of users and items. One of the successful techniques of RSs is collaborative filtering (CF) which makes recommendations for users based on what other like-mind users had preferred. Despite its success, CF is facing some challenges posed by Big Data, such as: scalability, sparsity and cold start. As a consequence, new approaches of CF that overcome the existing problems have been studied such as Singular value decomposition (SVD). This paper surveys the literature of RSs and reviews the current state of RSs with the main concerns surrounding them due to Big Data. Furthermore, it investigates thoroughly SVD, one of the promising approaches expected to perform well in tackling Big Data challenges, and provides an implementation to it using some of the successful Big Data tools (i.e. Apache Hadoop and Spark). This implementation is intended to validate the applicability of, existing contributions to the field of, SVD-based RSs as well as validated the effectiveness of Hadoop and spark in developing large-scale systems. The implementation has been evaluated empirically by measuring mean absolute error which gave comparable results with other experiments conducted, previously by other researchers, on a relatively smaller data set and non-distributed environment. This proved the scalability of SVD-based RS and its applicability to Big Data.

Download Full-text

Recommender System in the Context of Big Data

Effective Big Data Management and Opportunities for Implementation - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-0182-4.ch015 ◽

2016 ◽

pp. 231-246

Author(s):

Khadija Ateya Almohsen ◽

Huda Kadhim Al-Jobori

Keyword(s):

Big Data ◽

Recommender System ◽

Large Scale ◽

Singular Value ◽

Web Content ◽

Large Scale Systems ◽

Existing Problems ◽

Current State ◽

The World ◽

Value Decomposition

The increasing usage of e-commerce website has led to the emergence of Recommender System (RS) with the aim of personalizing the web content for each user. One of the successful techniques of RSs is Collaborative Filtering (CF) which makes recommendations for users based on what other like-mind users had preferred. However, as the world enter Big Data era, CF has faced some challenges such as: scalability, sparsity and cold start. Thus, new approaches that overcome the existing problems have been studied such as Singular Value Decomposition (SVD). This chapter surveys the literature of RSs, reviews the current state of RSs with the main concerns surrounding them due to Big Data, investigates thoroughly SVD and provides an implementation to it using Apache Hadoop and Spark. This is intended to validate the applicability of, existing contributions to the field of, SVD-based RSs as well as validated the effectiveness of Hadoop and spark in developing large-scale systems. The results proved the scalability of SVD-based RS and its applicability to Big Data.

Download Full-text

Towards Massive RDF Storage in NoSQL Databases

Advances in Data Mining and Database Management - Emerging Technologies and Applications in Data Processing and Management ◽

10.4018/978-1-5225-8446-9.ch013 ◽

2019 ◽

pp. 263-284 ◽

Cited By ~ 2

Author(s):

Zongmin Ma ◽

Li Yan

Keyword(s):

Data Storage ◽

Large Scale ◽

Future Research ◽

Nosql Databases ◽

Current State ◽

Data Store ◽

Rdf Data ◽

Description Framework ◽

Resource Description ◽

The Web

The resource description framework (RDF) is a model for representing information resources on the web. With the widespread acceptance of RDF as the de-facto standard recommended by W3C (World Wide Web Consortium) for the representation and exchange of information on the web, a huge amount of RDF data is being proliferated and becoming available. So, RDF data management is of increasing importance and has attracted attention in the database community as well as the Semantic Web community. Currently, much work has been devoted to propose different solutions to store large-scale RDF data efficiently. In order to manage massive RDF data, NoSQL (not only SQL) databases have been used for scalable RDF data store. This chapter focuses on using various NoSQL databases to store massive RDF data. An up-to-date overview of the current state of the art in RDF data storage in NoSQL databases is provided. The chapter aims at suggestions for future research.

Download Full-text

A Review of RDF Storage in NoSQL Databases

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Managing Big Data in Cloud Computing Environments ◽

10.4018/978-1-4666-9834-5.ch009 ◽

2016 ◽

pp. 210-229 ◽

Cited By ~ 2

Author(s):

Zongmin Ma ◽

Li Yan

Keyword(s):

Data Storage ◽

Large Scale ◽

Future Research ◽

Nosql Databases ◽

Current State ◽

Data Store ◽

Rdf Data ◽

Description Framework ◽

Resource Description ◽

The Web

The Resource Description Framework (RDF) is a model for representing information resources on the Web. With the widespread acceptance of RDF as the de-facto standard recommended by W3C (World Wide Web Consortium) for the representation and exchange of information on the Web, a huge amount of RDF data is being proliferated and becoming available. So RDF data management is of increasing importance, and has attracted attentions in the database community as well as the Semantic Web community. Currently much work has been devoted to propose different solutions to store large-scale RDF data efficiently. In order to manage massive RDF data, NoSQL (“not only SQL”) databases have been used for scalable RDF data store. This chapter focuses on using various NoSQL databases to store massive RDF data. An up-to-date overview of the current state of the art in RDF data storage in NoSQL databases is provided. The chapter aims at suggestions for future research.

Download Full-text

Wikipedia citations: A comprehensive data set of citations with identifiers extracted from English Wikipedia

Quantitative Science Studies ◽

10.1162/qss_a_00105 ◽

2021 ◽

Vol 2 (1) ◽

pp. 1-19

Author(s):

Harshdeep Singh ◽

Robert West ◽

Giovanni Colavizza

Keyword(s):

Web Of Science ◽

Journal Article ◽

Journal Articles ◽

Web Content ◽

Data Set ◽

Comprehensive Data ◽

Scholarly Publications ◽

The Future ◽

The Web

Abstract Wikipedia’s content is based on reliable and published sources. To this date, relatively little is known about what sources Wikipedia relies on, in part because extracting citations and identifying cited sources is challenging. To close this gap, we release Wikipedia Citations, a comprehensive data set of citations extracted from Wikipedia. We extracted29.3 million citations from 6.1 million English Wikipedia articles as of May 2020, and classified as being books, journal articles, or Web content. We were thus able to extract 4.0 million citations to scholarly publications with known identifiers—including DOI, PMC, PMID, and ISBN—and further equip an extra 261 thousand citations with DOIs from Crossref. As a result, we find that 6.7% of Wikipedia articles cite at least one journal article with an associated DOI, and that Wikipedia cites just 2% of all articles with a DOI currently indexed in the Web of Science. We release our code to allow the community to extend upon our work and update the data set in the future.

Download Full-text

Feasibility Data of Traditional Sports Industrialization Development and Informatization Development Based on Big Data

E3S Web of Conferences ◽

10.1051/e3sconf/202129203037 ◽

2021 ◽

Vol 292 ◽

pp. 03037

Author(s):

Sun Jiaji ◽

Li Yuezhong

Keyword(s):

Big Data ◽

Large Scale ◽

Field Investigation ◽

Living Standard ◽

Sports Industry ◽

Continuous Innovation ◽

Existing Problems ◽

Sports Events ◽

Depth Analysis ◽

Traditional Sports

Since the reform and opening up, China’s economy has been developing rapidly and the material living standard has been improving continuously. People begin to pay more attention to the mental and physical health. Therefore, more and more people take part in a variety of sports activities to exercise their body and watch large-scale sports events to cultivate the sports spirit. These changes have boosted the development of China’s sports industry, which is reflected in the continuous expansion of the scale of the sports industry, the deepening of the degree of industrial segmentation and the continuous innovation of the development concept. This paper mainly studies the feasibility data analysis of traditional sports industrialization development and informatization development based on big data. In this paper, by using the research methods of literature, observation, field investigation and logical analysis, the industrialization development of traditional sports is deeply studied and systematically combed. From the perspective of informationization, this paper makes an in-depth analysis of the current situation, existing problems, the importance of industrialization, and the advantages of industrialization of traditional sports, and makes a detailed exploration and elaboration of the opportunities brought by informatization to traditional sports industry and the development strategies of the industrialization of traditional sports in the future.

Download Full-text

A Consensus Algorithm for Linear Support Vector Machines

Management Science ◽

10.1287/mnsc.2021.4042 ◽

2021 ◽

Author(s):

Haimonti Dutta

Keyword(s):

Big Data ◽

Large Scale ◽

Descent Method ◽

Consensus Algorithm ◽

Stochastic Gradient Descent ◽

Support Vector ◽

Gradient Descent Method ◽

Scalable Algorithms ◽

Data Set ◽

Svm Algorithm

In the era of big data, an important weapon in a machine learning researcher’s arsenal is a scalable support vector machine (SVM) algorithm. Traditional algorithms for learning SVMs scale superlinearly with the training set size, which becomes infeasible quickly for large data sets. In recent years, scalable algorithms have been designed which study the primal or dual formulations of the problem. These often suggest a way to decompose the problem and facilitate development of distributed algorithms. In this paper, we present a distributed algorithm for learning linear SVMs in the primal form for binary classification called the gossip-based subgradient (GADGET) SVM. The algorithm is designed such that it can be executed locally on sites of a distributed system. Each site processes its local homogeneously partitioned data and learns a primal SVM model; it then gossips with random neighbors about the classifier learnt and uses this information to update the model. To learn the model, the SVM optimization problem is solved using several techniques, including a gradient estimation procedure, stochastic gradient descent method, and several variants including minibatches of varying sizes. Our theoretical results indicate that the rate at which the GADGET SVM algorithm converges to the global optimum at each site is dominated by an [Formula: see text] term, where λ measures the degree of convexity of the function at the site. Empirical results suggest that this anytime algorithm—where the quality of results improve gradually as computation time increases—has performance comparable to its centralized, pseudodistributed, and other state-of-the-art gossip-based SVM solvers. It is at least 1.5 times (often several orders of magnitude) faster than other gossip-based SVM solvers known in literature and has a message complexity of O(d) per iteration, where d represents the number of features of the data set. Finally, a large-scale case study is presented wherein the consensus-based SVM algorithm is used to predict failures of advanced mechanical components in a chocolate manufacturing process using more than one million data points. This paper was accepted by J. George Shanthikumar, big data analytics.

Download Full-text

A Review of RDF Storage in NoSQL Databases

Big Data ◽

10.4018/978-1-4666-9840-6.ch005 ◽

2016 ◽

pp. 85-104

Author(s):

Zongmin Ma ◽

Li Yan

Keyword(s):

Data Storage ◽

Large Scale ◽

Future Research ◽

Nosql Databases ◽

Current State ◽

Data Store ◽

Rdf Data ◽

Description Framework ◽

Resource Description ◽

The Web

Download Full-text

Big Data Analytics for Train Delay Prediction

Innovative Applications of Big Data in the Railway Industry - Advances in Civil and Industrial Engineering ◽

10.4018/978-1-5225-3176-0.ch014 ◽

2018 ◽

pp. 320-348 ◽

Cited By ~ 1

Author(s):

Emanuele Fumeo ◽

Luca Oneto ◽

Giorgio Clerico ◽

Renzo Canepa ◽

Federico Papa ◽

...

Keyword(s):

Big Data ◽

Large Scale ◽

Learning Algorithm ◽

State Of The Art ◽

Fast Learning ◽

Current State ◽

Big Data Technologies ◽

Large Scale Data Processing ◽

Prediction Systems ◽

Delay Prediction

Current Train Delay Prediction Systems (TDPSs) do not take advantage of state-of-the-art tools and techniques for extracting useful insights from large amounts of historical data collected by the railway information systems. Instead, these systems rely on static rules, based on classical univariate statistic, built by experts of the railway infrastructure. The purpose of this book chapter is to build a data-driven TDPS for large-scale railway networks, which exploits the most recent big data technologies, learning algorithms, and statistical tools. In particular, we propose a fast learning algorithm for Shallow and Deep Extreme Learning Machines that fully exploits the recent in-memory large-scale data processing technologies for predicting train delays. Proposal has been compared with the current state-of-the-art TDPSs. Results on real world data coming from the Italian railway network show that our proposal is able to improve over the current state-of-the-art TDPSs.

Download Full-text

BIG: a large-scale data integration tool for renal physiology

AJP Renal Physiology ◽

10.1152/ajprenal.00249.2016 ◽

2016 ◽

Vol 311 (4) ◽

pp. F787-F792 ◽

Cited By ~ 8

Author(s):

Yue Zhao ◽

Chin-Rang Yang ◽

Viswanathan Raghuram ◽

Jaya Parulekar ◽

Mark A. Knepper

Keyword(s):

Big Data ◽

Large Scale ◽

Data Science ◽

Relevant Information ◽

Biological Information ◽

Renal Physiology ◽

Data Sets ◽

Data Set ◽

Large Scale Data ◽

Quantify Gene Expression

Due to recent advances in high-throughput techniques, we and others have generated multiple proteomic and transcriptomic databases to describe and quantify gene expression, protein abundance, or cellular signaling on the scale of the whole genome/proteome in kidney cells. The existence of so much data from diverse sources raises the following question: “How can researchers find information efficiently for a given gene product over all of these data sets without searching each data set individually?” This is the type of problem that has motivated the “Big-Data” revolution in Data Science, which has driven progress in fields such as marketing. Here we present an online Big-Data tool called BIG (Biological Information Gatherer) that allows users to submit a single online query to obtain all relevant information from all indexed databases. BIG is accessible at http://big.nhlbi.nih.gov/ .

Download Full-text

Web Page Data Collection Based on Multithread

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.347-350.2575 ◽

2013 ◽

Vol 347-350 ◽

pp. 2575-2579

Author(s):

Wen Tao Liu

Keyword(s):

Data Collection ◽

Large Scale ◽

Average Response ◽

Web Content ◽

Web Page ◽

Average Response Time ◽

Web Data ◽

Redundant Data ◽

Web Structure ◽

The Web

The web data collection is the process of collecting the semi-structured, large-scale and redundant data which include web content, web structure and web usage in the web by the crawler and it is often used for the information extraction, information retrieval, search engine and web data mining. In this paper, the web data collection principle is introduced and some related topics are discussed such as page download, coding problem, updated strategy, static and dynamic page. The multithread technology is described and multithread mode for the web data collection is proposed. The web data collection with multithread can get better resource utilization, better average response time and better performance.

Download Full-text