scholarly journals Addressing big data variety using an automated approach for data characterization

2022 ◽  
Vol 9 (1) ◽  
Author(s):  
Georgios Vranopoulos ◽  
Nathan Clarke ◽  
Shirley Atkinson

AbstractThe creation of new knowledge from manipulating and analysing existing knowledge is one of the primary objectives of any cognitive system. Most of the effort on Big Data research has been focussed upon Volume and Velocity, while Variety, “the ugly duckling” of Big Data, is often neglected and difficult to solve. A principal challenge with Variety is being able to understand and comprehend the data. This paper proposes and evaluates an automated approach for metadata identification and enrichment in describing Big Data. The paper focuses on the use of self-learning systems that will enable automatic compliance of data against regulatory requirements along with the capability of generating valuable and readily usable metadata towards data classification. Two experiments towards data confidentiality and data identification were conducted in evaluating the feasibility of the approach. The focus of the experiments was to confirm that repetitive manual tasks can be automated, thus reducing the focus of a Data Scientist on data identification and thereby providing more focus towards the extraction and analysis of the data itself. The origin of the datasets used were Private/Business and Public/Governmental and exhibited diverse characteristics in relation to the number of files and size of the files. The experimental work confirmed that: (a) the use of algorithmic techniques attributed to the substantial decrease in false positives regarding the identification of confidential information; (b) evidence that the use of a fraction of a data set along with statistical analysis and supervised learning is sufficient in identifying the structure of information within it. With this approach, the issues of understanding the nature of data can be mitigated, enabling a greater focus on meaningful interpretation of the heterogeneous data.

2018 ◽  
Vol 5 (3) ◽  
pp. 132-149 ◽  
Author(s):  
Lennart Hammerström

Abstract Although many would argue that the most important factor for the success of a big data project is the process of analyzing the data, it is more important to staff, structure and organize the participants involved to ensure an efficient collaboration within the team and an effective use of the tool sets, the relevant applications and a customized flow of information. A main challenge of big data projects originates from the amount of people involved and that need to collaborate, the need for a higher and specific education, the defined approach to solve the analytical problem that is undefined in many cases, the data-set itself (structured or unstructured) and the required hard- and software (such as analysis-software or self-learning algorithms). Today there is neither an organizational framework nor overarching guidelines for the creation of a high-performance analytics team and its organizational integration available. This paper builds upon (a) the organizational design of a team for a big data project, (b) the relevant roles and competencies (such as programming or communication skills) of the members of the team and (c) the form in which they are connected and managed.


Author(s):  
Saifuzzafar Jaweed Ahmed

Big Data has become a very important part of all industries and organizations sectors nowadays. All sectors like energy, banking, retail, hardware, networking, etc all generate a huge amount of unstructured data which is processed and analyzed accurately in a structured form. Then the structured data can reveal very useful information for their business growth. Big Data helps in getting useful data from unstructured or heterogeneous data by analyzing them. Big data initially defined by the volume of a data set. Big data sets are generally huge, measuring tens of terabytes and sometimes crossing the sting of petabytes. Today, big data falls under three categories structured, unstructured, and semi-structured. The size of big data is improving in a fast phase from Terabytes to Exabytes Of data. Also, Big data requires techniques that help to integrate a huge amount of heterogeneous data and to process them. Data Analysis which is a big data process has its applications in various areas such as business processing, disease prevention, cybersecurity, and so on. Big data has three major issues such as data storage, data management, and information retrieval. Big data processing requires a particular setup of hardware and virtual machines to derive results. The processing is completed simultaneously to realize results as quickly as possible. These days big data processing techniques include Text mining and sentimental analysis. Text analytics is a very large field under which there are several techniques, models, methods for automatic and quantitative analysis of textual data. The purpose of this paper is to show how the text analysis and sentimental analysis process the unstructured data and how these techniques extract meaningful information and, thus make information available to the various data mining statistical and machine learning) algorithms.


2019 ◽  
Vol 53 (2) ◽  
pp. 217-229 ◽  
Author(s):  
Xiaomei Wei ◽  
Yaliang Zhang ◽  
Yu Huang ◽  
Yaping Fang

PurposeThe traditional drug development process is costly, time consuming and risky. Using computational methods to discover drug repositioning opportunities is a promising and efficient strategy in the era of big data. The explosive growth of large-scale genomic, phenotypic data and all kinds of “omics” data brings opportunities for developing new computational drug repositioning methods based on big data. The paper aims to discuss this issue.Design/methodology/approachHere, a new computational strategy is proposed for inferring drug–disease associations from rich biomedical resources toward drug repositioning. First, the network embedding (NE) algorithm is adopted to learn the latent feature representation of drugs from multiple biomedical resources. Furthermore, on the basis of the latent vectors of drugs from the NE module, a binary support vector machine classifier is trained to divide unknown drug–disease pairs into positive and negative instances. Finally, this model is validated on a well-established drug–disease association data set with tenfold cross-validation.FindingsThis model obtains the performance of an area under the receiver operating characteristic curve of 90.3 percent, which is comparable to those of similar systems. The authors also analyze the performance of the model and validate its effect on predicting the new indications of old drugs.Originality/valueThis study shows that the authors’ method is predictive, identifying novel drug–disease interactions for drug discovery. The new feature learning methods also positively contribute to the heterogeneous data integration.


2015 ◽  
Vol 7 (1) ◽  
pp. 17-32
Author(s):  
J.S. Shyam Mohan ◽  
P. Shanmugapriya ◽  
Bhamidipati Vinay Pawan Kumar

Abstract Finding out the widely used URL’s from online shopping sites for any particular category is a difficult task as there are many heterogeneous and multi-dimensional data set which depends on various factors. Traditional data mining methods are limited to homogenous data source, so they fail to sufficiently consider the characteristics of heterogeneous data. This paper presents a consistent Big Data mining search which performs analytics on text data to find the top rated URL’s. Though many heuristic search methods are available, our proposed method solves the problem of searching compared with traditional methods in data mining. The sample results are obtained in optimal time and are compared with other methods which is effective and efficient.


2020 ◽  
Vol 12 (14) ◽  
pp. 5595 ◽  
Author(s):  
Ana Lavalle ◽  
Miguel A. Teruel ◽  
Alejandro Maté ◽  
Juan Trujillo

Fostering sustainability is paramount for Smart Cities development. Lately, Smart Cities are benefiting from the rising of Big Data coming from IoT devices, leading to improvements on monitoring and prevention. However, monitoring and prevention processes require visualization techniques as a key component. Indeed, in order to prevent possible hazards (such as fires, leaks, etc.) and optimize their resources, Smart Cities require adequate visualizations that provide insights to decision makers. Nevertheless, visualization of Big Data has always been a challenging issue, especially when such data are originated in real-time. This problem becomes even bigger in Smart City environments since we have to deal with many different groups of users and multiple heterogeneous data sources. Without a proper visualization methodology, complex dashboards including data from different nature are difficult to understand. In order to tackle this issue, we propose a methodology based on visualization techniques for Big Data, aimed at improving the evidence-gathering process by assisting users in the decision making in the context of Smart Cities. Moreover, in order to assess the impact of our proposal, a case study based on service calls for a fire department is presented. In this sense, our findings will be applied to data coming from citizen calls. Thus, the results of this work will contribute to the optimization of resources, namely fire extinguishing battalions, helping to improve their effectiveness and, as a result, the sustainability of a Smart City, operating better with less resources. Finally, in order to evaluate the impact of our proposal, we have performed an experiment, with non-expert users in data visualization.


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Ikbal Taleb ◽  
Mohamed Adel Serhani ◽  
Chafik Bouhaddioui ◽  
Rachida Dssouli

AbstractBig Data is an essential research area for governments, institutions, and private agencies to support their analytics decisions. Big Data refers to all about data, how it is collected, processed, and analyzed to generate value-added data-driven insights and decisions. Degradation in Data Quality may result in unpredictable consequences. In this case, confidence and worthiness in the data and its source are lost. In the Big Data context, data characteristics, such as volume, multi-heterogeneous data sources, and fast data generation, increase the risk of quality degradation and require efficient mechanisms to check data worthiness. However, ensuring Big Data Quality (BDQ) is a very costly and time-consuming process, since excessive computing resources are required. Maintaining Quality through the Big Data lifecycle requires quality profiling and verification before its processing decision. A BDQ Management Framework for enhancing the pre-processing activities while strengthening data control is proposed. The proposed framework uses a new concept called Big Data Quality Profile. This concept captures quality outline, requirements, attributes, dimensions, scores, and rules. Using Big Data profiling and sampling components of the framework, a faster and efficient data quality estimation is initiated before and after an intermediate pre-processing phase. The exploratory profiling component of the framework plays an initial role in quality profiling; it uses a set of predefined quality metrics to evaluate important data quality dimensions. It generates quality rules by applying various pre-processing activities and their related functions. These rules mainly aim at the Data Quality Profile and result in quality scores for the selected quality attributes. The framework implementation and dataflow management across various quality management processes have been discussed, further some ongoing work on framework evaluation and deployment to support quality evaluation decisions conclude the paper.


Author(s):  
Yihao Tian

Big data is an unstructured data set with a considerable volume, coming from various sources such as the internet, business organizations, etc., in various formats. Predicting consumer behavior is a core responsibility for most dealers. Market research can show consumer intentions; it can be a big order for a best-designed research project to penetrate the veil, protecting real customer motivations from closer scrutiny. Customer behavior usually focuses on customer data mining, and each model is structured at one stage to answer one query. Customer behavior prediction is a complex and unpredictable challenge. In this paper, advanced mathematical and big data analytical (BDA) methods to predict customer behavior. Predictive behavior analytics can provide modern marketers with multiple insights to optimize efforts in their strategies. This model goes beyond analyzing historical evidence and making the most knowledgeable assumptions about what will happen in the future using mathematical. Because the method is complex, it is quite straightforward for most customers. As a result, most consumer behavior models, so many variables that produce predictions that are usually quite accurate using big data. This paper attempts to develop a model of association rule mining to predict customers’ behavior, improve accuracy, and derive major consumer data patterns. The finding recommended BDA method improves Big data analytics usability in the organization (98.2%), risk management ratio (96.2%), operational cost (97.1%), customer feedback ratio (98.5%), and demand prediction ratio (95.2%).


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Yusheng Lu ◽  
Jiantong Zhang

PurposeThe digital revolution and the use of big data (BD) in particular has important applications in the construction industry. In construction, massive amounts of heterogeneous data need to be analyzed to improve onsite efficiency. This article presents a systematic review and identifies future research directions, presenting valuable conclusions derived from rigorous bibliometric tools. The results of this study may provide guidelines for construction engineering and global policymaking to change the current low-efficiency of construction sites.Design/methodology/approachThis study identifies research trends from 1,253 peer-reviewed papers, using general statistics, keyword co-occurrence analysis, critical review, and qualitative-bibliometric techniques in two rounds of search.FindingsThe number of studies in this area rapidly increased from 2012 to 2020. A significant number of publications originated in the UK, China, the US, and Australia, and the smallest number from one of these countries is more than twice the largest number in the remaining countries. Keyword co-occurrence is divided into three clusters: BD application scenarios, emerging technology in BD, and BD management. Currently developing approaches in BD analytics include machine learning, data mining, and heuristic-optimization algorithms such as graph convolutional, recurrent neural networks and natural language processes (NLP). Studies have focused on safety management, energy reduction, and cost prediction. Blockchain integrated with BD is a promising means of managing construction contracts.Research limitations/implicationsThe study of BD is in a stage of rapid development, and this bibliometric analysis is only a part of the necessary practical analysis.Practical implicationsNational policies, temporal and spatial distribution, BD flow are interpreted, and the results of this may provide guidelines for policymakers. Overall, this work may develop the body of knowledge, producing a reference point and identifying future development.Originality/valueTo our knowledge, this is the first bibliometric review of BD in the construction industry. This study can also benefit construction practitioners by providing them a focused perspective of BD for emerging practices in the construction industry.


Author(s):  
Dominik Krimpmann ◽  
Anna Stühmeier

Big Data and Analytics have become key concepts within the corporate world, both commercially and from an information technology (IT) perspective. This paper presents the results of a global quantitative analysis of 400 IT leaders from different industries, which examined their attitudes toward dedicated roles for an Information Architect and a Data Scientist. The results illustrate the importance of these roles at the intersection of business and technology. They also show that to build sustainable and quantifiable business results and define an organization's competitive positioning, both roles need to be dedicated, rather than shared across different people. The research also showed that those dedicated roles contribute actively to a sustainable competitive positioning mainly driven by visualization of complex matters.


Sign in / Sign up

Export Citation Format

Share Document