web crawling
Recently Published Documents


TOTAL DOCUMENTS

255
(FIVE YEARS 62)

H-INDEX

17
(FIVE YEARS 2)

2022 ◽  
Vol 32 (3) ◽  
pp. 1617-1632
Author(s):  
S. Neelakandan ◽  
A. Arun ◽  
Raghu Ram Bhukya ◽  
Bhalchandra M. Hardas ◽  
T. Ch. Anil Kumar ◽  
...  
Keyword(s):  

Author(s):  
Palika Jajoo

Web crawling is the method in which the topics and information is browsed in the world wide web and then it is stored in big storing device from where it can be accessed by the user as per his need. This paper will explain the use of web crawling in digital world and how does it make difference for the search engine. There are a variety of web crawling available which is explained in brief in this paper. Web crawler has many advantages over other traditional methods of searching information online. Many tools are made available which supports web crawling and makes the process easy.


Author(s):  
Moaiad Khder

Web scraping or web crawling refers to the procedure of automatic extraction of data from websites using software. It is a process that is particularly important in fields such as Business Intelligence in the modern age. Web scrapping is a technology that allow us to extract structured data from text such as HTML. Web scrapping is extremely useful in situations where data isn’t provided in machine readable format such as JSON or XML. The use of web scrapping to gather data allows us to gather prices in near real time from retail store sites and provide further details, web scrapping can also be used to gather intelligence of illicit businesses such as drug marketplaces in the darknet to provide law enforcement and researchers valuable data such as drug prices and varieties that would be unavailable with conventional methods. It has been found that using a web scraping program would yield data that is far more thorough, accurate, and consistent than manual entry. Based on the result it has been concluded that Web scraping is a highly useful tool in the information age, and an essential one in the modern fields. Multiple technologies are required to implement web scrapping properly such as spidering and pattern matching which are discussed. This paper is looking into what web scraping is, how it works, web scraping stages, technologies, how it relates to Business Intelligence, artificial intelligence, data science, big data, cyber securityو how it can be done with the Python language, some of the main benefits of web scraping, and what the future of web scraping may look like, and a special degree of emphasis is placed on highlighting the ethical and legal issues. Keywords: Web Scraping, Web Crawling, Python Language, Business Intelligence, Data Science, Artificial Intelligence, Big Data, Cloud Computing, Cybersecurity, legal, ethical.


2021 ◽  
Vol 10 (5) ◽  
pp. 01-07
Author(s):  
Ikechukwu Onyenwe ◽  
Ebele Onyedinma ◽  
Chidinma Nwafor ◽  
Obinna Agbata

Websites are regarded as domains of limitless information which anyone and everyone can access. The new trend of technology has shaped the way we do and manage our businesses. Today, advancements in Internet technology has given rise to the proliferation of e-commerce websites. This, in turn made the activities and lifestyles of marketers/vendors, retailers and consumers (collectively regarded as users in this paper) easier as it provides convenient platforms to sale/order items through the internet. Unfortunately, these desirable benefits are not without drawbacks as these platforms require that the users spend a lot of time and efforts searching for best product deals, products updates and offers on ecommerce websites. Furthermore, they need to filter and compare search results by themselves which takes a lot of time and there are chances of ambiguous results. In this paper, we applied web crawling and scraping methods on an e-commerce website to obtain HTML data for identifying products updates based on the current time. These HTML data are preprocessed to extract details of the products such as name, price, post date and time, etc. to serve as useful information for users.


2021 ◽  
Vol 13 (21) ◽  
pp. 11694
Author(s):  
Jaehong Kim ◽  
Sangpil Youm ◽  
Yongwei Shan ◽  
Jonghoon Kim

Fire safety on construction sites has been rarely studied because fire accidents have a lower occurrence compared to construction’s “Fatal Four”. Despite the lower occurrence, construction fire accidents tend to have a larger severity of impact. This study aims at using news media data and big data analysis techniques to identify patterns and factors related to fire accidents on construction sites. News reports on various construction accidents covered by news media were first collected through web crawling. Then, the authors identified the level of media exposure for various keywords related to construction accidents and analyzed the similarities between them. The results show that the level of media exposure for fire accidents on construction sites is much higher than for fall accidents, which suggests that fire accidents may have a greater impact on the surroundings than other accidents. It was found that the main causes of fire accidents on construction sites are violations of fire safety regulations and the absence of inspections, which could be sufficiently prevented. This study contributes to the body of knowledge by exploring factors related to fire safety on construction sites and their interrelationships as well as providing evidence that the fire type should be emphasized in safety-related regulations and codes on construction sites.


Author(s):  
Ahrii Kim ◽  
Yunju Bak ◽  
Jimin Sun ◽  
Sungwon Lyu ◽  
Changmin Lee

With the advent of Neural Machine Translation, the more the achievement of human-machine parity is claimed at WMT, the more we come to ask ourselves if their evaluation environment can be trusted. In this paper, we argue that the low quality of the source test set of the news track at WMT may lead to an overrated human parity claim. First of all, we report nine types of so-called technical contaminants in the data set, originated from an absence of meticulous inspection after web-crawling. Our empirical findings show that when they are corrected, about 5% of the segments that have previously achieved a human parity claim turn out to be statistically invalid. Such a tendency gets evident when the contaminated sentences are solely concerned. To the best of our knowledge, it is the first attempt to question the “source” side of the test set as a potential cause of the overclaim of human parity. We cast evidence for such phenomenon that according to sentence-level TER scores, those trivial errors change a good part of system translations. We conclude that to overlook it would be a mistake, especially when it comes to an NMT evaluation.


2021 ◽  
Vol 23 (3) ◽  
pp. 87-104
Author(s):  
Daniel Borysowski
Keyword(s):  

Autor niniejszego artykułu zgromadził ok. 2,7 mln rosyjskojęzycznych newsów internetowych.Zasadnicze cele tego tekstu stanowią: omówienie pojęcia web crawlinguw odniesieniu do pozyskiwania internetowych danych tekstowych, omówienie kwestiistrukturyzacji takich danych w nieanotowanych korpusach tekstowych, a także przedstawieniewybranych aspektów analizy danych strukturyzowanych w ten sposób. Autorrozpatruje newsy internetowe jako połączenie tekstu zasadniczego oraz identyfikującychi charakteryzujących go metadanych (wyróżnionych podczas automatycznej ich ekscerpcjize stron internetowych). Rozdział newsów na tekst zasadniczy i metadane stwarzamożliwość przeprowadzenia ich analizy z dwóch perspektyw – tekstowej oraz metainformacyjnej(dodatkowo, np. w odniesieniu do badań chronologizacyjnych, z perspektywyuwzględniającej oba te poziomy). Zarys możliwych badań lingwistycznych zgromadzonegomateriału uzupełnia autor ewaluacją wybranych wielowyrazowych całostek, wydobytychz tych tekstów z wykorzystaniem delimitacyjnej funkcji cudzysłowu.


2021 ◽  
pp. 291-300
Author(s):  
Kapil Madan ◽  
Rajesh Bhatia

Sign in / Sign up

Export Citation Format

Share Document