Click here to Skip to main content

Submit your article

Big Data

big-data

Great Reads

Hierarchy of Categories and Classifying Wikipedia Articles using XML Dump

by Ilia Reznik, Vladimir Shatalov

How to classify articles on Wikipedia using XML dump

Apache Spark/Cassandra 2 of 2

by Sacha Barber

Looking at Spark/Cassandra working together

BriefMaker – An App for Processing Real-Time Market Data

by Ryan Scott White

Converts past and real-time stock market tick data into time-sliced summaries called Briefs

How to Delete a Large Folder in Windows

by Joezer BH

Explains the benefits of using the command line in the large folder delete case and shows an example of the syntax

Latest Articles

Hadoop Beginners Guide - How to Install

by Fazlur Rahman

Step by step procedure to install Hadoop 2.7.3 version on Ubuntu 16.04 operating system

BriefMaker – An App for Processing Real-Time Market Data

by Ryan Scott White

Converts past and real-time stock market tick data into time-sliced summaries called Briefs

Parsing Wikipedia XML Dump

by Ilia Reznik, Vladimir Shatalov

Parser for Wikipedia pages from XML dump is presented. Extraction of biographical data and categories with their parents is shown as an example.

Hierarchy of Categories and Classifying Wikipedia Articles using XML Dump

by Ilia Reznik, Vladimir Shatalov

How to classify articles on Wikipedia using XML dump

All Articles

Big Data

Learning Machine Learning, Part 1: An Introduction

18 Jan 2017 by Alibaba Cloud

This post features a basic introduction to Machine Learning. This post on Machine Learning will not only help you to understand the latest trends in the Internet industry, but increase your understanding of the technology that plays a major role in many services that make our lives easier.

machine-learning

Cloud and the Era of AR/VR Technology: What's Next

18 Jan 2017 by Alibaba Cloud

Cloud and the Era of AR/VR Technology: What's Next

The Secret to Success on IoT for Business

18 Jan 2017 by Alibaba Cloud

Connected devices – popularly known as the Internet of Things (IoT) or ubiquitous computing - represent a tremendous potential for the enhancement of social and business life, and a brand new frontier for market growth

5 Tips to Get the Most out of Your Cloud Infrastructure

18 Jan 2017 by Alibaba Cloud

Here are five top tips from our expert team to help you maximize the benefits of your cloud infrastructure.

Learning Machine Learning, Part 3: Application

14 Feb 2017 by Alibaba Cloud

This post features a basic introduction to machine learning (ML). You don’t need any prior knowledge about ML to get the best out of this article. Before getting started, let’s address this question: "Is ML so important that I really need to read this post?"

artificial-intelligence

parallel-processing

machine-learning

Loaded library lib-native-libhadoop.so.1.0.0 might have disabled stack guard . How to resolve it?

25 Jul 2018 by anjitaa

Loaded library lib-native-libhadoop.so.1.0.0 might have disabled stack guard. How to resolve it? What I have tried: I have tried loaded library lib-native-libhadoop I think 1.0.0 might have disabled stack guard

In mapreduce how to sort intermediate output based on values?

27 Jul 2018 by anjitaa

"The MapReduce sort the intermediate data(between mapper and reducer phase) by key by default. If we want the data should be sort based on value, then we need secondary sorting. There are 2 approaches to fulfill the same. 1. If reducers will get all the value for a particular key and buffer...

Explain the word count implementation via hadoop framework?

28 Jul 2018 by anjitaa

What do you understand by Word Count implementation via Hadoop framework? Explain in detail What I have tried: I am not able to implement the Word Count implementation via the Hadoop framework?

Why not a single point of contact, only namenode can be used for handling all read/write requests in HDFS?

31 Jul 2018 by anjitaa

" No, it is not feasible given the distributed architecture of HDFS. If ‘n’ no of clients process read/write requests simultaneously, then it will increase overhead on Namenode.To avoid these bottlenecks, a distributed system of a computing architecture in master-slave fashion is proposed. "

How to enable trash/recycle bin in hadoop?

16 Aug 2018 by anjitaa

"To enable the trash feature and to set the time delay for the trash removal in Hadoop, we have to edit the fs.trash.interval property in core-site.xml to the delay (and this has to be in minutes). Ex: if you want users to have 10 hours (600 minutes) to restore a deleted file, you should specify...

How to configure hadoop to reuse JVM for mappers?

20 Aug 2018 by anjitaa

"To configure Hadoop to reuse JVM for mappers, we just need to add entry in the configuration file: $HADOOP_HOME/conf/mapred-site.xml mapred.job.reuse.jvm.num.tasks -1 We need to specify a number value how many times the JVM is to be reused...

In mapreduce how to sort intermediate output based on values?

27 Jul 2018 by Bansal himani

How to sort intermediate output based on values in MapReduce ? What I have tried: How to sort intermediate output based on values in MapReduce?

Explain the word count implementation via hadoop framework?

28 Jul 2018 by Bansal himani

"Word Count Implementation will be as follows: For ex: Input File 1 contains data: “This is December Month.” Input File 2 contains data: “December is the last month of the year.” Step 1: Mapper will generate the following below output: Input File 1 output ...

How to enable trash/recycle bin in hadoop?

16 Aug 2018 by Bansal himani

How can I enable Trash/Recycle Bin in Hadoop? What I have tried: I was not being able to enable Trash/Recycle Bin in Hadoop

How to optimize query running on hive

12 Dec 2020 by BedantBiswal

Below is my query which takes around 5k mappers and 1k reducers and time taken is around 2.2 hours to finish. Any scope of optimization in here? What I have tried: SELECT sum(B.item_net_amount) net_amount, sum(B.item_gross_amount) gross_amount,...

Is there a clear procedure to install HDInsight on a Windows platform?

6 Mar 2015 by BillWoodruff

http://azure.microsoft.com/en-...

What is big data & what classifies as big data?

1 Apr 2016 by Chendur Srinivasan

I have went through a lot of articles but I dont seem to get a perfectly clear answer on what exactly a BIG DATA is. In one page I saw "any data which is bigger for your usage, is big data i.e. 100 MB is considered big data for your mailbox but not your hard disc". Whereas another article said...

Cloudera hadoop - daemon process not running

4 Mar 2016 by Chendur Srinivasan

I'm self learning Hadoop and started of with installing Cloudera QuickStart on a VMware Workstation running CENT OS.I was under the impression that Quickstart VM has most the of configurations predefined. Do I need to set up any other configurations to set up data and name node? Reason being...

Requirements Eng for Traditional and Big Data Business Intelligence

27 Apr 2016 by Daniel Joubert

Comparing Requirements Engineering for Traditional and BigData Business Intelligence

The Basics of Azure Data Factory

20 Oct 2018 by DataBytzAI

How to transform raw data into actionable business insights with Azure Data Factory

Plotting top 10 values in big data Python

13 Jul 2022 by E L 2022

I need help plotting some categorical and numerical Values in python. the code is given below: %%time import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns %%time ...

Beginners Guide - Introduction of Big Data & Hadoop

21 Jan 2017 by Fazlur Rahman

What is Big Data and how Hadoop been introduced to overcome the problems associated with Big Data?

Hadoop Beginners Guide - How To Setup Developer Environment

13 Feb 2017 by Fazlur Rahman

Step by step procedure to install NetBeans on Ubuntu 16.04 operating system with Hadoop 2.7.3 version. This may work for any other versions of Hadoop and Ubuntu.

Hadoop Beginners Guide - How to Install

22 May 2022 by Fazlur Rahman

Step by step procedure to install Hadoop 2.7.3 version on Ubuntu 16.04 operating system

How do I implement a data lake solution

7 Jul 2016 by GoodyGoodyGoody

I need advice on implementation of Data Lake. Any good references or examples of how to implement the data lake concept (tutorial) or pointing me to the right direction will suffice.Thanks in advanceWhat I have tried:I'm looking to set this up for my organization and I have no idea...

Creating OHLC bars from un-homogenized ticks using LINQ

13 Mar 2015 by GoogleMonster

Hi All,I am trying to create OHLC data from un-homogenised data. I have googled and discovered an article at StackOverFlow How to group a time series by interval (OHLC bars) with LINQWhich to be honest I have found to be really useful. However, the results I get are not in line with...

Run Your First Big Data Example with Microsoft HDInsight

3 Jan 2016 by Hadrich Mohamed

Working with hadoop in the big data domain is very interesting especially in the growth of data in this era. Within this tip, we are going to run our first big data example using the famous tool of Microsoft: HDInsight.

How to use weka.smote smapling algorithm in java code?

23 Feb 2018 by HusseinAl-haj

I have a question about the correct way to use the SMOTE sampling algorithm. I have been read a lot about this algorithm. I forced to use SMOTE within my code, so I can't use any tools like KNIME or WEKA. After few days in searching, I can say that there are two implementation of SMOTE, one in R...

machine-learning

Parsing Wikipedia XML Dump

10 Apr 2021 by Ilia Reznik, Vladimir Shatalov

Parser for Wikipedia pages from XML dump is presented. Extraction of biographical data and categories with their parents is shown as an example.

Hierarchy of Categories and Classifying Wikipedia Articles using XML Dump

9 Apr 2021 by Ilia Reznik, Vladimir Shatalov

How to classify articles on Wikipedia using XML dump