<>What is Big Data?
Big data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software.
Big Data is also data but with a huge size. Big Data is a term used to describe a collection of data that is huge in volume and yet growing exponentially with time. In short such data is so large and complex that none of the traditional data management tools are able to store it or process it efficiently.A big data is a problem in which we have limited size (terrabyte or peta byte) of storage or resource but due to increase in technologies we are getting more and more data(of size in peta byte or yotta byte) and as we are getting more data so it is raising a problem of insufficient storage so we can say that data is business but big data is problem for the world.
Problems due to Big Data:-
i) Volume — The name Big Data itself is related to a size which is enormous. Size of data plays a very crucial role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data. Hence, ‘Volume’ is one characteristic which needs to be considered while dealing with Big Data.
ii) Velocity — The term ‘velocity’ refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data.Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is massive and continuous.
iii) Variety — The next aspect of Big Data is its variety.Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets and databases were the only sources of data considered by most of the application.
iv) Variability — This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively.
Top 5 Sources of Big Data:-
<>Let us discuss how much data Google,Facebook ,etc handles everyday:-
According to Facebook, its data system processes 2.5 million pieces of content each day amounting to 500+ terabytes of data daily. Facebook generates 2.7 billion Like actions per day and 300 million new photos are uploaded daily. By processing data within minutes, Facebook can rollout out new products, understand user reactions, and modify designs in near real-time.
Facebook says that it scans roughly 105 TB of data each half hour. While 500 TB is a lot of data, that’s a mere drop in the bucket compared to the amount of data stored in a single Facebook Hadoop disk cluster. According to Facebook’s VP of engineering, Jay Parikh, Facebook’s Hadoop disk cluster has 100 petabytes of data.
Google now processes over 40,000 search queries every second on average, which translates to over 3.5 billion searches per day and 1.2 trillion searches per year worldwide.A place where google stores and handles all its data is a data center. Google doesn’t hold the biggest of data centers but still it handles a huge amount of data. A data center normally holds petabytes to exabytes of data.
Google currently processes over 20 petabytes of data per day through an average of 100,000 MapReduce jobs spread across its massive computing clusters.
Amazon, rather than giving us a nice, easy number of petabytes, instead announces the total number of objects stored by its S3 cloud storage service. As of April 2012, Amazon S3 stored 905 billion objects. If we assume an average size of 100KB, that’s around 90 petabytes; if the average size is 1MB, that’s 900 petabytes — almost an exabyte.
As the ancestral home of Hadoop, Yahoo is a big user of the open source software. In fact, its 32,000-node cluster is the still the largest in the world. Now the Web giant is souping up its massive investment in Hadoop to give it a deep learning environment that’s the envy of the valley.With more than 600 petabytes of data spread across 40,000 Hadoop nodes, Yahoo is obviously a big believer in all things Hadoop.
Lets see the overall Solution for the Big Data i.e:-
Apache Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation.
Hadoop is one of the solution for the Big Data problems.In Hadoop we uses concept of distributed Storage and also use a protocol known as HDFS(Hadoop Distributed File System) protocol. For hadoop we create a Master slave Architecture and we split a large file into number of blocks equal to slave node and distribute the file accordingly. By doing so we can able to store data persistent and also we can access the data very easily from the master node. HDFS cluster is like,
In Hadoop HDFS cluster the master node is also called as name node and slave node is also known as data node.Hadoop is critical for Yahoo’s business. It stores all of its data in HDFS, and relies on the power of YARN to allow it to bring different computational engines to the data. So it was natural for Cnudde’s team to build a deep learning environment atop Hadoop, and now that it’s built, various teams within Yahoo are being encouraged to find ways to use it.
Thanks for reading……