How is Facebook dealing with Big data?

Sainadh
4 min readAug 21, 2021
Photo by Franki Chamaki on Unsplash

In this blog, you will come to know about how much data is being generated per second, what is problem with rising data, how big companies are overcoming it.

Introduction to BigData Problem.

Have you ever wondered how the data we upload to Facebook is being stored and shown to us at any time, and keeping our messages synchronised every time.

Here is small glimpse on how much data is being generated every minute in some companies.

Snapchat: Over 527,760 photos shared by users

LinkedIn: Over 120 professionals join the network

YouTube: 4,146,600 videos watched

Twitter: 456,000 tweets sent or created

Instagram: 46,740 photos uploaded

Netflix: 69,444 hours of video watched

Giphy: 694,444 GIFs served

Tumblr: 74,220 posts published

Skype: 154,200 calls made by users

Big data is a problem as it is a biggest problem for storing, retrieving, querying, and processing the data. Many companies are trying to overcome this by using distributed computing and distributed storage.

The Five V’s you here about volume, velocity, variety, veracity, and value. They deal with every aspect of the data.

Challenges with Big-Data

How much data we can store and at what speed we can store data. With how much time it takes to query and retrieve the solution. Real time analysis on the data. Variety of data is available, how to distinguish. Bringing value from the data, data quality.

Let’s see a simple example, if your hard drive is taking 10 mins to store 1 GB file then for storing 85 GB it takes 850 mins. If Facebook’s generates 85 GB per minute, then they take 850 mins to store the data right.

Let’s see can we solve this problem by Distributed Storage. Gather 10 laptops, connect them via networking and divide the incoming data between 10 laptops, so each laptop can take 8.5 GB. So that with in 85 minutes, we can store FB’s 1 minute data. It is a good sign right even we are not at storing data at 85 GB/min, we have drastically decreased the time gap, increasing the number of laptops might help this problem.

We call the above process as Master-Slave Topology. we can call group of laptops as a cluster. The way we store our files in our system is called file system. And for storing this huge data we have technology called Hadoop.

Hadoop Distributed File System (HDFS) started by Yahoo built on java. Used by every big company.

Now, we discuss about how FB is using the Hadoop ecosystem for seamless integration and processing the 85 GB/min.

*Note Facebook generates 5 petabytes of data daily in 2020(it varies)

Facebook Integration of Hadoop Ecosystem.

Photo by Carlos Muza on Unsplash

Facebook runs the world’s largest Hadoop cluster.

The main ingredients to look about the data is processing, querying, analysis and real time analytics. They have developed different product for different operations.

Cassandra is a distributed storage system dedicated to managing a large amount of structured data across multiple commodity servers.

Scuba for speeding up the analysis part and showing the real time analysis.

Hive is tool that improved the query capability of Hadoop by using a subset of SQL and soon gained popularity in the unstructured world.

Prism is developed by FB to solve the limitations of Hadoop. As Hadoop was not designed to run across multiple data centers.

Corona yes, it is not a virus, it is a product which allows multiple jobs to be processed at a time on a single Hadoop cluster without crashing the system.

Peregrine is dedicated to addressing the issues of querying data as quickly as possible.

Here I not gone very deep into each product because each of them as their significance and each one is a different topic to be discussed.

Data we upload goes through a series of products. This shows us that products trying to solve the big data problem should process the data within seconds.

Next time you see that your profile pic or homepage is not loading then check the internet connection or recall how the data is been processing.

Conclusion

A small review of what we learned, problems of big data, demonstrated an example how storing the data at a high speed is important and how Facebook is dealing with big data.

--

--

Sainadh

Devops and automation export, explorer, opensource enthusiast, follow for more content related to devops and easy way to do the things.