The newspaper reads “Data Scientist: The Sexiest Job of the 21st Century”. For a minute you are like “I thought medical and engineering were the only two domains”. Then you smirk “Does it even pay?”. Then somehow you start to read the article. “Best paid job of the century”. Ahem Ahem! You close the newspaper, and realize you know nothing.

Who’s a data scientist you ask? The data scientist venn diagram by Drew Conway explains it quite well. A data scientist is a blend of maths and statistics, hacking skills and substantive expertise. So what does a data scientist do? He sees zetta bytes of data and goes “Zetta, play no vendetta”. But wait, where did you get zetta bytes of data? Let’s see, how many photos have you uploaded to Facebook till date? How many answers have you submitted to Quora? How many code patches have you contributed? Online shopping: how much or what items do you buy? Do you know what all that is, primarily? Its all data. Bytes making up photos, which is basically an array of RGB values. <Data?> Text in the ASCII format.

<Data, again!> Online shopping records your preference for items. So, data and a lot of data. Quote, “As for Facebook proper, it gets 208,300 photos uploaded every minute. Facebook has always been a bit vague about exactly how many photos they get over a period of time, but they’ve gone on record to say that it’s over six billion photos per month. It’s likely even more than that now.”, enquote, from sources on the Internet. I am sure we all agree that a picture speaks a thousand words and consumes a thousand times more memory. Quora till date has approximately 8.6 millions questions. Yahoo! Answers has 300 million. Just imagine how much data we have out there. Now, what does the data scientist do with all that data. He tries to find structure in it, process it, learn from it and so many more things.

Take online shopping for example. Recommendation systems are a huge YES for this industry. As Steve Jobs says “People don’t know what they want until you show it to them.”

They provide you with diverse options based on your selection and recommend what you might like. Based on the items bought in the past, they find the similarity between you and other people. Now, if that other person has bought something and liked it/rated it, you might get a recommendation to try it out as well.

Remember Google’s Page Ranking algorithm. Its based on similar stuff. Why is it that the moment you type “F” on your keyboard Facebook is the first link that appears. Because millions of people typed “F”, and then clicked on Facebook, certifying the fact that they were actually searching for Facebook. Hence, its rank increases. Another connotation is, while searching for an error, you get StackOverflow, Cross Validated, Ubuntu Forums and other such sites, because people who searched for the same error before you clicked these sites thus increasing ranks. This is just the basic idea, in reality its a lot more complicated that than.

Coming back to the question at hand, processing the data. Since the data is titanic, it will take time, enormous amount of time. You might want to try to process the data one by one, or parallely. You will probably agree that the parallel task is much, much faster. Thats why they use Hadoop. It is a framework for storage and large-scale processing petabytes of data. Pretty name, eh? Doug Cutting named it after his son’s toy elephant. Doug Cutting and Mike Cafarella created Hadoop. Its implementation is based on Google File System and Google Map Reduce. Since it is Java-based, Hadoop runs on all the platforms. Hadoop entails the Hadoop Distributed File System and MapReduce. There are many new alternatives to Hadoop like Spark and Cassandra. The documentation provided by Yahoo! is great for understanding the architecture and working of Hadoop.

This has been published in the YUGMA BridgeNIT magazine, you can also view the published version here.

blog comments powered by Disqus


10 October 2014