Data Scientists: Who, What, How?
The newspaper reads “Data Scientist: The Sexiest Job of the 21st Century”. For a minute you are like “I thought medical and engineering were the only two domains”. Then you smirk “Does it even pay?”. Then somehow you start to read the article. “Best paid job of the century”. Ahem Ahem! You close the newspaper, and realize you know nothing.
Who’s a data scientist you ask? The data scientist venn diagram by Drew Conway explains it quite well. A data scientist is a blend of maths and statistics, hacking skills and substantive expertise. So what does a data scientist do? He sees zetta bytes of data and goes “Zetta, play no vendetta”. But wait, where did you get zetta bytes of data? Let’s see, how many photos have you uploaded to Facebook till date? How many answers have you submitted to Quora? How many code patches have you contributed? Online shopping: how much or what items do you buy? Do you know what all that is, primarily? Its all data. Bytes making up photos, which is basically an array of RGB values. <Data?> Text in the ASCII format.
<Data, again!> Online shopping records your preference for items.
Take online shopping for example. Recommendation systems are a huge YES for this industry. As Steve Jobs says “People don’t know what they want until you show it to them.”
They provide you with diverse options based on your selection and recommend what you might like. Based on the items bought in the past, they find the similarity between you and other people. Now, if that other person has bought something and liked it/rated it, you might get a recommendation to try it out as well.
Remember Google’s Page Ranking algorithm. Its based on similar stuff. Why is it that the moment you type “F” on your keyboard Facebook is the first link that appears. Because millions of people typed “F”, and then clicked on Facebook, certifying the fact that they were actually searching for Facebook. Hence, its rank increases. Another connotation is, while searching for an error, you get StackOverflow, Cross Validated, Ubuntu Forums and other such sites, because people who searched for the same error before you clicked these sites thus increasing ranks. This is just the basic idea, in reality its a lot more complicated that than.
Coming back to the question at hand, processing the data. Since the data is titanic, it will take time, enormous amount of time. You might want to try to process the data one by one, or parallely. You will probably agree that the parallel task is much, much faster. Thats why they use Hadoop. It is a framework for storage and large-scale processing petabytes of data. Pretty name, eh? Doug Cutting named it after his son’s toy elephant. Doug Cutting and Mike Cafarella created Hadoop. Its implementation is based on Google File System and Google Map Reduce. Since it is Java-based, Hadoop runs on all the platforms. Hadoop entails the Hadoop Distributed File System and MapReduce. There are many new alternatives to Hadoop like Spark and Cassandra. The documentation provided by Yahoo! is great for understanding the architecture and working of Hadoop.
blog comments powered by Disqus