Lately I attended a Big Data conference (Euroforum Big Data, Switzerland). The conference was nicely organized and the topics were interesting. Participants were mostly from IT industry and from the list of participants one could expect rather executive-summary-style talks. Nothing against that! However, making things too simple is dangerous. As Einstein said: “make things as simple as possible but not simpler.” Make things too simple and you will miss some of the really important points.
Talking about Big Data, Machine Learning, and Personalization comprises many pitfalls to make things too simple. Here are some:
#1: Big Data means installing Hadoop
Listening to some of the speakers one got sometimes the impression that installing and maintaining a Hadoop Cluster is all you need to benefit from Big Data.
This is really bad because not technology but business related questions is the most important ingredient for a successful Big Data strategy. The nature of questions dictate the required technology. For example: you want to detect communities within a large network consisting of millions of nodes and billions of edges? Hadoop would be a bad choice. This kind of questions are answered more efficently using GraphLab. On the other hand GraphLab is not primus inter pares for other kind of questions. Hadoop is not the only kid in town and for sure not the solution for everything.
#2: Using Machine Learning Algorithm x works best
This kind of statement is even worse. Firstly, there is probably no serious Big Data project using some Machine Learning methods by just feeding algorithms with features (attributes) as they appear in data. If you do that you almost surely fail.
Feature engineering is the magic word here. Feature engineering is about the art to combine and/or to use only relevant attributes of the data. There is no way to make this selection automatically. Moreover, the selected set of features depends on the business context and the data topology. Note: having more and more data increases the noise level and intelligent feature selection becomes indispensable.
Secondly, having successfully applied algorithm x to dataset y does not imply that algorithm x applied to a dataset z will perform nicely as well, i.e., a successful recommender algorithm for books will fail recommending movies (for sure). Generalization is hardly achieved across different industries.
In practice one choses a set of algorithms to solve a problem on a given data set. An impressive example of this methodology was demonstrated by the winners of the famous Netflix recommendation contest.
#3: We hired a data scientist and he will solve all our problems
Here and there you can read about a new species called data scientist who is juggling with data. Ideally, a data scientist is educated in math/statistics/machine learning, he or she has deep domain knowledge across different industries, speaks python, java, c, c++, erlang, and every other programming language – you name it, doesn’t lack communication skills, and knows how to visualize data, and… and… and.
Come on! This is a joke. Nobody will ever be talented in all these areas. Perhaps you find somebody with mediocre skills in each of the above fields. But for sure this is not what you are looking for.
Looking at companies successful in designing data driven business processes (this means making money) drive data science teams. These teams consist of specialists in different areas. The key factor to have a performant team is the team members’ ability to develop a common language!
#4: Big Data is only for big companies
There are some common ‘definitions’ of the term Big Data. Many of them are extensions to the famous three v’s (volume, velocity, and variety). Then there are people having in mind: Big Data is equal Hadoop (sorry, but this is a complete nonsense).
Big Data should be rather understood as a particular situation, where available resources are behind the needs to efficiently maintain data driven business processes and tasks. This situation can hold for every company of whatever size. Applying the latest development in data mining/machine learning does request for a certain amount of data (to maintain a particular confidence in the statistics) but it does not ask for peta bytes of data. It’s a pity that most shown use cases (even here in Europe) always refer to LinkedIn, Google, Facebook etc. Mid sized and even smaller companies can benefit from developments within the Big Data movement, i.e., machine learning, data visualisation, and other interesting areas.