One size fits all – no way!

Lately I attended a Big Data conference (Euroforum Big Data, Switzerland). The conference was nicely organized and the topics were interesting. Participants were mostly  from IT industry and from the list of participants one could expect rather executive-summary-style talks. Nothing against that! However, making things too simple is dangerous. As Einstein said: “make things as simple as possible but not simpler.”  Make things too simple and you will miss some of the really important points.

Talking about Big Data, Machine Learning, and Personalization comprises many pitfalls to make things too simple. Here are some:

#1: Big Data means installing Hadoop
Listening to some of the speakers one got sometimes the impression that installing and maintaining a Hadoop Cluster is all you need to benefit from Big Data.
This is really bad because not technology but business related questions is the most important ingredient for a successful Big Data strategy. The nature of questions dictate the required technology. For example: you want to detect communities within a large network consisting of millions of nodes and billions of edges? Hadoop would be a bad choice. This kind of questions are answered more efficently using GraphLab. On the other hand GraphLab is not primus inter pares for other kind of questions. Hadoop is not the only kid in town and for sure not the solution for everything.

#2: Using Machine Learning Algorithm x works best
This kind of statement is even worse. Firstly, there is probably no serious Big Data project using some Machine Learning methods by just feeding algorithms with features (attributes) as they appear in data. If you do that you almost surely fail.
Feature engineering is the magic word here. Feature engineering is about the art to combine and/or to use only relevant attributes of the data. There is no way to make this selection automatically. Moreover, the selected set of features depends on the business context and the data topology. Note: having more and more data increases the noise level and intelligent feature selection becomes indispensable.

Secondly, having successfully applied algorithm x to dataset y does not imply that algorithm x applied to a dataset z will perform nicely as well, i.e., a successful recommender algorithm for books will fail recommending movies (for sure). Generalization is hardly achieved across different industries.
In practice one choses a set of algorithms to solve a problem on a given data set. An impressive example of this methodology was demonstrated by the winners of the famous Netflix recommendation contest.

#3: We hired a data scientist and he will solve all our problems
Here and there you can read about a new species called data scientist who is juggling with data. Ideally, a data scientist is educated in math/statistics/machine learning,  he or she has deep domain knowledge across different industries,  speaks python, java, c, c++, erlang, and every other programming language – you name it,  doesn’t lack communication skills, and knows how to visualize data, and… and… and.
Come on! This is a joke. Nobody will ever be talented in all these areas. Perhaps you find somebody with mediocre skills in each of the above fields. But for sure this is not what you are looking for.

Looking at companies successful in designing data driven business processes (this means making money) drive data science teams. These teams consist of specialists in different areas. The key factor to have a performant team is the team members’ ability to develop a common language!

#4: Big Data is only for big companies
There are some common ‘definitions’ of the term Big Data. Many of them are extensions to the famous three v’s (volume, velocity, and variety). Then there are people having in mind: Big Data is equal Hadoop (sorry, but this is a complete nonsense).

Big Data should be rather understood as a particular situation, where available resources are behind the needs to efficiently maintain data driven business processes and tasks. This situation can hold for every company of whatever size. Applying the latest development in data mining/machine learning does request for a certain amount of data (to maintain a particular confidence in the statistics) but it does not ask for peta bytes of data. It’s a pity that most shown use cases (even here in Europe) always refer to LinkedIn,  Google, Facebook etc. Mid sized and even smaller companies can benefit from developments within the Big Data movement, i.e., machine learning, data visualisation, and other interesting areas.

Posted in Big Data, data, Data Science | Tagged , , | Leave a comment

WebSci 13. Highlights, impressions.

I spent the week of from the 1st to the 5th of May at the Web Science 2013 conference in Paris. It was for sure the most diverse conference I have ever attended because the community of Web science itself is very diverse. There were many participants from a wide array of disciplines from philosophy to computer science (and everything in-between).

The conference kicked off on the 2nd of May with the keynote speech of Vint Cerf. (http://www.websci13.org/keynotes/2013/04/vint-cerf-keynote/)
The keynote was followed by the first series of talks under the name “Face in the Crowd”.
In the afternoon we the “Pecha Kucha” session. Pecha Kucha is a simple presentation format where you show 20 images, each for 20 seconds. The images advance automatically and you talk along to the images.

Is this good for an academic conference format? Well I think that it is not suitable for all presentations, but only for conferences with a homogeneous audience where backgrounds and methodological explanations are not required. And this was not the case!
In the late afternoon we heard another keynote speech from Cory Doctorow. a novelist and technology activist (details under http://www.websci13.org/keynotes/2013/04/cory-doctorow-keynote/).

Day 2 started with the ECRC Panel about the “Future of Computer Science”. The panel consisted of a broad spectrum of views including legal experts exploring the privacy implications of such technologies to the sociological and technological growth of such services.

Then we heard more presentations under the headings of ‘Web of the Mind’ and ‘Competition’ before lunch and ‘Governance & Trust’ and ‘Web Technologies’ afterwards.

At the end of the day another Panel Session named:”The new Village Pump”.

The final day started with the Panel: “How will the Web Revolutionize Society”. This panel will invite four guests, each of whom have made ground-breaking socio-technical contributions, to debate the future of society and the Web (for details, see http://www.websci13.org/keynotes/2013/05/saturday-panel-how-the-web-will-revolutionize-society/).

The rest of the day was all about presentations once again under the headings “Representation”, “News”, and “Networks”

Overall the Conference was a success for all those who attended. The quality of papers and presentations really highlighted the fact that good quality research was being conducted in Web Science in all different directions.

 

Posted in misc | Tagged | Leave a comment

The Role of Trends in Evolving Networks

Modelling complex networks has been the focus of much research for over a decade [1]-[3]. Preferential attachment (PA) [4] is considered a common explanation to the self organization of evolving networks, suggesting that new nodes prefer to attach to more popular nodes. The result is a broad degree distribution, found in many real world networks. Another feature present in observed real world networks is the tendency to build clusters, i.e., groups of tightly connected nodes. The traditional PA model does not reveal such a feature. In general, the PA model is driven by aging effects, i.e., the older a node, the higher the probability that a newly arrived node in the network connects to it. Clearly, there are other effects in networks like trends. A newly arrived node may become a very strong driver in a network, i.e., becoming a trend. Our latest paper describes a model, in which we incorporate the concept of trendiness. Namely, in trending networks, newly arriving nodes may become central at random, forming new clusters (groups). In particular we show that when the network is young it is more susceptible to trends, but even older networks may have trendy new nodes that become central in their structure.

The Model (TPA)
We assume an evolving network where in each time step a node is added with m links. We define the network’s tendency to adhere to trends by r. Then a node with degree k that has acquired \Delta k links in the last step will acquire new links in monotonically increasing rate that is a function of k + r \Delta k. The more trendy is the network, the bigger is the effect of small changes. Hence, new nodes are more likely to attach to nodes that either have a high degree or are gaining a momentum in the growth of the number of new links, and hence are trendy.

Like in the PA model we start with m_{0} nodes. Then, at each step, a new mode with m \le m_{0} links is addes. The other ends of the links are chosen with a probability that correlates with node’s importance, denoted by its relative a relative weight W_{i}.

\Pi(W_{i}) = \frac{W_{i}}{\sum_{j}W_{j}}.

Where W_{i} = k_{i} + r \Delta k_{i}, and \Delta k_{i} is the recent growth in the degree of node $i$ degree. We have chosen the most simple way for \Delta k_{i} = k_{i}(t) - k_{i}(t-1). The total weight at time $t$ is therefore 2mt + mr.
The striking feature of the model is the dynamics of young trending networks, namely, when r \gg N with $N$ the number of nodes in the network. When the growth model allows for the addition of one node at each step, N \sim t. Consequently, for very trending young networks, we get:

\Pi(t) = \frac{W_{i}(t)}{\sum_{j}W(t)} \rightarrow r^{-0.5}.

Showing that newly arriving nodes have a similar probability of becoming important as older nodes in the network. This model property has to be contrasted with the PA model, which favours older nodes in the network to become more dominant.

The following plot shows a network generated by the TPA model. The labels correspond to the node’s arriving time step. It is clearly visible that nodes arriving even late have the chance to become a trend, i.e., the center of a cluster.

r2000

The full paper can be downloaded here.

[1] D. Watts and S. Strogatz, “Collective dynamics of samll-world networks, Nature, vol.393, pp.440-442, 1998.
[2] R. Albert and A. Barabasi, “Statistical mechanics of complex networks”, Reviews of modern physics, vol 74, no 1., p.47, 2002.
[3] M. Newman, A. Barabasi, and D. Watts, “The strucutre and dynamics of networks.”, Princeton University Press, 2011.
[4] A. Barabais and R. Albert, “Emergence of scaling in random networks”, Science, vol. 286, no. 5439, pp. 509-512, 1999.

Posted in algos, complex_system, data | Tagged , , | Leave a comment

WebScience 13: our contribution

We are pleased to announce that our paper

“Preferential Attachment in Online Networks: Measurement and Explanations” has been accepted for the ACM Web Science Conference (www.websci13.org).

The paper was a joined work together with Jerome Kunegis from the University of Koblenz, Germany and Christine Moser from UV University of Amsterdam, Netherlands.

Abstract:

In this paper we performed an empirical study of the preferential attachment phenomenom in temporal networks and show that on the Web, networks follow a nonlinear preferential attachment model in which the exponent depends on the type of network considered. The classical preferential attachment model for networks (Barabasi and Albert 1999) assumes a linear relationship between the number of neighbours of a node in network and the probability of attachment.

Although this assumption is widely made in Web Science and related fields, the underlying linearity is rarely measured. We performed an empirical longitudinal (time-based) study on forty-seven diverse Web network datasets from seven network categories. We show that contrary to the usual assumption, preferential attachment is nonlinear in the networks under consideration. We observe a dependency between the non linearity and the type of network under consideration – sublinear preferential attachment in certain types of networks, and superlinear attachment in others.

We propose explanations for the behaviour of that network measure, based on the mechanisms underlying the growth of the network in question.

 You can access the paper here.

Posted in complex_system, data | Tagged , , , | 1 Comment

Originally posted on Galileo's Pendulum:

The fake table of contents from the April 1, 1986 issue of the Astrophysical Journal. (Click for larger version.)

The fake table of contents from the April 1, 1986 issue of the Astrophysical Journal. (Click for larger version.)

Recently, I acquired a stack of books, papers, conference proceedings, and other scientific memorabilia, including the fake journal table of contents on the right. Though the material varied a lot in content and tone, I noted a lot of things in common: nearly every set of items emphasized a debate that was very significant at one time, but now is pretty much dead and buried. Though it’s intended to be humorous, the first three items in the Astrophysiological Journey (a parody of the prominent Astrophysical Journal) table of contents highlighted a real conflict in the cosmology community.

That conflict was over the rate of the expansion of the Universe. Since the 1920s, we’ve known that most galaxies are moving away from us, and the farther they are, the faster they…

View original 1,166 more words

Posted in Uncategorized | Leave a comment

hd bbc – our lab local virtual hadoop cluster

When I finished (after few days) to install our local cluster I could finally get on this video on you tube. Nice video, cheers masterschema!

But I’ll give here my little report in any case for the records.

Lets start from the hosts machines first. At the lab we have a two nodes “cluster”, each one with:

  •  OS: CentOS 6.3 (6.0 upgraded to 6.3 with the exception of the kernel)
  • CPU: 2 x AMD Opteron(TM) Processor 6276. 2.3 GHz (16 Cores each)
  • RAM: 64Gb (8 x 8 Gb)
  • DISK: 2 x 600 Gb SAS-II 15000 rpm + RAID Controller Adaptec 6405

BTW: we called the two machines buffalo and bill, and the full cluster is now bbc!

We wanted to test hadoop to see if even in such small environment it might become useful. Actually we had two targets in mind: the first is simply to have a local hadoop for development before production, the second was to test if it might solves some problems which are normally, in other codes, limited by the RAM. The previous post about sorting cats was running on this small cluster.

We tried to install hadoop on our two nodes but we found pretty soon problems with a satisfactory configuration. Moreover we realized that it would be difficult to control the resources in particular if hadoop had to share the nodes with other jobs. So the idea was to create a virtual cluster dedicated only to hadoop. I found this combination comfortable: libvirtd/qemu/KVM. I used in the past VMware and VirtualBOX but libvirtd seems easier and straightforward to use for our purpose. You can just type virt-manager and the rest goes very smooth.

The first thing to do is to define a first virtual machine which we will clone to create the cluster. You can download one qcow2 CentOS image from the internet including cloudera. I did play a bit with Oz-Image-Buid. Getting a ready qcow2 file saves you the installation time, in particular choosing the main packages to install.

Continue reading

Posted in misc | Tagged , , , , , | Leave a comment

Sorting Cats with Hadoop and psort

This is my first “self” tutorial on hadoop mapreduce streaming. If you are really IT oriented you probably want to read http://hadoop.apache.org/docs/r0.15.2/streaming.html (or any newer version). This post doesn’t add much to that document with respect to hadoop mapreduce streaming. Here I play a bit with the “sort” on the command line. Probably you might want to read first my previous notes: psort: parallel sorting …. I will run these examples in a virtual cluster (libvirt/qemu/KVM) composed of 1 master node with 4 CPUs and 10 computing nodes with 2 CPUs each. The virtual nodes are distributed in two physical machines (I will post here in the future some details about this virtual cluster).

The question I had was: what does hadoop mapreduce streaming actually do? So the best I could do was to run it with the minimum coding: using /bin/cat.

Continue reading

Posted in data, misc | Tagged , , | Leave a comment