WebScience 13: our contribution

We are pleased to announce that our paper

“Preferential Attachment in Online Networks: Measurement and Explanations” has been accepted for the ACM Web Science Conference (www.websci13.org).

The paper was a joined work together with Jerome Kunegis from the University of Koblenz, Germany and Christine Moser from UV University of Amsterdam, Netherlands.

Abstract:

In this paper we performed an empirical study of the preferential attachment phenomenom in temporal networks and show that on the Web, networks follow a nonlinear preferential attachment model in which the exponent depends on the type of network considered. The classical preferential attachment model for networks (Barabasi and Albert 1999) assumes a linear relationship between the number of neighbours of a node in network and the probability of attachment.

Although this assumption is widely made in Web Science and related fields, the underlying linearity is rarely measured. We performed an empirical longitudinal (time-based) study on forty-seven diverse Web network datasets from seven network categories. We show that contrary to the usual assumption, preferential attachment is nonlinear in the networks under consideration. We observe a dependency between the non linearity and the type of network under consideration – sublinear preferential attachment in certain types of networks, and superlinear attachment in others.

We propose explanations for the behaviour of that network measure, based on the mechanisms underlying the growth of the network in question.

 You can access the paper here.

Posted in complex_system, data | Tagged , , , | 1 Comment

Scientific Debates Are (Mostly) a Waste of Time

Reblogged from Galileo's Pendulum:

Click to visit the original post
  • Click to visit the original post

Recently, I acquired a stack of books, papers, conference proceedings, and other scientific memorabilia, including the fake journal table of contents on the right. Though the material varied a lot in content and tone, I noted a lot of things in common: nearly every set of items emphasized a debate that was very significant at one time, but now is pretty much dead and buried.

Read more… 1,277 more words

Posted in Uncategorized | Leave a comment

hd bbc – our lab local virtual hadoop cluster

When I finished (after few days) to install our local cluster I could finally get on this video on you tube. Nice video, cheers masterschema!

But I’ll give here my little report in any case for the records.

Lets start from the hosts machines first. At the lab we have a two nodes “cluster”, each one with:

  •  OS: CentOS 6.3 (6.0 upgraded to 6.3 with the exception of the kernel)
  • CPU: 2 x AMD Opteron(TM) Processor 6276. 2.3 GHz (16 Cores each)
  • RAM: 64Gb (8 x 8 Gb)
  • DISK: 2 x 600 Gb SAS-II 15000 rpm + RAID Controller Adaptec 6405

BTW: we called the two machines buffalo and bill, and the full cluster is now bbc!

We wanted to test hadoop to see if even in such small environment it might become useful. Actually we had two targets in mind: the first is simply to have a local hadoop for development before production, the second was to test if it might solves some problems which are normally, in other codes, limited by the RAM. The previous post about sorting cats was running on this small cluster.

We tried to install hadoop on our two nodes but we found pretty soon problems with a satisfactory configuration. Moreover we realized that it would be difficult to control the resources in particular if hadoop had to share the nodes with other jobs. So the idea was to create a virtual cluster dedicated only to hadoop. I found this combination comfortable: libvirtd/qemu/KVM. I used in the past VMware and VirtualBOX but libvirtd seems easier and straightforward to use for our purpose. You can just type virt-manager and the rest goes very smooth.

The first thing to do is to define a first virtual machine which we will clone to create the cluster. You can download one qcow2 CentOS image from the internet including cloudera. I did play a bit with Oz-Image-Buid. Getting a ready qcow2 file saves you the installation time, in particular choosing the main packages to install.

Continue reading

Posted in misc | Tagged , , , , , | Leave a comment

Sorting Cats with Hadoop and psort

This is my first “self” tutorial on hadoop mapreduce streaming. If you are really IT oriented you probably want to read http://hadoop.apache.org/docs/r0.15.2/streaming.html (or any newer version). This post doesn’t add much to that document with respect to hadoop mapreduce streaming. Here I play a bit with the “sort” on the command line. Probably you might want to read first my previous notes: psort: parallel sorting …. I will run these examples in a virtual cluster (libvirt/qemu/KVM) composed of 1 master node with 4 CPUs and 10 computing nodes with 2 CPUs each. The virtual nodes are distributed in two physical machines (I will post here in the future some details about this virtual cluster).

The question I had was: what does hadoop mapreduce streaming actually do? So the best I could do was to run it with the minimum coding: using /bin/cat.

Continue reading

Posted in data, misc | Tagged , , | Leave a comment

RecSys2012 Conference – a personal summary

I attended the RecSys conference 2012 in Dublin. I summarise my personal impressions in this blog post. First of all, the conference was well organised and I met many bright people. I got some really new ideas and inspiring thoughts for future work in the field of recommender systems.

My conclusions first:

Conclusions and lessons learned
The recommendation research ecosystem became diverse and covers many different aspects. In particular, the community shifted from a pure algorithmic point of view to a more broader scope. That’s a promising move.

However, I think we still miss enough interdisciplinary. In particular when it comes to topics like decision making processes in recommender systems etc. Science branches like sociology, psychology should be explored for useful results to better model and understand recommendation system users. We should try to “operate” more on the interfaces between different research areas, e.g., how can we combine results from social network analysis with user generated opinions in recommender systems?

Another take a way is the following: real world recommenders like the ones from LinkedIn, Netflix etc. have a huge amount of context related data and signals from user behaviour. These data are essential to drive a recommender system successfully. However, academic researchers lack of such data. Because the pure algorithmic aspect of a successful recommendation system contributes only with a small percentage it is hard for academic researchers to make useful contributions for the industry. How can we solve this problem? It  might be true that special agreements between a company and a research unit are possible, but that’s a hard way to go most of the time. I think from an application point of view it is essential to think about solutions.

Our own contribution “Recommendation systems in the scope of opinion formation: a model” to the Conference was a talk in the decision@recsys2012 workshop.

Conference Summary
The conference was divided in three part: 1) workshops 2) paper sessions and tutorials 3) industry sessions. I attended two workshops, the full paper sessions, some of the tutorials, and part of the industry session. Because I left the conference on Thursday morning I could not attend the RecSys data challenge part.

Continue reading

Posted in misc, recommender_systems | Tagged , | 2 Comments

psort: Parallel sorting on the command line. An example.

I am in the process to understand hadoop and the map-reduce framework.

This introductory line will be clarified with the next post, but keep in mind that in this post I am not seeking for the fastest sort but a bit more for a sort within a parallel framework.

I needed a simple code which would work on my Q6600 processor and also on my 2 nodes 16×2 cores cpus. Sorting seems to be a good example, easy to understand, easy to implement with the sort command and a pretty typical problem. More over hadoop recently (maybe years in the IT time scale) won one of the sorting competition (See here or also here. Google it for up to date data). It sounded a good starting point for a simple and dummy comparison.

Continue reading

Posted in data, misc | 1 Comment

wishalyse. Our new machine learning project

Some days ago I could not fall asleep. I just get up and went to my living room with a cup of tea. I was thinking about this (machine learning) and that (machine learning). And yes, I would love to learn more about the topic, to understand better all the methods and so on. It’s a wish. So I was wondering what are other people’s wishes. I decided to conduct a machine learning project called wishalyse. The idea is simple. Collect as many wishes as possible from all over the world during one year and apply some machine learning methods to the collected data.

We will provide soon a Mobile App to let people easily tell their wishes. For now the wishes can be entered as a comment on our wishalyse blog.

Every month we will publish our analysis and code. The first milestone will be end of September 2012.

So, we need you. Start to tell the world your wishes and help us making our wishalyse project a success!

Thank you.

Posted in misc | Leave a comment