When I finished (after few days) to install our local cluster I could finally get on this video on you tube. Nice video, cheers masterschema!
But I’ll give here my little report in any case for the records.
Lets start from the hosts machines first. At the lab we have a two nodes “cluster”, each one with:
- OS: CentOS 6.3 (6.0 upgraded to 6.3 with the exception of the kernel)
- CPU: 2 x AMD Opteron(TM) Processor 6276. 2.3 GHz (16 Cores each)
- RAM: 64Gb (8 x 8 Gb)
- DISK: 2 x 600 Gb SAS-II 15000 rpm + RAID Controller Adaptec 6405
BTW: we called the two machines buffalo and bill, and the full cluster is now bbc!
We wanted to test hadoop to see if even in such small environment it might become useful. Actually we had two targets in mind: the first is simply to have a local hadoop for development before production, the second was to test if it might solves some problems which are normally, in other codes, limited by the RAM. The previous post about sorting cats was running on this small cluster.
We tried to install hadoop on our two nodes but we found pretty soon problems with a satisfactory configuration. Moreover we realized that it would be difficult to control the resources in particular if hadoop had to share the nodes with other jobs. So the idea was to create a virtual cluster dedicated only to hadoop. I found this combination comfortable: libvirtd/qemu/KVM. I used in the past VMware and VirtualBOX but libvirtd seems easier and straightforward to use for our purpose. You can just type virt-manager and the rest goes very smooth.
The first thing to do is to define a first virtual machine which we will clone to create the cluster. You can download one qcow2 CentOS image from the internet including cloudera. I did play a bit with Oz-Image-Buid. Getting a ready qcow2 file saves you the installation time, in particular choosing the main packages to install.
This is my first “self” tutorial on hadoop mapreduce streaming. If you are really IT oriented you probably want to read http://hadoop.apache.org/docs/r0.15.2/streaming.html (or any newer version). This post doesn’t add much to that document with respect to hadoop mapreduce streaming. Here I play a bit with the “sort” on the command line. Probably you might want to read first my previous notes: psort: parallel sorting …. I will run these examples in a virtual cluster (libvirt/qemu/KVM) composed of 1 master node with 4 CPUs and 10 computing nodes with 2 CPUs each. The virtual nodes are distributed in two physical machines (I will post here in the future some details about this virtual cluster).
The question I had was: what does hadoop mapreduce streaming actually do? So the best I could do was to run it with the minimum coding: using /bin/cat.
Posted in data, misc
Tagged hadoop, shell, sort
I attended the RecSys conference 2012 in Dublin. I summarise my personal impressions in this blog post. First of all, the conference was well organised and I met many bright people. I got some really new ideas and inspiring thoughts for future work in the field of recommender systems.
My conclusions first:
Conclusions and lessons learned
The recommendation research ecosystem became diverse and covers many different aspects. In particular, the community shifted from a pure algorithmic point of view to a more broader scope. That’s a promising move.
However, I think we still miss enough interdisciplinary. In particular when it comes to topics like decision making processes in recommender systems etc. Science branches like sociology, psychology should be explored for useful results to better model and understand recommendation system users. We should try to “operate” more on the interfaces between different research areas, e.g., how can we combine results from social network analysis with user generated opinions in recommender systems?
Another take a way is the following: real world recommenders like the ones from LinkedIn, Netflix etc. have a huge amount of context related data and signals from user behaviour. These data are essential to drive a recommender system successfully. However, academic researchers lack of such data. Because the pure algorithmic aspect of a successful recommendation system contributes only with a small percentage it is hard for academic researchers to make useful contributions for the industry. How can we solve this problem? It might be true that special agreements between a company and a research unit are possible, but that’s a hard way to go most of the time. I think from an application point of view it is essential to think about solutions.
Our own contribution “Recommendation systems in the scope of opinion formation: a model” to the Conference was a talk in the decision@recsys2012 workshop.
The conference was divided in three part: 1) workshops 2) paper sessions and tutorials 3) industry sessions. I attended two workshops, the full paper sessions, some of the tutorials, and part of the industry session. Because I left the conference on Thursday morning I could not attend the RecSys data challenge part.
I am in the process to understand hadoop and the map-reduce framework.
This introductory line will be clarified with the next post, but keep in mind that in this post I am not seeking for the fastest sort but a bit more for a sort within a parallel framework.
I needed a simple code which would work on my Q6600 processor and also on my 2 nodes 16×2 cores cpus. Sorting seems to be a good example, easy to understand, easy to implement with the sort command and a pretty typical problem. More over hadoop recently (maybe years in the IT time scale) won one of the sorting competition (See here or also here. Google it for up to date data). It sounded a good starting point for a simple and dummy comparison.
Some days ago I could not fall asleep. I just get up and went to my living room with a cup of tea. I was thinking about this (machine learning) and that (machine learning). And yes, I would love to learn more about the topic, to understand better all the methods and so on. It’s a wish. So I was wondering what are other people’s wishes. I decided to conduct a machine learning project called wishalyse. The idea is simple. Collect as many wishes as possible from all over the world during one year and apply some machine learning methods to the collected data.
We will provide soon a Mobile App to let people easily tell their wishes. For now the wishes can be entered as a comment on our wishalyse blog.
Every month we will publish our analysis and code. The first milestone will be end of September 2012.
So, we need you. Start to tell the world your wishes and help us making our wishalyse project a success!