This is my first “self” tutorial on hadoop mapreduce streaming. If you are really IT oriented you probably want to read http://hadoop.apache.org/docs/r0.15.2/streaming.html (or any newer version). This post doesn’t add much to that document with respect to hadoop mapreduce streaming. Here I play a bit with the “sort” on the command line. Probably you might want to read first my previous notes: psort: parallel sorting …. I will run these examples in a virtual cluster (libvirt/qemu/KVM) composed of 1 master node with 4 CPUs and 10 computing nodes with 2 CPUs each. The virtual nodes are distributed in two physical machines (I will post here in the future some details about this virtual cluster).
The question I had was: what does hadoop mapreduce streaming actually do? So the best I could do was to run it with the minimum coding: using /bin/cat.
Posted in data, misc
Tagged hadoop, shell, sort
I attended the RecSys conference 2012 in Dublin. I summarise my personal impressions in this blog post. First of all, the conference was well organised and I met many bright people. I got some really new ideas and inspiring thoughts for future work in the field of recommender systems.
My conclusions first:
Conclusions and lessons learned
The recommendation research ecosystem became diverse and covers many different aspects. In particular, the community shifted from a pure algorithmic point of view to a more broader scope. That’s a promising move.
However, I think we still miss enough interdisciplinary. In particular when it comes to topics like decision making processes in recommender systems etc. Science branches like sociology, psychology should be explored for useful results to better model and understand recommendation system users. We should try to “operate” more on the interfaces between different research areas, e.g., how can we combine results from social network analysis with user generated opinions in recommender systems?
Another take a way is the following: real world recommenders like the ones from LinkedIn, Netflix etc. have a huge amount of context related data and signals from user behaviour. These data are essential to drive a recommender system successfully. However, academic researchers lack of such data. Because the pure algorithmic aspect of a successful recommendation system contributes only with a small percentage it is hard for academic researchers to make useful contributions for the industry. How can we solve this problem? It might be true that special agreements between a company and a research unit are possible, but that’s a hard way to go most of the time. I think from an application point of view it is essential to think about solutions.
Our own contribution “Recommendation systems in the scope of opinion formation: a model” to the Conference was a talk in the decision@recsys2012 workshop.
The conference was divided in three part: 1) workshops 2) paper sessions and tutorials 3) industry sessions. I attended two workshops, the full paper sessions, some of the tutorials, and part of the industry session. Because I left the conference on Thursday morning I could not attend the RecSys data challenge part.
I am in the process to understand hadoop and the map-reduce framework.
This introductory line will be clarified with the next post, but keep in mind that in this post I am not seeking for the fastest sort but a bit more for a sort within a parallel framework.
I needed a simple code which would work on my Q6600 processor and also on my 2 nodes 16×2 cores cpus. Sorting seems to be a good example, easy to understand, easy to implement with the sort command and a pretty typical problem. More over hadoop recently (maybe years in the IT time scale) won one of the sorting competition (See here or also here. Google it for up to date data). It sounded a good starting point for a simple and dummy comparison.
Some days ago I could not fall asleep. I just get up and went to my living room with a cup of tea. I was thinking about this (machine learning) and that (machine learning). And yes, I would love to learn more about the topic, to understand better all the methods and so on. It’s a wish. So I was wondering what are other people’s wishes. I decided to conduct a machine learning project called wishalyse. The idea is simple. Collect as many wishes as possible from all over the world during one year and apply some machine learning methods to the collected data.
We will provide soon a Mobile App to let people easily tell their wishes. For now the wishes can be entered as a comment on our wishalyse blog.
Every month we will publish our analysis and code. The first milestone will be end of September 2012.
So, we need you. Start to tell the world your wishes and help us making our wishalyse project a success!
Most interesting phenomena in physics, social sciences, engineering, and other
disciplines are highly non-linear. This limits the ability to analytically
investigate such systems. Simulations of the dynamical processes are then the tool of
choice to explore the system. However, it is sometimes very important to have a
basic understanding in terms of approximative solutions. Non-linear differential
equations describing the dynamics are known to be harder to solve then
One often has to resort to asymptotic techniques or classical perturbation theory to obtain analytical approximations.
Classical perturbation theory strongly depends on small/large physical
parameters. Therefore, such methods are only valid for weakly non-linear systems.
Homotopy Analysis Method is a quite new approach to explore highly non-linear systems. The method composes the non-linear system by linear parts and approximates the ‘real’ solution by an iterative process. The convergence speed is governed by a tuning
parameter q. The approximative solution then can be found as a linear combination of
base functions. The main advantage of HAM are:
- Independence of small/large physical parameters
- Flexibility on the choice of the base functions
Moreover, if for a perturbative method the convergence is guaranteed in a small interval 0<=e<=1, then the HAM method allows the convergence in the whole interval 0 <= e < N, with N >> e and arbitrary.
This post is a bit geeky but never mind!
In our Lab we are not defending a particular tool or method to code or to analyze data. We stick on the simple strategy: chose the tool fitting best the task! Ok, sometimes you don’t know in advance what the best tool is but with time you get experienced and you will have a pretty high hit rate to select the best tool for your purposes. Our experiences lead to the following preferences:
- General data analysis: R, Python, and tools like awk, sed.
- Fast prototyping and simulations: R, Python
- Mathematical analysis for differential equations etc. : Mathematica
- Performance related stuff: C, C++ (yes). One team-member still favors FORTRAN 🙂
- Databases: MySQL and some NoSQL
If you are not interested in technical details, then here is the message to take from this post:
Don’t try doing everything with the same method, programming language, algorithms, and tools. Be flexible, figure out what is the most efficient method to solve a particular task! Don’t be addicted.
So far, so good. For our recommender research tasks we mainly use Python. This is comfort-zone. Sometimes it’s good going out of comfort-zone, right? For fun I decided to implement a simple recommender algorithm (B-Rank) in Mathematica. This is definitely out of comfort-zone, since for this kind of stuff we already have our best option: Python with Numpy and Scipy. We are happy in terms of performance and scalability. The code translation from Python to Mathematica was straight forward. However, the Mathematica Code reads a bit cryptic if one is not used to it. I was wondering, how the performance of this implementation compares to the Python version. The difference is striking.
Last week I attended a study tour organized by swissnex San Francsisco. The topic: social media. We visited the big players – Twitter, Facebook, LinkedIn, You Tube – and Universities like Stanford and Berkeley, to learn about their social media strategies. The lessons learned will be posted later. However, because I am pretty jet lagged I decided to hack instead a bit on twitter data related to the study tour.
Hack #1: follower network
Firstly, I wanted to visualize the follower network of participants tweeting during the study tour. For this business I used networkx and tweepy libraries for python to grab and visualize the data. This hack was pretty straight forward. I started with an ego centered network by fetching all participants following me on twitter. Next I grabbed their followers. The data were passed to networkx to represent them as a graph and finally I visualized the network using a spring layout. There are clear clusters visible. The reasons for this is how the network was constructed and the fact that participants of the tour were not highly linked to each other before the tour started. The labels of the nodes correspond to the names registered on twitter. To enlarge the picture simply press on it. Enjoy!
Posted in data