Facebook Twitter Gplus LinkedIn RSS

El BSC vuelve a abrir sus puertas para el 48h Open House Barcelona

Published on 23/10/2014

openhouseEl Barcelona Supercomputing Center se adhiere por tercer año a las jornadas 48h Open House Barcelona, que tienen como objetivo la divulgación del patrimonio arquitectónico de la ciudad.

Torre Girona, concretamente la capilla que alberga el supercomputador MareNostrum, estará abierta al público el sábado 25 de octubre de 11 a 19 horas para mostrar a los visitantes el interior de la capilla y para explicar la investigación que se desarrolla en el BSC-CNS.

La visita contará, además, con la participación de 4 investigadores que harán una presentación del centro y de la labor científica que se desarrolla. No es necesario reservar cita, las visitas se harán cada media hora aproximadamente.

No os perdáis la oportunidad de visitar este espacio singular. ¡Os esperamos!

 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

Databricks-Spark comes to Barcelona!

Published on 09/10/2014

Screen Shot 2014-05-22 at 23.54.05¡Lo hemos conseguido, un meetup con ingenieros llegados de USA para contarnos de primera mano lo que se cuece sobre Spark en la empresa Databricks!

Este cuarto meeting contará con Aaron Davidson (Apache Spark committer e Ingeniero de Software en Databricks) y Paco Nathan (Community Evangelism Director  at Databricks) que nos hablarán acerca de ‘Building a Unified Data Pipeline in Spark’ (conferencia en Inglés).

La charla se realizará el próximo jueves 20/Noviembre a las 18.30, en la sala de actos de la FIB, en el campus Nord de la UPC. Os esperamos a todos, seguro que va a ser impresionante!

Si estáis interesados es muy importante que os apuntéis lo antes posible  en la lista de asistentes  confirmados del meetup puesto que la capacidad de la sala de actos es de 80 personas y en ningún caso esta vez podremos incrementar el aforo. (Tal como se decidió conjuntamente entre los asistentes del anterior meetup, a partir de ahora vamos a probar con un pequeño fee de 2 euros para sufragar pequeños gastos).

This fourth meeting will feature Aaron Davidson (Apache Spark committer and Software Engineer at Databricks) and Paco Nathan (Community Evangelism Director  at Databricks) speaking about ‘Building a Unified Data Pipeline in Spark’ (talk in English). The talk will start next Thursday 20th November, 18:30 at sala de actos de la FIB (campus Nord – UPC). We will wait for all you!

Abstract: One of the promises of Apache Spark is to let users build unified data analytic pipelines that combine diverse processing types. In this talk, we’ll demo this live by building a machine learning pipeline with 3 stages: ingesting JSON data from Hive; training a k-means clustering model; and applying the model to a live stream of tweets. Typically this pipeline might require a separate processing framework for each stage, but we can leverage the versatility of the Spark runtime to combine Shark, MLlib, and Spark Streaming and do all of the data processing in a single, short program. This allows us to reuse code and memory between the components, improving both development time and runtime efficiency. Spark as a platform integrates seamlessly with Hadoop components, running natively in YARN and supporting arbitrary Hadoop InputFormats, so it brings the power to build these types of unified pipelines to any existing Hadoop user.

This talk will be a fully live demo and code walkthrough where we’ll build up the application throughout the session, explain the libraries used at each step, and finally classify raw tweets in real-time.

 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

InnoApps Challenge: a good opportunity to the youth!

Published on 09/10/2014

AAP-hackathonLet me to share with you the InnoApps Challenge, which encourages young people to develop innovative mobile apps. In my opinion it will be a great opportunity for your students to learn, network and win.

The InnoApps Hackathon, officially launched at the European Youth Event in the European Parliament in Strasbourg on 10 May, builds on last year’s highly successful competition to provide a cross-continental bridge for young people to collaborate, share ideas and jointly create mobile apps that address today’s pressing societal challenges. This year’s edition of InnoApps will see young people teaming up to jointly create innovative mobile apps for smart cities of the future.

The competition is designed around a collaborative work environment with built-in tools and support – during the virtual collaboration phase and the Brussels-based hackathon, the developers will benefit from extensive technical education, training and mentoring, online and offline. The winners will divide 35000 EUR, which they can use for finalizing their apps into market-ready – while they fully keep the ownership of their apps.

The InnoApps, open to young students and professionals up to 28 years old, is a joint initiative of European Young Innovators Forum and Huawei, and in partnership with AIESEC and Pinch.

The InnoApps Challenge runs in several phases, the deadline for submitting the individual application form is on 9 November. The selected finalists will be invited to all-expenses paid hackathon in Brussels in early February 2014, which culminates in the live-pitching final on the last day of the hackathon.

This could be a good opportunity to the youth!

 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

The First International Conference on Predictive APIs and Apps will be held in Barcelona

Published on 22/09/2014

PAPIS14PAPIs.io will take place on 17-18 November 2014 in Barcelona, at UPC Barcelona Tech campus, right before Strata. It will be the first ever international conference dedicated to Predictive APIs and Predictive Apps.

We want PAPIs.io to become an open forum for technologists and researchers on distributed, large-scale machine learning services and developers of real-world predictive applications.

We aim at seeding highly technical discussions on the way common and uncommon predictive problems are being solved. However, we want PAPIs to be an eminently hands-on conference.

In this first edition, we will focus on the pragmatic issues and challenges that companies on the trenches must face to make predictive APIs and applications a reality, and add academic tracks on future editions, once we understand them better.

So if you are working on an interesting Predictive API or Application and want to show the rest of the world your new advancements or discuss the challenges that you are facing please send us your proposal(Call for Proposals is open until 8 October 2014).

Predictive APIs and Applications cover a wider area of application than Recommender Systems. Therefore, their impact in our everyday’s life will be orders of magnitude higher and affect more industries than we can now imagine. So please don’t miss the opportunity to join this nascent community early on.

See you all at PAPIs in Barcelona!!!

 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

Third Spark Barcelona Meeting (CSIC)

Published on 04/09/2014

This third meeting will feature Jesús Cerquides (Tenured Researcher at Consejo Superior de Investigaciones Científicas-CSIC) speaking about GraphX: An introduction to distributed graph processing in Spark (talk in Spanish).

As usual in meetups, there will be beer, this time courtesy of Estrella Damm. We are looking forward to see you!!

JOIN US at: http://www.meetup.com/Spark-Barcelona/events/186861962/

Date: Monday 22/09/2014
Time: 19:00
Place: itnig  C/ àlaba 61, 5-2. Barcelona

 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

Big Data Open Source Landscape: Processing Technologies

Published on 15/07/2014

Screen Shot 2014-07-14 at 11.29.15Hadoop is a well established software framework which analyse structured/unstructured big data and distribute applications on thousands of servers. Hadoop was created in 2005 and after Hadoop several projects around in the Hadoop space appeared that tried to complement it. Sometimes those technologies overlap with each other and sometimes they are partially complementary. I will try to describe a brief map of them.


Programming Model

The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. Apache Hadoop Project brings an open source MapReduce Implementation.


Management layer

The scalability that is needed for big data processing is supported by their Hadoop Distributed File System (HDFS). Data in a Hadoop cluster is broken down into blocks and distributed throughout the cluster. Although there are many alternatives to the HDFS  layer (some of them known by NoSQL), it is well established in the present scenario. For this reason in this post I will only describe the technologies related with the data processing layer that can be supported by HDFS.  The data management layer  alternatives will be considered in a future post.


Hadoop Ecosystem

Beyond HDFS, the entire Apache Hadoop Ecosystem is now commonly considered to consist of a number of related projects as well. There are a main group of Apache technologies built to run on top of Hadoop clusters known as Hadoop Ecosystem. Three important are Apache Hive and Apache Pig to integrate data processing and warehousing capabilities; and Apache Sqoop which integrate HDFS with relational data stores. Another  important Apache technologies that are part of the open source Hadoop ecosystem are:  Apache Mahout is an open source machine-learning library that facilitates building scalable matching learning libraries; Apache Flume  is a distributed service for efficiently collecting, aggregating, and moving large log data amounts to HDFS; Apache ZooKeeper is a high-performance coordination service for distributed applications; Apache Avro and Apache Thrift are a two very popular data serialization systems; among other less important projects. Some projects replace MapReduce programming model, for instance,  Apache Giraph  that is used for building incremental reverse indexes instead of MapReduce.


Managing compute resources

Although MapReduce’s batch approach was a driving factor in initial adoption of the hadoop, its inability to multitask and provide satisfactory real-time processing has been a difficulty for developers in recent years.  For this reason apperared Apache Hadoop YARN (Yet Another Resource Negotiator), a cluster management technology. Basically this new layer splits key functions into two separate daemons, with resource management in one, and job scheduling and monitoring in the other, broadening Hadoop’s processing features layer. With YARN the community generalized Hadoop MapReduce to provide a general-purpose resource management framework wherein MapReduce became merely one of the applications that could process data in a Hadoop cluster. However developers required a more general data-processing application to the benefit of the entire ecosystem, and this is the role of Apache Tez. Tez is a generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez eliminates unnecessary tasks, synchronization barriers, and reads from and write to HDFS.  Tez is being adopted by Hive, Pig and other frameworks in the Hadoop ecosystem.


Streaming Data

An important requirement for many current big data applications is processing streaming data in real time. With this purpose appeared Apache Storm. Storm is a distributed real-time computation system for processing fast and large streams of data. Storm adds reliable real-time data processing capabilities to Apache Hadoop. Storm is by far the most widely used real-time computation system in this moment.  Mesosphere released a similar project for Apache Mesos (an alternative to YARN), a cluster manager that simplifies the complexity of running applications on a shared pool of servers making it easier to run Storm on Mesos clusters. Often Storm goes together with Apache Kafka as a distributed message broker used to store/send/subscribe data streams. An alternative to Storm (less widespread)  for streams of data is Apache S4.  Related projects are Suro, a pipeline service for large volumes of event data that can be used to dispatch events for both batch and real-time. Summingbird is a open source library from Twitter that lets you write MapReduce programs that look like native Scala or Java collection transformations and execute them on a number of well-known distributed MapReduce platforms. Finally, related with streaming we can found SAMOA, a distributed streaming machine-learning framework for mining big data streams.


User-facing Abstractions

Special attention requires  Apache Spark, a framework that will play an important role in the Big Data arena. It has been extensively featured in this blog (please refer my previous posts related with Spark). Another equivalent project are Apache Stratosphere.  Both are distributed general-purpose compute engines that offer user-facing APIs, and they can both run in a Hadoop cluster on top of HDFS and YARN. There are several other projects in the Hadoop space that offer user-facing abstractions as Cascading or Scalding. Cascading is a Java-based framework that abstracts and hides complex implementation details involved in the writing of big data applications.  Scalding is an extension to Cascading that enables application development with Scala, a powerful language for solving functional problems that is very popular in Big Data community.

I hope that you find useful this post and the links to resources. Please, let me know if you find any mistake or you have any suggestions for improve it. Thank you!. I also would like to thank Marc de Palol and Nico Poggi for their comments to the first draft of this post.


Links to the related projects:

 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

Open PhD position in Multimedia Big Data Computing at Barcelona Supercomputing Center

Published on 08/07/2014


Catalan Government’s FI grants / Spanish Government’s FPU grants (ref. BSC-Autonomic-Multimedia 09/2014)

The Research Group Autonomic Systems and eBusiness Platforms at Barcelona Supercomputing Center (BSC), invites outstanding candidates to apply for a full-time PhD under the Catalan Government’s FI grants or the Spanish Government’s FPU grants. The PhD work will focus on all aspects of algorithm, design and implementation of “big data” distributed computing systems to enable massive scale image and video analytics. Topics of interests include, but are not limited to:

  • Algorithms for large scale content analysis of massive scale image and video
  • Efficient or distributed high-dimensional indexing of multimedia
  • Multimedia Big Data analytics and visualization

The candidate must satisfy the UPC’s doctoral program access requirements and must prove, in addition to a very good academic record required to overcome the competitive selection of the grants.  

The following knowledge / skills will be considered for internal selection (not necessary):

  • Excellent programming skills with different languages ​​(C, C++, Java, Python, Scala …).
  • Knowledge of multimedia analysis and indexing (image analysis, automatic image annotation, machine learning, SURF, BoW, SVM, etc.).
  • Knowledge of distributed large-scale data processing environments (Hadoop, NoSQL, Spark, …etc.).
  • Knowledge of High Processing Computing environments (Parallel Programming experience,  Parallel Programming Models, …).

Those of you interested to be selected by our group within this program please send to Ruben Tous at rtous@ac.upc.edu  (with “position autonomic-Multimedia 09/2014″) the following information (all in English) by 31st of August 2014: Student CV, Official transcripts of previous and current academic records, any related important information.

 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

Databricks Cloud: Next Step For Spark

Published on 01/07/2014


This morning, during the Spark Summit,  Databricks announced a new step forward, that will allow users to leverage Apache Spark technology to build end-to-end pipelines that underlie advanced analytic running on Amazon AWS. The name is Databricks Cloud. Spark is already deployable on AWS, but Databricks Cloud is a managed service based on Spark that will be supported directly by Databricks. They shown us an impressive demo of the platform.

The Databricks Workspace (photo obtained with my iPhone :-) ) is composed by:

  • Notebooks. Provides a rich interface that allows users to perform data discovery and exploration and to plot the results interactively.
  • Dashboards. Create and host dashboards quickly and easily. Users can pick any outputs from previously created notebooks, assemble these outputs in a one-page dashboard with a WISIWYG editor, and publish the dashboard to a broader audience.
  • Job Launcher. Enables anyone to run arbitrary Apache Spark jobs and trigger their execution, simplifying the process of building data products.

I really enjoyed today sessions of the Summit and get impressed with the advancements of Spark.

In the other hand I read that Databricks also announced $33 million in Series B funding. The company has raised $47 million, including a $14 million round from last September. Congratulations!.

 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

Google launches DataFlow (a successor to MapReduce)

Published on 30/06/2014

Screen Shot 2014-06-29 at 23.27.45I’m in San Francisco ready to attend tomorrow to the 2014 Spark Summit. As I already mentioned in this blog Apache Spark is one technology that’s emerged as a potential alternative to Mapreduce/Hadoop. But it seem that it is not the only one.  Last week, also here in San Francisco, at its Google I/O 2014 conference, Google unveiled their successor to MapReduce called Dataflow, which it’s selling through its hosted cloud service (equivalent to Amazon data pipeline service and  Kinesis for real-time data processing).

Urs Holzle (Google’s senior vice president of technical infrastructure and a Google Fellow) introduces how Dataflow is used for Analytics during a keynote address at Google I/O 2014 conference  (minute 2:06:30 in this video of the keynote).  The service lets you construct an analytics workflow and then send it off to the Google Cloud for execution. A Google engineer did an interesting demo that analyzed the sentiment of soccer fans during World Cup expressed via Twitter (you can see it in this video at minute 2:06:30).

As you can see this is a very active area!

 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

Barcelona Supercomputing Center starts to work on Deep Learning

Published on 26/06/2014

Screen Shot 2014-06-23 at 01.24.54What is Deep Learning?

We can consider Deep Learning as a new area of Machine Learning research with the objective of moving Machine Learning closer to Artificial Intelligence (one of its original goals).  Our research group has been working in Machine Learning for a long time thanks to Ricard Gavaldà who introduced us in this wonderful world. It was during the summer of 2006, also with Toni Moreno, Josep Ll. Berral, Nico Poggi. Unforgettable moments! However, after 8 years we will make a step forward and start to work with Deep Learning. It was during a group retreat held last September when I listened “Deep Learning” from Jordi Nin for the first time.

Deep Learning comes from Neural nets conceived in the 1940s, inspired by the synaptic structures of the human brain. But early neural networks could simulate only a very limited number of neurons at once, so they could not recognise patterns of great complexity. Neural networks had resurgence in the 1980s when researchers helped spark a revival of interest in them with new algorithms, but complex speech or image recognition required more computer power than was then available.

In the last decade researchers made some fundamental conceptual breakthroughs, but until few years ago computers weren’t fast or powerful enough to process the enormous collections of data that these types of algorithms require. Right now, companies like Google, Facebook, Baidu, Yahoo or Microsoft are using deep learning to better match products with consumers by building more effective recommendation engines.

Deep Learning attempts to mimic the activity in layers of neurons in the neocortex with a software system. This software creates a set of virtual neurons and then assigns random weights values to connections between them. These weights determine how each simulated neuron responds to a digitised feature. The system is trained by blitzing it with digitised versions of images containing the objects. An important thing is that the system can do all that without asking a human to provide labels for objects (as is often the case with traditional machine learning tools). If the system didn’t accurately recognize a particular pattern, an automatic algorithm would adjust the weights of the neurons.

The first layer of neurons learns primitive features, like an edge in an image. It does this by finding combinations of digitized pixels that occur more often than they should by chance. Once that layer accurately recognizes those features, they are fed to the next layer, which trains itself to recognize more complex features, like a corner. The process is repeated in successive layers until the system can reliably recognize objects or phonemes. An interesting paper that Jordi Nin sent to me is from Google, that used a neural network of a billion connections. They consider the problem of building high-level, class-specific feature detectors from only unlabelled data training a 9-layered virtual neurons (the model has 1 billion connections), with a dataset of 10 million images. Training the many layers of virtual neurons in the experiment required 16,000 computer cores!!!. Is it clear now why our research group is entering in this amazing world? 

(*) Picture from Andrew Ng (Stanford)

 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

BSC releases COMPSs software for large scale parallelisation

Published on 24/06/2014

compss_home_0The Grid Computer and Clusters team at the Barcelona Supercomputing Centre has released COMPSs, a set of tools designed to help developers run applications efficiently on distributed systems such as clusters, grids, and clouds. Our research group is using this programming model in some of our ongoing research work.

COMPSs is a task based programming model known for notably improving the performance of large scale applications by automatically parallelising their execution. The new release includes PyCOMPSs, a new binding for Python which provides support to large number of scientific disciplines. It also includes some important features as a new tracing system using the Extrae tool and an Integrated Development Environment (IDE) for COMPSs applications that help in the development of the applications and in its deployment in the distributed environment.

In the last year the team’s efforts have been focusing on emerging virtualisation technologies, adopted by cloud environments. In such systems, COMPSs provides scalability and elasticity features by dynamically adapting the number of resources to the actual workload. COMPSs is designed to be interoperable with both public and private cloud providers like Amazon EC2, OpenNebula, BSC EMOTIVE Cloud and with OCCI compliant offerings.

Version 1.1.2, of the COMPSs programming environment is already available in three main programming languages: Java, C/C++ and Python.  The packages and the complete list of features are available in the Downloads page. A virtual appliance is also available to test the functionalities of COMPSs through a step-by-step tutorial that guides the user to develop and execute a set of example applications. For additional information contact with Rosa M Badia, the team leader of Grid & Cluster team.

Congratulations to my colleagues Grid Computer and Clusters team for the great effort and excellent results!

Tags: ,
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

Barcelona Spark Meetup: 200 members!!!

Published on 23/06/2014

Hi, the Barcelona Spark Meetup  achieved the magical number of 200 sparkers!  Great new for all of us! Thank you!

After the successful kickoff , the second meeting will feature David Rodriguez (CTO at Urbiotica) speaking about Apache Spark as a scalable and fault-tolerant environment for batch, speed and serving layer (talk in Spanish). We hope to see you next Thursday 10th July in this second meetup.

Join Barcelona Spark meetup


And if you have interest for other Spark meetup in the world here you can find the list:

 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

A Master on Smart Healthcare starts next September 2014

Published on 19/06/2014

smarthelthcaremasterA good colleague from the University of Girona , Beatriz López, told me about the Smart Healthcare Master. The Master aims to prepare technologists for the current revolution of health services based on the intensive use of data.

Have you ever analysed how the use of mobile phones by chronic patients will impact the healthcare system? Is it possible to know how personalized medicine be implemented for dealing with the ageing society? The Master is designed to educate technologists that help on finding answers to these challenging questions.

The Smart Healthcare Master is inter-disciplinary, joining disciplines such as Health Sciences (healthcare processes; decision-making in healthcare processes),  Artificial Intelligence (intelligent systems; machine learning), Data Analysis (intelligent data analysis; applied biostatistics),  and Organisation and Management (quality and standards; health informatics).

Accordingly, the faculty staff comes from these different disciplines; to highlight the collaboration of the Artificial Intelligence Research Institute (IIIA),  the reference research center on AI  belonging to the Spanish National Research Council (CSIC).

The Master aims to maintain a strong connection with the professional reality of the health sector and to give students closer contact with healthcare providers (service companies, hospital corporations, administrations, etc.). To this end, it is supported by the Master’s Advisory Board, a group of health organisations as well as professional associations, actively involved in the Master.

You can follow the Master’s activities on twitter at @UdGSmartHealth.

 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

Inicio del “Barcelona Spark Meetup” con Telefónica I+D

Published on 10/06/2014

Screen Shot 2014-05-22 at 23.54.05

El próximo jueves llega el primer encuentro mensual del Barcelona Spark Meetup. Impresionante la aceptación que ha tenido este grupo de Spark creado recientemente. ¡Gracias a todos!.

Por ello nos hemos animado y ya hemos previsto las posibles actividades en los próximos meses. En este primer encuentro nos acompañaran Daniel Tapiador [1] y Ignacio Blasco[2] de Telefónica I+D para contarnos el rol que juega esta tecnología emergente en la I+D de Telefónica.  El próximo mes de Julio será  David Rodriguez CTO de Urbiotica quien nos acompañará. Y para pasado el verano, en Septiembre,  ya tenemos prevista la visita de Daniel Villatoro, Senior Data Scientist del BBVA  y Jordi Aranda, uno de los principales investigadores en Big Data del BSC, para Octubre. Y en Noviembre esperamos tener alguna sorpresa para todos aprovechando la coincidencia de tener en Barcelona  la Strata Conference 2014Y siempre acabando con “discussion with some beers” como se estila en los meetups. ¡Os esperamos el próximo jueves a las 19:00!. 

Recordemos que Apache Spark es uno de los proyectos más activos del mundo Big Data, con más contribuyentes en el último año que Hadoop, del que vengo hablando hace un tiempo en este blog. Si les interesa este tema del Big Data les propongo que se unan al grupo en este link. También pueden seguir las actividades del grupo a través de @SparkBarcelona en twitter.


[1] Daniel Tapiador has over 11 years of experience in distributed systems and parallel computing and is currently leading the development of a core insights platform at Telefónica. He has also worked at the European Space Agency for around 8 years where he was mainly focused on scientific data processing and archiving. He is also undertaking some research within the team devoted to ESA Gaia mission data mining.

[2] Ignacio Blasco has over 15 years of experience in software development, 7 of them are in Telefónica I+D. He has been working in a wide range of projects with a wide range of technologies like signaling network monitoring, cloud computing or big data. Being a passionate of Functional Languages he is currently leading the Functional Programming Community in Telefónica I+D to push the adoption of FP technologies in the enterprise.

 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

Apache Spark 1.0.0: Spark SQL replaces Shark

Published on 02/06/2014

SharkApache Spark 1.0.0 released on May 30thAlessandro Chacon, a former student, realized that there are a new addition into the Spark ecosystem called Spark SQL. Spark SQL is separate from Shark (the current systems used), and does not use Hive under the hood. With the advent of Hadoop and NoSQL databases, building a data warehouse for processing big data became easier, however it requires specialized development skills and a non-trivial amount of effort. Hive solved this problem by providing a familiar SQL querying engine on top of Hadoop, that translates SQL queries into MapReduce code. Spark provides a similar SQL querying engine called Shark. Shark still relies on Hive for query-planning, but uses Spark instead of Hadoop during the physical execution phase. In conclusion, Spark SQL, is an alternative SQL engine, one that is divorced from Hive!. It provides schema-aware data modeling and SQL language support in Spark.

Some of the other improvements of this new release are that MLLib has expanded to include several new algorithms, and there are major updates to both Spark Streaming and GraphX.

Tags: , ,
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

Business-Driven Resource Allocation and Management for Data Centers in the Cloud Markets

Published on 01/06/2014

Screen Shot 2014-06-01 at 22.20.06

This week, Mario Macias, one of the researchers in our research group, did their PhD dissertation. The work is centered in the Cloud Computing arena. Cloud Computing markets arise as an efficient way to allocate resources for the execution of tasks and services within a set of geographically dispersed providers from different organisations. Client applications and service providers meet in a market and negotiate for the sales of services by means of the signature of a Service Level Agreement that contains the Quality of Service terms that the Cloud provider has to guarantee by managing properly its resources. Current implementations of Cloud markets have certain weaknesses at this level. Mario’s work present  interesting solutions for them. I’m really proud of Mario’s work.  My sincere congratulations to Mario and also to his advisor, Jordi Guitart, who has also done an excellent job. In the following link you can find the PDF of the PhD document. Enjoy!

 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

3rd International Workshop on Citizen Networks

Published on 01/06/2014

Screen Shot 2014-05-31 at 19.36.34We are pleased to announce the organisation of the workshop ‘CitiNet 2014’ by BSC during ECCS 2014. CitiNet 2014 will be a one-day event that will take place on the 25th of September 2014 in Lucca (Italy). The topics addressed include geo-spatial analytics, urban analytics, urban modelling and simulation, and citizen sensor networks, among others. Detailed information about the topics of interest, the workshop programme and the call for papers can be found at the workshop website.



Important Dates
* Submission Deadline: June 15, 2014
* Authors Notification: July 14, 2013
* Conference: September 15, 2014

 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

Adaptive MapReduce Scheduling in Shared Environments

Published on 31/05/2014

Screen Shot 2014-05-31 at 19.04.15
Jordà Polo presented our last research in Map Reduce at the 14TH IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing  held in Chicago. In this paper we present a MapReduce task scheduler for shared environments in which MapReduce is executed along with other resource-consuming workloads, such as transactional applications. All workloads may potentially share the same data store, some of them consuming data for analytics purposes while others acting as data generators. This kind of scenario is becoming increasingly important in data centers where improved resource utilization can be achieved through workload consolidation, and is specially challenging due to the interaction between workloads of different nature that compete for limited resources. The proposed scheduler aims to improve resource utilization across machines while observing completion time goals. Unlike other MapReduce schedulers, our approach also takes into account the resource demands for non-MapReduce workloads, and assumes that the amount of resources made available to the MapReduce applications is variable over time. As shown in our experiments, our proposal improves the management of MapReduce jobs in the presence of variable resource availability, increasing the accuracy of the estimations made by the scheduler, thus improving completion time goals without an impact on the fairness of the scheduler. 

A pdf copy of the paper can be downloaded here.

 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

¿Quieres saber más sobre la tecnología Big Data?

Published on 29/05/2014

meetup Apache Spark es uno de los proyectos más activos del mundo Big Data, con más contribuyentes en el último año que Hadoop, del que vengo hablando hace un tiempo en este blog. ¿Le interesa el mundo del Big Data? ¿Tiene usted alguna duda de para qué se utiliza y qué tipo de problemas se puede resolver? Si es que sí y viven cerca de Barcelona les esperamos en el Barcelona Spark Meetup que acabamos de crear compuesto por un grupo interdisciplinar de personas interesadas en los ámbitos más diversos del emergente mundo Big Data. Nuestro principal objetivo como grupo es reunirse con las personas interesadas en esta tecnología, escuchar acerca de sus proyectos relacionados y pasar un buen rato todos juntos. Como es habitual en este tipo de eventos habrá incluso cervezas.

Si les interesa el tema les propongo que se inscriban en el grupo en este link y seleccionen los meetups a los que quieren asistir.

El próximo jueves 12 de Junio, a las 19:00 horas tenemos el primer encuentro. Donde tendremos una presentación del grupo y que contará con David Rodriguez ( Director de Tecnología de Urbiotica ) que hablará de “Apache Spark como un entorno escalable y de alta disponibilidad”.  Para el mismo dia hemos organizado una visita guiada al Marenostrum,  uno de los supercomputadores más potentes de Europa y situado en la capilla de Torre Girona.

Y si no pueden asistir a este primer encuentro les propongo que se inscriban en el grupo y reciban información de próximos encuentros con presentaciones de gran interés (calculamos hacer un encuentro al mes). ¡Les espero!


Barcelona Spark Meetup link


 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

Is Hadoop showing its age?

Published on 22/05/2014

Screen Shot 2014-05-22 at 23.54.05In my opinion, yes!, the Hadoop framework is showing its age and new processing models are a must. Not only for performance but also for its lack of flexibility. In some way, it is the same that what is happening with the Big Data management. Due to the lack of flexibility of queries, NoSQL databases are adding new query features based on SQL; on the contrary side, SQL databases are bringing some measures of NoSQL performance to relational models.

Recently, together with some colleagues, we decided to explore the Spark ecosystem. Spark is a Hadoop MapReduce alternative that improves the performance of Hadoop in part due to its ability to catch intermediate results in-memory. Additionally, Spark addresses the lack of flexibility of the MapReduce model. Sparks also allow us to use Scala in addition to Python and Java (some of the members of our reseach team are members of Scala Developers Barcelona. :-) ).

The question that arises is “Why isn’t it more widely used?” In my opinion is because the lack of commercial support until now  and it only recently emerged out of academia. I think that Databricks is doing a good job to solve both aspects.

We have clearly opted for including Spark ecosystem as a testbed in two research lines, lead by Jordi Nin [1] and Ruben Tous [2], in real-time social big data analytics. Time will tell us if the bet has been successful :-)

[1] Ruben Tous is working on Multimedia Big Data Computing, a new topic that aims providing novel algorithms and frameworks for large scale content analysis of massive scale image and video. The work, which targets the hundreds of photos and videos that users share online every second, will integrate tools for indexing and analysis of multimedia with frameworks for stream processing such as Apache Spark.

[2] Jordi Nin is developing a decision support system to help start-up companies to determine which cloud services are the most convenient ones (cost vs. risk) for their products. To do that, several social networks are analyzed to extract the social reliability of the different services offered by current cloud providers.

 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn