Facebook Twitter Gplus LinkedIn RSS

Beca para hacer el doctorado en el Barcelona Supercomputing Center

Published on 17/12/2014

Becas de ”la Caixa” para estudios de doctorado en universidades españolas (ref. BSC-Autonomic 01/2015)

bigpicture_banner-1024x196_v2

Acaba de abrirse la convocatoria de Becas para estudios de doctorado en universidades españolas de la obra social la Caixa y nuestro grupo de investigación tiene una posición de investigador/investigadora para cursar el doctorado en el contexto del Barcelona Supercomputing Center (BSC-CNS) y dentro del programa de doctorado Arquitectura de Computadores de la UPC Barcelona Tech (que cumple con el requisito de mención de calidad requerido en la convocatoria), para un candidato o candidata que consiga esta beca.

Para optar a esta beca hace falta la nacionalidad española y cumplir todos los requisitos de acceso a un programa oficial de doctorado en septiembre de 2015 (la incorporación al programa de doctorado se haría entre Septiembre de 2015 y Enero de 2016).

El trabajo de doctorado se centraría en el estudio de sistemas, algoritmos y estructuras de datos para el procesamiento de flujos masivos de datos altamente dimensionales, como pueden ser las fotos y los vídeos provenientes de redes sociales, sobre arquitecturas de altas prestaciones como el supercomputador Marenostrum. Esta posición se enmarcará dentro del área del multimedia big data computing dentro del grupo de investigación Autonomic Systems and eBusiness Platforms, sin duda un apasionante mundo de investigación multidisciplinar con grandes aplicaciones de futuro.

Las dotación económica de la beca así como otros detalles (duración, incompatibilidades, etc.) está descrita en las bases de la convocatoria.

Para superar la competitiva selección de estas becas se precisa que el candidato disponga de un buen expediente y de un excelente conocimiento del idioma inglés (que, en caso de ser finalmente nuestro candidato a estas becas, deberá acreditar mediante alguno de los certificados especificados en las bases de la convocatoria).

Además, de cara al proceso de selección, se valorarán los siguientes conocimientos:

  1. Conocimientos de programación con diversos lenguajes (Scala, Java, C++, Python, etc.) además de dominar entornos Linux y sus lenguajes de scripting.
  2. Conocimientos de sistemas y arquitecturas altamente paralelas y distribuidas.
  3. Conocimientos de teoría de la probabilidad, álgebra lineal y análisis matemático.
  4. Conocimientos de sistemas Big Data (Hadoop, Cassandra, Apache Spark, etc.).
  5. Conocimientos de técnicas de análisis de datos (clustering, aprendizaje automático, etc.).
  6. Conocimientos de técnicas de visión por computador (OpenCV, etc.).

De todos los interesados solo uno podrá ser nuestro candidato a estas becas. Por ello proponemos que los interesados que cumplan con todas las condiciones mencionadas anteriormente se pongan cuanto antes en contacto con nosotros con un plazo máximo del 24 de Enero 2015, con confirmación por nuestra parte antes del 27 de enero, con el objetivo de poder posteriormente disponer de tiempo suficiente para preparar correctamente la candidatura y a su vez los no elegidos tener tiempo para encontrar otras oportunidades (el deadline de las becas es el 23 de Febrero de 2015).Los interesados pueden enviar un correo electrónico con una breve carta de presentación (<300 palabras) a rtous@ac.upc.edu (Profesor Rubèn Tous) y torres@ac.upc.edu (Profesor Jordi Torres) con el subject “position autonomic 01/2015″ e incluyendo los siguientes documentos (en formato .pdf, todos comprimidos dentro de un único fichero .zip):

  • Expediente académico del grado o licenciatura que incluya nota media y a ser posible posición relativa dentro de la promoción.
  • Expediente académico del máster (si aplica).
  • Breve curriculum vitae (2 páginas suficiente, 4 máximo) que incluya como mínimo:
    • Información personal básica (lugar de residencia, fecha de nacimiento, género y fotografía reciente).
    • Historial académico.
    • Experiencia profesional.
    • Proyectos relevantes para la candidatura (por ejemplo proyectos de final de grado y máster).
    • Estancias en otros centros durante los estudios (dado el caso).
    • Experiencia profesional.
    • Detalle de conocimientos y habilidades relevantes para la candidatura (los indicados en la anterior lista de 6 puntos). Indicar el nivel de conocimiento y de qué manera se ha obtenido.
    • Otros méritos destacados: Premios, publicaciones, etc.

Dada la complejidad del proceso las solicitudes que no incluyan toda la información requerida o que no la presenten en el formato indicado serán consideradas.

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

David Carrera selected for ERC Starting Grant

Published on 17/12/2014

ercDavid Carrera, one of the senior researchers in our research group, have been selected for a prestigious European Research Council (ERC) Starting Grant, for his project Holistic Integration of Emerging Supercomputing Technologies (Hi-EST). This first Starting Grant competition under the EU’s Horizon 2020 programme awards early-career talent to develop their ambitious high-risk, high-gain research projects.

David received the MS degree in 2002 and his PhD in 2008 and since then he has been leading several EU and industrial research projects in our group. David is an outstanding researcher, a Messi in our research group!. Congratulations David! Et mereixies aquest reconeixement! Ets el millor!

 

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

A new start-up that does deep learning

Published on 06/12/2014

bengio“Deep learning is a rapidly growing branch of artificial intelligence. It comprises a set of techniques that don’t require domain experts to program knowledge into algorithms. Instead, these techniques can learn by observing data.” This is the definition that we can found in the web of  MetaMind. The Palo Alto startup called MetaMind launched on Friday that uses deep learning to analyze their images, text and other data. The company has raised $8 million!. Yoshua Bengio (picture) of the University of Montreal, considered for us one of the handful of deep learning masters, are MetaMind’s advisers.  Professor Bengio say “Metamind is one of the few deep learning startups with recognized and strong academic credentials in the deep learning research community, in both areas of visual data and natural language (and their combination), as well as regarding algorithms and architectures. They have achieved state-of-the-art performance on difficult academic benchmarks in both of these areas and are committed in advancing the research in difficult and exciting challenges for deep learning ahead of us”.

More details about this  important new can be found in GIGAOM web and the MetaMind web page.

 

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

Programa “Big Data, el petroli del segle XXI” a RTVE-Catalunya

Published on 30/11/2014

RTVE a Catalunya, dins del seu programa #tincunaidea a realitzat un reportatge sobre el BIG DATA titulat “Big Data, el petroli del segle XXI” en el que varen venir a demanar l’opinió també de la UPC Barcelona Tech i del Barcelona Supercomputing Center (BSC-CNS) hi hem participat. Gràcies a l’equip del programa per l’agradable estona que varem passar mentre feiem l’enregistrament. Espero que aquest programa ajudi a divulgar el coneixement d’aquesta nova tecnologia que ens canviarà la vida a tots, ens agradi o no!

link al video: http://www.rtve.es/alacarta/videos/tinc-una-idea/tinc-idea-projectes-big-data-petroli-del-segle-xxi/2878007/

 
Tags: ,
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

Building a Unified Data Pipeline in Spark

Published on 24/11/2014

Excellent reception of sparkers to the last session of  Barcelona Spark meetup featured by Aaron Davidson (Apache Spark committer and Software Engineer at Databricks) speaking about ‘Building a Unified Data Pipeline in Spark’ .

If you missed the presentation or want to revisit it, check out the video recorded here  (talk in English). Enclosed you will find some pictures of the session.

Thank you very much to Aaron Davidson for accepting our invitation and also to Paco Nathan, Alex Sicoe, Sameer Farooqui  and Olivier Girardot for their support for this meetup. I hope you enjoyed barcelona and you come back soon.

Screen Shot 2014-11-24 at 18.47.41 Screen Shot 2014-11-24 at 18.47.00 Screen Shot 2014-11-24 at 18.46.22 Screen Shot 2014-11-24 at 18.45.21

 

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

Strata + Hadoop World in Barcelona 2014: Videos & Slides

Published on 22/11/2014

Strata+Hadoop Barcelona 2014The conference is over, and in my point of view it was a great success. The program of the conference were very good, with great networking opportunities and a good sponsor pavilion. I really enjoyed it.

Let me say to the organisers that Barcelona is delighted to welcome conferences like Strata+Hadoop. And all attendees with whom I spoke were excited to be in Barcelona.  Congratulations for choosing Barcelona!

If you missed the conference or want to revisit the main presentations or keynotes, check out the keynote videos or speaker slides. You can also check out the official photos.

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

BSC releases COMPSs version 1.2 at SC14

Published on 20/11/2014

compss_vertical_1The Grid Computing and Clusters group of Barcelona Supercomputing Center is proud to announce the release of COMPSs version 1.2 during Supercomputing Conference 2014. COMPSs is a framework for easily implement distributed applications.

This release implements the following main features:

* N implementations for task methods, each with its own constraints.

* Constraint-aware resource management.

* Support for multicore tasks.

* Pluggable schedulers: facilitate the addition of new schedulers and policies.

* Extended support for objects in C/C++ binding.

* Extended IDE for N implementations and deployment through PMES.

* Update cloud connector for rOCCI to work with rocci-cli v4.2.5.

* Enhance rOCCI connector to compute the real VM creation times.

* Extended resources schema to support Virtual Appliances pricing.

* New LSF GAT adaptor.

For more details and downloads please visit COMPSs webpage: http://compss.bsc.es

Install the IDE through the Eclipse Marketplace:

http://marketplace.eclipse.org/content/comp-superscalar-integrated-development-environment

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

Get certified for Apache Spark in Barcelona

Published on 13/11/2014

Screen Shot 2014-11-13 at 21.17.32As all my students know I think that Hadoop is showing its age and Apache Spark is exploding. Let me share with you an important opportunity to get the Developer Certification for Apache Spark in Barcelona. Yes, I said in Barcelona!,  at the upcoming Strata + Hadoop World  next week in the CCIB – Centre Convencions Internacional de Barcelona.  If you want to learn more you can visit this web page. it is a good opportunity!  I hope to see you in the Strata + Hadoop World event!.

Also you are invited to attend our next meeting of Barcelona Spark Meetup.  This fourth meeting will feature Aaron Davidson (Apache Spark committer and Software Engineer at Databricks) and Paco Nathan (Community Evangelism Director  at Databricks) speaking about ‘Building a Unified Data Pipeline in Spark’ (talk in English). The talk will start next Thursday 20th November, 18:30 at sala de actos de la FIB (campus Nord – UPC).  It is necessary to register here. We will wait for all you!

 

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

APIs that make it easier to create predictive models

Published on 04/11/2014

PAPIS_Pic1-1024x672Our research group is working in Data Analytics, where predictive modelling play an important role.

Predictive modelling is an important process by which a model is created or chosen to try to best predict the probability of an outcome. Most often the event one wants to predict is in the future, but predictive modelling can be applied to any type of unknown event, regardless of when it occurred. Now, you can use an increasing number of API products offering predictive analytics services that make it easier to create and deploy predictive models in your business or in your app.

Last year I invited as a guest lecture in one of my courses one of them,BigML. Together with  Google Prediction API were the only ones I knew. However, as Louis Dorard explains in his blog, there are many more great tools actually in this space (some of which only came out this year) DatagamiDataikuIndicoIntuiticsGraphlabOpenscoring,PredicionIORapidminerYhat… 

What better way to learn how to use them than from the very people who made them, through hands-on sessions illustrated with concrete case studies? Well, that’s what’s waiting for you at PAPIs.io on 17 and 18 November at UPC campus in Barcelona — right before Strata conference.

The Predictive APIs and Apps conference — PAPIs.io —  is the first of its kind, aimed at giving voice to the increasing number of API products offering predictive analytics services. Our research group is collaborating in the organisation of this important workshop.  Check out the full schedule and list of speakers on Lanyrd or you can download the PDF with the information. If you are interested to attend you  can register online.

Hoping to meet you in person soon in Barcelona!

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

La computación cognitiva en el diario El País

Published on 01/11/2014

Screen Shot 2014-11-01 at 17.14.26Hoy en el diario El País (tecnología) sale el artículo “Ordenadores que entienden a los humanos para buscar petróleo” escrito por el periodista Daniel Mediavilla, con quien tuve la oportunidad de intercambiar impresiones tal cómo se desprende del artículo sobre una de las “últimas tendencias en computación” llamada  computación cognitiva (cognitive computing).

En breve intentaré explicar en este blog lo que se entiende y que estamos haciendo en este campo de la computación en el BSC+UPC. Saben que soy de los que cree en la obligación que tenemos los investigadores de explicar lo que hacemos si queremos que la sociedad nos de soporte, pero el día a día es tan tremendo que últimamente me cuesta encontrar momentos para dedicar a esta pequeña ventana de divulgación.

De momento les adjunto el artículo de Daniel y les propongo que lo lean. Creo que Daniel ha sabido explicar de manera muy llana a que nos referimos cuando hablamos de computación cognitiva. Estoy seguro de que les gustará. Yo, además de que me encontré muy cómodo cuanto estuvimos conversando, he encontrado su artículo muy divulgativo y acertado de enfoque (no siempre tengo la misma sensación al ver el filtro/esfuerzo hecho por el periodista de hacer comprensible un tema de investigación sobre el cual hemos estado conversando previamente). !Disfrutenlo!

 

Ordenadores que entienden a los humanos para buscar petróleo

por DANIEL MEDIAVILLA 30 OCT 2014 – 17:29 CET
Repsol presenta un acuerdo con IBM para desarrollar sistemas de computación cognitiva que interpreten cantidades ingentes de datos para detectar yacimientos de petróleo y gas

1414686554_028577_1414688664_noticia_normalLas compañías energéticas como Repsol tienen un problema de escasez y otro de abundancia. El gas y el petróleo, esas masas subterráneas de energía acumuladas durante millones de años en cadáveres vegetales, ya no se encuentran a flor de tierra como en los buenos tiempos. En 1949, la profundidad media de un pozo en Estados Unidos era de 1.171 metros; en 2008, 1926. Los combustibles fósiles se acaban, y lo hacen justo ahora, cuando decenas de millones de personas en países de África o Asia están logrando acceder a la clase media, con sus utilitarios, su aire acondicionado y su apetito por la carne. Si nada cambia, en 2035 el consumo energético del planeta será un 50% superior y un 65% de esa voracidad será saciada con hidrocarburos.

Para hacer frente a este reto, encontrar el combustible en los escondrijos donde se oculta y seguir haciendo crecer su negocio, compañías como Repsol cuentan con un recurso que, al contrario que las materias primas que busca, es muy abundante. El conocimiento en torno a todos los aspectos que rodean la compleja tarea de extraer hidrocarburos es más abundante que nunca. Tanto, que no hay cerebro humano capaz de aprovecharlo, y los ordenadores empleados para ampliar nuestras capacidades de gestión de datos empiezan a resultar insuficientes.

Hoy, Repsol ha anunciado un acuerdo con IBM para no ahogarse en la abundancia de información y ponerla a su servicio. Entre las dos compañías, dentro de un proyecto bautizado como Pegasus, están desarrollando dos aplicaciones de lo que se conoce como computación cognitiva para mejorar la capacidad estratégica de la corporación energética a la hora de seleccionar nuevos campos petrolíferos en los que invertir y optimizar el uso de sus reservas. El proyecto se llevará a cabo de forma conjunta por un equipo mixto de Repsol e IBM, que trabajará en las instalaciones más avanzadas que existen en este campo, como el primer laboratorio cognitivo del mundo, propiedad de IBM y situado en Nueva York, y en el Centro de Tecnología Repsol, en Móstoles (Madrid).

Ordenadores que hacen preguntas

Con los sistemas de computación actuales, un ingeniero podría plantear una hipótesis, enseñarle al ordenador a ponerla a prueba y, teniendo en cuenta una gran cantidad de datos, comprobar si se cumple. “Los nuevos sistemas serán capaces de aprender y plantear nuevas preguntas”, explica Jordi Torres, investigador del Centro de Supercomputación de Barcelona. “Estos sistemas permiten tomar una gran cantidad de datos de diferente procedencia, desde artículos científicos a noticias de periódico o imágenes, analizarlos dentro de un contexto y, por ejemplo, descubrir una correlación que ni se te había ocurrido y plantearte una nueva pregunta”, añade.

“De alguna manera, Watson emula la forma de razonar de las personas”, señala Elisa Martín Garijo, directora de Innovación y Tecnología de IBM España. “Ante una pregunta, formula hipótesis y escoge la respuesta en la que tienen un mayor nivel de confianza, muestra los pasos para llegar a esa respuesta, muestra un razonamiento y aprende de su experiencia”, continúa. “Estas máquinas no te ofrecen la respuesta correcta, te dan la mejor respuesta posible teniendo en cuenta el contexto; son capaces de gestionar la ambigüedad de la vida real”, añade Torres.

“Como humano no tienes capacidad de acceso y procesamiento de una cantidad tan ingente de datos”, señala Santiago Quesada, director de Tecnología de Exploración y Producción de Repsol. “Con el nuevo sistema, al ordenador le podrías decir cuántos yacimientos del mundo explotan en terrenos de carbonatos o areniscas y darle información para plantear el contexto geológico”, prosigue. “Entonces, el ordenador combinaría esos datos con el acceso a información en toda la web, a bases de datos asociadas, informes… Y después proporcionaría su conclusión a los técnicos que serían siempre los responsables de tomar la decisión”, concluye Quesada. “Más que ofrecerte una respuesta final, funciona como un asistente, que te hace recomendaciones”, puntualiza Torres.

Según Quesada, el desarrollo de esta tecnología ayudaría a minimizar el número de prospecciones erróneas, incrementando los beneficios de la compañía y limitando el impacto ambiental de perforaciones inútiles. Antes, la compañía ya había colaborado con científicos del CSIC, la Universidad de Standford e IBM para desarrollar proyectos tecnológicos como Caleidoscopio. Este sistema, que hace posible procesar imágenes sísmicas con mayor rapidez y de una forma más fiable, incrementa las posibilidades de encontrar petróleo y gas a miles de metros bajo el suelo y habría desempeñado, según la compañía, un importante papel en sus más de 50 descubrimientos de yacimientos de hidrocarburos en los últimos ocho años.

Las capacidades de la computación cognitiva no solo tendrán aplicaciones en la extracción de combustibles fósiles. IBM también trabaja en el campo de la salud para gestionar la gran cantidad de información que está proporcionando la genómica para ofrecer a los especialistas la capacidad de interpretarla y ponerla a disposición de los pacientes. Con estos asesores informáticos, las posibilidades de análisis que ahora solo están a disposición de los hospitales con los mejores especialistas se podrá acercar a centros de salud con menos recursos.

En una combinación de la capacidad de los ordenadores para procesar lenguaje natural, gestionar la ambigüedad y comprender contextos, Jordi Torres cuenta que alguno de sus alumnos está aplicando este conocimiento a las redes sociales para predecir el futuro. “A través de los tuits, sería posible prever, por ejemplo, si hay mucha gente que se va a presentar a un evento en una plaza, y con esa información se podría planificar la presencia policial necesaria o informar a los taxistas”, explica. Como en el caso de la explotación de recursos naturales, la capacidad de acercar el modo de razonar de los ordenadores al de los humanos, puede cambiar a la humanidad misma.

Extraído de http://elpais.com/elpais/2014/10/30/ciencia/1414686554_028577.html

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

El BSC vuelve a abrir sus puertas para el 48h Open House Barcelona

Published on 23/10/2014

openhouseEl Barcelona Supercomputing Center se adhiere por tercer año a las jornadas 48h Open House Barcelona, que tienen como objetivo la divulgación del patrimonio arquitectónico de la ciudad.

Torre Girona, concretamente la capilla que alberga el supercomputador MareNostrum, estará abierta al público el sábado 25 de octubre de 11 a 19 horas para mostrar a los visitantes el interior de la capilla y para explicar la investigación que se desarrolla en el BSC-CNS.

La visita contará, además, con la participación de 4 investigadores que harán una presentación del centro y de la labor científica que se desarrolla. No es necesario reservar cita, las visitas se harán cada media hora aproximadamente.

No os perdáis la oportunidad de visitar este espacio singular. ¡Os esperamos!

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

Databricks-Spark comes to Barcelona!

Published on 09/10/2014

Screen Shot 2014-05-22 at 23.54.05¡Lo hemos conseguido, un meetup con ingenieros llegados de USA para contarnos de primera mano lo que se cuece sobre Spark en la empresa Databricks!

Este cuarto meeting contará con Aaron Davidson (Apache Spark committer e Ingeniero de Software en Databricks) y Paco Nathan (Community Evangelism Director  at Databricks) que nos hablarán acerca de ‘Building a Unified Data Pipeline in Spark’ (conferencia en Inglés).

La charla se realizará el próximo jueves 20/Noviembre a las 18.30, en la sala de actos de la FIB, en el campus Nord de la UPC. Os esperamos a todos, seguro que va a ser impresionante!

Si estáis interesados es muy importante que os apuntéis lo antes posible  en la lista de asistentes  confirmados del meetup puesto que la capacidad de la sala de actos es de 80 personas y en ningún caso esta vez podremos incrementar el aforo. (Tal como se decidió conjuntamente entre los asistentes del anterior meetup, a partir de ahora vamos a probar con un pequeño fee de 2 euros para sufragar pequeños gastos).

This fourth meeting will feature Aaron Davidson (Apache Spark committer and Software Engineer at Databricks) and Paco Nathan (Community Evangelism Director  at Databricks) speaking about ‘Building a Unified Data Pipeline in Spark’ (talk in English). The talk will start next Thursday 20th November, 18:30 at sala de actos de la FIB (campus Nord – UPC). We will wait for all you!

Abstract: One of the promises of Apache Spark is to let users build unified data analytic pipelines that combine diverse processing types. In this talk, we’ll demo this live by building a machine learning pipeline with 3 stages: ingesting JSON data from Hive; training a k-means clustering model; and applying the model to a live stream of tweets. Typically this pipeline might require a separate processing framework for each stage, but we can leverage the versatility of the Spark runtime to combine Shark, MLlib, and Spark Streaming and do all of the data processing in a single, short program. This allows us to reuse code and memory between the components, improving both development time and runtime efficiency. Spark as a platform integrates seamlessly with Hadoop components, running natively in YARN and supporting arbitrary Hadoop InputFormats, so it brings the power to build these types of unified pipelines to any existing Hadoop user.

This talk will be a fully live demo and code walkthrough where we’ll build up the application throughout the session, explain the libraries used at each step, and finally classify raw tweets in real-time.

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

InnoApps Challenge: a good opportunity to the youth!

Published on 09/10/2014

AAP-hackathonLet me to share with you the InnoApps Challenge, which encourages young people to develop innovative mobile apps. In my opinion it will be a great opportunity for your students to learn, network and win.

The InnoApps Hackathon, officially launched at the European Youth Event in the European Parliament in Strasbourg on 10 May, builds on last year’s highly successful competition to provide a cross-continental bridge for young people to collaborate, share ideas and jointly create mobile apps that address today’s pressing societal challenges. This year’s edition of InnoApps will see young people teaming up to jointly create innovative mobile apps for smart cities of the future.

The competition is designed around a collaborative work environment with built-in tools and support – during the virtual collaboration phase and the Brussels-based hackathon, the developers will benefit from extensive technical education, training and mentoring, online and offline. The winners will divide 35000 EUR, which they can use for finalizing their apps into market-ready – while they fully keep the ownership of their apps.

The InnoApps, open to young students and professionals up to 28 years old, is a joint initiative of European Young Innovators Forum and Huawei, and in partnership with AIESEC and Pinch.

The InnoApps Challenge runs in several phases, the deadline for submitting the individual application form is on 9 November. The selected finalists will be invited to all-expenses paid hackathon in Brussels in early February 2014, which culminates in the live-pitching final on the last day of the hackathon.

This could be a good opportunity to the youth!

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

The First International Conference on Predictive APIs and Apps will be held in Barcelona

Published on 22/09/2014

PAPIS14PAPIs.io will take place on 17-18 November 2014 in Barcelona, at UPC Barcelona Tech campus, right before Strata. It will be the first ever international conference dedicated to Predictive APIs and Predictive Apps.

We want PAPIs.io to become an open forum for technologists and researchers on distributed, large-scale machine learning services and developers of real-world predictive applications.

We aim at seeding highly technical discussions on the way common and uncommon predictive problems are being solved. However, we want PAPIs to be an eminently hands-on conference.

In this first edition, we will focus on the pragmatic issues and challenges that companies on the trenches must face to make predictive APIs and applications a reality, and add academic tracks on future editions, once we understand them better.

So if you are working on an interesting Predictive API or Application and want to show the rest of the world your new advancements or discuss the challenges that you are facing please send us your proposal(Call for Proposals is open until 8 October 2014).

Predictive APIs and Applications cover a wider area of application than Recommender Systems. Therefore, their impact in our everyday’s life will be orders of magnitude higher and affect more industries than we can now imagine. So please don’t miss the opportunity to join this nascent community early on.

See you all at PAPIs in Barcelona!!!

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

Third Spark Barcelona Meeting (CSIC)

Published on 04/09/2014

This third meeting will feature Jesús Cerquides (Tenured Researcher at Consejo Superior de Investigaciones Científicas-CSIC) speaking about GraphX: An introduction to distributed graph processing in Spark (talk in Spanish).

As usual in meetups, there will be beer, this time courtesy of Estrella Damm. We are looking forward to see you!!

JOIN US at: http://www.meetup.com/Spark-Barcelona/events/186861962/

Date: Monday 22/09/2014
Time: 19:00
Place: itnig  C/ àlaba 61, 5-2. Barcelona

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

Big Data Open Source Landscape: Processing Technologies

Published on 15/07/2014

Screen Shot 2014-07-14 at 11.29.15Hadoop is a well established software framework which analyse structured/unstructured big data and distribute applications on thousands of servers. Hadoop was created in 2005 and after Hadoop several projects around in the Hadoop space appeared that tried to complement it. Sometimes those technologies overlap with each other and sometimes they are partially complementary. I will try to describe a brief map of them.

 

Programming Model

The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. Apache Hadoop Project brings an open source MapReduce Implementation.

 

Management layer

The scalability that is needed for big data processing is supported by their Hadoop Distributed File System (HDFS). Data in a Hadoop cluster is broken down into blocks and distributed throughout the cluster. Although there are many alternatives to the HDFS  layer (some of them known by NoSQL), it is well established in the present scenario. For this reason in this post I will only describe the technologies related with the data processing layer that can be supported by HDFS.  The data management layer  alternatives will be considered in a future post.

 

Hadoop Ecosystem

Beyond HDFS, the entire Apache Hadoop Ecosystem is now commonly considered to consist of a number of related projects as well. There are a main group of Apache technologies built to run on top of Hadoop clusters known as Hadoop Ecosystem. Three important are Apache Hive and Apache Pig to integrate data processing and warehousing capabilities; and Apache Sqoop which integrate HDFS with relational data stores. Another  important Apache technologies that are part of the open source Hadoop ecosystem are:  Apache Mahout is an open source machine-learning library that facilitates building scalable matching learning libraries; Apache Flume  is a distributed service for efficiently collecting, aggregating, and moving large log data amounts to HDFS; Apache ZooKeeper is a high-performance coordination service for distributed applications; Apache Avro and Apache Thrift are a two very popular data serialization systems; among other less important projects. Some projects replace MapReduce programming model, for instance,  Apache Giraph  that is used for building incremental reverse indexes instead of MapReduce.

 

Managing compute resources

Although MapReduce’s batch approach was a driving factor in initial adoption of the hadoop, its inability to multitask and provide satisfactory real-time processing has been a difficulty for developers in recent years.  For this reason apperared Apache Hadoop YARN (Yet Another Resource Negotiator), a cluster management technology. Basically this new layer splits key functions into two separate daemons, with resource management in one, and job scheduling and monitoring in the other, broadening Hadoop’s processing features layer. With YARN the community generalized Hadoop MapReduce to provide a general-purpose resource management framework wherein MapReduce became merely one of the applications that could process data in a Hadoop cluster. However developers required a more general data-processing application to the benefit of the entire ecosystem, and this is the role of Apache Tez. Tez is a generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez eliminates unnecessary tasks, synchronization barriers, and reads from and write to HDFS.  Tez is being adopted by Hive, Pig and other frameworks in the Hadoop ecosystem.

 

Streaming Data

An important requirement for many current big data applications is processing streaming data in real time. With this purpose appeared Apache Storm. Storm is a distributed real-time computation system for processing fast and large streams of data. Storm adds reliable real-time data processing capabilities to Apache Hadoop. Storm is by far the most widely used real-time computation system in this moment.  Mesosphere released a similar project for Apache Mesos (an alternative to YARN), a cluster manager that simplifies the complexity of running applications on a shared pool of servers making it easier to run Storm on Mesos clusters. Often Storm goes together with Apache Kafka as a distributed message broker used to store/send/subscribe data streams. An alternative to Storm (less widespread)  for streams of data is Apache S4.  Related projects are Suro, a pipeline service for large volumes of event data that can be used to dispatch events for both batch and real-time. Summingbird is a open source library from Twitter that lets you write MapReduce programs that look like native Scala or Java collection transformations and execute them on a number of well-known distributed MapReduce platforms. Finally, related with streaming we can found SAMOA, a distributed streaming machine-learning framework for mining big data streams.

 

User-facing Abstractions

Special attention requires  Apache Spark, a framework that will play an important role in the Big Data arena. It has been extensively featured in this blog (please refer my previous posts related with Spark). Another equivalent project are Apache Stratosphere.  Both are distributed general-purpose compute engines that offer user-facing APIs, and they can both run in a Hadoop cluster on top of HDFS and YARN. There are several other projects in the Hadoop space that offer user-facing abstractions as Cascading or Scalding. Cascading is a Java-based framework that abstracts and hides complex implementation details involved in the writing of big data applications.  Scalding is an extension to Cascading that enables application development with Scala, a powerful language for solving functional problems that is very popular in Big Data community.

I hope that you find useful this post and the links to resources. Please, let me know if you find any mistake or you have any suggestions for improve it. Thank you!. I also would like to thank Marc de Palol and Nico Poggi for their comments to the first draft of this post.

 

Links to the related projects:

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

Open PhD position in Multimedia Big Data Computing at Barcelona Supercomputing Center

Published on 08/07/2014

bigpicture_banner

Catalan Government’s FI grants / Spanish Government’s FPU grants (ref. BSC-Autonomic-Multimedia 09/2014)

The Research Group Autonomic Systems and eBusiness Platforms at Barcelona Supercomputing Center (BSC), invites outstanding candidates to apply for a full-time PhD under the Catalan Government’s FI grants or the Spanish Government’s FPU grants. The PhD work will focus on all aspects of algorithm, design and implementation of “big data” distributed computing systems to enable massive scale image and video analytics. Topics of interests include, but are not limited to:

  • Algorithms for large scale content analysis of massive scale image and video
  • Efficient or distributed high-dimensional indexing of multimedia
  • Multimedia Big Data analytics and visualization

The candidate must satisfy the UPC’s doctoral program access requirements and must prove, in addition to a very good academic record required to overcome the competitive selection of the grants.  

The following knowledge / skills will be considered for internal selection (not necessary):

  • Excellent programming skills with different languages ​​(C, C++, Java, Python, Scala …).
  • Knowledge of multimedia analysis and indexing (image analysis, automatic image annotation, machine learning, SURF, BoW, SVM, etc.).
  • Knowledge of distributed large-scale data processing environments (Hadoop, NoSQL, Spark, …etc.).
  • Knowledge of High Processing Computing environments (Parallel Programming experience,  Parallel Programming Models, …).

Those of you interested to be selected by our group within this program please send to Ruben Tous at rtous@ac.upc.edu  (with “position autonomic-Multimedia 09/2014″) the following information (all in English) by 31st of August 2014: Student CV, Official transcripts of previous and current academic records, any related important information.

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

Databricks Cloud: Next Step For Spark

Published on 01/07/2014

DSC_0636

This morning, during the Spark Summit,  Databricks announced a new step forward, that will allow users to leverage Apache Spark technology to build end-to-end pipelines that underlie advanced analytic running on Amazon AWS. The name is Databricks Cloud. Spark is already deployable on AWS, but Databricks Cloud is a managed service based on Spark that will be supported directly by Databricks. They shown us an impressive demo of the platform.

The Databricks Workspace (photo obtained with my iPhone :-) ) is composed by:

  • Notebooks. Provides a rich interface that allows users to perform data discovery and exploration and to plot the results interactively.
  • Dashboards. Create and host dashboards quickly and easily. Users can pick any outputs from previously created notebooks, assemble these outputs in a one-page dashboard with a WISIWYG editor, and publish the dashboard to a broader audience.
  • Job Launcher. Enables anyone to run arbitrary Apache Spark jobs and trigger their execution, simplifying the process of building data products.

I really enjoyed today sessions of the Summit and get impressed with the advancements of Spark.

In the other hand I read that Databricks also announced $33 million in Series B funding. The company has raised $47 million, including a $14 million round from last September. Congratulations!.

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

Google launches DataFlow (a successor to MapReduce)

Published on 30/06/2014

Screen Shot 2014-06-29 at 23.27.45I’m in San Francisco ready to attend tomorrow to the 2014 Spark Summit. As I already mentioned in this blog Apache Spark is one technology that’s emerged as a potential alternative to Mapreduce/Hadoop. But it seem that it is not the only one.  Last week, also here in San Francisco, at its Google I/O 2014 conference, Google unveiled their successor to MapReduce called Dataflow, which it’s selling through its hosted cloud service (equivalent to Amazon data pipeline service and  Kinesis for real-time data processing).

Urs Holzle (Google’s senior vice president of technical infrastructure and a Google Fellow) introduces how Dataflow is used for Analytics during a keynote address at Google I/O 2014 conference  (minute 2:06:30 in this video of the keynote).  The service lets you construct an analytics workflow and then send it off to the Google Cloud for execution. A Google engineer did an interesting demo that analyzed the sentiment of soccer fans during World Cup expressed via Twitter (you can see it in this video at minute 2:06:30).

As you can see this is a very active area!

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn

Barcelona Supercomputing Center starts to work on Deep Learning

Published on 26/06/2014

Screen Shot 2014-06-23 at 01.24.54What is Deep Learning?

We can consider Deep Learning as a new area of Machine Learning research with the objective of moving Machine Learning closer to Artificial Intelligence (one of its original goals).  Our research group has been working in Machine Learning for a long time thanks to Ricard Gavaldà who introduced us in this wonderful world. It was during the summer of 2006, also with Toni Moreno, Josep Ll. Berral, Nico Poggi. Unforgettable moments! However, after 8 years we will make a step forward and start to work with Deep Learning. It was during a group retreat held last September when I listened “Deep Learning” from Jordi Nin for the first time.

Deep Learning comes from Neural nets conceived in the 1940s, inspired by the synaptic structures of the human brain. But early neural networks could simulate only a very limited number of neurons at once, so they could not recognise patterns of great complexity. Neural networks had resurgence in the 1980s when researchers helped spark a revival of interest in them with new algorithms, but complex speech or image recognition required more computer power than was then available.

In the last decade researchers made some fundamental conceptual breakthroughs, but until few years ago computers weren’t fast or powerful enough to process the enormous collections of data that these types of algorithms require. Right now, companies like Google, Facebook, Baidu, Yahoo or Microsoft are using deep learning to better match products with consumers by building more effective recommendation engines.

Deep Learning attempts to mimic the activity in layers of neurons in the neocortex with a software system. This software creates a set of virtual neurons and then assigns random weights values to connections between them. These weights determine how each simulated neuron responds to a digitised feature. The system is trained by blitzing it with digitised versions of images containing the objects. An important thing is that the system can do all that without asking a human to provide labels for objects (as is often the case with traditional machine learning tools). If the system didn’t accurately recognize a particular pattern, an automatic algorithm would adjust the weights of the neurons.

The first layer of neurons learns primitive features, like an edge in an image. It does this by finding combinations of digitized pixels that occur more often than they should by chance. Once that layer accurately recognizes those features, they are fed to the next layer, which trains itself to recognize more complex features, like a corner. The process is repeated in successive layers until the system can reliably recognize objects or phonemes. An interesting paper that Jordi Nin sent to me is from Google, that used a neural network of a billion connections. They consider the problem of building high-level, class-specific feature detectors from only unlabelled data training a 9-layered virtual neurons (the model has 1 billion connections), with a dataset of 10 million images. Training the many layers of virtual neurons in the experiment required 16,000 computer cores!!!. Is it clear now why our research group is entering in this amazing world? 

(*) Picture from Andrew Ng (Stanford)

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn