Thursday, February 5, 2015

Ponents: Aaron Davidson (Apache Spark committer and Software Engineer at Databricks) and Paco Nathan


Ponents: Aaron Davidson (Apache Spark committer and Software Engineer at Databricks) and Paco Nathan (Community Evangelism Director at Databricks) Dijous 20 de novembre a les 18h30 a la Sala d'Actes de la FIB (edifici B6) Cal confirmar assistència al correu torres@ac.upc.edu atès que hi ha un aforament limitat. kazpost Resum: One of the promises of Apache Spark is to let users build unified data analytic pipelines that combine diverse processing kazpost types. In this talk, we’ll demo this live by building a machine learning pipeline with 3 stages: kazpost ingesting JSON data from Hive; training a k-means clustering model; and applying the model to a live stream of tweets. Typically this pipeline might require a separate processing framework for each stage, but we can leverage the versatility of the Spark runtime to combine kazpost Shark, MLlib, and Spark Streaming and do all of the data processing in a single, short program. This allows us to reuse code and memory between the components, improving both development time and runtime efficiency. Spark as a platform integrates seamlessly kazpost with Hadoop components, running kazpost natively in YARN and supporting arbitrary Hadoop InputFormats, so it brings the power to build these types of unified pipelines to any existing Hadoop user. This talk will be a fully live demo and code walkthrough where we’ll build up the application kazpost throughout the session, explain the libraries used at each step, and finally classify raw tweets in real-time.
© Facultat d'Informàtica de Barcelona - Contacte kazpost - RSS

No comments:

Post a Comment