Triangle Hadoop Users Group
1 week ago

Slides from Intro to HBase presentation January 2012

Thanks to Chris Shain from Tresata for coming to Durham last night to talk about HBase.



TriHUG January 2012 Talk by Chris Shain
2 weeks ago

Next Meeting: January 17, 2012 @ Bronto Software

Title: Intro to Apache HBase by Chris Shain of Tresata

Location: Bronto Software in Durham, NC

RSVP

Abstract: Chris will provide an introduction to Apache HBase, aiming to discuss:

  1. What is HBase? (High level overview)
  2. Details of the HBase architecture
  3. How do clients interact with HBase?
  4. Some general HBase patterns and anti-patterns
  5. What are the use cases for HBase vs. Relational DB?

Bio: Chris Shain is the software development lead at Tresata, a provider of Big Data solutions for the financial industry in Charlotte NC. His background includes 7+ years of software development experience in the financial services industry, with a focus on customer-facing data management applications and data warehousing. Lately he works with Hadoop and HBase on data volumes in the multi-terabyte range, and tinkers with geographic information systems. He lives in Charlotte NC, and can be reached at chris@tresata.com or twitter @ChrisShain.

2 months ago

Slides from Alan Gates Presentation on Nov. 15, 2011

Thanks to Alan Gates of Hortonworks for the two excellent presentations on Apache Pig and Apache HCatalog. Links to the slides for the two talks are included below and are also available on Slideshare.

3 months ago

Slides from Oct. 11 TriHUG meeting featuring Josh Patterson of Cloudera

3 months ago

Next Meeting: November 15, 2011 @ Bronto Software

Our next meeting will be November 15 at Bronto Software.  The speaker will be Alan Gates, the author of Programming Pig and a member of the Hortonworks team.  RSVP here.

————-

Title:  New Features in Pig 0.9 and  Introducing HCatalog

Abstract:  Pig 0.9 added several features to make Pig a more powerful data processing platform, including macros, include statements, and the ability to embed Pig in Python for control flow.  We’ll cover these, talk about some new features that have been added since 0.9, and what’s next on Pig’s roadmap.

HCatalog is a table management and storage management layer for Hadoop that enables users with different data processing tools – Pig, MapReduce, Hive, Streaming – to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored – RCFile format, text files, sequence files.  This talk will include an overview of HCatalog’s features and a discussion of its current roadmap.

Bio:  Alan is a co-founder of Hortonworks as well as an original member of the engineering team that took Pig from a Yahoo! Labs research project to a successful Apache open source project. Alan also designed HCatalog and guided its adoption as an Apache Incubator project. Alan has a BS in Mathematics from Oregon State University and a MA in Theology from Fuller Theological Seminary. He is also the author of Programming Pig, a forthcoming book from O’Reilly Press. Follow Alan on Twitter: @alanfgates.

4 months ago

TriHUG Next Meeting featuring Josh Patterson of Cloudera set for Oct. 11

The next Triangle Hadoop User Group meeting will be October 11th at Bronto Software and will be featuring Josh Patterson of Cloudera.  RSVP here.

Title: Lumberyard: Time series Indexing at Scale

Abstract: 

As time series data explodes in volume in the genomic, sensor, and

financial realms [1] companies are looking for more effective ways to

store and query this data. To handle this explosion in scale systems

are looking to the Hadoop, HBase, and NoSQL domain for components to

build their systems on. In this talk we introduce Lumberyard [3], a

system which can potentially (1) store Terabytes of time series data

and allow for this data to be interactively queried at low latencies

to provide real time access. Lumberyard stores iSAX [4] indexes in

HBase’s Multi-dimensional sorted map storage system which give

Lumberyard the reliability of HDFS yet the low latencies of HBase. Our

approach leverages a multidimensional indexing structure which is

stored in HBase’s highly available distributed multi-dimensional

sorted map. We present the design of Lumberyard’s implementation and

illustrate the differences between an in-memory iSAX index compared

with a persisted HBase-backed iSAX index.

Sponsored by Cloudera and Bronto Software.

More info at www.trihug.org.

Bio:

Master’s Thesis: self-organizing mesh networks Published in IAAI-09:

TinyTermite: A Secure Routing Algorithm

Conceived, built, and led Hadoop integration for the openPDC project

at TVA (Smartgrid stuff). Led small team which designed classification

techniques for timeseries and Map Reduce. Open source work at

http://openpdc.codeplex.com

Now: Sr. Solutions Architect at Cloudera

4 months ago

Slides from Ted Dunning’s Sept. 2011 talk

Thanks to everyone for attending last night’s talk!  Ted’s slides are available for download below.

MapR, Implications for Integration View more presentations from trihug

4 months ago

Starfish Talk Slides from April 2011

Under the better late than never category, here are the slides from the April 2011 TriHUG meeting on Starfish.

Starfish: A Self-tuning System for Big Data Analytics View more presentations from gsingers.

5 months ago

Next Meeting: Sept. 13 @ Bronto Software

We trust everyone has had a good summer and is equally excited to get back into learning more about Apache Hadoop and scaling.  Our next meeting will be Sept. 13 at Bronto Software.  Food and drinks start at 6:30 and the talks start at 7.

We are pleased to announce that our speaker will be Ted Dunning from MapR Technologies.   See below for more details.  Please RSVP here.

Title: MapR, Architecture and Implications

Abstract:

The talk will be a description of how MapR’s architectural advances allow significant improvements in speed, reliability and scalability over stock Hadoop.  This will include a dive into the MapR file system and a discussion of how the map-reduce layer has been changed and the impact on other Hadoop eco-system components.  This will include actual test results.

In the second section of my talk, I will describe how this new architecture has surprising consequences.  In particular, I will show how tasks like machine learning, data visualization and search indexing can all work better on the MapR platform.

Ted’s Bio:

Ted has held Chief Scientist positions at Veoh Networks, ID Analytics and at MusicMatch, (now Yahoo Music). Ted is responsible for building the most advanced identity theft detection system on the planet, as well as one of the largest peer-assisted video distribution systems and ground-breaking music and video recommendations systems. Ted has 15 issued and 15 pending patents and contributes to several Apache open source projects including Hadoop, Zookeeper and Hbase. He is also a committer for Apache Mahout. Ted earned a BS degree in electrical engineering from the University of Colorado; a MS degree in computer science from New Mexico State University; and a Ph.D. in computing science from Sheffield University in the United Kingdom. Ted also bought the drinks at one of the very first Hadoop User Group meetings.

8 months ago

RTP Scaling Hackathon (Planning Stages)

Some TriHUG members are in the early stage of putting together an all day hackathon on all things scaling (Hadoop, Cassandra, Hive, Pig, Mahout, etc.) and wanted to get some info out to the community as well as a call for volunteers and sponsors.

The basic gist of the day is that we get together and spend the day hacking and learning about writing scalable, fault tolerant systems.  All ranges of experience are welcome and we fully expect that one of the groups that forms will be a “tutorial” group, while other groups will be doing more advanced things.  The key is to get lots of interaction and cross-fertilization of ideas.

Our tentative plan is that we will make available:

1. Compute Cluster time (likely Amazon EC2) along with ready to use instances w/ appropriate things already installed.  (More later)
2. Some public data sets, but feel free to bring your own publicly available on Amazon S3 
3. Food, drinks, etc. including pizza/beer at the end
4. Network connectivity
5. Space to work in
6. (TBD) Machine to submit jobs using a fair scheduler

You need to bring your laptop and an open mind.  Also having your favorite tools on your machine would be good.  A github account or something similar would also be useful.

Our likely date for this is June 18th with a backup date of June 25 (pending space availability) from 9 AM - 6 (?) PM.    Attendance will require RSVP and we will send out sign up info later.  For now, we are targeting it to be free (including EC2 compute time), but that is predicated on us getting sponsorships to cover costs, so if you think you or your company can sponsor, please let us know ASAP.

Tentative Schedule (strawman):  
8:30:  Doors open/networking/coffee/snacks
9 - 9:30: Idea pitches and Seed Projects announced and teams formed —  people can stand up and say what they are interested in and then we imagine people can team up based on their interest — for instance, I will probably work on Mahout and machine learning
9:30 - 12: Hack
12-1: Food/networking/hacking
1-5(?): Hack
5-6 (no firm cut off time): Share what you learned to the group over pizza and drinks.  Demo if you have one.  

How you can help:

- Help us get data sets organized and a Chef/Puppet recipe setup with all the appropriate tools/languages/SCM/etc.  Also, think of interesting problems to work on.
- Sponsor food/coffee/drinks/t-shirts/compute time/ etc.  Please contact Grant Ingersoll at info@trihug.org.  I don’t think we are talking about a super lot of money here (maybe $1000-1500 total? — more on this as things develop)
- Let us know you are interested, the more we hear from sooner, the better we can plan space accordingly.  Please reply on this list if you are interested.
- Are you graphically capable?  Help us design a t-shirt.
- Once we firm up some details, help us spread the word