Skip to content

alefbt/BigData-flume-hbase-serializer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BigData-flume-hbase-serializer

BigData flume AsyncHbaseEventSerializer implemented with CSV reader and RowKey creator Most of CSV handlers in flume thay split string with comma delimiter "," so the string for ex. a,b,"c,c2",d wil not parse correctly. This project uses Apache common lib for parsing CSV and handles thouse issues.

There are two (2) Types of serialziers:

  • AsyncHbaseCSVEventSerializer
  • HbaseCSVEventSerializer

Install

# Build mvn clean install # copy the jar to Lib in flume agent cp target/ 

Example

When input CSV is like:

ts,36882775649162,5610505046685714 ti,3528885721650035,4903559682490813 ti,30049400044021,6374608174602560 

Using in flume configuration

agent007.sinks.sink1.type = org.apache.flume.sink.hbase.AsyncHBaseSink agent007.sinks.sink1.channel = c1 agent007.sinks.sink1.table = transactions agent007.sinks.sink1.columnFamily = clients agent007.sinks.sink1.batchSize = 5000 # Add this agent007.sinks.sink1.serializer = com.alefbt.bigdata.flume.sink.hbase.AsyncHbaseCSVEventSerializer agent007.sinks.sink1.serializer.columns = type,id1,id2 agent007.sinks.sink1.serializer.key = type,id2 

It will Run hbase put 'table', 'rowkey', 'column', 'value' So each row in example it will run 3 put commands

the RowKey will generated from type,id2 fields

so,

# the config agent007.sinks.sink1.serializer.columns = type,id1,id2 agent007.sinks.sink1.serializer.key = type,id2 # csv value : ts,36882775649162,5610505046685714 # row key: ts:5610505046685714 # The actions on HBase will be: hbase put 'transactions', 'ts:5610505046685714', 'clients:type', 'ts' hbase put 'transactions', 'ts:5610505046685714', 'clients:id1', '36882775649162' hbase put 'transactions', 'ts:5610505046685714', 'clients:id2', '5610505046685714' 

Developers

Me: Yehuda Korotkin yehuda@alefbt.com

Tested

  • On cloudera 5.10 with AsyncHBase mode (was few issues:

Issues

On cloudera 5.10 - HBase "too many items queued"

org.hbase.async.RemoteException: Call queue is full on /0.0.0.0:60020, too many items queued ? According OpenTSDB/opentsdb#783 You should increase hbase.regionserver.handler.count I found that it solved this issue (but opend other issues)

See also

License

MIT. (see LICENSE file)

About

BigData flume AsyncHbaseEventSerializer implemented with CSV reader and RowKey creator

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages