Skip to content

hankcs/LDA4j

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LDA4j

A Java implemention of LDA(Latent Dirichlet Allocation). Inference topics from a set of documents with few lines of Java code.

How To Use

  • code
public static void main(String[] args) { // 1. Load corpus from disk Corpus corpus = Corpus.load("data/mini"); // 2. Create a LDA sampler LdaGibbsSampler ldaGibbsSampler = new LdaGibbsSampler(corpus.getDocument(), corpus.getVocabularySize()); // 3. Train it ldaGibbsSampler.gibbs(10); // 4. The phi matrix is a LDA model, you can use LdaUtil to explain it. double[][] phi = ldaGibbsSampler.getPhi(); Map<String, Double>[] topicMap = LdaUtil.translate(phi, corpus.getVocabulary(), 10); LdaUtil.explain(topicMap); }
  • output
topic 0 : 公司=0.009538408630174017 市场=0.008848009751698062 中国=0.008756489189917975 企业=0.0068280510303913395 发展=0.005991900977658479 目前=0.004408401842957633 产品=0.0041981128106208625 服务=0.003756081561227181 已经=0.003410105744626914 记者=0.003289155629929911 topic 1 : 专业=0.00872496522205349 工作=0.008108171408190876 学生=0.00793944661866665 学校=0.006307480899983371 考生=0.005295205701518912 大学=0.0052671267600129445 教育=0.0051547106121291805 考试=0.00507254577329609 人才=0.004037747449851247 招聘=0.003913811857165103 topic 2 : 医院=0.006197066939127888 治疗=0.0048149451145789455 患者=0.0032264139617756145 健康=0.0026521203697810374 手术=0.0025525793863978826 女性=0.0023724111474892357 专家=0.0021711200905248276 发现=0.0021645199996586885 病人=0.0021567877663232846 医生=0.002155356316589454 topic 3 : 没有=0.008818728535385055 问题=0.00476170232225101 中国=0.00476161560515722 工作=0.004610303190509696 生活=0.004283310385880329 文化=0.0036558079614339278 孩子=0.003327977201447208 不能=0.0032901108349775716 知道=0.003127437274214269 已经=0.0030419673256694545 topic 4 : 公司=0.018241005428669386 股东=0.009281048036676322 股份=0.0078638937643388 搜狐=0.0065617441267974705 有限公司=0.006139808167975946 直播员=0.005439495997416965 股权=0.005353954615162839 项目=0.004984451830097043 发行=0.004511099443364358 改革=0.004489038403046334 topic 5 : 旅游=0.013331508385667979 游客=0.004296589238778804 城市=0.0032312276892446116 文化=0.0026831367778820704 旅行社=0.002242817493567529 世界=0.0021001546909288965 成都=0.001991337289815279 活动=0.001894687770595843 北京=0.0017106388886854072 公园=0.0016134766410937638 topic 6 : 美国=0.007679518424242107 日本=0.004777746687572576 训练=0.003947682941243526 系统=0.003926562149803556 飞机=0.0038757503504304267 部队=0.00365041154980242 进行=0.003644226666909795 军事=0.003637873811678725 作战=0.003407296869780034 装备=0.003319112427162246 topic 7 : 比赛=0.0092171879571152 队员=0.0036851386114063237 联赛=0.0032845199043377146 球队=0.0029432131822116707 冠军=0.0024090127058022104 俱乐部=0.002348957542679953 球员=0.0022159606741087795 决赛=0.002192739194333911 赛季=0.0020352324832133267 对手=0.001974829226645783 topic 8 : The=0.002190604616155811 意思=0.001186435720799536 It=0.0011515962078723501 理解=0.0010433831740419728 What=9.560997173453189E-4 They=9.345962358267594E-4 听力=8.362275772461826E-4 In=8.166984660263638E-4 阅读=7.775969918239417E-4 译文=7.568900132152651E-4 topic 9 : 毛泽东=0.002633793448326645 曹操=0.0018832599387516155 曹丕=0.0016567353952110328 皇帝=0.001629990508040292 甄洛=0.0012930147890964736 中央=0.0012783947883529055 蒋介石=0.0010732052837016102 曹睿=8.511476483731437E-4 女王=8.125680914406854E-4 皇后=8.013815303127338E-4 
  • corpus The data/mini is some documents included in this project, which use space to segment words. Feel free to replace it with yours.
  • algorithm Mainly depend on Gregor Heinrich's great work. Read more about this implementation on《LDA入门与Java实现》

About

A Java implemention of LDA(Latent Dirichlet Allocation)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages