Skip to content

jumbojett/extract-content-javascript

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ExtractContentJS

Text extraction JavaScript library

You can do it

  • Text extraction

  • Tag Recommended

File

Basically, moving it to read the following in this order:

Lib / lib.js

thing in common

lib/extract-content.js

本文抽出

The repository of route make package Then extract-content-all.js that is the concatenation of these are generated.

When you see you want to detail the actual use:

sketch/extract-content.test.js

本文抽出テスト

Lib / scoring-words.js

scoring tag (sample)

How to use

Text extraction interface

Use if you want to specify that you want / handler only text extraction.

ExtractContentJS.LayeredExtractor

var ex = new ExtractContentJS.LayeredExtractor(); //ex.addHandler( ex.factory.getHandler('Description') ); //ex.addHandler( ex.factory.getHandler('Scraper')); //ex.addHandler( ex.factory.getHandler('GoogleAdsence') ); ex.addHandler( ex.factory.getHandler('Heuristics') ); var res = ex.extract(document); if (res.isSuccess) { res.url; // URL string res.title; // title string res.engine; // handler itself used for extraction res.content; // content class of an instance (see below) }

Handler is far Heuristics only been implemented.

Content class

Returns an array of // body's determined to be leaves class instance that contains the leaf node (see below); content.asLeaves () content.asNode (); // return the things of the deepest of the common ancestor of all of the leaf nodes content.asTextFragment (); return a concatenation of the text of the node that is included in the // asLeaves () Return the textContent of // asNode (); content.toString ()

Leaf class

leaf.node; // leaf node leaf.depth; // depth from the body of the node

AUTHOR

Ina Lintro

Copryright

Copyright © 2009 INA Lintaro / Hatena. All rights reserved.

Copyright of the original implementation

Copyright © 2007/2008 Nakatani Shuyo / Cybozu Labs Inc. All rights reserved.

LICENCE

MIT License

About

Text extraction JavaScript library

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 99.7%
  • Makefile 0.3%