Skip to content

Commit 198c9fb

Browse files
committed
add docs
1 parent 3e42965 commit 198c9fb

File tree

1 file changed

+42
-0
lines changed

1 file changed

+42
-0
lines changed

docs/PROPOSAL.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# NLP Shared Types Rationale
2+
3+
## Intoduction
4+
Modern NLP libraries tend to standardatization of core types and values used in language analysis. While it seems that some language migt be very different, their structure always can be boiled down to specific set, that been already researched.
5+
6+
## Standard Definitions
7+
8+
1) Part-of-Speech(POS) tags can be of 2 types - Coarse-Grained, Fine-Grained,
9+
10+
- Coarse-grained universal set initially presented in the paper http://www.petrovi.de/data/lrec.pdf
11+
- Fine-grained
12+
13+
Now, there is wide adoption within scientific community for universal things. Look at http://universaldependencies.org/
14+
15+
Both of those can be represented as sum type like this:
16+
17+
```haskell
18+
data PosCg =
19+
VERB -- verbs (all tenses and modes)
20+
| NOUN -- nouns (common and proper)
21+
| PRON -- pronouns
22+
| ADJ -- adjectives
23+
| ADV -- adverbs
24+
| ADP -- adpositions (prepositions and postpositions)
25+
| CONJ -- conjunctions
26+
| DET -- determiners
27+
| NUM -- cardinal numbers
28+
| PRT -- particles or other function words
29+
| X -- other: foreign words, typos, abbreviations
30+
| PUNCT -- punctuation
31+
deriving (Show, Eq, Generic)
32+
```
33+
34+
CoreNLP and SyntaxNet different in some small things, like first one defined punctuation with `punc` while other with `.`, but generally should be converted to same ADT.
35+
36+
2) Parsing logic should be separate from type definition.
37+
38+
While it seems like a nice way to describe type with additional features it's much more difficult to adopt its use with other libraries. I think it's vetter to have simepl record types and fill out with external parsing function, specific to each library.
39+
40+
3) Parsing
41+
42+
CoNLL is basically tab-separated csv (!) why do we need our own parser if we already have very fast `cassava`? Also in favor of using already established solutions for parsing. We need custom parsers only when specifc format encountered.

0 commit comments

Comments
 (0)