You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+114-3Lines changed: 114 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,11 +1,16 @@
1
1
# What is it?
2
-
TextFeatureSelection is a Python package providing feature selection for text tokens through filter method of feature selection and we can set a threshold to decide which words to be included. There are 4 methods for helping feature selection.
2
+
TextFeatureSelection is a Python library which helps improve text classification models through feature selection. It has 2 methods `TextFeatureSelection` and `TextFeatureSelectionGA` methods respectively.
3
+
4
+
**First method: TextFeatureSelection**
5
+
It follows the `filter` method for feature selection. It provides a score for each word token. We can set a threshold for the score to decide which words to be included. There are 4 algorithms in this method, as follows.
6
+
3
7
-**Chi-square** It measures the lack of independence between term(t) and class(c). It has a natural value of zero if t and c are independent. If it is higher, then term is dependent. It is not reliable for low-frequency terms
4
8
-**Mutual information** Rare terms will have a higher score than common terms. For multi-class categories, we will calculate MI value for all categories and will take the Max(MI) value across all categories at the word level.
5
9
-**Proportional difference** How close two numbers are from becoming equal. It helps find unigrams that occur mostly in one class of documents or the other.
6
10
-**Information gain** It gives discriminatory power of the word.
7
11
8
-
# Input parameters
12
+
It has below parameters
13
+
9
14
-**target** list object which has categories of labels. for more than one category, no need to dummy code and instead provide label encoded values as list object.
10
15
-**input_doc_list** List object which has text. each element of list is text corpus. No need to tokenize, as text will be tokenized in the module while processing. target and input_doc_list should have same length.
11
16
-**stop_words** Words for which you will not want to have metric values calculated. Default is blank
@@ -31,6 +36,110 @@ result_df=fsOBJ.getScore()
31
36
print(result_df)
32
37
33
38
```
39
+
40
+
**Second method: TextFeatureSelectionGA**
41
+
It follows the `genetic algorithm` method. This is a population based metaheuristics search algorithm. It returns the optimal set of word tokens which give the best possible model score.
42
+
43
+
Its parameters are divided into 2 groups.
44
+
45
+
a) Genetic algorithm parameters: These are provided during object initialization.
46
+
-**generations** Number of generations to run genetic algorithm. 500 as deafult, as used in the original paper
47
+
-**population** Number of individual chromosomes. 50 as default, as used in the original paper
48
+
-**prob_crossover** Probability of crossover. 0.9 as default, as used in the original paper
49
+
-**prob_mutation** Probability of mutation. 0.1 as default, as used in the original paper
50
+
-**percentage_of_token** Percentage of word features to be included in a given chromosome.
51
+
50 as default, as used in the original paper.
52
+
-**runtime_minutes** Number of minutes to run the algorithm. This is checked in between generations.
53
+
At start of each generation it is checked if runtime has exceeded than alloted time.
54
+
If case run time did exceeds provided limit, best result from generations executed so far is given as output.
55
+
Default is 2 hours. i.e. 120 minutes.
56
+
57
+
b) Machine learning model and tfidf parameters: These are provided during function call.
58
+
59
+
Data Parameters
60
+
61
+
-**doc_list** text documents in a python list.
62
+
Example: ['i had dinner','i am on vacation','I am happy','Wastage of time']
-[Categorical Proportional Difference: A Feature Selection Method for Text Categorization](https://pdfs.semanticscholar.org/6569/9f0e1159a40042cc766139f3dfac2a3860bb.pdf) by Mondelle Simeon, Robert J. Hilderman
49
158
50
-
-[Feature Selection and Weighting Methods in Sentiment Analysis](https://www.researchgate.net/publication/242088860_Feature_Selection_and_Weighting_Methods_in_Sentiment_Analysis) by Tim O`Keefe and Irena Koprinska
159
+
-[Feature Selection and Weighting Methods in Sentiment Analysis](https://www.researchgate.net/publication/242088860_Feature_Selection_and_Weighting_Methods_in_Sentiment_Analysis) by Tim O`Keefe and Irena Koprinska
160
+
161
+
-[Feature Selection For Text Classification Using Genetic Algorithms](https://ieeexplore.ieee.org/document/7804223) by Noria Bidi and Zakaria Elberrichi
Copy file name to clipboardExpand all lines: README.rst
+119-6Lines changed: 119 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,18 +1,18 @@
1
1
What is it?
2
2
===========
3
3
4
-
TextFeatureSelection is a Python package providing feature selection for text tokens through filter method of feature selection and we can set a threshold to decide which words to be included. There are 4 methods for helping feature selection.
4
+
TextFeatureSelection is a Python library which helps improve text classification models through feature selection. It has 2 methods `TextFeatureSelection` and `TextFeatureSelectionGA` methods respectively.
5
5
6
-
- **Chi-square** It measures the lack of independence between term(t) and class(c). It has a natural value of zero if t and c are independent. If it is higher, then term is dependent. It is not reliable for low-frequency terms
6
+
First method: TextFeatureSelection
7
+
=================
8
+
It follows the `filter` method for feature selection. It provides a score for each word token. We can set a threshold for the score to decide which words to be included. There are 4 algorithms in this method, as follows.
7
9
10
+
- **Chi-square** It measures the lack of independence between term(t) and class(c). It has a natural value of zero if t and c are independent. If it is higher, then term is dependent. It is not reliable for low-frequency terms
8
11
- **Mutual information** Rare terms will have a higher score than common terms. For multi-class categories, we will calculate MI value for all categories and will take the Max(MI) value across all categories at the word level.
9
-
10
12
- **Proportional difference** How close two numbers are from becoming equal. It helps find unigrams that occur mostly in one class of documents or the other.
11
-
12
13
- **Information gain** It gives discriminatory power of the word.
13
14
14
-
Input parameters
15
-
================
15
+
It has below parameters
16
16
17
17
- **target** list object which has categories of labels. for more than one category, no need to dummy code and instead provide label encoded values as list object.
18
18
- **input_doc_list** List object which has text. each element of list is text corpus. No need to tokenize, as text will be tokenized in the module while processing. target and input_doc_list should have same length.
It follows the `genetic algorithm` method. This is a population based metaheuristics search algorithm. It returns the optimal set of word tokens which give the best possible model score.
48
+
49
+
Its parameters are divided into 2 groups.
50
+
51
+
a) Genetic algorithm parameters: These are provided during object initialization.
52
+
53
+
- **generations** Number of generations to run genetic algorithm. 500 as deafult, as used in the original paper
54
+
- **population** Number of individual chromosomes. 50 as default, as used in the original paper
55
+
- **prob_crossover** Probability of crossover. 0.9 as default, as used in the original paper
56
+
- **prob_mutation** Probability of mutation. 0.1 as default, as used in the original paper
57
+
- **percentage_of_token** Percentage of word features to be included in a given chromosome.
58
+
50 as default, as used in the original paper.
59
+
- **runtime_minutes** Number of minutes to run the algorithm. This is checked in between generations.
60
+
At start of each generation it is checked if runtime has exceeded than alloted time.
61
+
If case run time did exceeds provided limit, best result from generations executed so far is given as output.
62
+
Default is 2 hours. i.e. 120 minutes.
63
+
64
+
b) Machine learning model and tfidf parameters: These are provided during function call.
65
+
66
+
Data Parameters
67
+
68
+
- **doc_list** text documents in a python list.
69
+
Example: ['i had dinner','i am on vacation','I am happy','Wastage of time']
- [Entropy based feature selection for text categorization](https://hal.archives-ouvertes.fr/hal-00617969/document) by Christine Largeron, Christophe Moulin, Mathias Géry
67
178
- [Categorical Proportional Difference: A Feature Selection Method for Text Categorization](https://pdfs.semanticscholar.org/6569/9f0e1159a40042cc766139f3dfac2a3860bb.pdf) by Mondelle Simeon, Robert J. Hilderman
68
179
- [Feature Selection and Weighting Methods in Sentiment Analysis](https://www.researchgate.net/publication/242088860_Feature_Selection_and_Weighting_Methods_in_Sentiment_Analysis) by Tim O`Keefe and Irena Koprinska
180
+
- [Feature Selection For Text Classification Using Genetic Algorithms](https://ieeexplore.ieee.org/document/7804223) by Noria Bidi and Zakaria Elberrichi
0 commit comments