Skip to content

Commit 26ced06

Browse files
committed
Added genetic algorithm for feature selection in the library
1 parent f9cbff9 commit 26ced06

File tree

10 files changed

+1538
-85
lines changed

10 files changed

+1538
-85
lines changed

README.md

Lines changed: 114 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,16 @@
11
# What is it?
2-
TextFeatureSelection is a Python package providing feature selection for text tokens through filter method of feature selection and we can set a threshold to decide which words to be included. There are 4 methods for helping feature selection.
2+
TextFeatureSelection is a Python library which helps improve text classification models through feature selection. It has 2 methods `TextFeatureSelection` and `TextFeatureSelectionGA` methods respectively.
3+
4+
**First method: TextFeatureSelection**
5+
It follows the `filter` method for feature selection. It provides a score for each word token. We can set a threshold for the score to decide which words to be included. There are 4 algorithms in this method, as follows.
6+
37
- **Chi-square** It measures the lack of independence between term(t) and class(c). It has a natural value of zero if t and c are independent. If it is higher, then term is dependent. It is not reliable for low-frequency terms
48
- **Mutual information** Rare terms will have a higher score than common terms. For multi-class categories, we will calculate MI value for all categories and will take the Max(MI) value across all categories at the word level.
59
- **Proportional difference** How close two numbers are from becoming equal. It helps find unigrams that occur mostly in one class of documents or the other.
610
- **Information gain** It gives discriminatory power of the word.
711

8-
# Input parameters
12+
It has below parameters
13+
914
- **target** list object which has categories of labels. for more than one category, no need to dummy code and instead provide label encoded values as list object.
1015
- **input_doc_list** List object which has text. each element of list is text corpus. No need to tokenize, as text will be tokenized in the module while processing. target and input_doc_list should have same length.
1116
- **stop_words** Words for which you will not want to have metric values calculated. Default is blank
@@ -31,6 +36,110 @@ result_df=fsOBJ.getScore()
3136
print(result_df)
3237

3338
```
39+
40+
**Second method: TextFeatureSelectionGA**
41+
It follows the `genetic algorithm` method. This is a population based metaheuristics search algorithm. It returns the optimal set of word tokens which give the best possible model score.
42+
43+
Its parameters are divided into 2 groups.
44+
45+
a) Genetic algorithm parameters: These are provided during object initialization.
46+
- **generations** Number of generations to run genetic algorithm. 500 as deafult, as used in the original paper
47+
- **population** Number of individual chromosomes. 50 as default, as used in the original paper
48+
- **prob_crossover** Probability of crossover. 0.9 as default, as used in the original paper
49+
- **prob_mutation** Probability of mutation. 0.1 as default, as used in the original paper
50+
- **percentage_of_token** Percentage of word features to be included in a given chromosome.
51+
50 as default, as used in the original paper.
52+
- **runtime_minutes** Number of minutes to run the algorithm. This is checked in between generations.
53+
At start of each generation it is checked if runtime has exceeded than alloted time.
54+
If case run time did exceeds provided limit, best result from generations executed so far is given as output.
55+
Default is 2 hours. i.e. 120 minutes.
56+
57+
b) Machine learning model and tfidf parameters: These are provided during function call.
58+
59+
Data Parameters
60+
61+
- **doc_list** text documents in a python list.
62+
Example: ['i had dinner','i am on vacation','I am happy','Wastage of time']
63+
64+
- **label_list** labels in a python list.
65+
Example: ['Neutral','Neutral','Positive','Negative']
66+
67+
68+
Modelling Parameters
69+
70+
- **model** Set a model which has .fit function to train model and .predict function to predict for test data.
71+
This model should also be able to train classifier using TfidfVectorizer feature.
72+
Default is set as Logistic regression in sklearn
73+
74+
- **model_metric** Classifier cost function. Select one from: ['f1','precision','recall'].
75+
Default is F1
76+
77+
- **avrg** Averaging used in model_metric. Select one from ['micro', 'macro', 'samples','weighted', 'binary'].
78+
For binary classification, default is 'binary' and for multi-class classification, default is 'micro'.
79+
80+
81+
TfidfVectorizer Parameters
82+
83+
- **analyzer** {'word', 'char', 'char_wb'} or callable, default='word'
84+
Whether the feature should be made of word or character n-grams.
85+
Option 'char_wb' creates character n-grams only from text inside
86+
word boundaries; n-grams at the edges of words are padded with space.
87+
88+
- **min_df** float or int, default=2
89+
When building the vocabulary ignore terms that have a document
90+
frequency strictly lower than the given threshold. This value is also
91+
called cut-off in the literature.
92+
If float in range of [0.0, 1.0], the parameter represents a proportion
93+
of documents, integer absolute counts.
94+
This parameter is ignored if vocabulary is not None.
95+
96+
- **max_df** float or int, default=1.0
97+
When building the vocabulary ignore terms that have a document
98+
frequency strictly higher than the given threshold (corpus-specific
99+
stop words).
100+
If float in range [0.0, 1.0], the parameter represents a proportion of
101+
documents, integer absolute counts.
102+
This parameter is ignored if vocabulary is not None.
103+
104+
- **stop_words** {'english'}, list, default=None
105+
If a string, it is passed to _check_stop_list and the appropriate stop
106+
list is returned. 'english' is currently the only supported string
107+
value.
108+
There are several known issues with 'english' and you should
109+
consider an alternative (see :ref:`stop_words`).
110+
If a list, that list is assumed to contain stop words, all of which
111+
will be removed from the resulting tokens.
112+
Only applies if analyzer == 'word'.
113+
If None, no stop words will be used. max_df can be set to a value
114+
in the range [0.7, 1.0) to automatically detect and filter stop
115+
words based on intra corpus document frequency of terms.
116+
117+
- **tokenizer** callable, default=None
118+
Override the string tokenization step while preserving the
119+
preprocessing and n-grams generation steps.
120+
Only applies if analyzer == 'word'
121+
122+
- **token_pattern** str, default=r"(?u)\\b\\w\\w+\\b"
123+
Regular expression denoting what constitutes a "token", only used
124+
if analyzer == 'word'. The default regexp selects tokens of 2
125+
or more alphanumeric characters (punctuation is completely ignored
126+
and always treated as a token separator).
127+
If there is a capturing group in token_pattern then the
128+
captured group content, not the entire match, becomes the token.
129+
At most one capturing group is permitted.
130+
131+
- **lowercase** bool, default=True
132+
Convert all characters to lowercase before tokenizing.
133+
134+
# How to use is it?
135+
```python
136+
from TextFeatureSelection import TextFeatureSelectionGA
137+
#Input documents: doc_list
138+
#Input labels: label_list
139+
getGAobj=TextFeatureSelectionGA(percentage_of_token=60)
140+
best_vocabulary=getGAobj.getGeneticFeatures(doc_list=doc_list,label_list=label_list)
141+
```
142+
34143
# Where to get it?
35144
`pip install TextFeatureSelection`
36145

@@ -47,4 +156,6 @@ print(result_df)
47156

48157
- [Categorical Proportional Difference: A Feature Selection Method for Text Categorization](https://pdfs.semanticscholar.org/6569/9f0e1159a40042cc766139f3dfac2a3860bb.pdf) by Mondelle Simeon, Robert J. Hilderman
49158

50-
- [Feature Selection and Weighting Methods in Sentiment Analysis](https://www.researchgate.net/publication/242088860_Feature_Selection_and_Weighting_Methods_in_Sentiment_Analysis) by Tim O`Keefe and Irena Koprinska
159+
- [Feature Selection and Weighting Methods in Sentiment Analysis](https://www.researchgate.net/publication/242088860_Feature_Selection_and_Weighting_Methods_in_Sentiment_Analysis) by Tim O`Keefe and Irena Koprinska
160+
161+
- [Feature Selection For Text Classification Using Genetic Algorithms](https://ieeexplore.ieee.org/document/7804223) by Noria Bidi and Zakaria Elberrichi

README.rst

Lines changed: 119 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,18 @@
11
What is it?
22
===========
33

4-
TextFeatureSelection is a Python package providing feature selection for text tokens through filter method of feature selection and we can set a threshold to decide which words to be included. There are 4 methods for helping feature selection.
4+
TextFeatureSelection is a Python library which helps improve text classification models through feature selection. It has 2 methods `TextFeatureSelection` and `TextFeatureSelectionGA` methods respectively.
55

6-
- **Chi-square** It measures the lack of independence between term(t) and class(c). It has a natural value of zero if t and c are independent. If it is higher, then term is dependent. It is not reliable for low-frequency terms
6+
First method: TextFeatureSelection
7+
=================
8+
It follows the `filter` method for feature selection. It provides a score for each word token. We can set a threshold for the score to decide which words to be included. There are 4 algorithms in this method, as follows.
79

10+
- **Chi-square** It measures the lack of independence between term(t) and class(c). It has a natural value of zero if t and c are independent. If it is higher, then term is dependent. It is not reliable for low-frequency terms
811
- **Mutual information** Rare terms will have a higher score than common terms. For multi-class categories, we will calculate MI value for all categories and will take the Max(MI) value across all categories at the word level.
9-
1012
- **Proportional difference** How close two numbers are from becoming equal. It helps find unigrams that occur mostly in one class of documents or the other.
11-
1213
- **Information gain** It gives discriminatory power of the word.
1314

14-
Input parameters
15-
================
15+
It has below parameters
1616

1717
- **target** list object which has categories of labels. for more than one category, no need to dummy code and instead provide label encoded values as list object.
1818
- **input_doc_list** List object which has text. each element of list is text corpus. No need to tokenize, as text will be tokenized in the module while processing. target and input_doc_list should have same length.
@@ -40,9 +40,120 @@ target=[1,1,0,1]
4040
fsOBJ=TextFeatureSelection(target=target,input_doc_list=input_doc_list)
4141
result_df=fsOBJ.getScore()
4242
print(result_df)
43+
```
44+
45+
Second method: TextFeatureSelectionGA
46+
=================
47+
It follows the `genetic algorithm` method. This is a population based metaheuristics search algorithm. It returns the optimal set of word tokens which give the best possible model score.
48+
49+
Its parameters are divided into 2 groups.
50+
51+
a) Genetic algorithm parameters: These are provided during object initialization.
52+
53+
- **generations** Number of generations to run genetic algorithm. 500 as deafult, as used in the original paper
54+
- **population** Number of individual chromosomes. 50 as default, as used in the original paper
55+
- **prob_crossover** Probability of crossover. 0.9 as default, as used in the original paper
56+
- **prob_mutation** Probability of mutation. 0.1 as default, as used in the original paper
57+
- **percentage_of_token** Percentage of word features to be included in a given chromosome.
58+
50 as default, as used in the original paper.
59+
- **runtime_minutes** Number of minutes to run the algorithm. This is checked in between generations.
60+
At start of each generation it is checked if runtime has exceeded than alloted time.
61+
If case run time did exceeds provided limit, best result from generations executed so far is given as output.
62+
Default is 2 hours. i.e. 120 minutes.
63+
64+
b) Machine learning model and tfidf parameters: These are provided during function call.
65+
66+
Data Parameters
67+
68+
- **doc_list** text documents in a python list.
69+
Example: ['i had dinner','i am on vacation','I am happy','Wastage of time']
70+
71+
- **label_list** labels in a python list.
72+
Example: ['Neutral','Neutral','Positive','Negative']
73+
74+
75+
Modelling Parameters
76+
77+
- **model** Set a model which has .fit function to train model and .predict function to predict for test data.
78+
This model should also be able to train classifier using TfidfVectorizer feature.
79+
Default is set as Logistic regression in sklearn
80+
81+
- **model_metric** Classifier cost function. Select one from: ['f1','precision','recall'].
82+
Default is F1
83+
84+
- **avrg** Averaging used in model_metric. Select one from ['micro', 'macro', 'samples','weighted', 'binary'].
85+
For binary classification, default is 'binary' and for multi-class classification, default is 'micro'.
86+
87+
88+
TfidfVectorizer Parameters
89+
90+
- **analyzer** {'word', 'char', 'char_wb'} or callable, default='word'
91+
Whether the feature should be made of word or character n-grams.
92+
Option 'char_wb' creates character n-grams only from text inside
93+
word boundaries; n-grams at the edges of words are padded with space.
94+
95+
- **min_df** float or int, default=2
96+
When building the vocabulary ignore terms that have a document
97+
frequency strictly lower than the given threshold. This value is also
98+
called cut-off in the literature.
99+
If float in range of [0.0, 1.0], the parameter represents a proportion
100+
of documents, integer absolute counts.
101+
This parameter is ignored if vocabulary is not None.
102+
103+
- **max_df** float or int, default=1.0
104+
When building the vocabulary ignore terms that have a document
105+
frequency strictly higher than the given threshold (corpus-specific
106+
stop words).
107+
If float in range [0.0, 1.0], the parameter represents a proportion of
108+
documents, integer absolute counts.
109+
This parameter is ignored if vocabulary is not None.
110+
111+
- **stop_words** {'english'}, list, default=None
112+
If a string, it is passed to _check_stop_list and the appropriate stop
113+
list is returned. 'english' is currently the only supported string
114+
value.
115+
There are several known issues with 'english' and you should
116+
consider an alternative (see :ref:`stop_words`).
117+
If a list, that list is assumed to contain stop words, all of which
118+
will be removed from the resulting tokens.
119+
Only applies if analyzer == 'word'.
120+
If None, no stop words will be used. max_df can be set to a value
121+
in the range [0.7, 1.0) to automatically detect and filter stop
122+
words based on intra corpus document frequency of terms.
123+
124+
- **tokenizer** callable, default=None
125+
Override the string tokenization step while preserving the
126+
preprocessing and n-grams generation steps.
127+
Only applies if analyzer == 'word'
128+
129+
- **token_pattern** str, default=r"(?u)\\b\\w\\w+\\b"
130+
Regular expression denoting what constitutes a "token", only used
131+
if analyzer == 'word'. The default regexp selects tokens of 2
132+
or more alphanumeric characters (punctuation is completely ignored
133+
and always treated as a token separator).
134+
If there is a capturing group in token_pattern then the
135+
captured group content, not the entire match, becomes the token.
136+
At most one capturing group is permitted.
137+
138+
- **lowercase** bool, default=True
139+
Convert all characters to lowercase before tokenizing.
140+
141+
How to use is it?
142+
=================
143+
144+
```python
145+
146+
from TextFeatureSelection import TextFeatureSelectionGA
147+
148+
#Input documents: doc_list
149+
#Input labels: label_list
150+
151+
getGAobj=TextFeatureSelectionGA(percentage_of_token=60)
152+
best_vocabulary=getGAobj.getGeneticFeatures(doc_list=doc_list,label_list=label_list)
43153
44154
```
45155

156+
46157
Where to get it?
47158
================
48159

@@ -66,3 +177,5 @@ References
66177
- [Entropy based feature selection for text categorization](https://hal.archives-ouvertes.fr/hal-00617969/document) by Christine Largeron, Christophe Moulin, Mathias Géry
67178
- [Categorical Proportional Difference: A Feature Selection Method for Text Categorization](https://pdfs.semanticscholar.org/6569/9f0e1159a40042cc766139f3dfac2a3860bb.pdf) by Mondelle Simeon, Robert J. Hilderman
68179
- [Feature Selection and Weighting Methods in Sentiment Analysis](https://www.researchgate.net/publication/242088860_Feature_Selection_and_Weighting_Methods_in_Sentiment_Analysis) by Tim O`Keefe and Irena Koprinska
180+
- [Feature Selection For Text Classification Using Genetic Algorithms](https://ieeexplore.ieee.org/document/7804223) by Noria Bidi and Zakaria Elberrichi
181+

0 commit comments

Comments
 (0)