Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
3e35143
Merge pull request #134 from evilsaint000/main
animator May 12, 2024
277ded1
Merge remote-tracking branch 'origin/main'
DivyanshiSingh00 May 15, 2024
078975e
Merge remote-tracking branch 'origin/main'
DivyanshiSingh00 May 15, 2024
d60ea0a
Merge remote-tracking branch 'origin/main'
DivyanshiSingh00 Jun 3, 2024
82a1c70
Update Naive_Bayes_Classifiers.md
DivyanshiSingh00 Jun 3, 2024
d6f3b90
Update index.md
DivyanshiSingh00 Jun 3, 2024
33d74d3
Update tic-tac-toe.md
DivyanshiSingh00 Jun 3, 2024
d97f32c
Update tic-tac-toe.md
DivyanshiSingh00 Jun 3, 2024
8e4dac5
Delete contrib/mini-projects/tic-tac-toe.md
DivyanshiSingh00 Jun 3, 2024
f8795ad
Update Naive_Bayes_Classifiers.md
DivyanshiSingh00 Jun 19, 2024
557a200
Update Naive_Bayes_Classifiers.md
DivyanshiSingh00 Jun 19, 2024
bab7d66
Update Naive_Bayes_Classifiers.md
DivyanshiSingh00 Jun 19, 2024
10c5fb5
Update Naive_Bayes_Classifiers.md
DivyanshiSingh00 Jun 19, 2024
0a7e1cb
Update Naive_Bayes_Classifiers.md
DivyanshiSingh00 Jun 19, 2024
3647a0e
Update Naive_Bayes_Classifiers.md
DivyanshiSingh00 Jun 19, 2024
ef1cd11
Update Naive_Bayes_Classifiers.md
DivyanshiSingh00 Jun 19, 2024
39883b0
Update Naive_Bayes_Classifiers.md
DivyanshiSingh00 Jun 19, 2024
1637c65
Update Naive_Bayes_Classifiers.md
DivyanshiSingh00 Jun 19, 2024
9d18d41
Update Naive_Bayes_Classifiers.md
DivyanshiSingh00 Jun 19, 2024
6dcd8ca
Update Naive_Bayes_Classifiers.md
DivyanshiSingh00 Jun 19, 2024
6795c3f
Update Naive_Bayes_Classifiers.md
DivyanshiSingh00 Jun 19, 2024
52fb4ee
Update Naive_Bayes_Classifiers.md
DivyanshiSingh00 Jun 19, 2024
8a90b2d
Update Naive_Bayes_Classifiers.md
DivyanshiSingh00 Jun 19, 2024
f9bed23
Update Naive_Bayes_Classifiers.md
DivyanshiSingh00 Jun 19, 2024
02e4d34
Update Naive_Bayes_Classifiers.md
DivyanshiSingh00 Jun 19, 2024
cb44aba
Update Naive_Bayes_Classifiers.md
DivyanshiSingh00 Jun 19, 2024
a99b790
Update Naive_Bayes_Classifiers.md
DivyanshiSingh00 Jun 19, 2024
f6cf3e3
Update Naive_Bayes_Classifiers.md
DivyanshiSingh00 Jun 19, 2024
ccc3f02
Update Naive_Bayes_Classifiers.md
DivyanshiSingh00 Jun 19, 2024
631528e
Merge remote-tracking branch 'origin/main'
DivyanshiSingh00 Jun 19, 2024
e487d31
Merge remote-tracking branch 'origin/main'
DivyanshiSingh00 Jun 22, 2024
3f661bd
Update Naive_Bayes_Classifiers.md
DivyanshiSingh00 Jun 27, 2024
bafd63c
Merge remote-tracking branch 'origin/main'
DivyanshiSingh00 Jun 27, 2024
078b4f6
Update Tf-IDF.md
DivyanshiSingh00 Jun 28, 2024
b27daac
Update Tf-IDF.md
DivyanshiSingh00 Jun 28, 2024
2dab964
Update Tf-IDF.md
DivyanshiSingh00 Jun 28, 2024
77a2fc9
Update Tf-IDF.md
DivyanshiSingh00 Jul 7, 2024
2af7f9a
Update Tf-IDF.md
DivyanshiSingh00 Jul 7, 2024
File filter

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
89 changes: 89 additions & 0 deletions contrib/machine-learning/Tf-IDF.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
## TF-IDF (Term Frequency-Inverse Document Frequency)

### Introduction
TF-IDF, which stands for Term Frequency-Inverse Document Frequency, is a powerful statistical measure used in the fields of information retrieval and text mining. It helps determine the significance of a word within a single document while also considering its rarity across a collection of documents, known as a corpus. By balancing these two aspects, TF-IDF effectively highlights the most important terms that characterize the content of a document.


### Terminologies
* Term Frequency (tf):Term Frequency (TF) measures how frequently a term appears in a document. This measure helps understand the prominence of a word within a given document. The basic idea is that the more a word appears in a document, the more important it might be considered. However, this importance is normalized by the total number of words in the document to ensure fair comparison across documents of varying lengths.
Mathematically, the term frequency tf(t,d) of term t in document d is given by:
$$tf(t,d) = N(t) / t(d)$$
where,
N(t) = Number of times term t appears in document d &
t(d) = Total number of terms in document d.


* Inverse Document Frequency (idf):Inverse Document Frequency (IDF) measures the rarity of a term across the entire corpus. While TF highlights the local importance of a term, IDF adjusts this importance by considering the term's distribution across all documents. The core idea is that if a term appears in many documents, it might not be particularly useful for distinguishing one document from another. Conversely, terms that are rare across documents are often more discriminative.
The IDF for a term t is calculated as:
$$idf(t) = log(N/ df(t))$$
where,
df(t) = Number of documents containing term t &
N = Total number of documents

* TF-IDF: TF-IDF is the product of TF and IDF, providing a comprehensive measure that reflects both the frequency of a term in a document and its rarity across the corpus. This combination allows TF-IDF to identify terms that are both significant within a specific document and distinctive across the entire collection of documents.
The TF-IDF score for a term t in document d within a corpus D is computed as:
$$TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)$$

### Applications of TF-IDF
TF-IDF is widely used in various applications across different fields, including:
* Information Retrieval: Enhancing search engines to return more relevant results.
* Content Tagging: Automatically tagging documents with relevant keywords.
* Text Mining and Analysis: Identifying important words in large text corpora.
* Text Similarity Measurement: Comparing documents to find similarities.
* Document Clustering and Classification: Grouping documents into clusters and classifying them based on their content.
* Natural Language Processing (NLP): Improving various NLP tasks like sentiment analysis, topic modeling, etc.
* Recommendation Systems: Recommending content based on text analysis.


### Advantages
* Simple to Implement: TF-IDF is straightforward to compute and implement.
* Useful for Information Retrieval: It helps in identifying the most relevant documents for a given query.
* Effective in Highlighting Important Words: It balances term frequency with the rarity of terms across the corpus.
* Does Not Require Labeled Data: It can be applied to any text corpus without the need for labeled data.
* Versatile: Applicable across a wide range of text analysis tasks.


### Disadvantages
* Ignores Word Order and Context: TF-IDF treats text as a bag of words, disregarding the order and context of terms.
* Does Not Capture Semantic Relationships: It cannot capture the meanings and relationships between words.
* Not Effective for Polysemous Words: Words with multiple meanings can lead to inaccuracies.
* Assumes Independence of Terms: Assumes that terms are independent of each other.
* Large Vocabulary Size: Can increase computational complexity with very large corpora.


### Working
###### Consider a simple example with three documents:

* Document 1: "the cat in the hat"
* Document 2: "the quick brown fox"
* Document 3: "the cat and the mouse"

###### Calculating TF-IDF for the term "cat":

1) TF (cat, Document 1):

* Term Frequency: 1 (appears once)
* Total Terms: 5
* TF: 1/5 = 0.2

2) IDF (cat, All Documents):

* Total Documents: 3
* Documents containing "cat": 2 (Document 1 and Document 3)
* IDF: log(3/2) ≈ 0.176

3) TF-IDF (cat, Document 1):

* TF-IDF: 0.2 × 0.176 = 0.0352

###### Interpretation
The TF-IDF scores indicate the importance of the term "cat" in each document:
* In Document 1, "cat" has a moderate importance with a TF-IDF score of 0.0352.
* In Document 2, "cat" does not appear, so its TF-IDF score is 0.
* In Document 3, "cat" has a lower but significant importance with a TF-IDF score of 0.0293.
This example shows how TF-IDF effectively balances term frequency within individual documents and the term's rarity across the entire corpus, allowing us to identify the most significant terms in context.



### Conclusion
TF-IDF (Term Frequency-Inverse Document Frequency) is a robust technique in text mining and information retrieval. It adeptly balances the frequency of terms within a document with their rarity across a corpus, making it an invaluable tool for highlighting significant terms. Whether used for enhancing search engines, tagging content, analyzing texts, or improving natural language processing tasks, TF-IDF remains a cornerstone technique in the realm of text analysis.
2 changes: 1 addition & 1 deletion contrib/machine-learning/index.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# List of sections

- [Section title](filename.md)
- [Term Frequency-Inverse Document Frequency](Tf-IDF)