animator · DivyanshiSingh00 · May 12, 2024 · May 15, 2024 · May 15, 2024 · Jun 3, 2024
diff --git a/contrib/machine-learning/Tf-IDF.md b/contrib/machine-learning/Tf-IDF.md
@@ -0,0 +1,89 @@
+## TF-IDF (Term Frequency-Inverse Document Frequency)
+
+### Introduction
+TF-IDF, which stands for Term Frequency-Inverse Document Frequency, is a powerful statistical measure used in the fields of information retrieval and text mining. It helps determine the significance of a word within a single document while also considering its rarity across a collection of documents, known as a corpus. By balancing these two aspects, TF-IDF effectively highlights the most important terms that characterize the content of a document.
+
+
+### Terminologies
+* Term Frequency (tf):Term Frequency (TF) measures how frequently a term appears in a document. This measure helps understand the prominence of a word within a given document. The basic idea is that the more a word appears in a document, the more important it might be considered. However, this importance is normalized by the total number of words in the document to ensure fair comparison across documents of varying lengths.
+Mathematically, the term frequency tf(t,d) of term t in document d is given by:
+$$tf(t,d) = N(t) / t(d)$$
+where, 
+N(t) = Number of times term t appears in document d &
+t(d) = Total number of terms in document d.
+
+
+* Inverse Document Frequency (idf):Inverse Document Frequency (IDF) measures the rarity of a term across the entire corpus. While TF highlights the local importance of a term, IDF adjusts this importance by considering the term's distribution across all documents. The core idea is that if a term appears in many documents, it might not be particularly useful for distinguishing one document from another. Conversely, terms that are rare across documents are often more discriminative.
+The IDF for a term t is calculated as:
+$$idf(t) = log(N/ df(t))$$
+where, 
+df(t) = Number of documents containing term t &
+N = Total number of documents
+
+* TF-IDF: TF-IDF is the product of TF and IDF, providing a comprehensive measure that reflects both the frequency of a term in a document and its rarity across the corpus. This combination allows TF-IDF to identify terms that are both significant within a specific document and distinctive across the entire collection of documents. 
+The TF-IDF score for a term t in document d within a corpus D is computed as:
+$$TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)$$
+
+### Applications of TF-IDF
+TF-IDF is widely used in various applications across different fields, including:
+* Information Retrieval: Enhancing search engines to return more relevant results.
+* Content Tagging: Automatically tagging documents with relevant keywords. 
+* Text Mining and Analysis: Identifying important words in large text corpora. 
+* Text Similarity Measurement: Comparing documents to find similarities. 
+* Document Clustering and Classification: Grouping documents into clusters and classifying them based on their content. 
+* Natural Language Processing (NLP): Improving various NLP tasks like sentiment analysis, topic modeling, etc. 
+* Recommendation Systems: Recommending content based on text analysis.
+
+
+### Advantages
+* Simple to Implement: TF-IDF is straightforward to compute and implement. 
+* Useful for Information Retrieval: It helps in identifying the most relevant documents for a given query. 
+* Effective in Highlighting Important Words: It balances term frequency with the rarity of terms across the corpus. 
+* Does Not Require Labeled Data: It can be applied to any text corpus without the need for labeled data. 
+* Versatile: Applicable across a wide range of text analysis tasks.
+
+
+### Disadvantages
+* Ignores Word Order and Context: TF-IDF treats text as a bag of words, disregarding the order and context of terms. 
+* Does Not Capture Semantic Relationships: It cannot capture the meanings and relationships between words. 
+* Not Effective for Polysemous Words: Words with multiple meanings can lead to inaccuracies. 
+* Assumes Independence of Terms: Assumes that terms are independent of each other. 
+* Large Vocabulary Size: Can increase computational complexity with very large corpora.
+
+
+### Working
+###### Consider a simple example with three documents:
+
+* Document 1: "the cat in the hat"
+* Document 2: "the quick brown fox"
+* Document 3: "the cat and the mouse"
+
+###### Calculating TF-IDF for the term "cat":
+
+1) TF (cat, Document 1):
+
+* Term Frequency: 1 (appears once)
+* Total Terms: 5
+* TF: 1/5 = 0.2
+
+2) IDF (cat, All Documents):
+
+* Total Documents: 3
+* Documents containing "cat": 2 (Document 1 and Document 3)
+* IDF: log(3/2) ≈ 0.176
+
+3) TF-IDF (cat, Document 1):
+
+* TF-IDF: 0.2 × 0.176 = 0.0352
+
+###### Interpretation
+The TF-IDF scores indicate the importance of the term "cat" in each document:
+* In Document 1, "cat" has a moderate importance with a TF-IDF score of 0.0352.
+* In Document 2, "cat" does not appear, so its TF-IDF score is 0.
+* In Document 3, "cat" has a lower but significant importance with a TF-IDF score of 0.0293.
+This example shows how TF-IDF effectively balances term frequency within individual documents and the term's rarity across the entire corpus, allowing us to identify the most significant terms in context.
+
+
+
+### Conclusion
+TF-IDF (Term Frequency-Inverse Document Frequency) is a robust technique in text mining and information retrieval. It adeptly balances the frequency of terms within a document with their rarity across a corpus, making it an invaluable tool for highlighting significant terms. Whether used for enhancing search engines, tagging content, analyzing texts, or improving natural language processing tasks, TF-IDF remains a cornerstone technique in the realm of text analysis.
diff --git a/contrib/machine-learning/index.md b/contrib/machine-learning/index.md
@@ -1,3 +1,3 @@
 # List of sections
 
-- [Section title](filename.md)
+- [Term Frequency-Inverse Document Frequency](Tf-IDF)