|
879 | 879 | "\n", |
880 | 880 | "### Disadvantages\n", |
881 | 881 | "1. The size of vector increases with the size of the vocabulary. Thus, sparsity continues to be a problem. One way to control it is by limiting the vocabulary to n number of the most frequent words.\n", |
882 | | - "2. It does not capture the similarity between different words that mean the same thing. Say we have three documents: “walk”, “walked”, and “walking”. BoW vectors of all three documents will be equally apart.\n", |
| 882 | + "2. It does not capture the similarity between different words that mean the same thing. i.e. Semantic Meaning is not captured.\n", |
| 883 | + "> a. \"walk\", \"walked\", and \"walking\". BoW vectors of all three tokens will be equally apart. \n", |
| 884 | + "> b. \"search\" and \"explore\" are synonyms. BoW won't capture the semantic similarity of these words.\n", |
883 | 885 | "3. This representation does not have any way to handle out of vocabulary (OOV) words (i.e., new words that were not seen in the corpus that was used to build the vectorizer).\n", |
884 | 886 | "4. As the name indicates, it is a “bag” of words. Word order information is lost in this representation. One way to control it is by using n-grams.\n", |
885 | 887 | "5. It suffers from **curse of high dimensionality.**" |
|
1124 | 1126 | "print(dtm.toarray())" |
1125 | 1127 | ] |
1126 | 1128 | }, |
1127 | | - { |
1128 | | - "cell_type": "code", |
1129 | | - "execution_count": 29, |
1130 | | - "metadata": {}, |
1131 | | - "outputs": [ |
1132 | | - { |
1133 | | - "name": "stdout", |
1134 | | - "output_type": "stream", |
1135 | | - "text": [ |
1136 | | - "{'best': 1, 'time': 3, 'worst': 5, 'age': 0, 'wisdom': 4, 'foolishness': 2}\n", |
1137 | | - "dict_keys(['best', 'time', 'worst', 'age', 'wisdom', 'foolishness'])\n", |
1138 | | - "[('age', 0), ('best', 1), ('foolishness', 2), ('time', 3), ('wisdom', 4), ('worst', 5)]\n", |
1139 | | - "['age', 'best', 'foolishness', 'time', 'wisdom', 'worst']\n" |
1140 | | - ] |
1141 | | - } |
1142 | | - ], |
1143 | | - "source": [ |
1144 | | - "vocabulary = vocab.vocabulary_\n", |
1145 | | - "\n", |
1146 | | - "print(vocabulary)\n", |
1147 | | - "\n", |
1148 | | - "print(vocabulary.keys())\n", |
1149 | | - "\n", |
1150 | | - "sort_vocab_tup = sorted(vocabulary.items(), key = lambda x : x[1])\n", |
1151 | | - "\n", |
1152 | | - "print(sort_vocab_tup)\n", |
1153 | | - "\n", |
1154 | | - "sort_vocab = [tup[0] for tup in sort_vocab_tup]\n", |
1155 | | - "\n", |
1156 | | - "print(sort_vocab)" |
1157 | | - ] |
1158 | | - }, |
1159 | 1129 | { |
1160 | 1130 | "cell_type": "code", |
1161 | 1131 | "execution_count": 30, |
|
1178 | 1148 | }, |
1179 | 1149 | { |
1180 | 1150 | "cell_type": "code", |
1181 | | - "execution_count": 31, |
| 1151 | + "execution_count": 32, |
1182 | 1152 | "metadata": {}, |
1183 | 1153 | "outputs": [ |
1184 | 1154 | { |
|
1259 | 1229 | "3 1 0 1 0 0 0" |
1260 | 1230 | ] |
1261 | 1231 | }, |
1262 | | - "execution_count": 31, |
| 1232 | + "execution_count": 32, |
1263 | 1233 | "metadata": {}, |
1264 | 1234 | "output_type": "execute_result" |
1265 | 1235 | } |
|
1270 | 1240 | }, |
1271 | 1241 | { |
1272 | 1242 | "cell_type": "code", |
1273 | | - "execution_count": 32, |
| 1243 | + "execution_count": 33, |
1274 | 1244 | "metadata": {}, |
1275 | 1245 | "outputs": [], |
1276 | 1246 | "source": [ |
|
1283 | 1253 | }, |
1284 | 1254 | { |
1285 | 1255 | "cell_type": "code", |
1286 | | - "execution_count": 33, |
| 1256 | + "execution_count": 34, |
1287 | 1257 | "metadata": {}, |
1288 | 1258 | "outputs": [ |
1289 | 1259 | { |
|
1300 | 1270 | }, |
1301 | 1271 | { |
1302 | 1272 | "cell_type": "code", |
1303 | | - "execution_count": 34, |
| 1273 | + "execution_count": 35, |
1304 | 1274 | "metadata": { |
1305 | 1275 | "scrolled": true |
1306 | 1276 | }, |
|
1323 | 1293 | }, |
1324 | 1294 | { |
1325 | 1295 | "cell_type": "code", |
1326 | | - "execution_count": 35, |
| 1296 | + "execution_count": 36, |
1327 | 1297 | "metadata": { |
1328 | 1298 | "scrolled": true |
1329 | 1299 | }, |
|
1432 | 1402 | "3 0 0 " |
1433 | 1403 | ] |
1434 | 1404 | }, |
1435 | | - "execution_count": 35, |
| 1405 | + "execution_count": 36, |
1436 | 1406 | "metadata": {}, |
1437 | 1407 | "output_type": "execute_result" |
1438 | 1408 | } |
|
1457 | 1427 | "source": [ |
1458 | 1428 | "## Term Frequency Inverse Document Frequency\n", |
1459 | 1429 | "\n", |
1460 | | - "In BOW approach all the words in the text are treated as equally important i.e. there's no notion of some words in the document being more important than others. TF-IDF, or term frequency-inverse document frequency, addresses this issue. It aims to quantify the importance of a given word relative to other words in the document and in the corpus." |
| 1430 | + "In BOW approach all the words in the text are treated as equally important i.e. there's no notion of some words in the document being more important than others. TF-IDF, or term frequency-inverse document frequency, addresses this issue. It aims to quantify the importance of a given word relative to other words in the document and in the corpus.\n", |
| 1431 | + "\n", |
| 1432 | + "***\n", |
| 1433 | + "\n", |
| 1434 | + "Let's now try to understand:\n", |
| 1435 | + "1. Term Frequency \n", |
| 1436 | + "2. Inverse Document Frequency\n", |
| 1437 | + "\n", |
| 1438 | + "$$ TF \\ IDF = TF(word_i, doc_j) * IDF(word_i, corpus) $$\n", |
| 1439 | + "\n", |
| 1440 | + "$$ TF(word_i, doc_j) = \\frac{No \\ of \\ time \\ word_i \\ occurs \\ in \\ doc_j}{Total \\ no \\ of \\ words \\ in \\ doc_j} $$\n", |
| 1441 | + "\n", |
| 1442 | + "$$ IDF(word_i, corpus) = \\log_n(\\frac{No \\ of \\ docs \\ in \\ corpus}{No \\ of \\ docs \\ which \\ contains \\ word_i}) $$\n", |
| 1443 | + "\n", |
| 1444 | + "***\n", |
| 1445 | + "\n", |
| 1446 | + "### Advantages\n", |
| 1447 | + "1. If the word is rare in the corpus, it will be given more importance. (i.e. IDF)\n", |
| 1448 | + "2. If the word is more frequent in a document, it will be given more importance. (i.e. TF)\n", |
| 1449 | + "\n", |
| 1450 | + "### Disadvantages\n", |
| 1451 | + "> **Same as BOW**" |
1461 | 1452 | ] |
1462 | 1453 | }, |
1463 | 1454 | { |
1464 | 1455 | "cell_type": "code", |
1465 | | - "execution_count": 36, |
| 1456 | + "execution_count": 37, |
1466 | 1457 | "metadata": {}, |
1467 | 1458 | "outputs": [], |
1468 | 1459 | "source": [ |
|
1477 | 1468 | }, |
1478 | 1469 | { |
1479 | 1470 | "cell_type": "code", |
1480 | | - "execution_count": 37, |
| 1471 | + "execution_count": 38, |
1481 | 1472 | "metadata": {}, |
1482 | 1473 | "outputs": [ |
1483 | 1474 | { |
|
1494 | 1485 | }, |
1495 | 1486 | { |
1496 | 1487 | "cell_type": "code", |
1497 | | - "execution_count": 38, |
| 1488 | + "execution_count": 39, |
1498 | 1489 | "metadata": { |
1499 | 1490 | "scrolled": true |
1500 | 1491 | }, |
|
1518 | 1509 | }, |
1519 | 1510 | { |
1520 | 1511 | "cell_type": "code", |
1521 | | - "execution_count": 39, |
| 1512 | + "execution_count": 40, |
1522 | 1513 | "metadata": {}, |
1523 | 1514 | "outputs": [ |
1524 | 1515 | { |
|
1599 | 1590 | "3 0.61913 0.000000 0.785288 0.00000 0.000000 0.000000" |
1600 | 1591 | ] |
1601 | 1592 | }, |
1602 | | - "execution_count": 39, |
| 1593 | + "execution_count": 40, |
1603 | 1594 | "metadata": {}, |
1604 | 1595 | "output_type": "execute_result" |
1605 | 1596 | } |
|
1612 | 1603 | "cell_type": "markdown", |
1613 | 1604 | "metadata": {}, |
1614 | 1605 | "source": [ |
| 1606 | + "## Word Embeddings\n", |
| 1607 | + "\n", |
| 1608 | + "In natural language processing (NLP), [word embedding](https://en.wikipedia.org/wiki/Word_embedding) is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using a set of [language modeling](https://en.wikipedia.org/wiki/Language_model) and [feature learning](https://en.wikipedia.org/wiki/Feature_learning) techniques where words or phrases from the vocabulary are mapped to vectors of real numbers.\n", |
| 1609 | + "\n", |
| 1610 | + "Methods to generate this mapping include **neural networks**, **dimensionality reduction on the word co-occurrence matrix**, **probabilistic models**, **explainable knowledge base method**, and **explicit representation in terms of the context in which words appear**.\n", |
| 1611 | + "\n", |
| 1612 | + "Traditionally, one of the main **limitations of word embeddings** (word vector space models in general) is that words with multiple meanings are conflated into a single representation (a single vector in the semantic space). In other words, [polysemy](https://en.wikipedia.org/wiki/Polysemy) and [homonymy](https://en.wikipedia.org/wiki/Homonym) are not handled properly. \n", |
| 1613 | + "\n", |
| 1614 | + "\n", |
| 1615 | + "\n", |
1615 | 1616 | "## Word2Vec\n", |
1616 | 1617 | "\n", |
1617 | 1618 | "\"You shall know the word by the company it keeps.\" by JR Firth\n", |
|
0 commit comments