This repository for Web Crawling, Information Extraction, and Knowledge Graph build up.
Implementations of utilities and algorithms to build your knowledge graph by Python 3
I will enrich those implementations and descriptions from time to time. If you include any of my work into your website or project; please add a link to this repository and send me an email to let me know.
Your comments are welcome. Thanks,
| Programs | Description | Link |
|---|---|---|
| JSONLines | Once your crawler download a lot of pages, how can you aggregate all of those files into single one? Json Lines is your answer. The program will package each of your file into single JSON object into the file which will contain multiple JSON objects. | Source Code |
| Conditional Random Field | This is a program to demostrate how to leverage crf to extract textbook information from syllabus of webpages. | Source Code |
| Wrapper and BeautifulSoup | This program demostrate how to extract information from JSON Lines by BeautifulSoup. | Source Code |
| Facebook Crawler | This is a crawler program to crawl facebook post via facebook graph api. | Source Code |
| SPARQL | This is an exercise to query information via dbpedia Virtuoso SPARQL Query Editor to answer/retrive University related questions. | Source Code |
| Market Index Prediction | This is a final project of building knowledge graph. I and my teammate YuCheng Kuo leverage not only stock price information but also combine social media listening data to feed into a LSTM (Long Short Term Memory) machine learning model to predict the trend of next day and next 30 day of Dow Jones Industrial Average index . | Project Repository |
- CRF suite Example: https://github.com/scrapinghub/python-crfsuite/blob/master/examples/CoNLL%202002.ipynb
- CRF Suite: https://python-crfsuite.readthedocs.io/en/latest/
- Facebook Crawler by Jacob: https://github.com/chenjr0719/Facebook-Page-Crawler/edit/master/Facebook_Page_Crawler.py
Cheng-Lin Li@University of Southern California chenglil@usc.edu or clark.cl.li@gmail.com