Skip to content

Commit 0a1d6bc

Browse files
author
rishabhiitbhu
committed
Script ready, Readme updated.
1 parent dfb34d3 commit 0a1d6bc

File tree

3 files changed

+44
-0
lines changed

3 files changed

+44
-0
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,3 +87,4 @@ ENV/
8787

8888
# Rope project settings
8989
.ropeproject
90+
old/

README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,7 @@
11
# BHU_mail
2+
23
A python script to get the emails of all the professors of BHU.
4+
5+
Just run `get_the_mails.py` and it will crawl the contact section of BHU website, download all the docs containing details of every department, then use regex to get the emails out and paste it in results.txt.
6+
7+
Enjoy!

get_the_mails.py

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
import httplib2
2+
import urllib2
3+
from bs4 import BeautifulSoup, SoupStrainer
4+
import re
5+
6+
http = httplib2.Http()
7+
website = 'http://www.bhu.ac.in/telephone/'
8+
doc_links = []
9+
emails = []
10+
11+
status, response = http.request(website)
12+
13+
for link in BeautifulSoup(response,"lxml", parse_only=SoupStrainer('a')):
14+
if link.has_attr('href'):
15+
if 'doc' in link['href']:
16+
doc_links.append(website+link['href'])
17+
18+
print "Got doc links.."
19+
print doc_links
20+
21+
for i in doc_links:
22+
try:
23+
doc = urllib2.urlopen(i)
24+
doc_data = doc.read()
25+
match = match = re.findall(r'[\w\.-]+@[\w\.-]+', doc_data)
26+
emails.extend(match)
27+
print "done", i
28+
except Exception as e:
29+
print e
30+
31+
print 'Yo, got all emails'
32+
33+
print 'writing data in results.txt'
34+
with open('results.txt', "w") as f:
35+
for email in emails:
36+
f.write(email)
37+
print(email)
38+
f.write("\n")

0 commit comments

Comments
 (0)