Skip to content

Commit

Permalink
Uscrapper (v2.0)
Browse files Browse the repository at this point in the history
Introducing Uscrapper 2.0:
*   Introduced multiple modules to bypass anti-webscrapping techniques.
*   Introducing Crawl and scrape: an advanced crawl and scrape module to scrape the websites from within.
*   Implemented Multithreading to make these processes faster.
  • Loading branch information
z0m31en7 committed Aug 2, 2023
1 parent 0b1f265 commit 23680de
Show file tree
Hide file tree
Showing 7 changed files with 223 additions and 86 deletions.
26 changes: 15 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
<h1 align="center" id="title">Uscrapper</h1><br>
<h1 align="center" id="title">Uscrapper 2.0</h1><br>

<p align="center"><img src="https://socialify.git.ci/z0m31en7/Uscrapper/image?font=Source%20Code%20Pro&amp;name=1&amp;owner=1&amp;pattern=Plus&amp;theme=Dark" alt="project-image"></p><br>

<p id="description">Uscrapper is an OSINT tool built on python that allows users to extract various personal information from a website. It leverages web scraping techniques and regular expressions to extract email addresses social media links author names geolocations phone numbers and usernames from both hyperlinked and non-hyperlinked sources on the webpage. The tool also provides an option to generate a report containing the extracted details.</p><br><br>
<p id="description">Introducing Uscrapper 2.0, A powerfull OSINT webscrapper that allows users to extract various personal information from a website. It leverages web scraping techniques and regular expressions to extract email addresses, social media links, author names, geolocations, phone numbers, and usernames from both hyperlinked and non-hyperlinked sources on the webpage, supports multithreading to make this process faster, Uscrapper 2.0 is equipped with advanced Anti-webscrapping bypassing modules and supports webcrawling to scrape from various sublinks within the same domain. The tool also provides an option to generate a report containing the extracted details. </p><br><br>

<p align="center"><img src="https://img.shields.io/badge/Windows-0078D6?style=for-the-badge&amp;logo=windows&amp;logoColor=white" alt="shields"><img src="https://img.shields.io/badge/Linux-FCC624?style=for-the-badge&amp;logo=linux&amp;logoColor=black" alt="shields"><img src="https://img.shields.io/badge/tmux-1BB91F?style=for-the-badge&amp;logo=tmux&amp;logoColor=white" alt="shields"><img src="https://img.shields.io/badge/windows%20terminal-4D4D4D?style=for-the-badge&amp;logo=windows%20terminal&amp;logoColor=white" alt="shields"><img src="https://img.shields.io/badge/iTerm2-000000?style=for-the-badge&amp;logo=iterm2&amp;logoColor=white" alt="shields"><img src="https://img.shields.io/badge/Python-3776AB?style=for-the-badge&amp;logo=python&amp;logoColor=white" alt="shields"></p><br><br>

<p align="center"><img src="https://lh3.googleusercontent.com/drive-viewer/AITFw-y1bNoooFgNgN_EUPlQ0Wco6ffMdJZf95GpWbX-Uadc9P7y3kmG-DxXQtDCmDJcvVjCZLTJlzPMKxrV2iS_sfsyKp1wgg=s1600" alt="project-logo"></p><br>
<p align="center"><img src="https://lh3.googleusercontent.com/drive-viewer/AITFw-yL2zYKX1yEZPYLPK5brCOz_jSMLH1ilEPi7jeSAv0XUIbkf4ardW0pflUV7ltxpqppYrmdOt5NWf24PjpgqxkE1zBl=s1600" alt="project-logo"></p><br>

<h2>💡 Extracted Details:</h2><br>

Expand All @@ -20,8 +20,16 @@ Uscrapper extracts the following details from the provided website:

<br><h2>📽 Preview:</h2><br>

<p align="center"><img src="https://lh3.googleusercontent.com/drive-viewer/AITFw-z2CjGQ8DNq_sCVDS7NLM4g82_eRUwkOM9hVwR56Gukzll_suuVg08mARAxVbPPI1grXOzbAdkdvUG7xnmCYd2nvD4P=s1600" alt="project-ss"></p><br>
<p align="center"><img src="https://lh3.googleusercontent.com/drive-viewer/AITFw-y-7PS48iC0sU2HPSjlBanpM4RPKJn3GGmmnFYmqZ5PqLyLvO4aefDzqITpO52fPwY5FH8y4stik_yYVW_RzsnlipDUxg=s2560" alt="project-ss"></p><br>
<p align="center"><img src="https://lh3.googleusercontent.com/drive-viewer/AITFw-x6V0zw3mgqnBcvKlWRLYvNQvjusTk-nvLeXCp3GmECsYLeibxnSCFJtqYt50OG1YVwPU22T1Q6FXRGdTBRe2mh4ne8Kw=s1600" alt="project-ss2"></p><br>

<br><h2>🤩 Whats New?:</h2><br>

Uscrapper 2.0:

* Introduced multiple modules to bypass anti-webscrapping techniques.
* Introducing Crawl and scrape: an advanced crawl and scrape module to scrape the websites from within.
* Implemented Multithreading to make these processes faster.

<h2>🛠️ Installation Steps:</h2><br>

Expand All @@ -33,23 +41,19 @@ cd Uscrapper/install/
chmod +x ./install.sh && ./install.sh #For Unix/Linux systems
```

<br><p> For Windows systems run:</p>

```
Uscrapper/install/install.bat
```

<br><h2>🔮 Usage:</h2>

<p>To run Uscrapper, use the following command-line syntax:</p>

```
python Uscrapper.py [-h] [-u URL] [-O] [-ns]
python Uscrapper.py [-h] [-u URL] [-c (INT)] [-t THREADS] [-O] [-ns]
```
<br><b>Arguments:</b>

* -h, --help: Show the help message and exit.
* -u URL, --url URL: Specify the URL of the website to extract details from.
* -c INT, --crawl INT: Specify the number of links to crawl
* -t INT, --threads INT: Specify the number of threads to use while crawling and scraping.
* -O, --generate-report: Generate a report file containing the extracted details.
* -ns, --nonstrict: Display non-strict usernames during extraction.

Expand Down
264 changes: 206 additions & 58 deletions Uscrapper.py
Original file line number Diff line number Diff line change
@@ -1,31 +1,144 @@

import requests
from bs4 import BeautifulSoup
import random
import argparse
import re
from termcolor import colored
from urllib.parse import urlparse
from urllib.parse import urlparse, urljoin
from concurrent.futures import ThreadPoolExecutor
from collections import OrderedDict
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.firefox.service import Service as FirefoxService
import signal

print("\n")
print(" █░██▀ █▀▀ █▀█ ▄▀█ █▀█ █▀█ █▀▀ █▀█")
print(" █▄█▄█ █▄▄ █▀▄ █▀█ █▀▀ █▀▀ ██▄ █▀▄ (v1)")
print(colored(" █░█","blue"),colored("█▀ █▀▀ █▀█ ▄▀█ █▀█ █▀█ █▀▀ █▀█ ","white",attrs=['bold']))
print(colored(" █▄█","blue"),colored("▄█ █▄▄ █▀▄ █▀█ █▀▀ █▀▀ ██▄ █▀▄ ","white", attrs=['bold']),colored("(v2.0)","blue",))

print(colored("\n A Webpage scrapper for OSINT.","yellow"))
print(colored("\n A Powerfull OSINT WebScrapper","yellow"))
print(colored(" ~By: Pranjal Goel (z0m31en7)\n", "red"))

extracted_usernames0 = []
extracted_phone_numbers0 = []
extracted_emails0 = []
geolocations0 = []
author_names0 = []
social_links0 = []
email_addresses0 = []
counter = 0
driver = 0

def handler(signum, frame):
res = input(colored("\n[x] Ctrl-c was pressed. Do you really want to exit? y/n: ","red"))
if res == 'y':
print(colored("[exiting..]","red"))
exit(1)

def selenium_wd(url):

global counter
global driver
options = Options()
options.add_argument('-headless')
if counter == 0:
driver = webdriver.Firefox(options=options)
counter = 1
driver.get(url)
source = driver.page_source
return source

def get_links_from_page(url):

response = requests.get(url)

if response.status_code == 200:
soup = BeautifulSoup(response.content, "html.parser")
domain = urlparse(url).netloc


links = set()
for anchor in soup.find_all("a", href=True):
link = anchor["href"]
absolute_link = urljoin(url, link)
parsed_link = urlparse(absolute_link)


if parsed_link.netloc == domain and parsed_link.scheme in {"http", "https"}:
links.add(absolute_link)

return links

if response.status_code == 403:
soup = BeautifulSoup(selenium_wd(url),"html.parser")
domain = urlparse(url).netloc


links = set()
for anchor in soup.find_all("a", href=True):
link = anchor["href"]
absolute_link = urljoin(url, link)
parsed_link = urlparse(absolute_link)


if parsed_link.netloc == domain and parsed_link.scheme in {"http", "https"}:
links.add(absolute_link)

return links

else:
print(f"Error: Unable to fetch {url}. Status code: {response.status_code}")
return set()

def web_crawler(start_url, max_pages=10, num_threads=4):

if num_threads == None:
num_threads = 4
visited_links = set()
queue = [start_url]

def crawl_page(url):
if url in visited_links:
return set()

print(f"Crawling: {url}")
extract_details(url, args.generate_report, args.nonstrict)
links_on_page = get_links_from_page(url)
visited_links.add(url)
return links_on_page

with ThreadPoolExecutor(max_workers=num_threads) as executor:
while queue and len(visited_links) < max_pages:
current_url = queue.pop(0)

future = executor.submit(crawl_page, current_url)
links_on_page = future.result()

for link in links_on_page:
if link not in visited_links:
queue.append(link)

print("Crawling finished.")


def extract_details(url, generate_report, non_strict):

user_agents_list = [
'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36'
'Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Mobile Safari/537.3'
'Mozilla/5.0 (iPhone; CPU iPhone OS 16_5_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.5.2 Mobile/15E148 Safari/604.'
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36'
]
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36' ]

response = requests.get(url, headers={'User-Agent': random.choice(user_agents_list)})
soup = BeautifulSoup(response.text, 'html.parser')

if response.status_code == 403:
soup = BeautifulSoup(selenium_wd(url), 'html.parser')

usernames = []
if non_strict:
usernames = set(username.string for username in soup.find_all('a', href=True, string=re.compile(r'^[^\s]+$')))
Expand All @@ -44,91 +157,113 @@ def extract_details(url, generate_report, non_strict):
phone_regex2 = r'(?:\+\d{1,3}[- ]?)?\(?\d{3}\)?[- ]?\d{3}\)?[- ]?\d{4}\b'
phone_regex_combined = '|'.join('(?:{0})'.format(x) for x in (phone_regex, phone_regex2, phone_regex3))
extracted_phone_numbers = set(re.findall(phone_regex_combined, webpage_text))

username_regex = r'@[A-Za-z0-9_]+'
extracted_usernames = set(re.findall(username_regex, webpage_text))

if email_addresses:
print(colored("\n[+] Email Addresses:", "cyan"))
for email in email_addresses:
print(email)
email_addresses0.append(email)

if social_links:
print(colored("\n[+] Social Media Links:", "cyan"))
social_media_platforms = ['instagram', 'facebook', 'whatsapp', 'snapchat', 'github', 'reddit', 'youtube', 'linkedin', 'twitter', 'telegram', 'imo', 'discord']
social_media_platforms = ['instagram', 'facebook', 'whatsapp', 'snapchat', 'github', 'reddit', 'youtube', 'linkedin', 'twitter', 'telegram', 'discord','pinterest']
for link in social_links:
for platform in social_media_platforms:
if platform in link:
print(link)
social_links0.append(link)

if author_names:
print(colored("\n[+] Author Names:", "cyan"))
for author in author_names:
print(author)
author_names0.append(author)

if geolocations:
print(colored("\n[+] Geolocations:", "cyan"))
for location in geolocations:
geolocations0.append(location)

if extracted_emails:
for email in extracted_emails:
if email.lower().startswith("email"):
email = email[5:]
extracted_emails0.append(email)

if extracted_phone_numbers:
for phone in extracted_phone_numbers:
extracted_phone_numbers0.append(phone)

if extracted_usernames and non_strict:
for username in extracted_usernames:
extracted_usernames0.append(username)

def printlist():

email_addresses1 = []
social_links1 = []
extracted_emails1 = []
author_names1 = []
geolocations1 = []
extracted_phone_numbers1 = []
extracted_usernames1 = []

if email_addresses0:
print(colored("\n[+] Email Addresses:", "cyan"))
email_addresses1 = list(OrderedDict.fromkeys(email_addresses0))
for email in email_addresses1:
print(email)

if social_links0:
print(colored("\n[+] Social Media Links:", "cyan"))
social_links1 = list(OrderedDict.fromkeys(social_links0))
for links in social_links1:
print(links)

if author_names0:
print(colored("\n[+] Author Names:", "cyan"))
author_names1 = list(OrderedDict.fromkeys(author_names0))
for author in author_names1:
print(author)

if geolocations0:
print(colored("\n[+] Geolocations:", "cyan"))
geolocations1 = list(OrderedDict.fromkeys(geolocation0))
for location in geolocations1:
print(location)

if generate_report:
with open('report.txt', 'w') as report_file:
if usernames:
report_file.write("[+] Usernames:\n")
for username in usernames:
report_file.write(username + '\n')

if email_addresses:
report_file.write("\n[+] Email Addresses:\n")
for email in email_addresses:
report_file.write(email + '\n')

if extracted_phone_numbers:
report_file.write("\n[+] Phone Numbers:\n")
for phone in extracted_phone_numbers:
report_file.write(phone + '\n')

if social_links:
report_file.write("\n[+] Social Media Links:\n")
for link in social_links:
report_file.write(link + '\n')

if author_names:
report_file.write("\n[+] Author Names:\n")
for author in author_names:
report_file.write(author + '\n')

if geolocations:
report_file.write("\n[+] Geolocations:\n")
for location in geolocations:
report_file.write(location + '\n')

if extracted_emails or extracted_phone_numbers or extracted_usernames:
if extracted_emails0 or extracted_phone_numbers0 or extracted_usernames0:
print(colored("\n----------Non-Hyperlinked Details----------", "yellow"))

if extracted_emails:
if extracted_emails0:
print(colored("\n[+] Email Addresses:", "cyan"))
for email in extracted_emails:
if email.lower().startswith("email"):
email = email[5:]
extracted_emails1 = list(OrderedDict.fromkeys(extracted_emails0))
for email in extracted_emails1:
print(email)

if extracted_phone_numbers:
if extracted_phone_numbers0:
print(colored("\n[+] Phone Numbers:", "cyan"))
for phone in extracted_phone_numbers:
extracted_phone_numbers1 = list(OrderedDict.fromkeys(extracted_phone_numbers0))
for phone in extracted_phone_numbers1:
print(phone)

if extracted_usernames and non_strict:
if extracted_usernames0 and non_strict:
print(colored("\n[+] Usernames:", "cyan"))
for username in extracted_usernames:
extracted_usernames1 = list(OrderedDict.fromkeys(extracted_usernames0))
for username in extracted_usernames1:
print(username)

concl = "Email Addresses:"+str(len(email_addresses1)+len(extracted_emails1)),"Social Links:"+str(len(social_links1)),"Phone Numbers:"+str(len(extracted_phone_numbers1)), "Geolocations:"+str(len(geolocations1))
print("\n")
print(colored(concl, "green", attrs=['blink']))
print("\n")
exit(1)

if __name__ == '__main__':
parser = argparse.ArgumentParser(description='OSINT Tool for Webpage scraping')
parser.add_argument('-u', '--url', help='URL of the website')
parser.add_argument('-O', '--generate-report', action='store_true', help='Generate a report')
parser.add_argument('-ns', '--nonstrict', action='store_true', help='Display non-strict usernames (may show inaccurate results)')
parser.add_argument('-c', '--crawl', type=int, help= 'specify max number of links to Crawl and scrape within the same scope')
parser.add_argument('-t', '--threads', type=int, help= 'Number of threads to utilize while crawling (default=4)')
args = parser.parse_args()
signal.signal(signal.SIGINT, handler)
counter = 0

if args.url:
url = args.url
Expand All @@ -137,8 +272,21 @@ def extract_details(url, generate_report, non_strict):
url = 'https://' + url
try:
response = requests.get(url)
if response.status_code == 200:
if response.status_code == 200:
if args.crawl:
web_crawler(url, args.crawl, args.threads)
printlist()
extract_details(url, args.generate_report, args.nonstrict)
printlist()

if response.status_code == 403:
print(colored("\n[!] Status code 403 (Forbidden), Website might be using anti webscrapping methods.", "red"))
print(colored("[+] Trying to bypass...","green"))
if args.crawl:
web_crawler(url, args.crawl, args.threads)
extract_details(url, args.generate_report, args.nonstrict)
printlist()
driver.quit()
else:
print(f"URL is down: Status code {response.status_code}")
except requests.exceptions.RequestException as e:
Expand Down
Binary file removed images/logo.png
Binary file not shown.
Binary file removed images/uscrapper.png
Binary file not shown.
Loading

0 comments on commit 23680de

Please sign in to comment.