Michael Tucker Portfolio

All Posts

Topics

Python

Automation

project

5 min read

Scraping Info from Web Pages Using Python

‍Demonstrating how to extract phone numbers and email addresses from web pages using Selenium and Python for efficient data scraping.

In the realm of cybersecurity, automation plays a crucial role in efficiently gathering intelligence and identifying potential threats. One aspect of this process is extracting information, such as phone numbers and email addresses, from web pages. Manual data collection can not only take a large amount of time but is also prone to error. However, with the Python and the Selenium library, cybersecurity analysts can use automation to assist in streamlining this task. In this blog post, we will explore how using Selenium and Python for web scraping can improve the efficiency and accuracy of gathering information. Get ready to unlock the potential of cybersecurity automation and elevate your threat intelligence gathering to the next level!

‍

Start by importing the necessary libraries: selenium for web scraping, re for regular expression matching, and urlparse from urllib.parse for parsing URLs.

Next specify the target URL and initialize a WebDriver for Firefox. and open the specified URL using driver.get(url).

Open the specified URL using driver.get(url) and get the domain from the URLdomain = urlparse(url).netloc. We extract the domain information in order to limit the search to only the target URL and its domain child pages.

Initialize two empty lists, phone_numbers and email_addresses, to store the phone numbers and email addresses found during the scraping process.

Retrieve the page source using driver.page_source and search for phone numbers and email addresses using regular expressions (re.findall). The regular expressions used are:

- Phone number: r'\\\\(?\\\\d{3}\\\\)?[-.\\\\s]?\\\\d{3}[-.\\\\s]\\\\d{4}'

- Email address: r'\\\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\\\.[A-Z|a-z]{2,}\\\\b'

The found phone numbers and email addresses are appended to the respective lists.

Find all the links on the main page using driver.find_elements(By.TAG_NAME, "a"). The links are stored in the links list. We then extract the URLs from the links using a list comprehension and store them in the urls list.

Create a simple loop to go over each URL in the urls list and check if the domain of the URL matches the original domain. If it does, the scraper visits the URL using driver.get(url), retrieves the page source, and searches for phone numbers and email addresses using the same regular expressions as before. The found phone numbers and email addresses are appended to the respective lists.

Remove duplicates from the phone_numbers and email_addresses lists by converting them to sets using set() and then converting them back to lists using list().

Print the collected phone numbers and email addresses using the print() function.

Finally, close the WebDriver to release system resources using driver.close().

Conclusion

While the program above demonstrates how to print out phone numbers and email addresses, the basic process remains the same for any data one would wish to collect. In addition, the data can also be stored to files to be analyzed by other programs.

Web scraping using Python and Selenium offers cybersecurity analysts an efficient and automated approach to gather information such as phone numbers and email addresses from web pages. By automating, analysts can save time and minimize errors that are often associated with manual data collection. This method allows for streamlined threat intelligence gathering and enhances the overall efficiency and accuracy of the process.

However, it is important to note that web scraping should be performed within legal boundaries and in compliance with the terms of service of the targeted websites. Scraping websites without permission or violating their terms of service can result in legal consequences. It is crucial to respect the policies and guidelines set by the website owners to maintain ethical practices while extracting data from web pages.

By utilizing Python and Selenium for web scraping, cybersecurity professionals can realize the potential of automation, grow their threat intelligence gathering capabilities, and improve their overall cybersecurity practices. However, it is necessary to show caution, stay within legal boundaries, and comply with terms of service to ensure responsible and ethical data scraping practices.

‍

Cybersecurity Analyst

Pages

Portfolio Resume

Contact

Linkedin mtucker@mtcyber.info