Introduction
Following two months of initial learning and experimentation using Selenium WebDriver with Python, I made the decision to pursue a project to automate capturing data and placing it into a spreadsheet to be used for a range of purposes, whether that be for data science, taking a snapshot of records at the current date and/or providing search results in a less complex manner, more specifically, assisting with job searching due to how confusing such a task can be for some people.
I set out to learn the basics of web scraping using Python and Selenium before carrying out the main objective of creating a web crawler for searching and documenting job search results, and had to take that approach by using alternative websites to start the learning process and to be able to branch out into different kinds of websites as I gained more confidence.
Learning the Basics
Football Results Data Scraper
I began by following this tutorial by YouTuber ThePyCoach which went through the steps of creating a web scraper for this website containing football results and saving it to a csv file.
While it's not a subject I personally show much interest in, the objective was more about getting to grips with the following:
- Opening a website and reading a data table using Selenium WebDriver.
- Saving the data into a dictionary using Python.
- Using Pandas to create a dataframe to save transfer the collected data into a spreadsheet.
Further lessons in the tutorial also covered interacting with drop-down web elements using Select functions in Selenium. The screenshot below shows what the code ended up becoming following the completion of the tutorial.
Further Learning without Tutorials
MyAnimeList Web Scraper
To make use of what I've just learned to help build my newly acquired skills in web scraping further, I made the decision to tackle creating a web scraper that takes the ranking, titles and scores of Japanese Anime from MyAnimeList. While this didn't require making use of drop-downs and other user-defined queries, it did have a somewhat different structure to the football results website as to how the table was coded.
Both websites made use of a 'tr' tag for table rows, but a problem with the MyAnimeList structure was that they included header row with the 'tr' tag while the football results website just had the pure data in the tables.
Adam Choi's football results website table structure
MyAnimeList table structure
This presented me with a problem, how could I not include the header row from MyAnimeList? What I found out was that each table row had a class name, so instead of selecting by the tag, I instead selected by class name 'ranking-list'. As for the columns, the code was structed to use 'td' tags which could be used as XPATH elements to extract the text in the columns of interest, particularly for rank and score.
In the case of the image below, I had to dig deeper into the code to get hold of the title text due to the title row's inclusion of additional information such as episodes and airing dates. This meant the need to define the XPATH much further than just the 'td[2]' tag. This ended up being successful but it required a significant amount of manual reading of the HTML code just to find the information that I wanted to extract which could have benefitted from alternative methods such as extracting link text as opposed to drilling deep down into the code.
In the end, I was able to repurpose the script originally written following the tutorial into a new script for scraping data from MyAnimeList.
Indeed Job Search
Following the successful creation of the previous web scrapers, I felt as if I was now ready to create the job search web scraper. This was intended to search a job title and location in Indeed, scan the search results, collect the job title, location and job advert link and save it into a spreadsheet.
This came with much more challenges in comparison to the previous webscrapers. I needed to work with data that was not stored in a table format to be able to show unique data for each result. After significant amount of experimentation where results were not returning uniquely, I arrived at using the class name of 'jcs-JobTitle' which allowed me to return unique results.
As the class name was inside of a link object, I took advantage of this by making use of getting the text and the link attribute of the object at the same time when extracting data.
Through using the same techniques for creating a dataframe which saved the extracted data into a spreadsheet, I eventually arrived at the earliest prototype which just gave job titles and the link to start with.
With all of that, there was still much more to work on to make the project less reliant on hard-coded values, allowing user input and much more optimised which will be covered in part 2.