Creating the Python Web Scrapers Part 2

In the previous post on this topic, I went over how I got the web scrapers for MyAnimeList, Adam Choi's football results website and job searching with Indeed up and running, albeit in a hard-coded fashion just to make sure that the initial functionality was there and was working as intended.

Converting from hard-coded to more dynamic

After reviewing the initial code with the wider testing community, I was given the suggestion to do the following to my code:
- Extract everything that's hard-coded into a variable section at top so anything can change is in 1 block of code rather than trying to find it in code.
- Split into parameterised functions so I just have a common code block that makes logical sense to anyone looking at it.

Further than that, I made the decision to change two of the scripts, the football results scraper and the job search scraper, to take in user input before running the scrapers to allow the end user to pick up the results that they wanted without needing to make changes to the code.

Football Data Scraper Updates

For this data scraper, I needed to do the following actions:
- Allow user input for Country and League.
- Update selection criteria to work with both the Country and League as defined by the user.
- Safely close the program if the Country and League choices defined by the user are invalid.
- Ending use of hard-coded values and implementing use of functions.

To start with, I created variables that were set based on what country and league names were entered by the user, even using text to warn the user of the choices being case-sensitive to minimise the risk of triggering errors later on.

This input would inform what selections the script would make in the drop-downs on the website. For instance, a user could enter 'England' and 'Championship' as their country/league combination and the script would attempt to select those options.

However, what if someone entered a country that's not listed in the website? What if they had a valid country but not a valid league? This would have caused the program to crash before data could even be retrieved, so I needed to include a failsafe to end the program safely, informing the user that the data they entered was invalid before closing the program.

This was where I made use of the 'try' and 'except' conditions of Python in addition to making use of functions to act in case the data is valid or not. The code below shows a series of functions all related to setting up the page prior to scraping the data where the user input influences the country and league selections, in which if one or both selections are invalid, the program will close without triggering a crash.

With the logic for page setup sorted out, the program needed to start doing the work to collect the data presented, put into a dataframe and then saved as a spreadsheet. Firstly, the variables containing references to XPATHs were created in addition to a variable that names the file that will be created.

I then created functions that would take in the data, create the dataframe and then save it as a spreadsheet, using the originally written code but making use of the newly created variables instead of hard-coded values.

MyAnimeList Scraper Updates

Updating the MyAnimeList scraper was a bit more straightforward due to being just a table with no need to worry much about user input, searches and so on.

For the most part, updating the MyAnimeList scraper was purely about using variables instead of hard-coded values and putting the code into functions. Firstly I took the values of the XPATH elements used previously, made them into variables, made sure the lists were ready to go and made a variable for the name of the file that the extracted data was going to be saved to.

Following that, I moved the data scraping functionality and creation of the pandas dataframe into functions and made executed them as function calls at the end of the script.

Updating the Indeed Job Search scraper

For this data scraper, I needed to carry out the following actions:
- Using variables and functions instead of hard-coded values and behaviour.
- Taking in user input for Job Title and the Location of the job search.
- Displaying information for the job location and the company advertising the jobs in addition to job title and URLs to the job adverts in the spreadsheet.

Much like the Football results scraper, I made use of variables that took user input to influence the job title and location values that would be entered into the search criteria in Indeed. However, as this was a search function rather than using drop-downs, I didn't need to worry too much about being case-sensitive.

In preparation for running data scraping functionality, I made sure that all previously used references to IDs, XPATHs, Class names and CSS Selectors were variables instead of hard coded values. Some of the variables worked globally while others only worked when put at function-level.

Following this, the functions for setting up the page (carrying out the search), performing the scraping and using pandas to create a dataframe and save the extracted data into a spreadsheet was implemented, using less hard-coded methods.

Further refining of code with AI

Following the successful creation and implementation of the web scrapers, I made use of ChatGPT to further refine the code. The results of this allowed the code to work exactly as it was designed to but making the code adhere to PEP8 standards, something I wasn't overly aware of due to having a background in working with programming languages like C++ and C# back when I was learning games programming at University, making tweaks to function and variable names, removal of unnecessary global variables, improved readability and maintainability and making the code more organised.

The code improved by ChatGPT with added edits to make use of Headless Firefox can be viewed here.

Why make use of AI to improve my code? Putting it into perspective, I created the webscrapers starting with a single tutorial and other knowledge I had in Selenium from my initial learning of the webdriver earlier in 2024, in addition to very little experience or knowledge with Python as a programming language. As I was self-learning with no prior experience, I needed to make use of whatever was at my disposal to aid my development, improvement and iteration of the web scrapers. What helped a lot was that ChatGPT explained the changes that were made and I made sure to read over the code changes and added comments to ensure that I fully understood what the AI did to my code and it still functioned as I had originally designed it to do.

This ultimately goes to show that AI can be a great assistant in improving and iterating code but it was used as an augment to this project rather than asking the AI to make the script for me from nothing. At the very least, the XPATH references and so on that I made into variables helped to AI to ensure that the code would still work based on how I wrote it originally.