Automating tasks using a web crawler

html, python, automate, web crawling

I’ve recently developed a deep fascination with web crawlers/scrapers. Despite knowing about them for a while, I’ve never had any real need or desire to develop one until recently. This change was brought about by the gruelling job application process that I have been engaging in recently.

As I made the decision to sign up to multiple online job search platforms, I’ve become familiar with some of the shortcomings of each site. However, instead of sending the devs an email and hoping for the best, I decided that I would take matters into my own hands wherever I can and use the tools I’m equipped with in order to solve these issues.

One of the sites that I registered to is WOBB, a Malaysian platform that connects professionals and employers. This is definitely one of my favourite platforms to date. The signup process is simple and streamlined, and once you’ve uploaded your resume and entered your details, job applications are as simple as one button click.

However, there is one tiny pet peeve about this platform — I don’t get notified when my application status changes. This means I have to constantly load the job applications page in order to get an update. This isn’t optimal for me as I have to send out applications on other sites and build a decent portfolio. The solution? Build a web crawler that will generate the notifications for me! It was a fairly simple solution that saved me tonnes of time.

Because I have a penchant for sub-par puns, I decided to name it “wobbot”. Anyway, I digress. The rest of this article will go into the technical details of the bot.

The code in this article is written within a “main” function and is called as shown by the code below:

if __name__ == "__main__":
    main()

Firstly, let’s go over the packages that I used in this project. The first is selenium, a web automation tool that sits at the heart of this project. This is the tool we use to navigate the site and extract valuable information from it. Next is tinydb, a lightweight serverless NoSQL database library that uses JSON files to store information. Yagmail is a Gmail/SMTP client that I used in order to send out email notifications to myself. If you’re going to develop similar functionality, I’d recommend creating a designated development email in order to prevent granting the library access to your private email. Finally, we included the time and JSON libraries in order to handle sleeps and data formatting within the script respectively.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException
import time
import json
from tinydb import TinyDB, Query
import yagmail

The next step is to initialize everything. I prefer to declare/initialize everything at the top of the file/method/function because I like having all my variable initialisations in one place that’s easy to find. TinyDB takes in the file name that will be used to store the data that we’re going to scrape, “Job” is the query object. “user_credentials” is a dictionary that we are going to use to store relevant user-specific variables that will be used throughout the script. The user data will be read from a JSON file in the same directory as the script file (“credentials.json”). I then initialized the driver object, passing a “headless” option in order to run the script without opening a browser window.

   db = TinyDB('jobs.json')
   Job = Query()
   user_credentials = {}
   url_home = "https://my.wobbjobs.com"
   url_histories = url_home + "/v2/users/job_histories"
   options = Options()
   options.add_argument("--headless")
   driver = webdriver.Chrome(options=options)
   driver.get(url_home)

WOBB sessions seem to timeout very quickly. I’m not sure if this is a universal issue or if it only happens with my account. Nonetheless, I noticed that I have to log in almost every time I visit the website. Considering that the initial tab title is the same whether I’m logged in or not, I decided to check for the login button in order to check if I’m logged in. If I’m not logged in, the credentials file is read and parsed into the “user_credentials” dictionary.

if login_button:
    with open("credentials.json", "r") as f:
        user_credentials = json.loads(f.read())

Once we’re logged in, we navigate to the job cards section and open the “I’ve applied” tab on the page. This is where all the applications that have been sent out live. Once here, I stumbled upon a minor challenge. WOBB only displays 5 jobs at a time, at the time of writing, I had 27 jobs in this section. Selenium, however, does not allow for interaction with hidden elements. The only way I could interact with all the cards would be to make all of them visible first.

I did this by setting the job card length to 0 and then creating a job_cards list and grabbing the first 5 job cards. I then click the “view more” button as long as the length of the visible job_cards is higher than job_card_len, making sure to get all the cards with each iteration and update job_card_len. I added a sleep function between clicking the button and getting the job cards in order to allow for loading time. 5 seconds might be overkill but it’s always better to be safe than risking a crash. The catching of a NoSuchElementException is necessary because the “view more” button disappears once the end of the list is reached. The while loop ends once the job_card_len and the length of the retrieved cards are equal (i.e all the cards have been retrieved).

job_card_len = 0
job_cards = driver.find_elements_by_class_name("job-card")
try:
    while len(job_cards) > job_card_len:
        job_card_len = len(job_cards)
        view_more = driver.find_element_by_class_name("button-settings")
        view_more.click()
        time.sleep(5)
        job_cards = driver.find_elements_by_class_name("job-card")
except NoSuchElementException
    pass

Now comes one of the more interesting parts — for each card I retrieve, I will extract the job title, the posting company, the application status and the date applied. There was no apparent way to get the unique job id so database queries had to used the “and” operator with all the details other than status to uniquely identify different jobs. I did this with the faith that no 2 postings will have the same title, company and date. A bit of a risk, but I was willing to make that assumption.

The first task was to check whether the current job exists in the jobs JSON file. If it does not, the job is inserted into the file and added to inserted_jobs. If it does exist, we retrieve it and compare the status to the saved status, if they are not equal, the status is updated, saved and the job is added to updated_jobs.

for card in job_cards:
    title = card.find_element_by_class_name("mdc-typography--subtitle1").text
    company = card.find_element_by_class_name("mdc-typography--subtitle2").text
    status = card.find_element_by_class_name("mdc-chip__text").text
    date = card.find_element_by_class_name("ja-created-at").text
    if not db.contains((Job.title == title) & (Job.company == company) & (Job.date == date)):
        db.insert({'title': title, 'company': company, 'status': status, 'date': date})
        inserted_jobs.append(db.get((Job.title == title) & (Job.company == company) & (Job.date == date)))
    else:
        job = db.get((Job.title == title) & (Job.company == company) & (Job.date == date))
        if job['status'] != status:
            db.update({'status': status}, (Job.title == title) & (Job.company == company) & (Job.date == date))
            updated_jobs.append(job)

Finally, we check if there were any jobs updated or inserted. If there are, we append them to a string that will be the body of the email we send out using yagmail. You can go here to learn how to setup yagmail. I use a development Gmail account to send these emails.

    if len(inserted_jobs) > 0:
        subject = "Wobb - New Jobs Found"
        for job in inserted_jobs:
            body += "\n\nTitle: " + job['title'] + "\nCompany: " + job['company'] + "\nStatus: " + job['status'] + "\nCreated At: " + job['date']
        yag.send(target, subject, body)
    if len(updated_jobs) > 0:
        subject = "Wobb - Jobs Updated!"
        for job in updated_jobs:
            body += "\n\nTitle: " + job['title'] + "\nCompany: " + job['company'] + "\nStatus: " + job['status'] + "\nCreated At: " + job['date']
        yag.send(target, subject, body)

The above snippets only highlight the important parts of the script (which is most of the script) because I wanted to keep the article fairly short. If you’re interested in looking at the full project, the code is available on my Github.

Enjoyed the article or found it useful? Help out with a share and spread the word to other developers! Also feel free to drop a comment below. I’d love to hear your thoughts

Kelvin Mwinuka
Kelvin Mwinuka

I am a software developer with a BS in Computer Science from The University of Nottingham. I’m passionate about web technologies. In my free time, I like blogging and challenging myself physically.

I write awesome articles regularly. Subscribe so you don’t miss out on great content!

Name:

Email:

Leave a Reply

Your email address will not be published.