Job Recommendation System

Caleb Weng
13 min readJun 8, 2021

--

-by UCI MSBA NLP Team1B

Project description and objectives

Based on the large demand for jobs, people start looking for jobs constantly online through different platforms. The online job boards become the major channel for job seekers in the digital era. Another challenge we face in this study is failing to match their experiences with the job requirements. Based on research, there are 250 resumes received for each job opening, and only 50% of the resumes meet the minimum requirement. To save the time for both recruiter and candidate from endless searching time, we suggest building a job recommendation system for more efficient work. The purpose of this is to build a recommendation system to analyze the job description to identify the potential area of interest to the candidates.
The primary objective of this project is to create models to help recommend applicants the job they might be interested in based on their previous experience and positions of interest as well as other demographic information.
Job recommendation has traditionally been treated as a filter-based match or as a recommendation based on the features of jobs and candidates as discrete entities. In this paper, we use Bag of Words, TFIDF, KNN where we leverage the progression of job selection by candidates using Nature Learning Processing. Job recommendation is primarily aimed at supporting the discovery of jobs that may interest the user. It should be dynamic in order to cater to can be a good indicator of their motivations and preferences.

Data Description

Our dataset comes from four tables on Kaggle. Since our project is trying to help candidates find the most suitable job according to their work experience and other associated information like job description, we are going to create two corpus: job openings and applicants. Details of the two corpuses are shown below.

Exploratory data analysis

WordCloud

The below word cloud for applicant view text indicates some high frequency words such as customer service, part time, administrative assistant and service rep.

Job postings (left) and Applicant experience (right)

Word frequency

We used RegexpTokenizer to find the frequency of unique words in text. The above graph shows 20 most frequent words in applicant view text. Words including service, sales, customer, time, part, assistant, rep and provide all appear more than 2000 times.

The below four graphs show the top 50 states and cities in the number of job openings and the number of jobs viewed by applicants.

State wise job openings

As we can see from the bar chart above, there are the largest number of job openings in California with over 10,000. Florida and Texas are following with nearly 6000 job openings. And there are fewer jobs in other states.

City wise job openings

The graph shows Chicago, Houston and New Albany are the top 3 cities with the most job openings which are all close to 800.

State wise job applicants

As the bar graph shows, the top 3 most popular states are California, Now York and Illinois, especially California where over 650 jobs are viewed by applicants.

City wise job applicants

The city wise job seekers graph shows jobs in New York, Los Angeles, Chicago and Washington are viewed more than 100 times.

Data pre-processing

Fill in NAs values

To discover which variables were filled in NAs values, we analyzed the missing values in each column. Below are the variables with NA value counts within the Jobs Data, there were 135 values in the City, 171 States Names,10 Employment Type and also 267 Education Required were missing. For Applicants Data only 22 State Names were missing.

For the NA values of City and State, we found out the company names and searched them through Google to find out the locations of the company. Next, we filled in NA values of city and state with those locations.

For the employment type, we mainly filled NA with ‘full/part time’ since it is the most common type of employment in our dataset.

For the education requirements, we filled it with ‘no specified’.

Job openings
Applicants

Text cleaning

Text contains different symbols and words which does not convey meaning to the model. First of all, we import Stopworks library from Natural Language Toolkit to exclude some words that are not meaningful to recommend jobs and convert them to lowercase. In addition, we cleaned the text using regular expressions to remove digits, symbols and punctuations. After applying previous steps, texts were split as words and transferred to their lemma by performing tokenization and lemmatization.

Final Dataset

Below are examples for final datasets for job openings and applicant information after text cleaning.

Job openings
Applicants

Topic models

We applied topic modeling to our job description corpus using the LDA model and the package of Gensim in Python. Latent Dirichlet Allocation (LDA) with Gensim can transform job descriptions into different topics and each topic will have a different weight of words.

Before modeling, we removed stop words in the SpaCy package and created Bi-gram of our job description text. And then, we applied lemmatization to the job description corpus and filtered nouns, adj, verbs, and adv as our final text. Finally, we adopted id2word.doc2bow to convert our job document into the corpus with Bag Of Word format. The outcome is a document-term matrix that contains the word id and its frequency shown as below.

[(0,
'0.037*"park" + 0.027*"service" + 0.024*"guest" + 0.022*"time" + '
'0.014*"part" + 0.011*"clean" + 0.011*"attendant" + 0.010*"perform" + '
'0.010*"food" + 0.010*"equipment"'),
(1,
'0.069*"account" + 0.037*"accountemp" + 0.032*"clerk" + 0.018*"accountant" + '
'0.018*"payable" + 0.017*"specify" + 0.016*"temp" + 0.016*"experience" + '
'0.015*"seasonal" + 0.015*"bull"'),
(2,
'0.091*"time" + 0.056*"part" + 0.040*"full" + 0.037*"opportunity" + '
'0.034*"rsquo" + 0.030*"work" + 0.028*"pay" + 0.021*"offer" + 0.019*"look" + '
'0.019*"schedule"'),
(3,
'0.105*"sale" + 0.069*"customer" + 0.042*"retail" + 0.042*"product" + '
'0.034*"associate" + 0.032*"store" + 0.027*"service" + 0.023*"part" + '
'0.022*"time" + 0.022*"representative"'),
(4,
'0.037*"patient" + 0.036*"care" + 0.032*"nurse" + 0.019*"health" + '
'0.017*"medical" + 0.015*"hospital" + 0.013*"provide" + 0.013*"resident" + '
'0.012*"healthcare" + 0.012*"plan"'),
(5,
'0.079*"home" + 0.063*"care" + 0.037*"health" + 0.024*"nurse" + '
'0.022*"caregiver" + 0.022*"bayada" + 0.021*"senior" + 0.018*"provide" + '
'0.017*"time" + 0.017*"client"'),
(6,
'0.034*"entry" + 0.021*"officeteam" + 0.020*"assistant" + 0.020*"office" + '
'0.019*"specify" + 0.017*"customer" + 0.016*"seasonal" + 0.014*"call" + '
'0.013*"administrative" + 0.013*"level"'),
(7,
'0.019*"career" + 0.014*"school" + 0.013*"train" + 0.011*"program" + '
'0.011*"job" + 0.011*"grow" + 0.010*"work" + 0.009*"center" + 0.009*"level" '
'+ 0.009*"part"'),
(8,
'0.022*"service" + 0.014*"client" + 0.013*"maintain" + 0.011*"provide" + '
'0.011*"manager" + 0.010*"management" + 0.010*"work" + 0.010*"customer" + '
'0.009*"ensure" + 0.009*"information"'),
(9,
'0.035*"ability" + 0.030*"able" + 0.023*"job" + 0.022*"skill" + '
'0.022*"towne" + 0.021*"work" + 0.018*"macys" + 0.017*"time" + '
'0.017*"customer" + 0.017*"experience"')]

After applying the LDA model which can summarize and assign each document to one specific topic, we can get ten topics with different weight/distribution of words.

From the figure below, we concluded the job description into ten unique categories as below. From the intertopic distance plot, the bubble size shows the importance of the topic related to job data. These topics can be divided into four parts because some topics are close to each other: 1) topic 5 and 8, 2) topic 2, 9 and 10, 3) topic 1, 6 and 3, 4) topic 4 and 7, which means they are similar, but the distances of these four parts are far away.

  • Topic 1 — Business Management
  • Topic 2 — Trainer/ Tutor
  • Topic 3 — Healthcare
  • Topic 4 — Office Clerk / Administrative Assistant
  • Topic 5 — Sales Representative
  • Topic 6 — Server/ Security/ Technician
  • Topic 7 — Accountant
  • Topic 8 — Hourly-pay and part-time
  • Topic 9 — Customer Service
  • Topic 10 — Home-care

Recommendation models

Considering an applicant usually applies for a job from a nearby location, we created a new dataset containing job postings that are located in the same state as the applicant. The first step for text processing is to convert corpus to vector/numerical dataset, so we put corpus for both job openings and applicant experiences into two models, Bag of words and TF-IDF. Bag of words converts the text data to word frequency. Same as bag of words in terms of word frequency, but TF-IDF also calculated the inverse of the document frequency and it gives an importance to the rarity of a word. The top 10 recommendations were calculated by cosine similarity. In addition, we tried a combination model of KNN and TF-IDF with the metric as ‘cosine’ as our third model. We illustrated our recommendation system models using below graph of models’ structure.

Model Structure

Below are two applicant examples we used to test the recommendation system.

Applicant 1 (ID 14235):

“personal assistant administrative assistant quest llc new york office cleaner assign various company perform office cleaner duties duties include polish office furniture sweep vacuum floor clean kitchen area put dish dishwasher need empty trash bin metro north office assistant collect customers relate information periodic demographic market survey provide customer service information relate customers travel directions around metro area new york mta reception office assistant clerical perform data entry type letter office memoranda make copy file work various company use microsoft network research update file use quickbooks answer telephone take brief message cashier food supervisor greet customers provide customer service operate cash register make food order perform supervisor duties server host line cook receptionist book keeper customer service rep”

Applicant 2 (ID 13960):

“administrative assistant quest llc los angeles executive manager implement market strategies techniques accord customer need responsibilities include execute operational function order achieve exceed cial goals establish store identify opportunities impact sales store events community outreach ensure outstanding service provide customer sales leadership direction order plan direct day day operations store product price human resources customer relations cash disbursement personal train visual merchandiser creative maximize sales potential create maintain overall store visual proposition line brand strategy plan future range consider past present future trend analyze current information order identify key learn optimize profit next season range along corporate forecast sales review balance stock develop strategic promotions success company visual merchandiser customer service suggest design ideas incorporate current trend fashion analyze develop implement sales strategies constant communication managers develop trend new merchandize promote advertise display windows store recognize consumer trend brand forever product image implement ongoing need business customer customer service use fashion merchandise methodology practical application along constant replenish merchandize receptionist”

Here is the summary of these two applications. Based on their information, we got the following results.

Approach 1 Bag of words with cosine similarity

For this method, we created a dictionary of word frequency that contains all the words in the corpus as keys and the occurrence of the words as values using Bag Of Words. Bag of words model creates a matrix where the columns correspond to the most frequent words in the dictionary where rows correspond to the document. After running the model, we calculated the top 10 cosine similarity with the applicant experience as his/her job recommendation.

Below are the results for applicants 14235 and 13960.

The position of interest of applicant 14235 is Server, Host, Line Cook, Receptionist, Bookkeeper, Customer Service and current job position is administrative assistant in New York.

The position of interest of applicant 13960 is receptionist.

Results

Based on this applicant’s current position, our model recommended ops associate and activities assistant. According to this person’s position of interest bookkeeper, we successfully matched accounts payable clerk that is similar job positions with interested one.

Based on this applicant’s current position as administrative assistant, our model recommended exactly administrative assistant to this person. According to this person’s position of interest receptionist, the similar positions the recommendation model matched are PT teller, file clerk, order management representative, retail sales consultant.

Approach 2 TF-IDF with cosine similarity

TF-IDF is another method to perform job recommendation. With TF-IDF, words are given weight. First, TF-IDF measures the term frequency in a given document. However, words that appear frequently should be discounted. In general, a word appears more frequently in all documents, the less valuable this word is. The inverse document frequency (IDF) intended to give more weights for distinctive words. Same as approach one, we calculated the top 10 cosine similarity with the applicant experience as his/her job recommendation.

Results

Based on this applicant’s current and past experiences, our model recommended customer service, representative, client service coordinator, retail sales consultant, retail field representative, receptionist, retail commission sales, service representative. According to this person’s position of interest bookkeeper and line cook, we successfully matched accounts payable clerk and line cooks that are similar job positions with interested one. The TF-IDF model recommended more accuracy than bag of words one.

Based on this applicant’s past and current position, our model recommended similar positions that are administrative assistant, assistant manager, Macy’s Coordinator, office manager and Macy’s retail stock merchandising to this person.

Approach 3 KNN

We also applied k-Nearest Neighbors (KNN) — an unsupervised machine learning algorithm — to find clusters of similar positions based on the applicant’s previous experience, state and position of interest. We also specified the metric as cosine so that the algorithm will calculate the cosine similarity between jd text vectors and applicants vectors. and make recommendations using the function of “kneighbors” to find the top 10 closest positions.

Results

Based on this applicant’s current and past experiences, our model recommended client service coordinator, retail sales consultant, retail field representative, receptionist, retail commission sales, service representative, account executive. According to this person’s position of interest bookkeeper and line cook, we successfully matched accounts payable clerk and line cooks that are similar job positions with interested one. This model performs as good as the TF-IDF model.

Based on this applicant’s past and current position, our model recommended similar positions that are Macy’s stock merchandising, assistant manager, office manager and Macy’s Coordinator to this person.

Recommendation System Test

Comparing the results given to the applicant (id = 13960), We found that the recommenders of TF-IDF and KNN have the consistent results, meaning that both models recommend similar positions with similar job id and ranking. However, the recommender of BOW has totally different recommendations for job positions. According to the above analysis of recommendation results, we have confidence to say that TF-IDF and KNN with TF-IDF recommendation models perform better than the bag of words. The theory behind this may be that the rarity words are more valuable in the TF-IDF model that is more fit into a job recommendation system.

Challenges

For topic modelling, we applied topic modeling to our job description corpus using the LDA model that transformed job descriptions into different topics. However, we can still find some duplicate words in every topic.

One of the challenges we faced for recommendation models was matching the candidates’ past experience with the preference of locations. There are lots of missing values for job openings in several states. Some past experiences are totally different from their current and interested positions. When we calculated the cosine similarity using the recommendation models, it may be mismatched with their past experiences that are not related with current ones and recommend them a wrong job posting. We can improve our models to put more weight on recent experiences and less on past ones. Additionally, custom features, such as compound score, vader positive and negative score, subjectivity and polarity from VADER and Textblob etc. that fit the job recommendation system, can be added to enhance our recommendation models as well.

Conclusion

What we learn from this project is that California, Florida and Texas have the most job opportunities, and also the project only focuses on the job research data in the U.S, it would be interesting to see how it varies from country to country of recommendation on job research.

For the number of applicants, New York and California have the most people applying for jobs. We can see a bigger applicant pool in those states and our job recommendation system could help a greater audience.

To sum up, the recommendation models of TF-IDF and KNN have very similar results. The 9 out of 10 jobs that TF-IDF recommended are the same with KNN for both examples. However, bag of words have very few same recommendations with other two models.

Ref : Please find the data set and the python code in the GitHub.

--

--

No responses yet