ABSTRACT

CONTENTS 12.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457

12.1.1 Computational Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459 12.2 Exploring Different Web Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459 12.3 Preliminary/Exploratory Scraping: The Kaggle Job List . . . . . . . . . . . . . . . . . . . . . 465

12.3.1 Processing the Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469 12.3.2 Generalizing to Other Posts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470 12.3.3 Scraping the Kaggle Post List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473

12.4 Scraping CyberCoders.com . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475 12.4.1 Getting the Skill List from a Job Post . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478 12.4.2 Finding the Links to Job Postings in the Search Results . . . . . . . . . . . . . 482 12.4.3 Finding the Next Page of Job Post Search Results . . . . . . . . . . . . . . . . . . . 487 12.4.4 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488

12.5 A Reusable Generic Framework for Arbitrary Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . 489 12.6 Scraping Career Builder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492 12.7 Scraping Monster.com . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494 12.8 Analyzing the Results: The Important Skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495 12.9 Note on Web Scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503 12.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504

12.1 Introduction and Motivation In this case study, we will explore on-line job postings for different professions or types of positions. We are interested in finding the set of skills that different types of positions expect and want, and which are valuable, but not required. We also want to find information about what educational level an applicant should have (i.e., BSc, MSc, or PhD) for different types of jobs, how many years of experience are needed, what the salary ranges are, and how these differ geographically. We will work with on-line postings so they are up-to-date and easily accessed programmatically. We expect the resulting information will be interesting for both students and instructors. Also, in the case study, you will learn some of the skills that are

to

Ideally, there would be a single Web site with all job postings of interest. We could then query these with a rich query language and the results would be returned to us in a convenient form such as JSON or XML. We could then convert these results into data structures in any language such as R [6] or Python and start to explore the data. Each job posting would, ideally, have the same structure, with fields for salary range as a minimum and a maximum, location, list of required skills, educational background, and so on. We would be able to extract these easily, given the standard structure for each job. Furthermore, we could do text processing and even natural language processing (NLP) on the less structured aspects of each posting. We might even think of retrieving the job postings and housing them in a local database, or perhaps even better, a text search engine such as Lucene [1] or ElasticSearch [2].