— Programming — 2 min read
In the last post about a Craigslist bot we were able to generate a list of posts by ID. While this by itself is super cool, it's not particularly useful..? It can give a nice list of posts but what I want is data about these listings. Where are they? Who are they? When are they?
Thankfully the work done during the last post can be leveraged here as well. Each unique post ID has a unique URL to a post listing (which is obvious - you want one post to equal one URL...would be a nightmare if they shared!). By grabbing the URL that the searched post links to, we can webscrape from that URL for more details about the post.
I'm going to first define a function that takes in a base URL to start with, and an array of post IDs:
1const got = require('got');2const cheerio = require('cheerio');3
4const get_updated = async (url, ids) => {5
6}
Easy enough. These post IDs will be the difference between the baseline and current post listing, but for testing purposes I'm going to throw a ton of IDs in.
Firstly, we need to get the HTML data from the URL provided. I know that I'm going to have a ton of different posts to get data from, and I want this to be done efficiently and in parallel - there's no dependencies between posts. Because of this I'm going to use Promise.all()
to run a large number of parallel HTTP requests simultaneously.
1const got = require('got');2const cheerio = require('cheerio');3
4const get_updated = async (url, ids) => {5
6 const data = await got(url).text();7
8 return await Promise.all(ids.map(async id => {9
10 }))11}
For each of the posts, I want to only have to search through IDs that match the list of those that are different. No need to have a ton of text if you can downselect using the IDs. jQuery has a selector that lets you select those with specific attributes & values, and I want to use the same data-pid
attribute that I used last time to generate the list of IDs to help me downselect. To do that, you need to specify the attribute and value in the form [attr=value]
.
1const got = require('got');2const cheerio = require('cheerio');3
4const get_updated = async (url, ids) => {5
6 const data = await got(url).text();7
8 return await Promise.all(ids.map(async id => {9 const results = cheerio.load(data)(`[data-pid=${id}]`).html();10 const $ = cheerio.load(results);11 }))12}
Within this downselection, there's only one <a>
element to the specific post, which is really handy. We can follow the href
attribute of the <a>
element to get more information about the post, and to grab the URL that this link links to, we just need to access the href
attribute using the .attr()
function.
1const got = require('got');2const cheerio = require('cheerio');3
4const get_updated = async (url, ids) => {5
6 const data = await got(url).text();7
8 return await Promise.all(ids.map(async id => {9 const results = cheerio.load(data)(`[data-pid=${id}]`).html();10 const $ = cheerio.load(results);11 const url = $('a').attr('href');12 }))13}
The URL we just extracted leads to a more detailed page about this particular listing. If we follow it / grab its contents using got
, we can get more information for this listing.
1const got = require('got');2const cheerio = require('cheerio');3
4const get_updated = async (url, ids) => {5
6 const data = await got(url).text();7
8 return await Promise.all(ids.map(async id => {9 const results = cheerio.load(data)(`[data-pid=${id}]`).html();10 const $ = cheerio.load(results);11 const url = $('a').attr('href');12 13 return got(url).text()14 .then(data => {15 // I'm using a double $ to remind myself that this is an 'inner'16 // jQuery selector function17 const $$ = cheerio.load(data);18 });19 }))20}
There's a lot of fields available in the post listing, but not all of them are useful! I'm specifically interested in the following fields:
I dug around a bit in the HTML of a Craigslist post and was able to come up with the following. I'm sure it's not perfect, but it seems to work for searches that I'm interested in! Feel free to edit as necessary.
Here's the final function:
1const got = require('got');2const cheerio = require('cheerio');3
4const get_updated = async (url, ids) => {5
6 const data = await got(url).text();7
8 return await Promise.all(ids.map(async id => {9 const results = cheerio.load(data)(`[data-pid=${id}]`).html();10 const $ = cheerio.load(results);11 const url = $('a').attr('href');12 13 return got(url).text()14 .then(data => {15 const $$ = cheerio.load(data);16
17 return {18 url,19 price: $('.result-meta > .result-price').text(),20 title: $('.result-title').text(),21 // Replace non-whitespace characters with nothing, but don't remove Unicode22 location: $('.result-hood').text().replace(/[^a-zA-Z\d\s:\u00C0-\u00FF]/g, '').trim(),23 distance: $('.maptag').text(),24 time: $('time').attr('datetime'),25 // There seems to be 70 random characters at the start of every posting body...26 // Trim start & remove newlines and whitespace27 description: $$('#postingbody').text().slice(70).split('\n').filter(Boolean).map(el => el.trim()).join(' '),28 // Grab the thumbnail, if applicable29 img: $$('img[title=1]').attr('src'),30 // Filter and remove unwanted attributes31 attributes: $$('.attrgroup > span').map((idx, el) => $$(el).text()).toArray().filter(Boolean).filter(el => !el.includes('ads'))32 }33 });34 }))35}
How does this tie into the code from last time? If we combine the two like so:
1(async () => {2 const url = `https://maine.craigslist.org/d/furniture/search/fua?postal=04402&search_distance=10`;4 const update_interval = `86400`;5 const ids = await get_ids(url);6
7 console.log({8 uuid: uuidv4(),9 url,10 email,11 update_interval,12 // Ignore the ids[0] here...I don't want to copy in 50+ records lol13 posts: await get_updated(url, [ids[0]])14 });15}16)();
we can get the following output:
1{2 uuid: '222393f5-3a27-4b21-9c26-d18755303ae5',3 url: 'https://maine.craigslist.org/d/furniture/search/fua?postal=04402&search_distance=10',5 update_interval: '86400',6 posts: [7 {8 url: 'https://maine.craigslist.org/fuo/d/bangor-oak-table/7408541258.html',9 price: '$1,200',10 title: 'oak table',11 location: 'Bangor',12 distance: '0mi',13 time: '2021-11-15 21:21',14 description: 'Oak table center tier folds down, underside cabinet and six high back chairs great shape. 42 by 54 two tiers 54x54 with three tiers',15 img: 'https://images.craigslist.org/01717_i51JiGM6Uo1z_0CI0t2_600x450.jpg',16 attributes: ["condition: excellent"]17 }18 ]19}
Lookie here! Looks good :)
I lied - next post we will pull together a DB for some persistance of posts & begin to create an API to interact with. We'll also need to build out a notifier of some sort...
Thanks for reading!