Parse sites with Node.js

post image
 

Scrapping sites takes not the last role in web world, there are all sorts of instruments & libraries to do that, once i had to parse data from different sites and i was looking for ways to perform it:
Firstly i found proper lib which allows to make requests: request package,
to install this run:

npm i request

www.npmjs.com/package/request
Now it can be used for getting data from sources, like below:

const request = require("request");getData = async () => {
return new Promise((res, rej) => {
return request(this.parseUrl, async (error, response, body) => {
if (error) {
return rej(error)
}
return res(body);
});
})
}

Here is async function which returns Promise, inside called Get request and set the callbacks.
After getting DOM we need to parse it, the most spreaded tool: cheerio,
install it:

npm i cheerio

www.npmjs.com/package/cheerio
Example how to use:

const cheerio = require("cheerio");getBody = async (body) => {
if (body) {
const $ = await cheerio.load(body);
const links = [];
$('a').each(function (index, link) {
links.push($(this).attr('href'))
})
return {
body: $('body'),
links,
};
}
return { body };
}

Be sure script is waiting till cheerio loaded body, then can be done whatever is needed.
Here returned object with entire DOM & Array with all links,
by using cheerio it’s easy to get any needed elements or attributes.

In addition: for some reasons it’s good to be able to make regular parsing without PC interaction, using cron or other tools for self-launching scripts allows this, i usually use node-schedule cause it’s simply configured lib,
command for installation:

npm i node-schedule

www.npmjs.com/package/node-schedule
For understang basic usage:

const schedule = require('node-schedule');function scheduleWork(work = () => {}, period = { minutes: '59', hours: '*', days: '*' }) {
const periodToLaunch = `${period.minutes} ${period.hours} ${period.days} * *`;
return schedule.scheduleJob(periodToLaunch, function() {
work();
});
}
module.exports = {
scheduleWork,
};

There is launched script every hour that comes from function arguments.
Thanks for reading this, hope it will save some time for you.
Best regards.