How to turn any website into an RSS feed

What if a website you want to integrate does not provide an RSS feed? In this article, we’ll show you how to build a simple crawler and publish its content in an RSS feed.

Content

If you are like me - used to following the internet the good old way by using RSS, it happens once upon a time that you find an interesting website with no RSS feed available. A sad but not unresolvable situation.

Can all websites have an RSS feed?

One of our main claims on the Apify website is that we turn websites into APIs. So we should know how to solve this kind of situation, right? The only missing ingredient for this article is a website without a proper RSS feed. Well, it's the cobbler's children that go barefoot, so let's use the Apify change log!

Changelog · Apify
Keep up to date with the latest releases, fixes, and features from Apify.

We will be using our most popular generic web scraping tool - Web Scraper (apify/web-scraper).

Get a free Apify account if you haven't already got one, then open the scraper in Apify Console and create a task for it:

The option to create a new task for the web scraper actor.

After you've created a new task, open the "Input and options" tab. Here we will have to configure three fields:

  • Start-URLs - simply enter the URL for the scraper to start at https://apify.com/change-log.
  • Pseudo-URLs - this is a pattern for URLs you want the scraper to visit. These are all in the form https://apify.com/change-log/[.*] where [.*] stands for any series of characters.
  • Page function - here the programmer fun starts 🎡.
The actor has a page function input form.

We won't need any other configuration fields to accomplish our task. But to put together the page function, we will have to look more deeply into the HTML source code of the changelog page.

For more information on various inputs of Web Scraper, see its documentation.

To summarize what we have just configured:

What is an aggregator website and how to make yours
Crash course on why and how to create an online aggregator.
One RSS feed doesn't cut it for you anymore? Try making a news aggregator.

How to create an RSS feed XML file

If you open the Wikipedia page about RSS, you will find that the RSS item must contain the following fields to be valid:

  <item>
	<title>Example entry</title>
  	<description>Here is some text ...</description>
  	<link>http://www.example.com/blog/post/1</link>
  	<guid>7bd204c6-1655-4c27-aeee-53f933c5395f</guid>
  	<pubDate>Sun, 06 Sep 2009 16:20:00 +0000</pubDate>
  </item>

This means that from each changelog post (such as this one), we need to create a page function to extract:

  • Title
  • Description - the body of the post
  • URL
  • Some unique ID
  • Publication date

Here you can see the HTML structure of a changelog detail page with the data we need:

The location of the description, title and date in the html.

I'll use jQuery, which is already embedded in the page function, to extract the data, and the whole code will look as follows:

async function pageFunction(context) {
    // Skip landing page and extract data from details only.
    if (context.request.url === 'https://apify.com/change-log') return;

    const $ = context.jQuery;
    const title = $('h1').first().text();
    const date = $('.ChangeLogItem-date').text();
    // There is one <div> between header and the description.
    // Also trim() the text to get rid of a whitespace.
    const description = $('.ChangeLogItem-header')
        .next().next() // Skip the <div> in between
        .text()
        .trim();
    const isoDate = new Date(date).toUTCString();

    return {
        url: context.request.url,
        title,
        date: isoDate,
        guid: isoDate, // Date is unique so we can use it.
        description,
    };
}

Now let's run our task, and after it finishes, preview the data in the dataset. You should get the following results:

Json preveiw of the scraped data.

How often do you want your RSS feed to update?

Now let's configure Apify Scheduler to run your task every hour to get fresh updates. Finally, you can copy-paste this API URL to access results from the last run of your task in RSS format:

https://api.apify.com/v2/actor-tasks/[TASK_ID]/runs/last/dataset/items?token=[YOUR_API_TOKEN]&format=rss

to the RSS reader of your choice, and this is the result:

Apify Change Log.

There you go. Now that you know how to create an automatic RSS feed from any website resource, staying on top of the most important news updates is easy. Let us know how it worked for you on Twitter!

💡
Interested in reading about how AWS helped Apify cut cloud costs by 25%? Read this Apify case study over on the AWS site ⚙️🧰
Marek Trunkát
Marek Trunkát
CTO and one of the earliest Apifiers. Writing about challenges our development team faces when building and scaling the Apify platform, which automates millions of tasks every month.

Get started now

Step up your web scraping and automation