### Overview
Broadly there are three tasks in our Newspaper application -
1. Gathering - Scrape articles based on the current date and topic.
2. Scoring - The content gathered needs to be prioritised based on rules the user gives, the feedback from the user as well as 'significance' of the news.
3. Aggregating - Based on the sources provided the articles need to be aggregated and summarized handling any contradictory information gracefully.
---
### Gathering
input - URL, category, date
output - List of { url, score, date, summary }
#### Non Agentic Scraper Approach
The first approach we will attempt for gathering is to directly scrape sources based on the URLs provided by users. This will be a completely non agentic approach and rely purely on code. The website will be scraped and articles will later be filtered using the date and category filter. I used Claude Code to generate a few different scrapers and tested out with BBC, NY Times and Hindu websites.
I initially tried to brute force the process but the site structures were too different so I had to switch to an approach to break down the scraping process. The first part would be to find appropriate listing pages based on the category that is provided. The second would then get article pages from the listing pages. We will use LLMs to check the match between the category provided and the listing pages and also use LLMs to check if article pages are truly article pages.
##### Challenges Faced
- Challenges in allowing sites to be accessed by a bot with many sites blocking me.
- This approach completely loses information from the layout of the website and the prominence the source has given to a particular article.
#### Web Search Approach
The second approach we'll look at is to leverage search engines. We will use the OpenAI web search tools and ask it to find articles based on the category and source. We will use the evaluator and optimiser pattern and have one agent performing the search while another will respond to it and suggest approaches it can take.
##### Challenges Faced
- This is far too slow and the sites are often blocked to GPT. The error rate is too high and it seems far too inefficient a method. The agents often tend to gravitate towards some very convoluted logic and complex approaches.
#### Search Engine Approach
The third approach is to try and see what we can get out of the Google APIs. We will use the Google search API to get to our aggregator pages. Further we will consider what we can do with tools like Exa.
##### Challenges Faced
- This is not allowing us to be opinionated enough and is largely dependent on the interpretation of the tool we're using.
#### Scraping Approach with MCP
With the initial scraping approach we ran into a lot of issues since we started from the basic website. Here what I'm trying is to use a particular listing page that's defining the category and writing an MCP server to help the agent scrape. This should hopefully be more robust than our earlier scraping approach as it combines the default scraping with agentic capabilities to make decisions on the fly. With the MCP approach we also add some tools that understand the context and position of the URLs in the page to handle some amount of the relevance and scoring as well.
##### Challenges Faced
- There's some difficulty in making sure we don't get blocked by websites.
- Need to tune the prompts quite well so that the best articles are found and not just some random article.
### Final Thoughts
The results are still not that satisfactory and doesn't inspire confidence that all articles are being aggregated. While the approach works it will need some time and effort to be comprehensive which is something that is important in a newspaper application.
I had a lot of fun working on this but will take a pause here and wait for updates in model capabilities with respect to browsing the web.
The code can be found on GitHub here - https://github.com/vigneshprasad/personal-newspaper