For the past few years I've been trying to move away from Social Media like Reddit and Twitter. The main challenge has been that I find myself going back to these platforms because of the value they provide to me in terms of keeping me updated on the "happenings" in the world. Apart from these two I find myself going to other websites as well each day - Cricinfo for cricket news, HackerNews for tech news, Nature for science news etc.
Recently, I decided to explore trying to build a personal newspaper application to aggregate information from the sources I typically consume. The idea was to build a personal newsletter that helps me not live under a rock. I know there are newsletters and RSS feeds out there solving this exact problem (I've found StrictlyVC's newsletter quite great in terms of getting tech news) but the idea was to build something opinionated and personal to me.
I took a quick stab at this and while the results are not exactly where I want it to be, it gets pretty close and I think with more time and effort there's something worth building here. I pretty much used Claude Code and Codex for this application. This is not a discussion on the Tech Stack or the technology I used but more so on the approach. Having said that [Modal](https://modal.com/) is awesome and you should definitely consider using it to spin up projects really quickly.
Caveat - I wish I had stored my outputs and retained old code but I optimized for a quick and easy solution so here are thoughts on how I went about designing this system sans evidence (you'll have to just take my word for it) -
### Design
To give high amount of control to the user and allow users to live in their own "echo chambers" (or not) I decided to allow the user (in this case me) to define the sources and categories they're interested in.
With the sources clearly defined there are broadly three tasks in our Newspaper application -
1. Gathering - Scrape articles based on the current date and topic.
2. Scoring - The content gathered needs to be prioritised based on rules the user gives, the feedback from the user as well as 'significance' of the news.
3. Aggregating - Based on the sources provided the articles need to be aggregated and summarized handling any contradictory information gracefully.
### Gathering
As you must have realised in the outline, I never really got past gathering but I do think this is the "meat" of the problem and the part that needs solving.
For gathering as well there were too many tasks
1. Get the listing pages based on the source. For example for Sports News in BBC you would want to get to this URL - https://www.bbc.com/sport
2. Get the article links from the category pages.
#### Non Agentic Scraper Approach
The first approach I attempted for gathering is to directly scrape sources based on the URLs provided by users. This was a completely non agentic approach and rely purely on code using the Beautiful Soup library. The website will be scraped and articles will later be filtered using the date and category filter. I used Claude Code to generate a few different scrapers and tested out with BBC, NY Times and The Hindu websites.
I initially tried to brute force the process but the site structures were too different so I had to switch to an approach to break down the scraping process. I assumed that the category page would be available as a URL on the home page. I used LLMs to check the match between the category provided and the listing pages and also used LLMs to check if article pages are truly article pages.
##### Challenges Faced
- This approach completely loses information from the layout of the website and the prominence the source has given to a particular article.
- I ended up with way too many links to make sense of.
- Multiple points of failure and often the category page finding proved to be challenging, particularly in sources like HackerNews where the page itself is the category page.
#### Agentic Web Search Approach
The second approach I looked at was to leverage search engines. The idea was to get to the category page using OpenAI web search tools. I asked OpenAI find articles based on the category and source. I used the evaluator and optimiser pattern and had one agent performing the search while another responded to it and suggest approaches it can take.
##### Challenges Faced
- This was far too slow and the sites were often blocked to GPT. The error rate was too high and it seems far too inefficient a method. The agents often tend to gravitate towards some very convoluted logic and complex approaches.
- The costs also ballooned very quickly. I needed to use reasoning models often and this proved to be too expensive.
#### Search Engine Approach
The third approach I tried was to see what I could get out of the Google and Exa APIs. I used the search engines and included the source URL for what I was looking for. I thought that search engines could be great to solve part 1 of my gathering problem in particular.
##### Challenges Faced
- When I include URLs the results are very sparse, perhaps because the indexing hasn't been done yet.
- The other issue is that this is largely dependent on the interpretation of the search engine.
- It was hard to discern whether we were directly solving part 1 and part 2 of the gathering problem or only solving one part.
#### Scraping Approach with MCP
I realised that my whole idea of taking on gathering in two parts was a mistake. I realised that we would probably want to do away with part 1 all together. Part 1 is a compile time problem - human's can make that decision when setting the sources.
To create a more robust system than my earlier scraping approach I decided to combine agentic capabilities of making decisions on the fly with basic scraping by creating an MCP server with some tools. With this approach we're also able to understand the context and position of URLs in the page.
These are the tools I gave the agent -
1. Open URL - Open a URL in the browser and return the page content. This is the first step before any scraping or navigation.
2. Click Element - Click on an element on the current page. Can click by CSS selector or by text content.
3. Scrape Content - Scrape and extract content from the current page. Returns text content and publish date (if available), optionally filtered by selector.
4. Get Links - Get all links from the current page, optionally filtered by pattern or selector.
5. Get Page Info - Get information about the current page including URL, title, and basic metadata.
6. Screenshot - Take a screenshot of the current page. Useful for debugging or visual inspection.
7. Get Link URL - Get the URL of a link by its text content. Returns the href without clicking or navigating.
8. Get Links with Context - Get links with rich context including position, styling, and prominence indicators. Use this to identify featured/prominent articles.
9. Analyze Page Structure - Analyze the page structure to identify main content areas, navigation, and featured sections. Useful for understanding page layout.
Primarily the agent relied on `Get Links` and `Get Links Context` with some limited usage of `Analyze Page Structure` and `Get Link URL`. I tried this approach with both GPT and Claude and set a limit of 30 tool calls. I started with 10 articles for each category page but decided to increase to 20 articles and lean more on the scoring step.
##### Results
The results were actually quite promising and I feel pretty good about this approach. For some links in particular the agent performed exactly as I'd have liked. Sample results can be found on the [GitHub](https://github.com/vigneshprasad/personal-newspaper) repository for a few links and sources. I tried to include a wide variety in the types of links in this run.
##### Challenges Faced
- There's some difficulty in making sure we don't get blocked by websites.
- Need to tune the prompts quite well so that the best articles are found and not just some random article.
- Often times the agents tended to return other aggregate pages. Perhaps this is something that can be handled in later steps
### Final Thoughts
The main challenge with all the approaches is that they don't inspire confidence that all articles are being evaluated. While the approach works it will need some time and effort to be comprehensive. The idea of building this was to help me move away from going to internet rabbit holes first thing in the morning but I don't think it's there yet.
I had a lot of fun working on this but will take a pause here and get back to this project at a later point in time. The code can be found on GitHub here - https://github.com/vigneshprasad/personal-newspaper. Feel free to fork, contribute or do whatever you please.
<div class="subscribe-box"> <p>Get notified of new essays and receive an email whenever I publish something new.</p>
<form action="https://buttondown.com/api/emails/embed-subscribe/vignesh-prasad" method="post" target="popupwindow" onsubmit="window.open('https://buttondown.com/vigneshprasad', 'popupwindow')" class="embeddable-buttondown-form" > <label for="bd-email">Enter your email</label> <input type="email" name="email" id="bd-email" /> <input type="submit" value="Subscribe" /> </form></div>