Designing the System - Vignesh Prasad

### Overview Broadly there are three tasks in our Newspaper application - 1. Gathering - There are two cases for the gathering 1. If the user has provided sources - scrape articles based on the current date and topic. 2. If no sources are provided - browse the web and scrape articles from the top two or three results. 2. Scoring - The content gathered needs to be prioritised based on rules the user gives, the feedback from the user as well as 'significance' of the news. 3. Aggregating - Based on the sources provided the articles need to be aggregated and summarised handling any contradictory information gracefully. --- ### Gathering input - URL, category, date output - List of { url, score, date, summary } #### Non Agentic Scraper Approach The first approach we will attempt for gathering is to directly scrape sources based on the URLs provided by users. This will be a completely non agentic approach and rely purely on code. The website will be scraped and articles will later be filtered using the date and category filter. I used Claude Code to generate a few different scrapers and tested out with BBC, NY Times and Hindu websites. I initially tried to brute force the process but the site structures were too different so I had to switch to an approach to break down the scraping process. The first part would be to find appropriate listing pages based on the category that is provided. The second would then get article pages from the listing pages. We will use LLMs to check the match between the category provided and the listing pages and also use LLMs to check if article pages are truly article pages. ##### Challenges Faced - Challenges in allowing sites to be accessed by a bot with many sites blocking me. - This approach completely loses information from the layout of the website and the prominence the source has given to a particular article. #### Web Search Approach The second approach we'll look at is to leverage search engines. We will use the OpenAI web search tools and ask it to find articles based on the category and source. We will use the evaluator and optimiser pattern and have one agent performing the search while another will respond to it and suggest approaches it can take. ##### Challenges Faced - This is far too slow and the sites are often blocked to GPT. The error rate is too high and it seems far too inefficient a method. The agents often tend to gravitate towards some very convoluted logic and complex approaches. #### Search Engine Approach The third approach is to try and see what we can get out of the Google APIs. We will use the Google search API to get to our aggregator pages and then use a scraper to go over the list of.