Attempt to directly establish a RAG system using CrewAI's web scraping tool
-
Today, I attempted to use CrewAI's ScrapeWebsiteTool to establish a RAG system. It can scrape website content based on a given URL and directly use the content to respond without storing it.
Using the GameFi consulting section as an example, here are the steps to achieve this with two agents:- GameLinkSearchAgent: Uses ScrapeWebsiteTool to provide links to game listing pages and find links to detailed game pages.
- GameDetailAgent: Uses ScrapeWebsiteTool to search for answers within the specified webpage based on user queries and organize the response.
Comparison with a regular RAG system:
- Regular RAG system:
- Offline work: Scrape content and store it in Vector DB.
- Online service: User query > Vector search > Document + AI analysis > Response.
- Online crawler RAG system:
- Offline work: None.
- Online service: User query > GameLinkSearchAgent > Link + GameDetailAgent > Response.
It is evident that the online crawler RAG system is simpler and directly eliminates the offline work of building a VectorDB. However, it has a significant drawback: the response time is too slow.
The following two images show the time comparison between the online crawler RAG system I tried and the regular RAG system I previously built. The average response time for the regular system is about 4 seconds, while the online crawler RAG system takes over 30 seconds.This is easy to understand why, as live scraping takes time. In contrast, Vector Search in a Vector DB is a very fast search method. It only requires parallel computation of the similarity between documents and the query, followed by quick sorting. Compared to finding content on a webpage, this method significantly saves time. Additionally, the documents in the vector database are already parsed content, eliminating the time needed for online parsing.
Although completely eliminating the offline crawling work is still unrealistic in actual development, it does not mean that CrewAI is useless in this context. CrewAI can be used in two scenarios:- Offline crawling + storing in vector database: Use CrewAI for offline crawling, parse the scraped content, and store it in a vector database. This approach improves the efficiency of crawling work development and does not affect the speed of resource retrieval for subsequent online services.
- Combining online crawler with RAG system: While a vector database can store a large number of documents, its capacity is limited and cannot cover all information, especially the latest news. Combining the online crawler setup with the RAG system can handle the latest content that has not yet been stored in the vector database. This method leverages the real-time nature of the online crawler to compensate for the limitations of the vector database.