My question is, can we scrape 2 websites and use both of those for 1 chat agent, what is the limit of using the scraped websites (2 or more) as a KB for the chat agent
what is the limitation of scraping the website, and what is the cost of it.
I know it depends upon the size of the website, isn’t there some sort of limitation we have
If I am putting it in an example, like I have 2 websites from 1 organisation, 1 website related to finance and one related to hr.
So in a business aspect, if we are using it internally, the finance users should only get the information from the finance website, they should not be able to get the info about the hr or hr website.
So basically are we able to create a wall / boundary between the crawled websites within 1 agent
And also what is the cost for scraping, I understand it depends upon the website but a standard price would be there right
Hi @Irene_Sibi,
To my understanding, there would not be an issue scraping 2 websites; however I don’t believe the agents currently have a built in function that can create a barrier between the digested information for specific users.
Instead, I would suggest setting up separate Spaces, one for each department and have then having one crawler for each so that it only has access to the proper information.
In terms of web scraper pricing; there is little documentation that clearly outlines that, it may have more to do with the storage of the information your scraping but you can review the attached documentation for additional information that’s available.
Hi @Irene_Sibi great questions. Let me try to address them all:
Can we scrape 2 websites and use both of those for 1 chat agent?
Yes. You can use the web crawler integration.
What is the limit of using the scraped websites (2 or more) as a KB for the chat agent what is the limitation of scraping the website
The limits are documented in the docs above. In Quick you have three ways to browse the web:
Through a UI agent in Quick Automate or Quick Flows. This is agentic browsing and works great for searching for content on a page that you may not know where the information. e.g. Go to , find where x is, and tell me about the x. / Go to site fill out survey with information <x, y, z> and click submit.
Through the web search capabilities built into the service. This is simple to use and is great for live / shallow research.
Through a web crawler integration. This will index the entire page and save the context for vectorized search in the Quick Index. This is great if you are doing offline analysis, utilizing it in Quick research, etc.
All three above have limitations based on several factors like:
robots.txt (can you visit this page as a bot and are you restricted form doing so)
ethical reason e.g. does this violate the sites terms of service or are you deliberately steeling IP
CDN’s like CloudFlare / Akamai often block bots from accessing the site. You will often run into reCAPTCHA’s and other mechanisms that prevent you from crawling the data.
authentication - Is the site behind a login, does it require MFA. With web search you will not be able to login. With UI Agent / web crawler you can access some sites behind an authentication, however, you should be cautious as to how you provide the user / pass. MFA (AFAIK) does not work.
What is the cost of it.
UI agent its based on agent hours
Web search comes with your subscription for free
Web crawler (integration) is based on index pricing. This is extracted text costs.
In a business aspect, if we are using it internally, the finance users should only get the information from the finance website, they should not be able to get the info about the hr or hr website.
If you truly need separation of concerns, you can create a couple of knowledge bases that are centric to the topic. This is a best practice and helps narrow down agent searches. You can still give the finance team and HR team access to the same agent, but they would have access to a space with knowledge bases for HR / Finance. Each knowledge base will have it’s own permission. So if the agent will not search across permission boundaries.
Here’s a good experiment to try:
Try indexing these pages:
Then ask your questions in chat with the knowledge base selected. You can then formulate a cost analysis based on the available data. Also FYI, Quick has an internal knowledge base that has that data available, I’m proposing as an experiment.
Hi @Jedl First query,
As You said to create a barriers, You can use separate knowledge bases instead one 1. But will that affect the cost ? Even if we do it under 1, we should be able to customise it right.
And also in the three options You have mentioned,
I am going with web crawler because my plan is to deploy it as an agent.
But thanks for the experiment idea.
Just to also confirm, I was told that the scraped knowledge is stored internally in Amazon Q. Is it stored in vector format ?
And do we have access to move the vectorised data into a new space that we specifically create to store it.
And for example if we are scraping and then if it does have a login, how can we scrape the details from that.
And I haven’t tried it yet, but with the experiment I did, I used the scraping depth as 1. And I don’t know maybe because of that, it is scraping everything good, but does not scrape the content inside a specific pdf. It returns the pdf. But if we want it to scrape the content of that pdf too, will it do it if I increase the depth ?
And also I have a doubt regarding incremental sync. So in quick suite, we do have daily/ monthly incremental sync. Suppose if we do daily sync, and say for example it sync daily like 10:30 am .
Suppose, u synced on Monday 10:30 am and on that day maybe like 12 noon or something, update happened, and say for example, some of the links changed. Before Tuesday 10:30 comes, if a person just asks the agent for that link, it will deliver the old link right. How will we solve that gap.