27-08-2015 дата публикации
Номер: US20150242508A1
Принадлежит:
Systems and methods for scheduling documents for crawling are disclosed in which sitemap information is updated for a first website identified by a sitemap by downloading updated sitemap information for the first website and scheduling documents for crawling in accordance with the updated sitemap information for the first website. The sitemap information includes one or more sitemap indexes, where each respective sitemap index in the one or more sitemap indices includes a list of URLs corresponding to documents stored at a corresponding website in a plurality of websites, the plurality of websites including the first website, and each sitemap index in the one or more sitemap indexes includes information identifying one or more of: a last modification date of a URL in the list of URLs, a change frequency of a document specified by the URL, a document title, an authority of the document, and a priority of the document. 1. A method of scheduling documents for crawling , performed on a computer system having one or more processors and memory storing one or more programs for execution by the one or more processors , the method comprising:identifying an updated sitemap using a last change date in a sitemap index, the sitemap index including a list of sitemaps for a website, each sitemap having a URL and a last change date;updating sitemap information for the sitemap by downloading updated sitemap information, whereinthe sitemap information includes a list of URLs corresponding to documents stored at the website and each URL is associated with two or more of: a last modification date for the URL, a change frequency of a document specified by the URL, and a priority of the document; andscheduling documents for crawling in accordance with the updated sitemap information.2. (canceled)3. (canceled)4. The method of claim 1 , wherein the sitemap is generated using (i) an access log claim 1 , (ii) a pre-made URL list claim 1 , and (iii) information obtained from a content ...
Подробнее