We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I have tried below curl -
curl -H "X-Respond-With: markdown" 'http://127.0.0.1:3000/https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf'
below are the logs from docker container -
2024-10-13 18:08:01 [Crawler] INFO: Crawl request received for URL: /https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf 2024-10-13 18:08:01 Crawl method called with request: /https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf 2024-10-13 18:08:01 req.headers: {"host":"127.0.0.1:3000","user-agent":"curl/8.7.1","accept":"/","x-respond-with":"markdown"} 2024-10-13 18:08:01 Request headers: { 2024-10-13 18:08:01 host: '127.0.0.1:3000', 2024-10-13 18:08:01 'user-agent': 'curl/8.7.1', 2024-10-13 18:08:01 accept: '/', 2024-10-13 18:08:01 'x-respond-with': 'markdown' 2024-10-13 18:08:01 } 2024-10-13 18:08:01 Request headers: { 2024-10-13 18:08:01 host: '127.0.0.1:3000', 2024-10-13 18:08:01 'user-agent': 'curl/8.7.1', 2024-10-13 18:08:01 accept: '/', 2024-10-13 18:08:01 'x-respond-with': 'markdown' 2024-10-13 18:08:01 } 2024-10-13 18:08:01 Crawler options: CrawlerOptionsHeaderOnly { 2024-10-13 18:08:01 respondWith: 'markdown', 2024-10-13 18:08:01 withGeneratedAlt: false, 2024-10-13 18:08:01 withLinksSummary: false, 2024-10-13 18:08:01 withImagesSummary: false, 2024-10-13 18:08:01 noCache: false, 2024-10-13 18:08:01 keepImgDataUrl: false, 2024-10-13 18:08:01 withIframe: false, 2024-10-13 18:08:01 removeSelector: undefined, 2024-10-13 18:08:01 targetSelector: undefined, 2024-10-13 18:08:01 waitForSelector: undefined, 2024-10-13 18:08:01 userAgent: undefined, 2024-10-13 18:08:01 proxyUrl: undefined 2024-10-13 18:08:01 } 2024-10-13 18:08:01 Added to circuit breaker hosts: 127.0.0.1 2024-10-13 18:08:01 Cookies: [] 2024-10-13 18:08:01 Configured crawl options: { 2024-10-13 18:08:01 proxyUrl: undefined, 2024-10-13 18:08:01 cookies: [], 2024-10-13 18:08:01 favorScreenshot: false, 2024-10-13 18:08:01 removeSelector: undefined, 2024-10-13 18:08:01 targetSelector: undefined, 2024-10-13 18:08:01 waitForSelector: undefined, 2024-10-13 18:08:01 overrideUserAgent: undefined, 2024-10-13 18:08:01 timeoutMs: undefined, 2024-10-13 18:08:01 withIframe: false 2024-10-13 18:08:01 } 2024-10-13 18:08:01 [Crawler] INFO: Starting scrap for URL: https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf 2024-10-13 18:08:01 Starting scrap for URL: https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf 2024-10-13 18:08:01 Crawl options: { 2024-10-13 18:08:01 proxyUrl: undefined, 2024-10-13 18:08:01 cookies: [], 2024-10-13 18:08:01 favorScreenshot: false, 2024-10-13 18:08:01 removeSelector: undefined, 2024-10-13 18:08:01 targetSelector: undefined, 2024-10-13 18:08:01 waitForSelector: undefined, 2024-10-13 18:08:01 overrideUserAgent: undefined, 2024-10-13 18:08:01 timeoutMs: undefined, 2024-10-13 18:08:01 withIframe: false 2024-10-13 18:08:01 } 2024-10-13 18:08:01 Crawler options: CrawlerOptionsHeaderOnly { 2024-10-13 18:08:01 respondWith: 'markdown', 2024-10-13 18:08:01 withGeneratedAlt: false, 2024-10-13 18:08:01 withLinksSummary: false, 2024-10-13 18:08:01 withImagesSummary: false, 2024-10-13 18:08:01 noCache: false, 2024-10-13 18:08:01 keepImgDataUrl: false, 2024-10-13 18:08:01 withIframe: false, 2024-10-13 18:08:01 removeSelector: undefined, 2024-10-13 18:08:01 targetSelector: undefined, 2024-10-13 18:08:01 waitForSelector: undefined, 2024-10-13 18:08:01 userAgent: undefined, 2024-10-13 18:08:01 proxyUrl: undefined 2024-10-13 18:08:01 } 2024-10-13 18:08:01 Using default scraping method 2024-10-13 18:08:01 Scraping options: { 2024-10-13 18:08:01 proxyUrl: undefined, 2024-10-13 18:08:01 cookies: [], 2024-10-13 18:08:01 favorScreenshot: false, 2024-10-13 18:08:01 removeSelector: undefined, 2024-10-13 18:08:01 targetSelector: undefined, 2024-10-13 18:08:01 waitForSelector: undefined, 2024-10-13 18:08:01 overrideUserAgent: undefined, 2024-10-13 18:08:01 timeoutMs: undefined, 2024-10-13 18:08:01 withIframe: false 2024-10-13 18:08:01 } 2024-10-13 18:08:01 [CHANGE_LOGGER_NAME] INFO: Page 40: Scraping https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf { 2024-10-13 18:08:01 url: 'https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf' 2024-10-13 18:08:01 } 2024-10-13 18:08:01 [CHANGE_LOGGER_NAME] INFO: Page 40: Attempting to set cookies: [] 2024-10-13 18:08:01 Formatting snapshot { 2024-10-13 18:08:01 mode: 'markdown', 2024-10-13 18:08:01 url: 'https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf' 2024-10-13 18:08:01 } 2024-10-13 18:08:01 Processing HTML content 2024-10-13 18:08:01 Getting Turndown service { 2024-10-13 18:08:01 url: URL { 2024-10-13 18:08:01 href: 'https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf', 2024-10-13 18:08:01 origin: 'https://assets.airtel.in', 2024-10-13 18:08:01 protocol: 'https:', 2024-10-13 18:08:01 username: '', 2024-10-13 18:08:01 password: '', 2024-10-13 18:08:01 host: 'assets.airtel.in', 2024-10-13 18:08:01 hostname: 'assets.airtel.in', 2024-10-13 18:08:01 port: '', 2024-10-13 18:08:01 pathname: '/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf', 2024-10-13 18:08:01 search: '', 2024-10-13 18:08:01 searchParams: URLSearchParams {}, 2024-10-13 18:08:01 hash: '' 2024-10-13 18:08:01 }, 2024-10-13 18:08:01 imgDataUrlToObjectUrl: true 2024-10-13 18:08:01 } 2024-10-13 18:08:01 Adding Turndown rules 2024-10-13 18:08:01 Adding data-url-to-pseudo-object-url rule 2024-10-13 18:08:01 Turndown service configured 2024-10-13 18:08:01 Skipping parsed content processing
did not get any response. am I missing something ??
also if possible please add helm charts on how to deploy it on a kubernetes cluster, it will be really helpful.
what should be the crawl request, if you can add example in the readme.
The text was updated successfully, but these errors were encountered:
No branches or pull requests
I have tried below curl -
curl -H "X-Respond-With: markdown" 'http://127.0.0.1:3000/https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf'
below are the logs from docker container -
2024-10-13 18:08:01 [Crawler] INFO: Crawl request received for URL: /https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf
2024-10-13 18:08:01 Crawl method called with request: /https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf
2024-10-13 18:08:01 req.headers: {"host":"127.0.0.1:3000","user-agent":"curl/8.7.1","accept":"/","x-respond-with":"markdown"}
2024-10-13 18:08:01 Request headers: {
2024-10-13 18:08:01 host: '127.0.0.1:3000',
2024-10-13 18:08:01 'user-agent': 'curl/8.7.1',
2024-10-13 18:08:01 accept: '/',
2024-10-13 18:08:01 'x-respond-with': 'markdown'
2024-10-13 18:08:01 }
2024-10-13 18:08:01 Request headers: {
2024-10-13 18:08:01 host: '127.0.0.1:3000',
2024-10-13 18:08:01 'user-agent': 'curl/8.7.1',
2024-10-13 18:08:01 accept: '/',
2024-10-13 18:08:01 'x-respond-with': 'markdown'
2024-10-13 18:08:01 }
2024-10-13 18:08:01 Crawler options: CrawlerOptionsHeaderOnly {
2024-10-13 18:08:01 respondWith: 'markdown',
2024-10-13 18:08:01 withGeneratedAlt: false,
2024-10-13 18:08:01 withLinksSummary: false,
2024-10-13 18:08:01 withImagesSummary: false,
2024-10-13 18:08:01 noCache: false,
2024-10-13 18:08:01 keepImgDataUrl: false,
2024-10-13 18:08:01 withIframe: false,
2024-10-13 18:08:01 removeSelector: undefined,
2024-10-13 18:08:01 targetSelector: undefined,
2024-10-13 18:08:01 waitForSelector: undefined,
2024-10-13 18:08:01 userAgent: undefined,
2024-10-13 18:08:01 proxyUrl: undefined
2024-10-13 18:08:01 }
2024-10-13 18:08:01 Added to circuit breaker hosts: 127.0.0.1
2024-10-13 18:08:01 Cookies: []
2024-10-13 18:08:01 Configured crawl options: {
2024-10-13 18:08:01 proxyUrl: undefined,
2024-10-13 18:08:01 cookies: [],
2024-10-13 18:08:01 favorScreenshot: false,
2024-10-13 18:08:01 removeSelector: undefined,
2024-10-13 18:08:01 targetSelector: undefined,
2024-10-13 18:08:01 waitForSelector: undefined,
2024-10-13 18:08:01 overrideUserAgent: undefined,
2024-10-13 18:08:01 timeoutMs: undefined,
2024-10-13 18:08:01 withIframe: false
2024-10-13 18:08:01 }
2024-10-13 18:08:01 [Crawler] INFO: Starting scrap for URL: https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf
2024-10-13 18:08:01 Starting scrap for URL: https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf
2024-10-13 18:08:01 Crawl options: {
2024-10-13 18:08:01 proxyUrl: undefined,
2024-10-13 18:08:01 cookies: [],
2024-10-13 18:08:01 favorScreenshot: false,
2024-10-13 18:08:01 removeSelector: undefined,
2024-10-13 18:08:01 targetSelector: undefined,
2024-10-13 18:08:01 waitForSelector: undefined,
2024-10-13 18:08:01 overrideUserAgent: undefined,
2024-10-13 18:08:01 timeoutMs: undefined,
2024-10-13 18:08:01 withIframe: false
2024-10-13 18:08:01 }
2024-10-13 18:08:01 Crawler options: CrawlerOptionsHeaderOnly {
2024-10-13 18:08:01 respondWith: 'markdown',
2024-10-13 18:08:01 withGeneratedAlt: false,
2024-10-13 18:08:01 withLinksSummary: false,
2024-10-13 18:08:01 withImagesSummary: false,
2024-10-13 18:08:01 noCache: false,
2024-10-13 18:08:01 keepImgDataUrl: false,
2024-10-13 18:08:01 withIframe: false,
2024-10-13 18:08:01 removeSelector: undefined,
2024-10-13 18:08:01 targetSelector: undefined,
2024-10-13 18:08:01 waitForSelector: undefined,
2024-10-13 18:08:01 userAgent: undefined,
2024-10-13 18:08:01 proxyUrl: undefined
2024-10-13 18:08:01 }
2024-10-13 18:08:01 Using default scraping method
2024-10-13 18:08:01 Scraping options: {
2024-10-13 18:08:01 proxyUrl: undefined,
2024-10-13 18:08:01 cookies: [],
2024-10-13 18:08:01 favorScreenshot: false,
2024-10-13 18:08:01 removeSelector: undefined,
2024-10-13 18:08:01 targetSelector: undefined,
2024-10-13 18:08:01 waitForSelector: undefined,
2024-10-13 18:08:01 overrideUserAgent: undefined,
2024-10-13 18:08:01 timeoutMs: undefined,
2024-10-13 18:08:01 withIframe: false
2024-10-13 18:08:01 }
2024-10-13 18:08:01 [CHANGE_LOGGER_NAME] INFO: Page 40: Scraping https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf {
2024-10-13 18:08:01 url: 'https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf'
2024-10-13 18:08:01 }
2024-10-13 18:08:01 [CHANGE_LOGGER_NAME] INFO: Page 40: Attempting to set cookies: []
2024-10-13 18:08:01 Formatting snapshot {
2024-10-13 18:08:01 mode: 'markdown',
2024-10-13 18:08:01 url: 'https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf'
2024-10-13 18:08:01 }
2024-10-13 18:08:01 Processing HTML content
2024-10-13 18:08:01 Getting Turndown service {
2024-10-13 18:08:01 url: URL {
2024-10-13 18:08:01 href: 'https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf',
2024-10-13 18:08:01 origin: 'https://assets.airtel.in',
2024-10-13 18:08:01 protocol: 'https:',
2024-10-13 18:08:01 username: '',
2024-10-13 18:08:01 password: '',
2024-10-13 18:08:01 host: 'assets.airtel.in',
2024-10-13 18:08:01 hostname: 'assets.airtel.in',
2024-10-13 18:08:01 port: '',
2024-10-13 18:08:01 pathname: '/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf',
2024-10-13 18:08:01 search: '',
2024-10-13 18:08:01 searchParams: URLSearchParams {},
2024-10-13 18:08:01 hash: ''
2024-10-13 18:08:01 },
2024-10-13 18:08:01 imgDataUrlToObjectUrl: true
2024-10-13 18:08:01 }
2024-10-13 18:08:01 Adding Turndown rules
2024-10-13 18:08:01 Adding data-url-to-pseudo-object-url rule
2024-10-13 18:08:01 Turndown service configured
2024-10-13 18:08:01 Skipping parsed content processing
did not get any response. am I missing something ??
also if possible please add helm charts on how to deploy it on a kubernetes cluster, it will be really helpful.
what should be the crawl request, if you can add example in the readme.
The text was updated successfully, but these errors were encountered: