Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does it support pdf and how do we crawl a website ?? #10

Open
rostwal95 opened this issue Oct 13, 2024 · 0 comments
Open

Does it support pdf and how do we crawl a website ?? #10

rostwal95 opened this issue Oct 13, 2024 · 0 comments

Comments

@rostwal95
Copy link

I have tried below curl -

curl -H "X-Respond-With: markdown" 'http://127.0.0.1:3000/https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf'

below are the logs from docker container -

2024-10-13 18:08:01 [Crawler] INFO: Crawl request received for URL: /https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf
2024-10-13 18:08:01 Crawl method called with request: /https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf
2024-10-13 18:08:01 req.headers: {"host":"127.0.0.1:3000","user-agent":"curl/8.7.1","accept":"/","x-respond-with":"markdown"}
2024-10-13 18:08:01 Request headers: {
2024-10-13 18:08:01 host: '127.0.0.1:3000',
2024-10-13 18:08:01 'user-agent': 'curl/8.7.1',
2024-10-13 18:08:01 accept: '/',
2024-10-13 18:08:01 'x-respond-with': 'markdown'
2024-10-13 18:08:01 }
2024-10-13 18:08:01 Request headers: {
2024-10-13 18:08:01 host: '127.0.0.1:3000',
2024-10-13 18:08:01 'user-agent': 'curl/8.7.1',
2024-10-13 18:08:01 accept: '/',
2024-10-13 18:08:01 'x-respond-with': 'markdown'
2024-10-13 18:08:01 }
2024-10-13 18:08:01 Crawler options: CrawlerOptionsHeaderOnly {
2024-10-13 18:08:01 respondWith: 'markdown',
2024-10-13 18:08:01 withGeneratedAlt: false,
2024-10-13 18:08:01 withLinksSummary: false,
2024-10-13 18:08:01 withImagesSummary: false,
2024-10-13 18:08:01 noCache: false,
2024-10-13 18:08:01 keepImgDataUrl: false,
2024-10-13 18:08:01 withIframe: false,
2024-10-13 18:08:01 removeSelector: undefined,
2024-10-13 18:08:01 targetSelector: undefined,
2024-10-13 18:08:01 waitForSelector: undefined,
2024-10-13 18:08:01 userAgent: undefined,
2024-10-13 18:08:01 proxyUrl: undefined
2024-10-13 18:08:01 }
2024-10-13 18:08:01 Added to circuit breaker hosts: 127.0.0.1
2024-10-13 18:08:01 Cookies: []
2024-10-13 18:08:01 Configured crawl options: {
2024-10-13 18:08:01 proxyUrl: undefined,
2024-10-13 18:08:01 cookies: [],
2024-10-13 18:08:01 favorScreenshot: false,
2024-10-13 18:08:01 removeSelector: undefined,
2024-10-13 18:08:01 targetSelector: undefined,
2024-10-13 18:08:01 waitForSelector: undefined,
2024-10-13 18:08:01 overrideUserAgent: undefined,
2024-10-13 18:08:01 timeoutMs: undefined,
2024-10-13 18:08:01 withIframe: false
2024-10-13 18:08:01 }
2024-10-13 18:08:01 [Crawler] INFO: Starting scrap for URL: https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf
2024-10-13 18:08:01 Starting scrap for URL: https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf
2024-10-13 18:08:01 Crawl options: {
2024-10-13 18:08:01 proxyUrl: undefined,
2024-10-13 18:08:01 cookies: [],
2024-10-13 18:08:01 favorScreenshot: false,
2024-10-13 18:08:01 removeSelector: undefined,
2024-10-13 18:08:01 targetSelector: undefined,
2024-10-13 18:08:01 waitForSelector: undefined,
2024-10-13 18:08:01 overrideUserAgent: undefined,
2024-10-13 18:08:01 timeoutMs: undefined,
2024-10-13 18:08:01 withIframe: false
2024-10-13 18:08:01 }
2024-10-13 18:08:01 Crawler options: CrawlerOptionsHeaderOnly {
2024-10-13 18:08:01 respondWith: 'markdown',
2024-10-13 18:08:01 withGeneratedAlt: false,
2024-10-13 18:08:01 withLinksSummary: false,
2024-10-13 18:08:01 withImagesSummary: false,
2024-10-13 18:08:01 noCache: false,
2024-10-13 18:08:01 keepImgDataUrl: false,
2024-10-13 18:08:01 withIframe: false,
2024-10-13 18:08:01 removeSelector: undefined,
2024-10-13 18:08:01 targetSelector: undefined,
2024-10-13 18:08:01 waitForSelector: undefined,
2024-10-13 18:08:01 userAgent: undefined,
2024-10-13 18:08:01 proxyUrl: undefined
2024-10-13 18:08:01 }
2024-10-13 18:08:01 Using default scraping method
2024-10-13 18:08:01 Scraping options: {
2024-10-13 18:08:01 proxyUrl: undefined,
2024-10-13 18:08:01 cookies: [],
2024-10-13 18:08:01 favorScreenshot: false,
2024-10-13 18:08:01 removeSelector: undefined,
2024-10-13 18:08:01 targetSelector: undefined,
2024-10-13 18:08:01 waitForSelector: undefined,
2024-10-13 18:08:01 overrideUserAgent: undefined,
2024-10-13 18:08:01 timeoutMs: undefined,
2024-10-13 18:08:01 withIframe: false
2024-10-13 18:08:01 }
2024-10-13 18:08:01 [CHANGE_LOGGER_NAME] INFO: Page 40: Scraping https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf {
2024-10-13 18:08:01 url: 'https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf'
2024-10-13 18:08:01 }
2024-10-13 18:08:01 [CHANGE_LOGGER_NAME] INFO: Page 40: Attempting to set cookies: []
2024-10-13 18:08:01 Formatting snapshot {
2024-10-13 18:08:01 mode: 'markdown',
2024-10-13 18:08:01 url: 'https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf'
2024-10-13 18:08:01 }
2024-10-13 18:08:01 Processing HTML content
2024-10-13 18:08:01 Getting Turndown service {
2024-10-13 18:08:01 url: URL {
2024-10-13 18:08:01 href: 'https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf',
2024-10-13 18:08:01 origin: 'https://assets.airtel.in',
2024-10-13 18:08:01 protocol: 'https:',
2024-10-13 18:08:01 username: '',
2024-10-13 18:08:01 password: '',
2024-10-13 18:08:01 host: 'assets.airtel.in',
2024-10-13 18:08:01 hostname: 'assets.airtel.in',
2024-10-13 18:08:01 port: '',
2024-10-13 18:08:01 pathname: '/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf',
2024-10-13 18:08:01 search: '',
2024-10-13 18:08:01 searchParams: URLSearchParams {},
2024-10-13 18:08:01 hash: ''
2024-10-13 18:08:01 },
2024-10-13 18:08:01 imgDataUrlToObjectUrl: true
2024-10-13 18:08:01 }
2024-10-13 18:08:01 Adding Turndown rules
2024-10-13 18:08:01 Adding data-url-to-pseudo-object-url rule
2024-10-13 18:08:01 Turndown service configured
2024-10-13 18:08:01 Skipping parsed content processing

did not get any response. am I missing something ??

also if possible please add helm charts on how to deploy it on a kubernetes cluster, it will be really helpful.

what should be the crawl request, if you can add example in the readme.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant