NAiOS IconNAiOS Logo
Volver al catálogo

Firecrawl

firecrawl

Firecrawl automates web crawling and data extraction, enabling organizations to gather content, index sites, and gain insights from online sources at scale

Acciones
7
Triggers
0
Autenticación
OAuth gestionado
No
Información técnica: el detalle de parámetros, schemas y triggers de esta página está pensado para equipos de integración. Si solo necesitas saber si tu herramienta favorita está disponible, basta con ver la lista de acciones.

Acciones disponibles (7)

Cada acción es una operación que el agente puede ejecutar contra este conector. Haz clic en una acción para ver sus parámetros.

Cancel a crawl jobFIRECRAWL_CANCEL_A_CRAWL_JOBAcción

Cancels an active or queued web crawl job using its id; attempting to cancel completed, failed, or previously canceled jobs will not change their state.

Parámetros de entrada

  • idstringObligatorio

    The unique identifier (UUID) of the crawl job to be canceled.

Parámetros de salida

  • dataobjectObligatorio
  • errorstring

    Error if any occurred during the execution of the action

  • successfulbooleanObligatorio

    Whether or not the action execution was successful or not

Start a web crawlFIRECRAWL_CRAWLAcción

Initiates a firecrawl web crawl from a given url, applying various filtering and content extraction rules, and polls until the job is complete; ensure the url is accessible and any regex patterns for paths are valid.

Parámetros de entrada

  • urlstringObligatorio

    The base URL to start crawling from. This is the initial entry point for the web crawler.

  • delayinteger

    Delay in milliseconds between requests to avoid overwhelming the server

  • limitinteger

    Maximum number of pages to crawl. The crawl will stop once this limit is reached. Default is 10.

  • webhookstring

    An optional webhook URL to receive real-time updates on the crawl job. Events include crawl start (`crawl.started`), page crawled (`crawl.page`), and crawl completion (`crawl.completed` or `crawl.failed`). The payload structure matches the `/scrape` endpoint response.

  • maxDepthinteger

    Maximum depth of subpages to crawl relative to the base URL. A depth of 0 crawls only the base URL, 1 crawls the base URL and its direct links, etc. Default is 2.

  • excludePathsstring[]

    A list of Regular Expression (regex) patterns for URL paths to exclude from the crawl. URLs whose paths match any of these patterns will be ignored. For example, `"blog/archive/.*"` would exclude all paths under `/blog/archive/`.

  • includePathsstring[]

    A list of Regular Expression (regex) patterns for URL paths to include in the crawl. Only URLs whose paths match one of these patterns will be processed. For example, `"products/featured/.*"` would only include paths under `/products/featured/`.

  • ignoreSitemapboolean

    If true, the crawler will ignore any sitemap.xml found on the website. Defaults to true.

  • maxDiscoveryDepthinteger

    Maximum depth for discovering new links, separate from crawling depth

  • allowBackwardLinksboolean

    If true, allows the crawler to navigate to pages that were linked from pages already visited (i.e., navigate 'backwards'). Defaults to false.

  • allowExternalLinksboolean

    If true, allows the crawler to follow links that lead to external websites (different domains). Defaults to false.

  • scrapeOptions_proxystring

    Proxy configuration for requests

  • scrapeOptions_maxAgeinteger

    Maximum age in seconds for cached content. If content is older than this, it will be re-scraped

  • scrapeOptions_mobileboolean

    If true, emulate a mobile device when scraping

  • ignoreQueryParametersboolean

    If true, ignore query parameters when determining if a URL has been visited

  • scrapeOptions_actionsobject[]

    List of actions to perform on each page before scraping (e.g., clicking buttons, waiting)

  • scrapeOptions_formatsstring[]

    Specifies the desired output formats for the scraped content from each page. Default is `["markdown"]`. If format is json, jsonOptions is required.

  • scrapeOptions_headersobject

    Custom HTTP headers to send with each request

  • scrapeOptions_timeoutinteger

    Timeout in milliseconds for each page request. Default is 30000ms (30 seconds)

  • scrapeOptions_waitForinteger

    The duration in milliseconds to wait for page JavaScript to execute and content to load before scraping. Useful for pages with dynamically loaded content. Default is 123ms.

  • scrapeOptions_blockAdsboolean

    If true, block advertisements during scraping

  • scrapeOptions_locationobject

    Geolocation settings for the scraper

  • scrapeOptions_parsePDFboolean

    If true, attempt to parse PDF files encountered during crawling

  • scrapeOptions_excludeTagsstring[]

    A list of HTML tags to exclude from the scraped output. Content within these tags (and their children) will be removed before processing.

  • scrapeOptions_includeTagsstring[]

    A list of HTML tags to specifically include in the scraped output. Only content within these tags will be processed. If empty or null, all relevant content is considered based on other options.

  • scrapeOptions_jsonOptionsobject

    Options for JSON format extraction including schema and prompts

  • scrapeOptions_storeInCacheboolean

    If true, store scraped content in cache for future use

  • scrapeOptions_onlyMainContentboolean

    If true, attempts to extract only the main content of each page, excluding common elements like headers, navigation bars, and footers. Default is true.

  • scrapeOptions_removeBase64Imagesboolean

    If true, remove base64-encoded images from the scraped content

  • scrapeOptions_skipTlsVerificationboolean

    If true, skip TLS certificate verification

  • scrapeOptions_changeTrackingOptionsobject

    Options for tracking changes between crawls

Parámetros de salida

  • dataobjectObligatorio

    A dictionary containing the crawled data. This typically includes a job ID, status, and an array of page data if the crawl is complete and successful. The structure can vary based on the crawl outcome (e.g., success, failure, ongoing).

  • errorstring

    Error if any occurred during the execution of the action

  • successfulbooleanObligatorio

    Whether or not the action execution was successful or not

Extract structured dataFIRECRAWL_EXTRACTAcción

Extracts structured data from web pages by initiating an extraction job and polling for completion; requires a natural language `prompt` or a json `schema` (one must be provided).

Parámetros de entrada

  • urlsstring[]Obligatorio

    A list of URLs from which to extract data. Wildcards (e.g., `https://example.com/blog/*`) can be used for crawling multiple pages under a specific path.

  • promptstring

    Natural language query for information to extract from URL content. E.g., 'Extract the company mission, whether it supports SSO, etc.'

  • schemaobject

    JSON object defining the desired structure for extracted data (e.g., field names, types). Dictates output format.

  • enable_web_searchboolean

    If `True`, allows crawling links outside initial domains in `urls`; if `False`, restricts to same domains.

Parámetros de salida

  • dataobjectObligatorio

    A dictionary containing the structured data extracted from the URLs. The structure of this data will conform to the provided `schema` or the LLM's interpretation of the `prompt`.

  • errorstring

    Error if any occurred during the execution of the action

  • successfulbooleanObligatorio

    Whether or not the action execution was successful or not

Get the status of a crawl jobFIRECRAWL_GET_THE_STATUS_OF_A_CRAWL_JOBAcción

Retrieves the current status, progress, and details of a web crawl job, using the job id obtained when the crawl was initiated.

Parámetros de entrada

  • idstringObligatorio

    Unique identifier (UUID) of the crawl job.

Parámetros de salida

  • dataobjectObligatorio

    Details of the crawl job, including `status` (e.g., "scraping", "completed", "failed"), `total` pages attempted, `completed` successfully crawled pages, `creditsUsed`, and `expiresAt` (data expiration timestamp).

  • errorstring

    Error if any occurred during the execution of the action

  • successfulbooleanObligatorio

    Whether or not the action execution was successful or not

Map multiple URLsFIRECRAWL_MAP_MULTIPLE_URLS_BASED_ON_OPTIONSAcción

Maps a website by discovering urls from a starting base url, with options to customize the crawl via search query, subdomain inclusion, sitemap handling, and result limits; search effectiveness is site-dependent.

Parámetros de entrada

  • urlstringObligatoriouri

    The primary base URL from which the mapping process will begin.

  • limitinteger

    Maximum number of unique links/pages to discover and return; helps control mapping scope and duration.

  • searchstring

    Optional search query to guide URL mapping, prioritizing or finding specific page types. 'Smart' search is limited to 1000 initial results in Alpha, but overall mapping can exceed this.

  • ignoreSitemapboolean

    If true, the crawler ignores sitemap.xml files, relying on page links for discovery.

  • includeSubdomainsboolean

    If true, includes subdomains of the base URL in the mapping. E.g., if `url` is example.com, blog.example.com is mapped.

Parámetros de salida

  • dataobjectObligatorio

    Dictionary containing the URL mapping results, typically a list of discovered URLs or a structured sitemap representation.

  • errorstring

    Error if any occurred during the execution of the action

  • successfulbooleanObligatorio

    Whether or not the action execution was successful or not

Scrape URLFIRECRAWL_SCRAPEAcción

Scrapes a publicly accessible url, optionally performing pre-scrape browser actions or extracting structured json using an llm, to retrieve content in specified formats.

Parámetros de entrada

  • urlstringObligatorio

    The fully qualified URL of the web page to scrape.

  • actionsobject[]

    An optional list of browser actions (e.g., click, type, wait) to perform on the page *before* scraping occurs. Useful for interacting with dynamic content, filling forms, or navigating through page elements.

  • formatsstring[]

    A list of desired output formats for the scraped content. Defaults to ['markdown']. If `json` is included, `jsonOptions` *must* be provided.

  • timeoutinteger

    Maximum time in milliseconds to wait for the scraping request to complete. Defaults to 30000.

  • waitForinteger

    Time in milliseconds to wait for the page to load or for dynamic content to render before starting the scrape. Defaults to 0.

  • locationobject

    Location settings for the request

  • excludeTagsstring[]

    A list of HTML tags to specifically exclude from the output. Content within these tags will be removed.

  • includeTagsstring[]

    A list of HTML tags to specifically include in the output. Content within these tags will be prioritized.

  • jsonOptionsobject

    Options for JSON extraction

  • onlyMainContentboolean

    If true, attempts to extract only the main article content, excluding headers, footers, navigation bars, and ads. Defaults to true.

Parámetros de salida

  • dataobjectObligatorio

    A dictionary containing the scraped data. Keys correspond to the requested formats (e.g., 'markdown', 'html', 'json', 'screenshot'), and values are the extracted content or metadata for those formats.

  • errorstring

    Error if any occurred during the execution of the action

  • successfulbooleanObligatorio

    Whether or not the action execution was successful or not

SearchFIRECRAWL_SEARCHAcción

Performs a web search for a query, scrapes content from the top search results using firecrawl, and returns details in specified formats.

Parámetros de entrada

  • langstring

    Language code for search results (e.g., 'en' for English, default 'en').

  • limitinteger

    Maximum number of search results to return (1-10, default 5).

  • querystringObligatorio

    The search query to execute.

  • countrystring

    Country code to tailor search results (e.g., 'us' for United States, default 'us').

  • formatsstring[]

    Desired output formats for scraped content of each search result (e.g., 'markdown', 'html'). If None, default scraping applies. Available: 'markdown', 'html', 'rawHtml', 'links', 'screenshot', 'screenshot@fullPage'.

  • timeoutinteger

    Maximum time in milliseconds for search and scrape operations (1000-300000, default 60000).

Parámetros de salida

  • dataobject[]

    List of search result items, each with details and potentially scraped content.

  • errorstring

    Error if any occurred during the execution of the action

  • successboolean

    Indicates if the overall search operation was successful.

  • warningstring

    Optional warning message about the search operation.

  • successfulbooleanObligatorio

    Whether or not the action execution was successful or not