Skip to content

Data Extraction

Methods for extracting text, attributes, and structured data from the page.


get_text

Returns the visible text of an element.

text = await browser.get_text(selector)

Example

balance = await browser.get_text(".account-balance")
# "$1,200.00"

title = await browser.get_text("h1")
# "Dashboard"

get_attribute

Returns the value of an HTML attribute.

value = await browser.get_attribute(selector, attribute="name")

Example

href = await browser.get_attribute("a.report", attribute="href")
# "https://portal.com/report.pdf"

css_class = await browser.get_attribute("#btn", attribute="class")
# "btn btn-primary"

get_full_text

Returns all visible text on the page.

text = await browser.get_full_text()

extract_data

Extracts structured data from repetitive elements (cards, lists, divs).

data = await browser.extract_data(
    container="parent-selector",
    row="each-item-selector",
    columns={"column_name": "inner-selector"},
)

Parameters

Parameter Type Default Description
container str "" Parent container selector
row str "" Selector for each repeated item
columns dict None Name -> inner selector mapping
next_page str None "Next page" button selector
max_pages int 100 Maximum pages to extract

Return

List of dicts:

[
    {"text": "Quote 1...", "author": "Einstein"},
    {"text": "Quote 2...", "author": "Tolkien"},
]

Example -- Quotes

quotes = await browser.extract_data(
    container="body",
    row=".quote",
    columns={
        "text": ".text",
        "author": ".author",
        "tags": ".tags",
    }
)

Example -- Product cards

products = await browser.extract_data(
    container="#product-list",
    row=".product-card",
    columns={
        "Name": ".card-title",
        "Price": ".card-price",
        "Stock": ".card-stock span",
    },
    next_page="#btn-next",
    max_pages=5,
)

How to discover selectors

Use inspect() to map the structure:

# 1. View the general page
await browser.inspect(depth=5)
# Shows: <div.quote> x 10 items

# 2. View inside an item
await browser.inspect(".quote", depth=5)
# Shows: <span.text>, <small.author>, <a.tag>

# 3. Use in extract_data
data = await browser.extract_data(
    container="body", row=".quote",
    columns={"text": ".text", "author": ".author"}
)

extract_table

Extracts data from an HTML <table>. Uses headers as keys.

data = await browser.extract_table(selector)

Parameters

Parameter Type Default Description
selector str -- <table> selector
next_page str None Next page button selector
max_pages int 100 Maximum pages

Return

[
    {"Name": "John", "SSN": "123-45-6789", "Balance": "1,200.00"},
    {"Name": "Mary", "SSN": "987-65-4321", "Balance": "3,500.00"},
]

Example

# Simple table
clients = await browser.extract_table("#clients-table")

# With pagination
clients = await browser.extract_table(
    "#clients-table",
    next_page="#btn-next",
    max_pages=10,
)

# Convert to DataFrame
import pandas as pd
df = pd.DataFrame(clients)