No description
Find a file
Jaime Pillora 81c04938e4
Add tests, harden server, expand extractor + jq pipeline support
- main: wrap http.Server with Read/Write/Idle timeouts and SIGINT/SIGTERM
  graceful shutdown via Server.Shutdown.
- handler: collapse repeated query params (?tag=a&tag=b) to a comma-joined
  value instead of silently dropping the second.
- endpoint (json mode): treat each field's extractor list as a jq pipeline
  joined by " | ", matching HTML-mode chaining semantics.
- extractors: add join(sep) for explicit multi-match separators (quoted
  separators are unquoted via strconv); add $N backref support to
  s/.../.../ via re.ExpandString and replace the manual loop with
  re.ReplaceAllString for the global flag.
- extractors: factor sed parsing into parseSed so the matcher and
  generator share one validated grammar (rejects bad delimiters, empty
  match, unknown flags, extra parts).
- tests: add unit coverage for template, extractor generators (default,
  attr, regex match, sed first/global/backref/custom-delim, first, html,
  trim, query-param, join), Extractors.UnmarshalJSON, Extractor chaining,
  jsonValueString across types, extractHTML row completeness, extractJSON
  with chaining, unsupported method/mode rejection, and gostruct's
  panic-safety on bad inputs.
- README: document new join() and first() extractors, $N backref + custom
  delimiter for sed, multi-value query collapse, and JSON mode pipelines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 11:18:41 +10:00
.github Upgrade deps, fix latent bugs, modernise the scraper package 2026-04-27 11:07:32 +10:00
doc Add JSON mode support with jq selectors 2025-10-06 11:53:19 +00:00
example Add JSON mode support with jq selectors 2025-10-06 11:53:19 +00:00
scraper Add tests, harden server, expand extractor + jq pipeline support 2026-04-27 11:18:41 +10:00
.gitignore Upgrade deps, fix latent bugs, modernise the scraper package 2026-04-27 11:07:32 +10:00
go.mod Upgrade deps, fix latent bugs, modernise the scraper package 2026-04-27 11:07:32 +10:00
go.sum Upgrade deps, fix latent bugs, modernise the scraper package 2026-04-27 11:07:32 +10:00
LICENSE enable ci builds with github actions 2021-01-26 06:20:36 +11:00
main.go Add tests, harden server, expand extractor + jq pipeline support 2026-04-27 11:18:41 +10:00
README.md Add tests, harden server, expand extractor + jq pipeline support 2026-04-27 11:18:41 +10:00
TASKS.md Add JSON mode support with jq selectors 2025-10-06 11:53:19 +00:00

scraper

GoDoc CI

A dual interface Go module for building simple web scrapers

Features

  • Go struct-tag interface
  • Command-line interface
    • HTML⇒JSON API server
    • Single binary
    • Simple configuration
    • Zero-downtime config reload with kill -s SIGHUP <scraper-pid>

Install

Binaries

See the latest release or download it with this one-liner: curl https://i.jpillora.com/scraper | bash

Source

$ go get -v github.com/jpillora/scraper

Go Example

package main

import (
	"log"

	"github.com/jpillora/scraper/scraper"
)

func main() {
	type result struct {
		Title string `scraper:"h3 span"`
		URL   string `scraper:"a[href] | @href"`
	}

	type google struct {
		URL    string   `scraper:"https://www.google.com/search?q={{query}}"`
		Result []result `scraper:"#rso div[class=g]"`
		Query  string   `scraper:"query"`
	}

	g := google{Query: "hello world"}

	if err := scraper.Execute(&g); err != nil {
		log.Fatal(err)
	}

	for i, r := range g.Result {
		fmt.Printf("#%d: '%s' => %s\n", i+1, r.Title, r.URL)
	}
}
#1: 'Helloworld Travel  Deals on Accommodation, Flights ...' => https://www.helloworld.com.au/
#2: '"Hello, World!" program - Wikipedia' => https://en.wikipedia.org/wiki/%22Hello,_World!%22_program
#3: 'Helloworld Travel - Wikipedia' => https://en.wikipedia.org/wiki/Helloworld_Travel
#4: 'Helloworld Travel Limited' => https://www.helloworldlimited.com.au/
#5: 'Total immersion, Serious fun! with Hello-World!' => https://www.hello-world.com/
#6: 'Helloworld Travel - Home | Facebook' => https://www.facebook.com/helloworldau/

CLI Example

Given google.json

{
  "/search": {
    "url": "https://www.google.com/search?q={{query}}",
    "list": "#rso div[class=g]",
    "result": {
      "title": "h3 span",
      "url": ["a[href]", "@href"]
    }
  }
}
$ scraper google.json
2015/05/16 20:10:46 listening on 3000...
$ curl "localhost:3000/search?query=hellokitty"
[
  {
    "title": "Official Home of Hello Kitty \u0026 Friends | Hello Kitty Shop",
    "url": "http://www.sanrio.com/"
  },
  {
    "title": "Hello Kitty - Wikipedia, the free encyclopedia",
    "url": "http://en.wikipedia.org/wiki/Hello_Kitty"
  },
  ...

JSON API

{
  <path>: {
    "method": <method>
    "url": <url>
    "list": <selector>,
    "result": {
      <field>: <extractor>,
      <field>: [<extractor>, <extractor>, ...],
      ...
    }
  }
}
  • <path> - Required The path of the scraper
    • Accessible at http://<host>:port/<path>
    • You may define path variables like: my/path/:var when set to /my/path/foo then :var = "foo"
  • <url> - Required The URL of the remote server to scrape
    • It may contain template variables in the form {{ var }}, scraper will look for a var path variable, if not found, it will then look for a query parameter var
  • result - Required represents the resulting JSON object, after executing the <extractor> on the current DOM context. A field may use sequence of <extractor>s to perform more complex queries.
  • <method> - The HTTP request method (defaults to GET)
  • <extractor> - A string in which must be one of:
    • a regex in form /abc/ - searches the text of the current DOM context (extracts the first group when provided).
    • a regex in form s/abc/xyz/ - searches the text of the current DOM context and replaces with the provided text (sed-like syntax). Supports $N backreferences (s/v(\d+)/version-$1/) and g flag for replace-all (s/a/b/g). Any single character may be used as the delimiter (s|/|-|g).
    • an attribute in the form @abc - gets the attribute abc from the DOM context.
    • a function in the form html() - gets the DOM context as string
    • a function in the form trim() - trims space from the beginning and the end of the string
    • a function in the form first() - narrows the selection to the first matched element.
    • a function in the form join(sep) - joins the text of every matched element with sep. Quoted separators (join("\n"), join(", ")) are unescaped via Go's strconv rules; bare separators (join(|)) are taken literally.
    • a query param in the form query-param(abc) - parses the current context as a URL and extracts the provided param
    • a css selector abc (if not in the forms above) alters the DOM context.
  • list - Optional A css selector used to split the root DOM context into a set of DOM contexts. Useful for capturing search results.

Multiple matched elements are comma-joined by default; use join(sep) for a different separator. Repeated query params (?tag=a&tag=b) are collapsed to a comma-joined value before template substitution.

JSON mode

Setting "mode": "json" switches the endpoint to a JSON-API scraper. list and the result fields are then jq selectors instead of CSS selectors. As with HTML mode, fields can be a string or an array; arrays are joined with | to form a jq pipeline ([".count", "tonumber"] becomes .count | tonumber).

Go API

Replace <variable> with your configuration, documented above.

  1. Define your endpoint struct:
type endpoint struct {
  Method string   `scraper:"<method>"`
  URL    string   `scraper:"<url>"`
  Result []result `scraper:"<list>`
  <param>  string `scraper:"<param>"`
}

Method, URL, Result and Debug are special fields, the remaining string fields are treated as input parameters. Input parameters use the field name with first character lowercased by default.

  1. Define your result struct:
type result struct {
  <field> string `scraper:"<extractor>"`
  <field> string `scraper:"<extractor> | <extractor>"`
}

The result struct is used to define field to extractor mappings. All fields must be strings. Struct tags cannot contain arrays so instead we join multiple extractors with |.

  1. Execute it:
e := endpoint{MyParam: "hello world"}
if err := scraper.Execute(&e); err != nil {
  ...
}
// e.Result is now set

Similar projects