mirror of https://github.com/jpillora/scraper.git synced 2026-07-15 19:35:29 -06:00

No description

Find a file

Jaime Pillora 81c04938e4 Add tests, harden server, expand extractor + jq pipeline support - main: wrap http.Server with Read/Write/Idle timeouts and SIGINT/SIGTERM graceful shutdown via Server.Shutdown. - handler: collapse repeated query params (?tag=a&tag=b) to a comma-joined value instead of silently dropping the second. - endpoint (json mode): treat each field's extractor list as a jq pipeline joined by " \| ", matching HTML-mode chaining semantics. - extractors: add join(sep) for explicit multi-match separators (quoted separators are unquoted via strconv); add $N backref support to s/.../.../ via re.ExpandString and replace the manual loop with re.ReplaceAllString for the global flag. - extractors: factor sed parsing into parseSed so the matcher and generator share one validated grammar (rejects bad delimiters, empty match, unknown flags, extra parts). - tests: add unit coverage for template, extractor generators (default, attr, regex match, sed first/global/backref/custom-delim, first, html, trim, query-param, join), Extractors.UnmarshalJSON, Extractor chaining, jsonValueString across types, extractHTML row completeness, extractJSON with chaining, unsupported method/mode rejection, and gostruct's panic-safety on bad inputs. - README: document new join() and first() extractors, $N backref + custom delimiter for sed, multi-value query collapse, and JSON mode pipelines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-27 11:18:41 +10:00
.github
doc
example
scraper	Add tests, harden server, expand extractor + jq pipeline support	2026-04-27 11:18:41 +10:00
.gitignore
go.mod
go.sum
LICENSE
main.go	Add tests, harden server, expand extractor + jq pipeline support	2026-04-27 11:18:41 +10:00
README.md	Add tests, harden server, expand extractor + jq pipeline support	2026-04-27 11:18:41 +10:00
TASKS.md

README.md

scraper

A dual interface Go module for building simple web scrapers

Features

Go struct-tag interface
Command-line interface
- HTML⇒JSON API server
- Single binary
- Simple configuration
- Zero-downtime config reload with kill -s SIGHUP <scraper-pid>

Install

Binaries

See the latest release or download it with this one-liner: curl https://i.jpillora.com/scraper | bash

Source

$ go get -v github.com/jpillora/scraper

Go Example

package main

import (
	"log"

	"github.com/jpillora/scraper/scraper"
)

func main() {
	type result struct {
		Title string `scraper:"h3 span"`
		URL   string `scraper:"a[href] | @href"`
	}

	type google struct {
		URL    string   `scraper:"https://www.google.com/search?q={{query}}"`
		Result []result `scraper:"#rso div[class=g]"`
		Query  string   `scraper:"query"`
	}

	g := google{Query: "hello world"}

	if err := scraper.Execute(&g); err != nil {
		log.Fatal(err)
	}

	for i, r := range g.Result {
		fmt.Printf("#%d: '%s' => %s\n", i+1, r.Title, r.URL)
	}
}

#1: 'Helloworld Travel – Deals on Accommodation, Flights ...' => https://www.helloworld.com.au/
#2: '"Hello, World!" program - Wikipedia' => https://en.wikipedia.org/wiki/%22Hello,_World!%22_program
#3: 'Helloworld Travel - Wikipedia' => https://en.wikipedia.org/wiki/Helloworld_Travel
#4: 'Helloworld Travel Limited' => https://www.helloworldlimited.com.au/
#5: 'Total immersion, Serious fun! with Hello-World!' => https://www.hello-world.com/
#6: 'Helloworld Travel - Home | Facebook' => https://www.facebook.com/helloworldau/

CLI Example

Given google.json

{
  "/search": {
    "url": "https://www.google.com/search?q={{query}}",
    "list": "#rso div[class=g]",
    "result": {
      "title": "h3 span",
      "url": ["a[href]", "@href"]
    }
  }
}

$ scraper google.json
2015/05/16 20:10:46 listening on 3000...

$ curl "localhost:3000/search?query=hellokitty"
[
  {
    "title": "Official Home of Hello Kitty \u0026 Friends | Hello Kitty Shop",
    "url": "http://www.sanrio.com/"
  },
  {
    "title": "Hello Kitty - Wikipedia, the free encyclopedia",
    "url": "http://en.wikipedia.org/wiki/Hello_Kitty"
  },
  ...

JSON API

{
  <path>: {
    "method": <method>
    "url": <url>
    "list": <selector>,
    "result": {
      <field>: <extractor>,
      <field>: [<extractor>, <extractor>, ...],
      ...
    }
  }
}

<path> - Required The path of the scraper
- Accessible at http://<host>:port/<path>
- You may define path variables like: my/path/:var when set to /my/path/foo then :var = "foo"
<url> - Required The URL of the remote server to scrape
- It may contain template variables in the form {{ var }}, scraper will look for a var path variable, if not found, it will then look for a query parameter var
result - Required represents the resulting JSON object, after executing the <extractor> on the current DOM context. A field may use sequence of <extractor>s to perform more complex queries.
<method> - The HTTP request method (defaults to GET)
<extractor> - A string in which must be one of:
- a regex in form /abc/ - searches the text of the current DOM context (extracts the first group when provided).
- a regex in form s/abc/xyz/ - searches the text of the current DOM context and replaces with the provided text (sed-like syntax). Supports $N backreferences (s/v(\d+)/version-$1/) and g flag for replace-all (s/a/b/g). Any single character may be used as the delimiter (s|/|-|g).
- an attribute in the form @abc - gets the attribute abc from the DOM context.
- a function in the form html() - gets the DOM context as string
- a function in the form trim() - trims space from the beginning and the end of the string
- a function in the form first() - narrows the selection to the first matched element.
- a function in the form join(sep) - joins the text of every matched element with sep. Quoted separators (join("\n"), join(", ")) are unescaped via Go's strconv rules; bare separators (join(|)) are taken literally.
- a query param in the form query-param(abc) - parses the current context as a URL and extracts the provided param
- a css selector abc (if not in the forms above) alters the DOM context.
list - Optional A css selector used to split the root DOM context into a set of DOM contexts. Useful for capturing search results.

Multiple matched elements are comma-joined by default; use join(sep) for a different separator. Repeated query params (?tag=a&tag=b) are collapsed to a comma-joined value before template substitution.

JSON mode

Setting "mode": "json" switches the endpoint to a JSON-API scraper. list and the result fields are then jq selectors instead of CSS selectors. As with HTML mode, fields can be a string or an array; arrays are joined with | to form a jq pipeline ([".count", "tonumber"] becomes .count | tonumber).

Go API

Replace <variable> with your configuration, documented above.

Define your endpoint struct:

type endpoint struct {
  Method string   `scraper:"<method>"`
  URL    string   `scraper:"<url>"`
  Result []result `scraper:"<list>`
  <param>  string `scraper:"<param>"`
}

Method, URL, Result and Debug are special fields, the remaining string fields are treated as input parameters. Input parameters use the field name with first character lowercased by default.

Define your result struct:

type result struct {
  <field> string `scraper:"<extractor>"`
  <field> string `scraper:"<extractor> | <extractor>"`
}

The result struct is used to define field to extractor mappings. All fields must be strings. Struct tags cannot contain arrays so instead we join multiple extractors with |.

Execute it:

e := endpoint{MyParam: "hello world"}
if err := scraper.Execute(&e); err != nil {
  ...
}
// e.Result is now set

Similar projects

https://github.com/ernesto-jimenez/scraperboardR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

README.md Unescape Escape