- main: wrap http.Server with Read/Write/Idle timeouts and SIGINT/SIGTERM graceful shutdown via Server.Shutdown. - handler: collapse repeated query params (?tag=a&tag=b) to a comma-joined value instead of silently dropping the second. - endpoint (json mode): treat each field's extractor list as a jq pipeline joined by " | ", matching HTML-mode chaining semantics. - extractors: add join(sep) for explicit multi-match separators (quoted separators are unquoted via strconv); add $N backref support to s/.../.../ via re.ExpandString and replace the manual loop with re.ReplaceAllString for the global flag. - extractors: factor sed parsing into parseSed so the matcher and generator share one validated grammar (rejects bad delimiters, empty match, unknown flags, extra parts). - tests: add unit coverage for template, extractor generators (default, attr, regex match, sed first/global/backref/custom-delim, first, html, trim, query-param, join), Extractors.UnmarshalJSON, Extractor chaining, jsonValueString across types, extractHTML row completeness, extractJSON with chaining, unsupported method/mode rejection, and gostruct's panic-safety on bad inputs. - README: document new join() and first() extractors, $N backref + custom delimiter for sed, multi-value query collapse, and JSON mode pipelines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|---|---|---|
| .github | ||
| doc | ||
| example | ||
| scraper | ||
| .gitignore | ||
| go.mod | ||
| go.sum | ||
| LICENSE | ||
| main.go | ||
| README.md | ||
| TASKS.md | ||
scraper
A dual interface Go module for building simple web scrapers
Features
- Go struct-tag interface
- Command-line interface
- HTML⇒JSON API server
- Single binary
- Simple configuration
- Zero-downtime config reload with
kill -s SIGHUP <scraper-pid>
Install
Binaries
See the latest release or download it with this one-liner: curl https://i.jpillora.com/scraper | bash
Source
$ go get -v github.com/jpillora/scraper
Go Example
package main
import (
"log"
"github.com/jpillora/scraper/scraper"
)
func main() {
type result struct {
Title string `scraper:"h3 span"`
URL string `scraper:"a[href] | @href"`
}
type google struct {
URL string `scraper:"https://www.google.com/search?q={{query}}"`
Result []result `scraper:"#rso div[class=g]"`
Query string `scraper:"query"`
}
g := google{Query: "hello world"}
if err := scraper.Execute(&g); err != nil {
log.Fatal(err)
}
for i, r := range g.Result {
fmt.Printf("#%d: '%s' => %s\n", i+1, r.Title, r.URL)
}
}
#1: 'Helloworld Travel – Deals on Accommodation, Flights ...' => https://www.helloworld.com.au/
#2: '"Hello, World!" program - Wikipedia' => https://en.wikipedia.org/wiki/%22Hello,_World!%22_program
#3: 'Helloworld Travel - Wikipedia' => https://en.wikipedia.org/wiki/Helloworld_Travel
#4: 'Helloworld Travel Limited' => https://www.helloworldlimited.com.au/
#5: 'Total immersion, Serious fun! with Hello-World!' => https://www.hello-world.com/
#6: 'Helloworld Travel - Home | Facebook' => https://www.facebook.com/helloworldau/
CLI Example
Given google.json
{
"/search": {
"url": "https://www.google.com/search?q={{query}}",
"list": "#rso div[class=g]",
"result": {
"title": "h3 span",
"url": ["a[href]", "@href"]
}
}
}
$ scraper google.json
2015/05/16 20:10:46 listening on 3000...
$ curl "localhost:3000/search?query=hellokitty"
[
{
"title": "Official Home of Hello Kitty \u0026 Friends | Hello Kitty Shop",
"url": "http://www.sanrio.com/"
},
{
"title": "Hello Kitty - Wikipedia, the free encyclopedia",
"url": "http://en.wikipedia.org/wiki/Hello_Kitty"
},
...
JSON API
{
<path>: {
"method": <method>
"url": <url>
"list": <selector>,
"result": {
<field>: <extractor>,
<field>: [<extractor>, <extractor>, ...],
...
}
}
}
<path>- Required The path of the scraper- Accessible at
http://<host>:port/<path> - You may define path variables like:
my/path/:varwhen set to/my/path/foothen:var = "foo"
- Accessible at
<url>- Required The URL of the remote server to scrape- It may contain template variables in the form
{{ var }}, scraper will look for avarpath variable, if not found, it will then look for a query parametervar
- It may contain template variables in the form
result- Required represents the resulting JSON object, after executing the<extractor>on the current DOM context. A field may use sequence of<extractor>s to perform more complex queries.<method>- The HTTP request method (defaults toGET)<extractor>- A string in which must be one of:- a regex in form
/abc/- searches the text of the current DOM context (extracts the first group when provided). - a regex in form
s/abc/xyz/- searches the text of the current DOM context and replaces with the provided text (sed-like syntax). Supports$Nbackreferences (s/v(\d+)/version-$1/) andgflag for replace-all (s/a/b/g). Any single character may be used as the delimiter (s|/|-|g). - an attribute in the form
@abc- gets the attributeabcfrom the DOM context. - a function in the form
html()- gets the DOM context as string - a function in the form
trim()- trims space from the beginning and the end of the string - a function in the form
first()- narrows the selection to the first matched element. - a function in the form
join(sep)- joins the text of every matched element withsep. Quoted separators (join("\n"),join(", ")) are unescaped via Go's strconv rules; bare separators (join(|)) are taken literally. - a query param in the form
query-param(abc)- parses the current context as a URL and extracts the provided param - a css selector
abc(if not in the forms above) alters the DOM context.
- a regex in form
list- Optional A css selector used to split the root DOM context into a set of DOM contexts. Useful for capturing search results.
Multiple matched elements are comma-joined by default; use join(sep) for a different separator. Repeated query params (?tag=a&tag=b) are collapsed to a comma-joined value before template substitution.
JSON mode
Setting "mode": "json" switches the endpoint to a JSON-API scraper. list and the result fields are then jq selectors instead of CSS selectors. As with HTML mode, fields can be a string or an array; arrays are joined with | to form a jq pipeline ([".count", "tonumber"] becomes .count | tonumber).
Go API
Replace <variable> with your configuration, documented above.
- Define your endpoint struct:
type endpoint struct {
Method string `scraper:"<method>"`
URL string `scraper:"<url>"`
Result []result `scraper:"<list>`
<param> string `scraper:"<param>"`
}
Method, URL, Result and Debug are special fields, the remaining string fields are treated as input parameters. Input parameters use the field name with first character lowercased by default.
- Define your result struct:
type result struct {
<field> string `scraper:"<extractor>"`
<field> string `scraper:"<extractor> | <extractor>"`
}
The result struct is used to define field to extractor mappings. All fields must be strings. Struct tags cannot contain arrays so instead we join multiple extractors with |.
- Execute it:
e := endpoint{MyParam: "hello world"}
if err := scraper.Execute(&e); err != nil {
...
}
// e.Result is now set
Similar projects
- https://github.com/ernesto-jimenez/scraperboardR THE USE OR OTHER DEALINGS IN THE SOFTWARE.