Problem

How do you fetch an html element by class or id using go’s html package?

Solution

Let’s say you want to fetch an element with the id upcoming-episode from https://mydramalist.com/705723-strange-lawyer-woo-young-woo

There are two ways to do this:

1. Using tokenizer

Tokenizer allows you to get the element you want while go is parsing the html.

Here’s how you can do it with the tokenizer

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
package main

import (
	"fmt"
	"log"
	"net/http"

	"golang.org/x/net/html"
)

const url = "https://mydramalist.com/705723-strange-lawyer-woo-young-woo"

func main() {
    // get the html from the url
	resp, err := http.Get(url)
	if err != nil {
		log.Fatal(err)
	}
	defer resp.Body.Close()

    // stop if the http request doesn't succeed with 200
	if resp.StatusCode != 200 {
		log.Fatalf("Status code error: %d %s", resp.StatusCode, resp.Status)
	}

    // initialize the tokenizer
	h := html.NewTokenizer(resp.Body)

	for {
        // stop if there are no further tokens to process
		if h.Next() == html.ErrorToken {
			break
		}

        // `h.Token() can be called only once after `h.Next()`
        // check this SO thread for more info: 
        // https://stackoverflow.com/q/73031647/6874596
		t := h.Token()
		attrs := t.Attr

		for _, attr := range attrs {
            // attributes of an html element are stored as a map of key value pairs
			if attr.Key == "id" && attr.Val == "upcoming-episode" {
				fmt.Println("upcoming episode", t.String())
			}
		}

	}
}

Above example is a modified version of the example shown in the official docs

Output

1
2
$ go run main.go
upcoming episode <div id="upcoming-episode" class="box">

2. Using parser

Parser parses the html into a tree first. You can then go through the tree to get the element you want.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
package main

import (
	"fmt"
	"log"
	"net/http"

	"golang.org/x/net/html"
)

const url = "https://mydramalist.com/705723-strange-lawyer-woo-young-woo"

func main() {
	resp, err := http.Get(url)
	if err != nil {
		log.Fatal(err)
	}
	defer resp.Body.Close()

	if resp.StatusCode != 200 {
		log.Fatalf("Status code error: %d %s", resp.StatusCode, resp.Status)
	}

    // returns the root node
	doc, err := html.Parse(resp.Body)
	if err != nil {
		log.Fatal(err)
	}

	if err != nil {
		log.Fatal(err)
	}
    // recursive function which starts from root node
    // and searches through the entire tree until we find the result
    // OR no node is left to search
	var f func(*html.Node)
	f = func(n *html.Node) {
		attrs := n.Attr
        // loop over the node attributes (which are a map of key value pairs)
        // to see if the id matches `upcoming-episode`
		for _, attr := range attrs {
			if attr.Key == "id" && attr.Val == "upcoming-episode" {
				fmt.Println("upcoming episode", n.Attr, n.Data)
                return
			}
		}
        // loop over the children
        // read: start with the first child
        // go to the next sibling until there is no sibling
		for c := n.FirstChild; c != nil; c = c.NextSibling {
			f(c)
		}
	}
	f(doc)
}

Above example is a modified version of the example shown in the official docs

Gotchas

  1. When you are using the tokenizer, you need to follow the right call sequence

Tokenizer has a kind of funny interface, and you aren’t allowed to call Token() more than once between calls to Next(). As the doc says:

In EBNF notation, the valid call sequence per token is:
Next {Raw} [ Token | Text | TagName {TagAttr} ]

Which is to say: after calling Next() you may call Raw() zero or more times; then you can either:

  • Call Token() once,
  • Call Text() once,
  • Call TagName() once followed by TagAttr() zero or more times (presumably, either not at all because you don’t care about the attributes, or enough times to retrieve all of the attributes).
  • Or do nothing (maybe you’re skipping tokens).

The results of calling things out of sequence are undefined, because the methods modify internal state — they’re not pure accessors. In your first snippet you call Token() multiple times between calls to Next(), so the result is invalid. All of the attributes are consumed by the first call, and aren’t returned by the later ones.

https://stackoverflow.com/a/73032012/6874596

  1. HTML response won’t have all the html elements that you see in the browser

If you haven’t noticed it yet, the HTML Id upcoming-episode points to the following div in the browser: upcoming-episode

As you can see the div with the id upcoming-episode has many children under it but if you see the output of the tokenizer of parser program above, it doesn’t show any children. This is because the children are created dynamically using Javascript.

This is what the page looks like without Javascript (I temporarily disabled Javascript in my browser. Check this for how to do it): upcoming-episode-without-js

You might want to use a headless browser package like chromedp to fetch the same response as the browser.