Web Scraping with Colly

Parsing Websites with Golang and Colly

Posted on

I stumbled across a scraper and crawler framework written in Go called Colly. Colly makes it really easy to scrape content from web pages with it’s fast speed and easy interface. I have always been interested in web scrapers ever since I did a project for my university studies and you can read about that project here. Before continuing, please note that scraping of websites is not always allowed and sometimes even illegal. In the guide below we will be parsing this blog, GoPHP.io.

To begin let’s take a look at the Colly Github page and scroll down to the example code listed there. We will create a new project with a new main.go file that looks like this:

package main

import (
   "fmt"
   "github.com/gocolly/colly"
)

func main() {
   c := colly.NewCollector()

   // Find and visit all links
   c.OnHTML("a[href]", func(e *colly.HTMLElement) {
      e.Request.Visit(e.Attr("href"))
   })

   c.OnRequest(func(r *colly.Request) {
      fmt.Println("Visiting", r.URL)
   })

   c.Visit("http://go-colly.org/")
}

You may need to use go get -u github.com/gocolly/colly/... to download the framework into your go directory. Now let’s go ahead and change the url to the gophp.io website.

c.Visit("https://gophp.io/")

And then we can run the script by typing go run main.go in your terminal making sure you are in the project directory when you do this. You can use ctrl+c in your terminal to cancel as it may run for a long time. What do we get as our output? For me it looked like this:

Scraping the web with Colly

What we see here is exactly what you would expect. Our program parsed all the urls on the main gophp.io page and then proceeded to the first link. This first link is a post at gophp.io but the first link on that page is a link to Virtualbox and our program will keep looping until it stops finding links. That could be a long time and unless you want to make a search engine spider it won’t be the most efficent. What I want is a server that I can call on from a PHP script that just fetches and formats the data I need. Luckily Colly has a complete example of what we need, a scraper server.

package main

import (
   "encoding/json"
   "log"
   "net/http"

   "github.com/gocolly/colly"
)

type pageInfo struct {
   StatusCode int
   Links      map[string]int
}

func handler(w http.ResponseWriter, r *http.Request) {
   URL := r.URL.Query().Get("url")
   if URL == "" {
      log.Println("missing URL argument")
      return
   }
   log.Println("visiting", URL)

   c := colly.NewCollector()

   p := &pageInfo{Links: make(map[string]int)}

   // count links
   c.OnHTML("a[href]", func(e *colly.HTMLElement) {
      link := e.Request.AbsoluteURL(e.Attr("href"))
      if link != "" {
         p.Links[link]++
      }
   })

   // extract status code
   c.OnResponse(func(r *colly.Response) {
      log.Println("response received", r.StatusCode)
      p.StatusCode = r.StatusCode
   })
   c.OnError(func(r *colly.Response, err error) {
      log.Println("error:", r.StatusCode, err)
      p.StatusCode = r.StatusCode
   })

   c.Visit(URL)

   // dump results
   b, err := json.Marshal(p)
   if err != nil {
      log.Println("failed to serialize response:", err)
      return
   }
   w.Header().Add("Content-Type", "application/json")
   w.Write(b)
}

func main() {
   // example usage: curl -s 'http://127.0.0.1:7171/?url=http://go-colly.org/'
   addr := ":7171"

   http.HandleFunc("/", handler)

   log.Println("listening on", addr)
   log.Fatal(http.ListenAndServe(addr, nil))
}

What does the above code do? It will start a webserver running locally on your machine on port 7171. It takes a url parameter and returns all the links found on the url you input. Let’s give it a go by going to http://127.0.0.1:7171/?url=https://gophp.io/. Here is an example of the json encoded output we get:

{
  "StatusCode": 200,
  "Links": {
    "http://185.201.144.162:9181/static/": 1,
    "http://humanstxt.org/": 1,
    "http://pierrickcalvez.com/journal/a-five-minutes-guide-to-better-typography": 1,
    "http://www.gjermundbjaanes.com/understanding-ethereum-smart-contracts/": 1,
    "http://www.neopets.com/": 1,
    "http://www.zoon.cc/stupid/": 1,
    "https://archives.tenghamn.com": 1,
    "https://archives.tenghamn.com/2013/02/10/best-php-ide-jetbrains-phpstorm-review-2013.html": 1,
    "https://bcrypt.fun": 1,
    "https://bitinfocharts.com/vertcoin/address/VcMhEJrnYKNjTSrkazwJXgLHVRB5vKuouv": 1,
    "https://bittrex.com/": 1,
    "https://caddy.community/c/plugins": 1,
    "https://caddyserver.com/": 3,
    "https://code.tutsplus.com/tutorials/apache-vs-nginx-pros-cons-for-wordpress--cms-28540": 1,
    "https://coincall.io/": 1,
    "https://cryptozombies.io/": 1,
    "https://en.bitcoin.it/wiki/Hardware_wallet": 1,
    "https://etherscan.io/address/0xbbb2917f759a09299490d443b82d5324aefe8f9f": 1,
    "https://ferdinand-muetsch.de/caddy-a-modern-web-server-vs-nginx.html": 1,
    "https://github.com/Password-Fun/bcrypt_fun": 1,
    "https://github.com/bayandin/awesome-awesomeness": 1,
    "https://github.com/caddyserver/examples/blob/master/laravel/Caddyfile": 1,
    "https://github.com/egonelbre/gophers": 1,
    "https://github.com/joshbuchea/HEAD": 1,
    "https://github.com/markustenghamn/golang-cryptotracker": 1,
    "https://github.com/markustenghamn/golang-steem-cryptotracker": 1,
    "https://github.com/mholt/caddy": 1,
    "https://github.com/thedaviddias/Front-End-Checklist": 1,
    "https://github.com/vertcoin-project/One-Click-Miner/releases": 1,
    "https://godotengine.org/": 1,
    "https://golang.org/dl/": 1,
    "https://gophp.io/": 2,
    "https://gophp.io/a-simple-bcrypt-hash-generator-website/": 4,
    "https://gophp.io/author/markustenghamngophp/": 10,
    "https://gophp.io/category/caddy/": 2,
    "https://gophp.io/category/crypto/": 3,
    "https://gophp.io/category/crypto/games/": 1,
    "https://gophp.io/category/crypto/hardware-wallet/": 1,
    "https://gophp.io/category/crypto/vertcoin/": 1,
    "https://gophp.io/category/general/": 9,
    "https://gophp.io/category/go/": 5,
    "https://gophp.io/category/html/": 1,
    "https://gophp.io/category/notifications/": 1,
    "https://gophp.io/cli-crypto-portfolio-tracker-in-go/": 3,
    "https://gophp.io/cryptokitties-a-game-played-on-the-blockchain/": 4,
    "https://gophp.io/facebook-free-10-credit-will-applied/": 4,
    "https://gophp.io/go-html-head/": 4,
    "https://gophp.io/how-to-developing-with-go-on-linux/": 4,
    "https://gophp.io/how-to-make-a-simple-go-program-to-track-the-price-of-steem-via-an-api/": 3,
    "https://gophp.io/how-to-protect-your-cryptocurrencies/": 4,
    "https://gophp.io/page/2/": 2,
    "https://gophp.io/switching-from-nginx-to-caddy/": 3,
    "https://gophp.io/tag/ads/": 1,
    "https://gophp.io/tag/api/": 1,
    "https://gophp.io/tag/bcrypt/": 1,
    "https://gophp.io/tag/blockchain/": 1,
    "https://gophp.io/tag/caddy/": 2,
    "https://gophp.io/tag/cryptocurrencies/": 2,
    "https://gophp.io/tag/cryptokitties/": 1,
    "https://gophp.io/tag/description/": 1,
    "https://gophp.io/tag/development/": 1,
    "https://gophp.io/tag/ethereum/": 1,
    "https://gophp.io/tag/facebook/": 1,
    "https://gophp.io/tag/games/": 1,
    "https://gophp.io/tag/go/": 1,
    "https://gophp.io/tag/goland/": 1,
    "https://gophp.io/tag/golang-crypto-portfolio-bitcoin-steemit/": 1,
    "https://gophp.io/tag/golang/": 4,
    "https://gophp.io/tag/hardware-wallet/": 1,
    "https://gophp.io/tag/head/": 1,
    "https://gophp.io/tag/html/": 1,
    "https://gophp.io/tag/ledger-nano-s/": 1,
    "https://gophp.io/tag/linux/": 1,
    "https://gophp.io/tag/mining-crypto/": 1,
    "https://gophp.io/tag/mining-vertcoin/": 1,
    "https://gophp.io/tag/nginx/": 1,
    "https://gophp.io/tag/notifications/": 1,
    "https://gophp.io/tag/safety/": 1,
    "https://gophp.io/tag/steem/": 1,
    "https://gophp.io/tag/steemit/": 1,
    "https://gophp.io/tag/title/": 1,
    "https://gophp.io/tried-mining-vertcoin-1-month/": 4,
    "https://www.beubo.com/gophp/wp-content/uploads/sites/4/2018/05/goland-screenshot.png": 1,
    "https://hackernoon.com/why-isnt-agile-working-d7127af1c552": 1,
    "https://jeiwan.cc/posts/building-blockchain-in-go-part-1/": 1,
    "https://keepass.info/": 1,
    "https://letsencrypt.org/": 1,
    "https://lightning.network/": 1,
    "https://ma.rkus.io/": 1,
    "https://metamask.io/": 1,
    "https://password.fun": 1,
    "https://revel.github.io/": 1,
    "https://steemit.com/@tenghamn": 1,
    "https://steemit.com/cryptocurrency/@tenghamn/how-to-make-a-simple-go-program-to-track-the-price-of-steem-via-an-api": 1,
    "https://steemit.com/cryptocurrency/@tenghamn/i-made-a-command-line-cryptocurrency-tracker-in-go": 1,
    "https://support.google.com/webmasters/answer/79812?hl=en": 1,
    "https://thousandetherhomepage.com/": 1,
    "https://tobsta.github.io/OpenSource/": 1,
    "https://trezor.io/": 2,
    "https://vtcpool.io/how-to-start-mining-vertcoin/": 1,
    "https://wordpress.org/": 1,
    "https://www.binance.com/": 2,
    "https://www.coinbase.com/": 1,
    "https://www.coinbase.com/join/56a40201e9f0bb14ae000072": 1,
    "https://www.cryptokitties.co/": 4,
    "https://www.cryptokitties.co/profile/0xbbb2917f759a09299490d443b82d5324aefe8f9f": 1,
    "https://www.facebook.com/ads/manager": 1,
    "https://www.facebook.com/ads/manager/account_settings/notification_preferences": 1,
    "https://www.google.com/webmasters/tools/home?hl=en": 1,
    "https://www.jetbrains.com/go/": 2,
    "https://www.jetbrains.com/store/?fromMenu": 1,
    "https://www.lastpass.com/": 1,
    "https://www.ledgerwallet.com/r/eb88": 2,
    "https://www.ledgerwallet.com/r/eb88?path=/products/ledger-nano-s": 3,
    "https://www.nginx.com/resources/wiki/start/topics/examples/full/": 1,
    "https://www.sublimetext.com/": 1,
    "https://www.theguardian.com/technology/2016/aug/31/dropbox-hack-passwords-68m-data-breach": 1,
    "https://www.travian.com/international": 1,
    "https://www.virtualbox.org/": 1,
    "https://www.yes-www.org/why-use-www/": 1
  }
}

The above json output is only 1 level deep. Notice that it does not keep finding links on the pages it finds. This is great because now we could use this program as a sort of microservice. A PHP application could make calls to this microservice and receive all links for the specified url which could later be processed by the PHP application. Now, links are good but we might want to parse other content on the page. Let’s customize our code for this purpose.

Queries For Specific Content With Colly

If we take a look at the source of gophp.io we can see that every title has the css class entry-title which we can use for our query. We will modify the handler function by adding another map for headings. I am only including the section of code that I have changed below:

...
type pageInfo struct {
   StatusCode int
   Links      map[string]int
   // Added headings
   Headings   map[string]int
}

func handler(w http.ResponseWriter, r *http.Request) {
   URL := r.URL.Query().Get("url")
   if URL == "" {
      log.Println("missing URL argument")
      return
   }
   log.Println("visiting", URL)

   c := colly.NewCollector()

   // We add Headings here
   p := &pageInfo{Links: make(map[string]int), Headings: make(map[string]int)}

   // count links
   c.OnHTML("a[href]", func(e *colly.HTMLElement) {
      link := e.Request.AbsoluteURL(e.Attr("href"))
      if link != "" {
         p.Links[link]++
      }
   })

   // count headings
   c.OnHTML(".entry-title", func(e *colly.HTMLElement) {
      // We are looping through the .entry-title elements and then getting the text of the a element
      heading := e.ChildText("a")
      if heading != "" {
         p.Headings[heading]++
      }
   })
...

Now if we restart our program and navigate to our page on port 7171 again we will see some additional output in our json response.

...
  "Headings": {
    "CLI Crypto Portfolio Tracker In Go": 1,
    "CryptoKitties – A Game Played On The Blockchain": 1,
    "Facebook – A Free $10 Credit Will Be Applied When …": 1,
    "How To Make A Simple Go Program To Track The Price Of Steem Via An API": 1,
    "How To Protect Your Cryptocurrencies": 1,
    "How to: Developing with Go on Linux": 1,
    "I made a simple Bcrypt hash generator website with Golang": 1,
    "I tried mining Vertcoin for 1 month": 1,
    "Switching From Nginx To Caddy": 1,
    "What should go in the HTML head?": 1
  }
}

As you can see we have now parsed all the titles on the page and added them to our json output. Using queries we can make very general or specific parsers for any kind of website.

I hope this guide helps someone get started with web scraping. There are several real world examples in the documentation if you would like to learn more. I would love to hear your feedback, questions and comments below!

Got Something To Say?

Your email address will not be published. Required fields are marked *