I have an obsession with data.

In this post I’ll show you what I did with data and Elixir, Poolboy, Mogrify, AndreaMosaic.

I normally attempt to use new technologies so I can learn from real problems.

Some years ago I did a Web crawler that scrapped the news from the most popular media websites in my city. I did a Facebook bot so people could read all the news in one place.

The first time I did the crawler, it was “OK.” I wrote it in Java and I learned that it was very difficult to deal with concurrency. For example, it is easy to introduce race condition issues in the code. In the end, I got it all to work, but it took a lot of time and effort.

Elixir

I’ve been trying to learn Elixir for the past two years. I learned about basic things like Pattern Matching, OTP, and macros but I hadn’t had a chance to do something from scratch. So, I decided to re-do the Java crawler, but this time using Elixir.

I won’t explain how I did it but I will tell the how it works and the tools I used.

Challenges

  • Read all links from the front page of the media website
  • Identify which links match with the pattern related to a single news item
  • Generate an object of type Article that had things like title, content, URL, etc
  • Save it to the database
  • Save the thumbnail in my computer
  • Resize the thumbnail
  • Do all this recursively for child nodes
  • Use Elixir concurrent workers for doing these tasks without exhausting my system resources

Libraries

  • These are some libraries I used to do this,
  • HTTPotion: This is an HTTP client
  • Floki: HTML parser
  • Ecto: A database wrapper and language integrated query for Elixir
  • Mogrify: Wrapper to use an awesome library called ImageMagick
  • Poolboy: Worker pool factory

How It Works

I used HTTPotion to serve the HTML for every single page. The first thing I did was to crawl the home page of the media site. Then, with the help of Floki, I got all the href attributes of every <a>.

The code looked something like this,

def extract_links(html) do
   html
   |> Floki.find("a")  # get all  tags
   |> get_only_links() # get href attributes
   |> filter_links()   # get only single new links
   |> Enum.uniq()      # remove duplicate links
end

Once I had extracted the URLs, I looped through each of them and crawled for getting an object,

 %{title: title, content: content, thumbnail: thumbnail … etc} 
def get_article(html, url) do
   %ArticleStruct{
     title: title(html),
     slug: Slugger.slugify_downcase(article.title, ?_), # slug
     content: content(html),
     url: url,
     thumbnail: thumbnail(html),
   }
end

Noticed that I also save the slug of the title. This could help me later to identify each thumbnail’s name.

Once I got this object filled I could go ahead and save it in my database using Ecto.

In order to make our beautiful mosaic we need tons of images somewhere locally. I used HTTPotion again for getting the image from the thumbnail URL; I used Mogrify to resize the image.

def save_image(article) do
   case HTTPotion.get(article.thumbnail) do
       %HTTPotion.Response{body: body} ->
         basepath = "/path/images/"
         filename = Path.join(basepath, "#{article.slug}.png")
         File.write!(filename, body)               
         resize_image(filename, 200, 200)
         article
       _ -> nil
   end
 end

Here is how I resized the image and saved it,

def resize_image(imagePath, width, height, _opts \\ []) do
   Mogrify.open(imagePath)
   |> Mogrify.resize_to_limit(~s(#{width}x#{height}))
   |> Mogrify.save(path: imagePath)
end

Once I had all this working, I needed to set a pool of Elixir workers so I my computer can do all this concurrent work without dying.

Here is where Poolboy comes in play. I used to configure a Supervisor which will have a series of workers available all the time.

defmodule ScrapperApp.Application do
@moduledoc false
 use Application
 defp poolboy_config do
   [
     {:name, {:local, :worker}},
     {:worker_module, ScrapperApp.MyWorker},
     {:size, 3},
     {:max_overflow, 4}
   ]
 end
 def start(_type, _args) do
   import Supervisor.Spec, warn: false
   children = [
     :poolboy.child_spec(:worker, poolboy_config()),     
   ]
   opts = [strategy: :one_for_one, name: Scrapper.Supervisor]
   Supervisor.start_link(children, opts)
 end
end

Running the App

AndreaMosaic

AndreaMosaic is a free software that creates mosaic images for you and it’s really fast. I love this tool.
Here is a screenshot of how it looks,

To make it work, choose a background image and a folder where you will get every mosaic. You can specify whether to repeat mosaics, the size of the final image, etc. Give it a try, is really easy to use.

Conclusion

I’m really impressed of how easy it is to use Elixir. I highly recommend that you make something from scratch. It worked really well for me.

The Elixir community is still growing and this is the time to get onboard.

Resources