31
Aug 15

p1p1_viewer

Forewords

The purpose of the technical overview is to present approaches for tackling and solving challenges from start to finish. I am aiming this series for intermediate level developers. This means that I expect the readers to be fairly familiar with programming and some web technologies (i.e. not complete beginners), and I will spend very little time delving into implementation details, so very experienced developers may be familiar with most of the content.

The code

https://github.com/royk/draft_scraper/

The stack

Server

  • Java (Spark framework)
  • Groovy

Client

  • EmberJS (without EmberData)
  • Twitter Bootstrap 3

Platform

  • Heroku

DB

  • MongoDB

The project

This is a personal project revolving around my hobby of the card game Magic the Gathering.

The company behind the game has an online app that shows choices that pro players made during major competitions. The problem is that the app has an outdated UI, and every choice is presented on a single web page. If you want to view all the choices of all the players, 360 choices per game, you need to view 360 web pages.

I set out to display all those choices on a single web page, using a method demonstrated by a pro Magic player on one of his videos. Inspired by the video, I tried recreating the single page screen using photoshop, only to realize that it's a classic task for a computer to do, not a human. This realization is our cue as developers to either write down the idea for some future time when we want to work on a project, or open the IDE and get cracking immediately 🙂

The plan

Since there is no API available to get the information from the company's website, I decided to scrape it from the online app. Scraping is not some sort of a weird youtube dance; it means collecting data from a webpage. The company makes the data available in a user-friendly manner, i.e. embedded inside HTML, so technically you could go over each page and get the data out by hand. Or, write an app to do that for you! This is a classic solution to overcome lack of API, and from conversations with other developers I feel it is a very common tool in the arsenal of the hobbyist hacker. I highly recommend becoming good at this.

One approach to scraping is writing a piece of javascript code, injecting it into the page via the console or via overwritng the site's HTML file (a useful technique that I will not get into in this article), and having it spit out a string result for you to copy and use in your app. This could work if the data was in a single page or a few pages. but since we're talking about hundreds of pages, this would get tedious fast. Another approach, the one I chose, is to have a server run a headless browser, and have it go over each page for us, scraping the data, and outputting some string representation of it, or putting it in a DB.

Once we have the data one way or the other, we need to write a web app that accepts this data and displays it to the user in the single page format I had chosen. That's all there is to it.

Choosing the tools - Back end

Since I planned this project to run on a free heroku server, I wanted a very lightweight framework to work with. I had a very bad experience trying to run a full blown Ruby on Rails app on a free heroku server. Therefore, I started reading about Java microframeworks. I chose Java because my workplace at the time was using it and I wanted to get more experience by starting a Java codebase from scratch. I found Spark, a Java framework, promising to be lightweight and easy to work with. Sounded like a perfect match. I set up a hello world Spark app and saw that it runs on my local machine. Great start!

Then I needed a solution to run a browser in the server and query the HTML using a syntax I was already familiar with. Looking for some Java solution with a CSS selector-like syntax, I found Geb framework. Geb uses Groovy, which is a great JVM based language that is quite easy to pick up for Javascript developers. Since it's JVM based, it means you can run it from Java. I set up a basic scraping test, and it worked. Yay!

I decided to stop here and move to the client - I figured that setting up a DB for one string of JSON or XML is an overkill for the project, I could just store them in the code in some variable. I like to simplify when possible and move forward as fast as I can. If this decision is wrong, back tracking to correct it is not a lot of work.

A note about keeping myself motivated

I got the idea for this project while I was employed full time at a chronically understaffed start-up, and didn't have a lot of spare time or energy to program in the evenings. If you've read my post Don't be overwhelmed, you will be familiar with the mental tricks I was employing to tackle this project without getting discouraged with the workload. I separated my work plan to multiple manageable steps, and whenever I picked up a step, I treated it as if it's a project in and of itself, allowing myself to feel accomplishment each time I completed such step.

Step one was to go over the current situation, and coming up with a short term plan of how to solve it. Once I figured out that I need a lightweight server that can also scrape, I let myslf feel great that I figured out the solution in theory. Then, I focused on finding and getting a lightweight server running. Getting a "hello world" from a framework I've never seen before is an accomplishment, no matter how many huge and complex projects I've worked on in the past.

The key to enjoying the project is partitioning the workload to manageable steps, and allowing myself to enjoy the completion of each one. I also advocate skipping unnecessary steps, but some people have criticized me in the past for presenting projects that have no minification/obfuscation or dependency management. I can see their point, so I'm not taking a hard stance on this. But for me I like to maximize the chance of completing a project, and that means cutting or postponing steps that aren't crucial.

Choosing the tools - Front end

In the list of tools, I wrote that I used EmberJS for the front end, but that was in a much later step. I started (and actually completely wrote) the project with pure inline javascript (and jQuery).

The reason is that setting up a front end framework is a step by itself, and I didn't consider it a necessary step. I wanted to do only work that moves me forward to seeing something working. In my evaluation of the project's initial phase, I figured there isn't going to be enough front end logic to require a framework.

For the CSS I used Bootstrap, a pretty popular choice that I've used before and knew what to expect of. A small twist was that I was familiar with Bootstrap 2, and decided to use Bootstrap 3 purely for the educational value of it. It ended up being quite straightforward.

Putting it all together for version 1

Once I had a server, a scraper and a client all buzzing together, all I was left is just pure programming tasks: Scrape a page, send the data to the client, display it.

filling_the_blanks

One intense programming session, and I had a basic version of my project up and running. Not bad!

Closing thoughts

You will notice that a very small percentage of words were spent on the actual programming of this project. But this is an honest reflection of the project: Planning, researching, setting up and hooking up the development environments simply took more time than actually implementing the code I set out to write.

An unfortunate reality for modern development, and for web development in particular, is that setting up environments is a necessary and significant part of our job. This means that programming is definitely NOT the only responsibility for us as developers. Setting up servers, clients, programming environments etc. are all parts of the job, and people often dismiss how complex they can be. Be prepared to spend chunks of your time figuring out why some dependencies aren't compiling in Java,  why your one-line node server running on a Linux machine is not accessible through port 80, or why your client asset pipeline is working locally but not on production. My advice is to get these things out of your way as soon as you can, so that you run into problems with your chosen tools as soon as possible (and avoid huge rewrites), and so that you can group as much of your programming tasks together as possible.

In this project I made the mistake of completing the first iteration of the project before deploying it to Heroku even once. I ran into some problems running Spark on Heroku, and if it wasn't for some other developers running into similar issues and posting online, I could've been stumped. This would've been quite a set-back, to go back and choose a new framework or even a programming language. The correct order of business would've been to deploy my client-server hello worlds to heroku and see that they work there before continuing with the implementation. Lesson learned!

 

In part 2 I will go into how and why I re-wrote the entire front end to use EmberJS, and some interesting choices I had to make in this regard. Stay tuned!

Leave a Reply