How to be a Great Engineer
…at Coursera!

As an engineering manager, I’m often asked by colleagues and interview candidates “What does my career progression look like as an engineer at Coursera?” Though developing as a people manager is appealing for some, I find that most are interested in hearing about how they can grow as an individual contributor.

All engineers at Coursera share the same title: “Software Engineer”. You may think this would result in an ambiguous career progression, but we actually prefer this model to a rigid hierarchy for the following reasons.

  • We’re a small, tight-knit engineering organization within a startup and we’re laser-focused on working together to achieve our mission of providing universal access to the world’s best education. Everyone is encouraged to contribute to their fullest potential without being held back by artificial organizational structures, titles and roles.
  • Our culture embodies humility. Great engineers are recognized for their contributions, leadership and attitude, not their title.
  • Everyone is a leader. Our culture is very open and inclusive; some of the best ideas we’ve heard come from new college graduates or from our interns. We’re eager to help everyone in our organization grow into technical leaders.

As technologists, we are continuously striving to improve our craft, and at Coursera, we want to help our engineers improve continuously. We hope working at Coursera is a transformative experience for our engineers and, in the same way, we hope our engineers transform company’s trajectory.1

To guide our engineering teams, we’ve put together a list of qualities that we think are embodied by great, high-performing engineers. These are qualities that we’ve admired in our colleagues at Coursera, as well as in colleagues and cultures at other Silicon Valley tech companies such as LinkedIn, Google and Facebook. We’re sharing our own list in the hopes that it will inspire other engineering teams to think about the qualities they value, and how to build a culture that nurtures and rewards outstanding talent.

Without further ado…

How to be a Great Engineer at Coursera

xkcd goto comic2

Results

Great engineers produce great results. Coursera values engineers who’ve directly designed, implemented and delivered major initiatives. Here’s why:

  • For any major project, the devil is in the details of rolling out, “productionizing” and operating services and product features. Delivering and operating a service or product demonstrates ownership, which is one of our core engineering values.
  • Results directly add value to our business. We consider the cumulative portfolio of a employee’s contributions to measure value added. Impact can be measured across any number of dimensions, including growth, engagement, revenue, engineering productivity, site up, site scalability and more. Significant impact is rarely achieved in shipping an MVP. Constant iteration is required to maximize value from the products and features we build.

Driven by our guiding principle of “climbing effectively”, we appreciate results that thoughtfully balance speed of execution with extensibility and code quality. We also value the “10x engineer” who not only delivers quality results quickly, but also inspires and mentors others around them to work smarter and faster as well.

Leadership

You don’t have to be a manager to be a leader. Technical leadership is about the way you do your job. You’re making your projects, your team, and the entire engineering organization better. Most great engineers exhibit at least some of these qualities:

  • Project leadership: Great engineers can play the role of technical lead on a project of significant scope or on numerous, smaller, high-value projects. They drive ideation, clarify design, remove roadblocks and continuously ship improvements. They work well with product to sequence the right products and features to build, and they know how to balance the trade-offs between quality, completeness, and speed. When applicable, they drive the project to completion by making data driven decisions.
  • Identifying gaps: Great engineers are able to think broadly about the gaps and problems we face. More importantly, they are the first to identify problems we never knew we had. They value solving problems over complaining - in fact, they are eager to get their hands dirty, and tackle the challenge at hand with creativity and genuine enthusiasm.
  • “Up-leveling”: Great engineers make the engineers around them better. They are highly productive mentors who lead by example and inspire others. They use code and design reviews as a venue for thoughtful, asynchronous mentorship.
  • Love of learning: Great engineers make themselves better by continuously improving their craft. They enthusiastically devour technical documentation, research papers, and blog posts. They take classes and absorb the experience of others.
  • Organizational presence: Great engineers are known throughout the organization for their knowledge and experience. They share their work via tech talks, show-and-tells, make-a-thons, and more. A great engineer’s presence can extend outside of Coursera in the form of external blog posts, speaking at conferences, and publishing research papers.
  • Influence: Great engineers influence other engineers to adopt new technologies, architectures, processes, and standards. This can also be measured by the length of the “line outside their cube” or the size of their Differential queue.
  • Attitude: Like all Coursera employees, great engineers care for teammates and exhibit humility. They realize that every mistake gives them an opportunity to become better at what they do.

Technical Excellence

Great engineers at Coursera are technically excellent in many ways: they can be brilliant product hackers, algorithmic masterminds, detail-oriented infrastructure engineers, or all of the above. We value engineers who think deeply in designing solutions to complex product and infrastructure problems.

Great engineers produce designs that are robust, intuitive, extensible, flexible, maintainable, operable, scalable, and efficient. In doing so, they strive to achieve a balance between quality and speed of execution.

Leverage

In addition to contributing to business objectives, great engineers enhance the engineering organization as a whole by boosting the productivity of engineering teams, building reusable components, improving tooling, and generally making the codebase better. This could mean building services or components abstractly so they serve multiple product needs or increase developer productivity. It could also mean taking the initiative to build tools, extract libraries, fix broken windows, write engineering documentation, or write tests.


This is not a checklist!

Great engineers need not excel in all areas listed above, but must excel in some. They may be exceptionally well-rounded, or exceptionally strong in a few areas. Per the montage below, it’s unlikely you’ll be able to “max out” your stats like Cecil (left); you’ll probably be more balanced like Gorath (right).

rpg stats3 4

How do we use this list at Coursera?

  • We all internally call out great engineers who exemplify the criteria in the document.
  • Individual contributors use this document to track their progress towards greatness and we all add notes, stories and examples so that others can learn about exemplary engineers who’ve done great things.
  • Engineering managers at Coursera use this document to structure mentorship of team members and feedback during 1:1s and performance reviews.
  • Anyone can shout out to their peers when they see them doing great things. This can happen in 1:1s, team meetings, eng all-hands, via our developers’ Slack channel, or via email.

Final Thoughts

At Coursera, we provide universal access to the world’s best education. We level the playing field by making high-quality education something that’s no longer available only to the elite. Likewise, in our engineering organization, we create an environment where every engineer can achieve greatness. We promote transparency and inclusiveness and provide this list of qualities as guidance to help engineers understand how they can improve.

I hope that was helpful! If you like what you’ve read here and you think you might be interested in working with us, check out our careers page or email us at joinus@coursera.org.



  1. Casnocha, Ben (8 July 2014). LinkedIn. The Alliance at LinkedIn: LinkedIn Speaker Series with Reid Hoffman and Jeff Weiner.

  2. Randall Munroe. xkcd. “goto”

  3. Final Fantasy II: Square. 1991. Video game.

  4. Betrayal at Krondor: Sierra On-Line. 1993. Video game.

Redshift SSD Benchmark

Our warehouse runs completely on Redshift, and query performance is extremely important to us. Earlier this year, the AWS team announced the release of SSD instances for Amazon Redshift. Is the extra CPU truly worth it? We do a lot of processing with Redshift, so this question is big for us. To answer this, we decided to benchmark SSD performance and compare it to our original HDD performance.

Redshift is easy to use because its PostgreSQL JDBC drivers allow us to use a range of familiar SQL clients. Its speedy performance is achieved through columnar storage and data compression.

Experiment Setup

The Redshift instance specs are based off on-demand pricing, but the reserved instances can be 75% more affordable. The results from the benchmark are the mean run times after running each query 3 times.

HDD Setup 1 HDD Setup 2 SSD Setup 1 SSD Setup 2
Nodes 4 dw1.xlarge 8 dw1.xlarge 32 dw2.large 4 dw2.8xlarge
Storage 8 TB 16 TB 5.12 TB 10.24 TB
Memory 60 GB 120 GB 480 GB 976 GB
vCPU 8 16 64 128
Price $3.4 / hr $6.8 / hr $8 / hr $19.2 / hr

Query 1.

First, we ran a simple join query between a table with 1 billion rows and a table with 50 million rows. The total amount of data processed was around 46GB. The results fell in favour of SSD’s.

Query 2.

This complex query features REGEX matching and aggregate functions across 1 million rows from 4 joins. The total amount of data processed was around 100GB. The results fell even more in favour of SSD’s from 5x - 15x the performance improvement.

Query 3.

A query that runs window functions on a table of 1 billion rows showed surprising results. The total amount of data in this table is about 400GB. Although the SSD’s performed better, the smaller SSD’s out-performed the bigger SSD’s despite having double the memory and CPU power.

Query 4.

This last query has 4 join statements with a subquery that also includes 2 joins. The amount of data processed is around 107GB. Since this query is very compute-heavy, it is not surprising that SSD’s perform 10x better. What is shocking is that the smaller SSD’s are once again more performant than the bigger SSD’s.

Conclusion

We also ran some other queries and the performance improvement from HDD to SSD was consistent at about 5 - 10 times. From these experiments, the DW2 machines are clearly promising in terms of computation time. For the same price, SSD’s provide 3.4 times more CPU power and memory. However, the disk storage is about 25% of that of the HDD’s.

A limitation to the dw2.large SSD instances is that a Redshift cluster can support at most 32 of them. That means dw2.large’s can provide at most 5.12 TB of disk storage. The only other option is to upgrade to dw2.8xlarge’s but this experiment shows little performance benefits from dw2.large’s to dw2.8xlarge’s despite doubling the memory and CPU.


Bringing Data To Teaching

When an instructor teaches a class on Coursera, they get to make a direct impact on hundreds of thousands of learners all around the world. This broad reach is a huge attraction for teachers. But we think the benefits of teaching a MOOC can go even further. The streams of data coming in from learners can give instructors an unprecedentedly detailed view into how learning happens, where they can make improvements, and future pedagogical directions to explore.

The University Product team at Coursera develops new tools and features for our university partners. A major part of this is building windows onto these rich and complex streams of data. We take raw data from learners’ activity and, with the help of analysts and designers, shape it into a form that instructors can act on. The visualizations and metrics we present help instructors understand their learners and make informed decisions. By building user-friendly tools, we are making data a part of the everyday act of teaching.

Dashboards

This spring, we launched a “Google Analytics”-style Course Dashboard to give teaching staff a top-level view of what was going on in their courses: Who is taking my course? Where are they coming from? How are they doing? What are the trouble spots, where learners fall off track? This dashboard was received with enthusiasm. Every week, over half of our instructors stop by to check on their course’s progress.

But this first version really just skimmed the surface. To go deeper, we worked this summer on extending the dashboard into quizzes and peer assessments, giving our instructors a question-by-question and option-by-option view of a course’s interactive content. This kind of feedback is essential to instructors. It lets them find the answers to questions like: What’s easy for my learners? What’s hard? What are the common mistakes, and where should I focus my instructional attention?

As an example, take the first quiz in our founder Andrew Ng’s Machine Learning course, currently in its 8th offering. At the top of the quiz dashboard (shown above) we show a few vital stats, as well as three interactive charts showing details on overall performance. Below this lies an itemized table of the quiz’s questions, giving basic metrics on a question-by-question basis.

These stats are all pretty straightforward, but you can do a lot with them. A good first step is to sort by “First attempt average score”, so that the questions which stump learners the most float to the top. In this case, there are some clear outliers:

Now this is surprising! The expanded details pane shows that the first two questions have only two choices each. A 54% average means learners are performing almost as poorly as if they just chose options at random.

At this point, it’s up to the instructor to look closer. Is this level of performance what the instructor anticipated? Or is there some unexpected problem – a concept that wasn’t communicated clearly, or an incorrectly worded question? In this case, with the issue brought to his attention by the dashboard, Andrew compared these low-scoring questions to similar ones, and found that these two questions were more confusing than the rest. As a result, he went into our quiz editor and clarified these questions for the next run of his course. The dashboard will be there to check that these fixes worked, continuing the cycle of iteration and improvement.

Peer-graded assignments

The new detailed dashboard also supports peer-graded assignments. For these assignments, simple metrics of averages and counts are not sufficient. One thing we wanted to shed light on was the accuracy of submitted assignments’ final scores. We compute a final score for a submission by combining the scores that different peer evaluators give it. If these evaluators tend to agree on a score, we can be fairly confident that we’ve pinned down a good guess of the submission’s true score. But if evaluators assign varied scores to the same submission, that means that random noise is affecting the final score much more.

Low inter-grader variability High inter-grader variability
Submission 1
0%
100%
0%
100%
Submission 2
0%
100%
0%
100%
Submission 3
0%
100%
0%
100%

For our dashboard, we use a bootstrap sampling process to simulate the scoring process. From this simulation, we compute a simple measure of the average sampling error. It tells, on average, how far the score we give a submission is from the ideal score we would give it if we had an infinite number of evaluations. For example, the low inter-grader variability submissions on the left (above) result in an average error of 6%, and the high inter-grader variability submissions on the right result in an average error of 17%.

If an instructor finds a peer-grading score item with an abnormally high average error, they might be able to reduce this error by making the scoring criteria more clear or increasing the number of peer evaluators per submission. We hope this metric will help instructors monitor the health of their course’s peer-grading systems.

Data pipeline

As anyone who has built a data-driven product knows, the engineering behind systems like this goes much deeper than a pretty interface. Here’s an overview showing the flow of data from raw production databases all the way to a dashboard user:

MySQL

Production DBs

The data driving these dashboards starts in our production MySQL and Cassandra databases. This data is spread across multiple systems and is not yet in a format suitable for analytics.
ETLs
Amazon Redshift

Data warehouse
(primary tables)

To consolidate, clean, and preprocess the data, it is ETLed into our Amazon Redshift-hosted data warehouse. This process runs every 24 hours, using our recently open-sourced Dataduct system.
Aggregation queries
Amazon Redshift

Data warehouse
(intermediate tables)

At this point, we could directly query the data warehouse to generate reports for all courses, but the complex joins and aggregations required would make these per-course queries cost-prohibitive. So first, we generate Coursera-wide intermediate tables of aggregates.
Report generation
Amazon S3

Report storage

Simpler per-course queries can now be run against these intermediate tables. The results are processed to generate JSON reports of data for each class, which are stored in S3.
Report delivery API
Browser
The instructor’s browser hits a REST layer which pulls the appropriate report down from S3. It renders the result in our single-page-app front-end.

The end result: Instructors have instant access to up-to-the-day metrics on every one of their quiz questions and peer assessment criteria.

Conclusion

We are very glad to be able to offer these kinds of features to our instructors, and we are excited to see what they do with them. But really, this is just the beginning. We want to do far more to pull insight-needles out of the data-haystack, directing instructors’ attention to the most important patterns and points of interest. We are also working on completing the feedback loop, by integrating dashboards and analysis tools with the authoring tools instructors use to create and edit course content. Imagine using analytics to identify a location for improvement, making a revision, and then, within days, seeing the impact of your change on learners’ success. This vision – of platforms which allow instructors to rapidly advance the effectiveness of their instruction – drives a lot of what we do here in the University Product team.

(And like the rest of Coursera, we’re growing fast. Let us know if you’re excited about the work we’re doing and think you might want to join the fun.)


Long-running jobs at Coursera

Out with the old…

In the early days of Coursera, we had a variety of long-running jobs needed to support our platform, such as batch email sending, class-wide quiz regrades, gradebook exports for our instructors, and more. This resulted in us building Cascade, a simple PHP framework using worker threads to poll Amazon SQS for new jobs and execute them.

However, we found that there were a number of drawbacks with the system we had built, such as a lack of isolation between colocated workers and a fragile and manual deployment process. In addition, tight integration with SQS resulted in a poor development story that made it difficult for developers to easily prototype and test new jobs on our framework. At first, building Cascade in PHP allowed us to integrate tightly with existing code for our online PHP stack. However, as we transitioned to Scala for both the online and offline worlds, confining our jobs to PHP became a hindrance rather than an advantage. As a result, we decided to write a more flexible successor to Cascade, without the inefficiencies of our first system.

… in with the new

We named this new system “Iguazú,” after the famous South American waterfalls. Rather than construct Iguazú from scratch, we chose to leverage Docker and Mesos, which were a great fit for our needs in several ways. We also generalized the framework to support pluggable queuing services, thereby streamlining the development lifecycle by allowing local queues.

As a lightweight packaging tool, Docker allows us to easily transition away from Cascade, simply by bundling our existing code for long-running jobs inside a Docker image. Moreover, our deployment process has become a quick two-step process: build a new Docker image using a Dockerfile, and upload it to a private registry.

While Docker helps us manage our job code, Mesos does the heavy lifting in managing how the jobs are run. By design, Mesos allows us to isolate our jobs and ensure that no one runaway job will cause other jobs to be terminated. Furthermore, Mesos still leaves us with enough control over how our jobs are scheduled and run, allowing us to autoscale without terminating machines that are still running jobs.

Conclusion

By using Mesos and Docker, we have built a new job-running system that we plan to use for many functions across Coursera, with use cases ranging from export jobs for instructors to grading student-submitted programming assignments to running batch analytics jobs for internal teams. We are currently vetting Iguazú in production and making it as robust and performant as we need it to be. Nevertheless, Mesos and Docker already provide us with numerous wins that we believe will make our new system a great tool for the many kinds of jobs we want to run.

Slides

Video


Coursera Engineering Open House: Update

We’d like to thank everyone who made it to Coursera last Thursday for our open house. We had a great time answering questions, introducing everyone to Coursera’s culture, and sharing the mission behind our company. If you weren’t able to make it, don’t worry; this definitely wasn’t our last open house! For anyone interested, a few pictures from last night, along with a video of the whole event, have been included below.

Be sure to keep updated with the blog for announcements on upcoming engineering events and talks at Coursera. Further, if anything you’ve seen at the open house or on the blog resonates with you, we’d love it if you joined our team! Check out: https://www.coursera.org/about/careers for all our open positions.

Efficient Front-end Development at Coursera

we like to go fast!

As a front-end developer, you find yourself mucking around in the browser. A lot. Whether its ironing out interaction flows, figuring out transition durations, or making sure your API calls are valid, the browser is your battleground; anything that takes you out of that is a distraction.

As such, I took a look at what it might look like to develop on the Coursera front-end without having our hefty virtual machine running… and made some really interesting revelations about the impact of being able to do so.

How I did it.

With my virtual machine turned off, I needed a solution to two things:

  1. Something to serve our static assets (HTML, JavaScript, and CSS) to the browser.
  2. Something to proxy our API requests to get the data we needed to construct the client-side application (class information, user models, that sort of stuff).

Without the webserver that our virtual machine normally hosts, we turned to the lightweight Express.JS to serve all our compiled assets. Express proved to be extremely easy to set up, and I was able to get our static assets properly hosted at localhost:9000/static in no time at all. One down, one to go.

The second problem was a little more cumbersome. I needed a fast, efficient way to resolve our API requests with the data required to construct our single page application, which could include things like course information and student information. I didn’t want to have to burden developers with having to generate this data, so I took a look at our existing data sources for ways to construct the application. We came up with two different ways to address this problem, depending on our development needs.

For our first approach, we found a solution that worked well if you just wanted to see the UI and quickly iterate on the frontend. As part of our testing suite, we generate mock data in order to simulate models that had already come back from the server. Since we use Require.JS, we were able to take advantage of the map config in order to reroute requests for data to the mocks that we had generated for testing. And because we test almost everything, it was incredibly easy to map our data resources to the mock data. This worked really well, and in conjunction with the Express.JS server we had a complete solution to the problem of running Coursera without a virtual machine.

For our second approach, we realized we had an existing service powering all the APIs we wanted: our staging servers. Thus, we enabled a proxy between our Express.JS web server and staging instances of Coursera. This allowed us to continue running the front-end on our local machines but use backends running on beefier servers. This solution allows us to iterate on the front-end quickly while still having the persistent data we need for the full interaction flows that our product supports.

Revelations

After doing this process, I realized a couple of important things that I hadn’t really envisioned from the outset, added benefits to being able to develop against a lightweight server.

1. Validate mock data

We create a lot of mock data to get proper test coverage for all of our client-side code, but you can imagine that in the pursuit of those green checks, some data falls through the cracks. Some of the mock data had essential fields omitted, some data was too expansive, and some was just plain wrong. Being able to construct the entire application in the browser, just like a client would, makes sure that our test data is actually worth using in our test suites.

2. Increase battery life

Don’t discount this one; no one gets bothered more by a hot (or dead) laptop than a developer in the zone. Virtual machines are usually the culprits when it comes to shorter than normal battery life, the absence of which means we can work where we want, in the places we are most productive.

The Future of Front-end Development

As front-end developers, we should focus on the front-end. Courser.JS gives us the ability to do just that, making our front-end development process faster and more efficient. But being able to run our applications more easily also greatly lowers the barrier of entry when it comes to contributing to front-end code; think about the technical designers and marketers who want to get down and dirty in the code and want to see how the application looks and feels without having to figure out all the odds and ends of a virtual machine. Running front-ends without backends allows us to focus on the work we’re doing, in the medium that it runs in, in the most efficient way possible.

Coursera Engineering Open House

If you’ve been following our engineering blog (or just happened to stop by), and are interested in learning more about engineering at Coursera, please join us on Thursday November 6th for Coursera’s engineering open house. Throughout the night, you’ll have the opportunity to both learn more about the tools and technologies used at Coursera, as well as meet the engineers behind the blog posts.

We’ll be starting off the open house with an introduction from co-founder Andrew Ng. Following will be a variety of lightning talks covering everything from the monetization and business side of MOOCs, to life as an engineer at Coursera, and finally, our newest products and the technologies that run them. We’ve set aside most of the night for you to check out our demo stations, where engineers from each team will show off some of the work they’re proud of and answer any questions you may have about engineering at Coursera.

Doors at our Downtown Mountain View office open at 6pm. Food, drinks, and swag will be provided. If you’re interested and would like to find out more, please check out: http://courseraengoh.splashthat.com/. Don’t forget to RSVP!

See you there.

Writing a Custom Control for iOS 8 using Swift and Auto Layout


Here at Coursera, every new line of code we write for iOS is in Swift. As a result, I’ve written a few custom controls lately in Swift using Auto Layout and iOS 8’s new IBDesignable/IBInspectable attributes which are supposed to live render custom views in Storyboard files. I’ve seen a simple example setting up the new Interface Builder (IB) attributes and a custom control rewritten using Swift, but nothing written from scratch with all these technologies in mind. Today, I’m going to walk through the strategies I used when writing a custom scrubber we recently open sourced from Coursera. Before we get started, it’s worth setting up your project using the simple custom control example by WeHeartSwift I mentioned earlier since those IB attributes require certain project configuration.

Creating the ScrubberBar with Auto Layout

The scrubber we want to build today will look something like this:


There are a lot of components here, so let’s start by creating the ScrubberBar as a subclass of UIView. Since we are considering Auto Layout from the beginning, there are three things to immediately think about:

  1. Turn off the autoresizing mask in the init.
    required init(coder aDecoder: NSCoder) {
       super.init(code: aDecoder)   
       setTranslatesAutoresizingMaskIntoConstraints(false)
    }
  2. Override the intrinsicContentSize to specify what size this component would like to be. I typically define these intrinsic values as private properties, but for now I’ll simply specify some reasonable values.
    override func intrinsicContentSize() -> CGSize {
       return CGSizeMake(100, 70)
    }
  3. Indicate to other views that this view requires auto-layout.
    override class func requiresConstraintBasedLayout() -> Bool {
      return true
    }


Finally, we need to set the corner radius for the ScrubberBar’s view to get those rounded corners. In order to do this while respecting Auto Layout we will override layoutSubviews which gets called when the constraints and frame changes.

override func layoutSubviews() {
   super.layoutSubviews()
   layer.cornerRadius = frame.height/2
}


Establishing View Hierarchy


Components of the ScrubberControlNow that we have our ScrubberBar component created, we need to think about the view hierarchy for our control. While it makes since for UI aspects like ScrubberEvents and the BufferBar to be laid out by the ScrubberBar other things like the ScrubberElement seem more independent. Therefore, we likely need an overarching container for both ScrubberElement and ScrubberBar components. We’ll call this container something über creative like “ScrubberControl”. We are able to choose such a generic name because frameworks are namespaced in Swift. Therefore, if a name conflict occurs you can refer to the scrubber as .ScrubberControl.

Now as with the ScrubberBar, this custom view will need the same initializations as before. However, this time the ScrubberControl will also create a ScrubberBar in the init method.

public var scrubberColor: UIColor = UIColor.grayColor()

required public init(coder aDecoder: NSCoder) {

   scrubberBar = ScrubberBar(coder: aDecoder)
   super.init(coder: aDecoder)
   
   setTranslatesAutoresizingMaskIntoConstraints(false)
   addSubview(scrubberBar)
   scrubberBar.backgroundColor = scrubberBarColor
   setupLayout()
}
You’ve probably noticed we also set the ScrubberBar’s color, this is so we can see the view when its rendered later.


Next we need to use Auto Layout to position the ScrubberBar in our ScrubberControl’s view. For this we’ll implement the setupLayout method above, creating several layout constraints to center the ScrubberBar in our control and then add them to our view.

func setupLayout() {
	var constraintsArray = Array<NSObject>()
		
	// Background Bar Constraints
	constraintsArray.append(NSLayoutConstraint(item: scrubberBar, attribute: NSLayoutAttribute.CenterX, relatedBy: NSLayoutRelation.Equal, toItem: self, attribute: NSLayoutAttribute.CenterX, multiplier: 1.0, constant: 0.0))

	constraintsArray.append(NSLayoutConstraint(item: scrubberBar, attribute: NSLayoutAttribute.CenterY, relatedBy: NSLayoutRelation.Equal, toItem: self, attribute: NSLayoutAttribute.CenterY, multiplier: 1.0, constant: 0.0))

	constraintsArray.append(NSLayoutConstraint(item: scrubberBar, attribute: NSLayoutAttribute.Width, relatedBy: NSLayoutRelation.Equal, toItem: self, attribute: NSLayoutAttribute.Width, multiplier: 1.0, constant: 1.0))

	constraintsArray.append(NSLayoutConstraint(item: scrubberBar, attribute: NSLayoutAttribute.Height , relatedBy: NSLayoutRelation.Equal, toItem: nil, attribute: NSLayoutAttribute.NotAnAttribute, multiplier: 1.0, constant: 15))

	self.addConstraints(constraintsArray)
}
Always make sure to add the subview to the view hierarchy before adding constraints and to always add those constraints on the parent view.


Provided you’ve added your custom control to the view hierarchy through code or storyboard you should be able to run your project and render something like this:

I colored the background of the ScrubberControl orange to show its frame.

Making the Control Viewable in Interface Builder

At this point we have a basic control, now let’s make it available our Storyboard. This part requires some project configuration, so if you haven’t been building these classes inside a separate framework then you might want to walk through the simple WeHeartSwift walkthrough I mentioned earlier. First we’ll add the @IBDesignable attribute to our class.

@IBDesignable public class ScrubberControl: UIView {


Then our next step will be making the scrubberBarColor property available for use in the Storyboard. To get this behavior, apply the @IBInspectable attribute to the scrubberBarColor property and change the border color of our scrubber whenever that variable is set.

@IBInspectable var scrubberBarColor: UIColor = UIColor.grayColor() {
   didSet {
      scrubberBar.backgroundColor = scrubberBarColor
   }
}


Now you should be able to change the scrubber bar color in your Storyboard and have that color reflected when you build+run your app.


Why can’t you just view the rendered ScrubberControl in the storyboard with the new value? Well… thats because despite Apple launching the feature to live render custom views it doesn’t really work. Here are my real world experiences using the @IBDesignable attribute:

Whenever a non-initialized, non-optional property is added to a subclass of UIView the compiler requires initWithCoder be overridden. However, if initWithCoder is overridden then your custom UIView will no longer render in the storyboard. Whenever your custom view doesn’t render (ex: previous bullet point) there is no debug information to determine why. Often the side bar will read “Updating” or “Timed Out” with no additional information.  If the sidebar does display “Crashed” for the Designable a clickable button appears labeled “Debug” that brings you to your @IBDesignable class. However, the debugger will just highlight the class definition, throw a BAD_ACCESS exception and provide no further details.

In summary, @IBDesignable is unusable, however @IBInspectable has always reliable.

Handling Touch Input or Animation with Auto Layout 

At this point we just have a simple ScrubberBar inside our ScrubberControl, however as we build each new component the steps I described above can be applied repeatedly to design a static layout:

  1. Turn off resizing mask
  2. Set the intrinsic content size
  3. Indicate to the system this view requires Auto Layout (requiresConstraintBasedLayout)
  4. Modify the view appearance and layout the constraints in the init method


Therefore, I’m going to skip ahead a bit and discuss how we handled user input on the ScrubberElement of our control. The behavior we want for this ScrubberElement is for a drag gesture to update the element’s center, but for that center to never exceed the extent of the scrubber bar.


When handling any user input or animation while using Auto Layout I perform two steps:

  1. Ignore constraints and modify the frame directly
  2. After touch or animation completes re-adjust constraints to match the current layout.


Just like previously, the first thing to do is create all constraints within the init method. Next, store any constraints affected by an animation or touch input in a property. For our ScrubberElement the center constraint is the sole aspect that affects our element’s horizontal position. Therefore we will store it in a property so it can be adjusted after the touch gesture completes.

scrubberElementCenterConstraint = NSLayoutConstraint(item: scrubberElement, attribute: NSLayoutAttribute.CenterX, relatedBy: NSLayoutRelation.Equal, toItem: scrubberBar, attribute: NSLayoutAttribute.Left, multiplier: 1.0, constant: centerValue)


Next when the drag gesture occurs, directly set the center of our element to keep it in step with the touch location.

scrubberElement.center = CGPointMake(calculatedXCoordinate, scrubberElement.center.y)


Lastly, after the touch completes update the constraint with the new values by calling the layoutSubviews method and overriding its implementation.

override public func layoutSubviews() {

   super.layoutSubviews()

   if let centerConstraint = scrubberElementCenterConstraint {

      centerConstraint.constant = scrubberBar.centerValueForItem(scrubberElement.index)

   }

…


You may have noticed that I’m not updating constraints in the updateConstraints method (the typical scenario). This is because the centerValueForItem method used above relies on the scrubberBar’s frame. Therefore, the center constant value must be calculated later on in the view’s layout cycle (later than updateConstraints) in order to return a correct frame value.

Closing Thoughts

As you’ve seen, despite how intimidating UIKit can be creating a custom UI component isn’t all that difficult. Simply following a repeatable procedure will lead you down the right path 90% of the time. There is a lot more code than what we walked through in this post so take a look at the custom scrubber control we open sourced for more details. If you want to learn more about the layout life-cycle and using auto-layout with custom controls I suggest reading the objc.io blog’s Advanced Auto Layout Toolbox.

About the Author

Brice Pollock is an iOS Software Engineer at Coursera where they make applications which disrupt education using Swift and a modified interpretation of the VIPER architecture. He also writes bi-weekly about technology and Silicon Valley on Medium.

Talks @ Coursera – Counting at Scale with Scalding and Algebird

Data plays an important role at Coursera. We use data to improve our learner experience, gather insights in MOOC pedagogy, and provide instructors insight into their courses via our instructor dashboards. The data infrastructure team at Coursera seeks to provide data consumers with great tools that enable them to transform and analyze data effectively.

One need that arose was the ability to write complicated data flows that leverage Hadoop MapReduce. MapReduce is revolutionary in that two simple distributed operations, map and reduce, could be used to effectively parallelize computations across large datasets. However, this same simple API makes it inconvenient to perform operations like joins, secondary sorts, and aggregations.

Coursera has started writing some of its Hadoop transformations in Scalding, and so far results are great. Scalding is concise, performant (including powerful optimizations like skewed join support), and allows us to write all our transformations in Scala. In addition, Scalding makes it really easy to unit-test our data flows without having to run Hadoop at all.

It’s only natural, then, that as part of our Talks @ Coursera series, we had the pleasure of hosting Ian O’Connell, Scalding contributor and Sr. Software Engineer at Twitter, to talk about real world examples of using Scalding and Algebird at Twitter-scale.

We also had Daniel Chia talk about why Coursera uses Scalding, and what our experience has been like so far.

We hope to see you at our next talk soon!

Slides


Coursera’s Adoption of Cassandra

Like many startups, Coursera began its data storage journey with MySQL, a familiar and industry-proven database. As Coursera’s user base grew from several thousand to many millions, we found that MySQL provided limited availability and restricted our ability to scale easily. New product initiatives and requirements provided a perfect opportunity to revisit our choice of core workhorse database.

After evaluating several NoSQL databases, including MongoDB, DynamoDB and HBase, we elected to transition to Cassandra . Cassandra’s relative maturity, masterless architecture (for availability), tunable consistency, and stable low-latency performance made it a clear winner for our needs.

Transitioning from MySQL to Cassandra

Given the significant differences between MySQL and Cassandra, the transition to Cassandra took time and thoughtful effort. Relational data models designed for MySQL will perform very poorly on Cassandra, because Cassandra is optimized for many simple point or range queries, while relational data models encourage normalization. MySQL uses joins and filters to extract data, but Cassandra does not support joins, and provides only weak support for arbitrary filtering; instead, each Cassandra query should only read a subset of data for a partition key.

Here are some takeaways from our transition to Cassandra.

Start Small

Don’t migrate your entire product to Cassandra on day one. You will inevitably make mistakes on your first attempt; limiting the scope of early migrations will minimize both migration effort and end-user impact. Start with a small feature for which you can afford some downtime. For example, moving your user login information to Cassandra would probably be a bad idea; better choices would be a new feature that can be slowly rolled out, or an optional feature like a feedback gathering tool that can be omitted if it fails to load.

quick questions

We started our first Cassandra project with a lightweight polling feature called Quick Questions. As the resident Cassandra expert, I worked closely with the engineers building Quick Questions to ensure smooth development on top of Cassandra.We found it helpful to provide data store libraries, as well as data modelling consulting, both of which are explained in more detail below.

Effect a mindset shift

Cassandra has a very specific set of capabilities. We found that we needed to give developers time to properly understand Cassandra’s data models in order to get good query latencies. As mentioned earlier, treating Cassandra as a relational database like MySQL is a guaranteed way have terrible query performance. Few developers at Coursera had applicable experience, so we used tech talks – informal engineering seminars held over lunch – to help educate our team.

One of the biggest differences between Cassandra and MySQL is that Cassandra requires much earlier development and understanding of intended query patterns. In MySQL, you store your data in a normalized fashion, and then write appropriate queries to retrieve data based on your application needs. In Cassandra, however, you start with your application first, figure out which queries you need to ask of your data, and store your data in a denormalized fashion according to the queries.

For example, at Coursera we have a many-to-many relationship between learners and courses. We need to keep track of for each learner, which courses they are part of, and for each course, which learners are part of the course.

A possible SQL table for this relationship would be:

1
2
3
4
5
6
7
8
CREATE TABLE `courses_learners`  (
  `id`         INT(11) NOT NULL auto_increment,
  `course_id`  INT(11) NOT NULL,
  `learner_id` INT(11) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `c_l` (`learner_id`, `course_id`),
  UNIQUE KEY `l_c` (`course_id`, `learner_id`)
)

Translating this schema directly to Cassandra would be problematic for several reasons:

  • The use of an auto-increment primary key: We don’t have a way to generate auto-incrementing keys in Cassandra. Instead, we favor UUIDs or natural keys.
  • Requiring indexes: Our two required queries would require scanning all the data in Cassandra, since we can’t easily create indexes on (course_id, learner_id).

Instead, in Cassandra we’d opt to denormalize our data and write into two tables optimized for our queries:

1
2
3
4
5
6
7
8
9
10
11
CREATE TABLE courses_by_learner (
  learner_id uuid,
  course_id uuid,
  PRIMARY KEY (learner_id, course_id)
)

CREATE TABLE learners_by_course (
  course_id uuid,
  learner_id uuid,
  PRIMARY KEY (course_id, learner_id)
)

Here, we’ve used the pair (learner_id, course_id) as the natural key for membership in a course. By denormalizing the data into two tables, we’re able to answer both queries using a single read of one partition in Cassandra, rather than scanning and filtering all data. Cassandra excels at reading one row, or a slice of rows, from exactly one partition. By denormalizing our data and organizing it according to our query pattern, we’re able to ensure we can obtain our data using read patterns that play to Cassandra’s strengths.

Data modeling consulting

Most applications probably don’t have very complicated query patterns, and application developers should be able to easily identify the right library or Cassandra pattern to apply. For the minority of use cases that are more advanced, it’s useful to have a set of experts whom developers can consult.

Libraries

Libraries offer a way to abstract away the complexities of Cassandra for simpler use cases. For example, many applications only need a simple key-value store, as concurrency is low and read-modify-write is not a problem. As a bonus, our libraries also maintain a history of previous values per key to aid with debugging and troubleshooting. We plan to extend the library in the future to make it easy to build custom secondary indexes. In such cases, developers can use a pre-written Cassandra key-value store library, and not have to worry about the low-level Cassandra details at all. The library uses a good Cassandra data model, sets read and write consistency levels appropriately, converts Java Futures to Scala Futures, and might even provide other goodies in the future.

Conclusion

The transition from MySQL to Cassandra has been an interesting one for Coursera. Cassandra has required us to leave our SQL mindset behind and design our data models according to our queries. We can’t say that our journey is complete, but so far, we’ve been reaping the benefits. Maintenance is much simpler, we’ve had practically no downtime, and performance on SSDs has been great with 95th percentile read latencies of less than 5ms.

quick questions

If working on Cassandra and other infrastructure problems interest you, or you just want to help scale our education platform to reach every learner in the world, we’re always looking for passionate and talented engineers. Check out our careers page or email us at joinus@coursera.org.