XBlock Lessons: Plugin Performance and Grading

Published in

edX Engineering

12 min readJul 6, 2018

A major architectural goal for Open edX in the coming years is to push more functionality out into pluggable interfaces, so that developers can extend and customize the platform without having to fork the code. With this in mind, I thought it would be helpful to write a series of posts to reflect on some of the lessons we learned from XBlock, our first widely used plugin system. XBlocks are developed in separate repositories by edX and other Open edX community members, pip-installed, and then discovered and executed in-process by the LMS and Studio.

This first post is on plugin performance as seen through the lens of grading.

TLDR

Things to do when designing a performant system with plugins:

For any aggregate queries (e.g. getting many scores together for a final grade), make sure the plugins incrementally push the necessary data into your system so that your system can inspect its own data model to respond to requests. Do not query the plugins at runtime for aggregate requests.
Set conservative constraints. Arbitrary expressive power is hard to rein in and make performant in a backwards compatible way later.
If you’ve already lost battles #1 and #2, asynchronous processing can help to bridge the gap and convert your system to something more performant, at the expense of operational complexity.

For justifications, please continue reading…

A Brief History of XBlock

EdX has emphasized content quality from its earliest days. The inaugural prototype course launched in early 2012 with rich, interactive visualizations and a gradable circuit simulator. Every content type in the courseware — HTML snippet, video, problem, etc. — was represented as a different subclass of XModule. Container types like chapters, sequences, and individual units were also XModules that had references to their child nodes.

An XModule is like a miniature web application in many ways. It declares its own static assets, has state information associated with both the content and the user, defines its own URL handlers, and knows how to render itself. XModules were a bit unusual in that they were built to be composed together on a single page. Introspecting child nodes was a common pattern, so a call to render a container XModule might result in that container then making render calls to all its child XModules and combining the resulting HTML before sending it back to the original caller. The entire course was a large DAG of XModule instances.

XModules and XBlocks: Turtles all the way down.

XModules worked well enough, but they had a number of drawbacks, the most immediately obvious of which was that their code was built into edx-platform itself, along with all the LMS, Studio, and student management code. Anyone who wanted to add a new content type needed to spin up and run the entire LMS (this was before we used Docker or even Vagrant). Getting code reviews in a timely manner was difficult, and meeting the scaling requirements for edx.org was often an unnecessary burden for people who just wanted to do experiments on their own course of thirty students.

Our solution to this was a new API called XBlock. It would serve the same role that XModule did, but be pluggable so that you could develop a new content type in its own repository and simply pip-install it. Open edX operators could choose to install whichever XBlocks met their requirements without having to wait on pull requests into the central edx-platform repository. In addition, XBlock would address some of the shortcomings of the XModule API and add more features around state storage and composing content.

Note: This is an extremely abbreviated history, and I’ve glossed over a lot of details to talk about only what’s relevant for this article. XModules still exist in edx-platform, though they’ve been refactored to be a subclass of XBlock with compatibility shims on top.

XBlocks and Grading

Grading offers some insight into the more general tradeoffs made in XBlock design. Please keep in mind that most of what’s described here is historical, to demonstrate some of the growing pains our design choices caused over the years. If you want to get an overview of what grading looks like today in Open edX, I highly recommend the Grades Architecture design documents.

Two of the most basic questions that Open edX grading code needs to ask of an XBlock is how many points a user has earned and how many points a problem is worth (if no attempt has been made). While most XBlocks serialize their content data as attributes in a single tag, they are free to define any arbitrary XML structure. More complex XBlocks like the Open Response Assessment Block have XML that define the question, assessment rubrics, assessment intervals, etc. An ORA2 problem defines the possible score as the sum of the scores of all the rubrics. Capa problems define their max score by how many response fields they have. The details of how any given XBlock figures this out are opaque to the grading system, which simply calls a given XBlock instance’s get_score and max_score methods. Grading also calls get_children to crawl through the list of content that is available to the user.

Simple, Powerful, Predictable: Pick Two

Let’s step back and look at the implications of this approach with a real world query. A user has just asked to see their Progress page, which will list their scores for every problem in the course. Our grading code crawls the course structure with get_children and calls some combination of get_score and max_score on all the leaf XBlocks so that it can eventually aggregate the responses and apply a grading policy. How long does this operation take to complete?

The answer is simple: We have no idea.

Our interface synchronously queries XBlocks at runtime for score information, but we know nothing about how individual XBlocks will implement those methods. They could parse long XML snippets. They could start sandboxed processes to run untrusted code, evaluate it, and return results over an IPC call. They could do an HTTP request to a remote server. At various points, different XBlocks have done all these things. Multiply this uncertainty across the hundreds and sometimes thousands of XBlock instances that must be queried across an entire course, and you have a recipe for sluggish pages and even the occasional system outage.

This was the tradeoff we found ourselves with. We had pushed away the complexity of understanding how to model and store the data around grading, significantly simplifying our code and allowing for rapid implementation and extensibility. But in doing so, we also sacrificed our ability to reason about the data or make any promises around the stability or performance of the grading system.

Clawing Towards Predictability with Bandaids

Predictable performance can be ignored for an experimental prototype, but not for a production system used by millions of students. Something had to be done.

The first tactical thing to do in this situation is to explore quick optimizations that can be made without changing the interface, which usually means caching. But what can we cache? What’s safe to cache?

The answer should be familiar: We have no idea.

The get_score and max_score methods don’t take any parameters — the results depend on XBlock logic that we can’t know and XBlock state that we don’t explicitly track. If someone wanted to create an XBlock that returns a different possible score on Thursday afternoons for users with five-letter usernames, there is nothing in the function contract that expressly forbids that. For a more realistic example, the first version of our essay grading XModule stored peer assessments in a separate service and needed to check back with that system in order to determine the score. Another applied late day penalties to the score returned, depending on how far past the due date the user submitted the answer. During the early days of the system, it was not unrealistic to think that an XBlock would dynamically change its max_score over time in a randomized or adaptive manner as it surfaced new content to the student over time.

Comments in the base class definition of max_score() for XModule in edx-platform highlight the tension between initial design intent and the reality of where the code evolved to.

So then we started digging into known XModule and XBlock code to determine what simplifying, non-breaking assumptions we could make. Could we assume that the max score for a given problem will be the same across all users? What if we assume only that the max score for a problem is the same across all users who have never interacted with it? Can we skip the call to get_score and go straight to the table where we store the scores (there’s actually a good correctness argument for doing this anyway)? Okay, we clearly can’t do that for everything because of how version 1 of essay peer assessment works, but maybe we could do it for most XBlocks if we add a special class attribute that lets us know when it’s safe to do so…

Conflicts that can arise when performance isn’t factored into plugin design.

Bit by bit, we traded away power and simplicity to get more predictable performance. The grading code quickly became a scary place to operate. Access to XBlock structure was contorted to minimize the need to instantiate user state, leading to duplicated and sometimes fragile code to traverse course content. Scores were cached in various ways at different levels.

Yet these were ultimately all bandaid patches. They kept the system from falling over, but the Progress page would remain terribly sluggish until a more radical overhaul was completed.

Fixing (Creating) the Data Model

By this point we were over four years into the life of edx-platform, and the only aggregated grade information stored in the database was a summary course grade that was stored in a certificates-related table during certificate generation. Scores for individual users on individual problems were stored in the courseware_studentmodule and submissions_score tables, but grades for sequences on the Progress page were always recalculated on demand.

Let’s shift our perspective to the course instructor. If there are 100,000 people enrolled in a course, and each one takes two seconds to grade, then the (regrettably single-threaded) grading task has to run for over two days to generate a grade report for all students enrolled in the course.

By this point, the performance issue should come as no surprise, but what about basic correctness? What happens if I want to remove an exam from my course after it’s closed, because I don’t want those questions and answers to be publicly available? Well, if I delete the exam or change permissions on it so that it’s no longer visible to students, then it will no longer be returned when we’re crawling the course structure. As far as the grading code is concerned, the entire exam has disappeared from the course, and any student looking at their progress page will suddenly see that they have no exam score at all, significantly dropping their grade for the course.

All of these issues culminated in the Robust Grades project which introduced a new set of models that would persist course and subsection-level grade aggregations for individual users. It also persisted a list of what content was visible to the user in each subsection at the time of grading. Now course teams could remove course content without affecting student grades. Persisting the list of visible content also improves performance because the grading code no longer has to crawl through the XBlock children to determine which permutation of the sequence this user has seen (a number of factors can affect this, such as permissions, cohorts, and A/B tests).

This was great work, and a huge leap forward for grading on Open edX. Despite having relatively little traffic, the Progress page had always been the #3 or #4 transaction in terms of total system CPU consumption. Now it wasn’t even in the top twenty. Severe load spikes would result in queuing and delayed grade computation, but it would not threaten the stability of the user facing site.

Yet it wasn’t all rosy. The grading system was now much more operationally complex. Race conditions had to be accounted for. In cases of high load, the grade on a user’s Progress page might not reflect their latest answers. It also meant that there needed to be retry logic to make sure that transient failures did not prevent a user’s updated grade from getting recorded. In the old system, if an XBlock had a scoring-related bug, then just reloading the Progress page after the bug fix was applied would often be enough to fix the issue. But now, we had to have a way to invalidate grades that were generated during a known flaky period.

To be clear, I agree with the tradeoffs that the team made on that project. But it’s important to note that there are still drawbacks, even in our shiny new world.

Lessons Learned

I wasn’t the original author of the grading system, and I wasn’t on the team that did the radical overhaul. I was the caretaker of the grading system for much of the time between those events however, and I’ve watched the newer work being done. There are a few things I took away from the experience with respect to performance.

Control your data.

You cannot make any guarantees about the performance of your system if you have to call arbitrary plugin code during a request. In single invocation cases, this is often fine. A web framework can’t prevent someone from writing a horribly slow controller/view. But it completely kills a system if it has to be done across many instances of a plugin for aggregate queries, like the ones needed to fetch all the scores or max scores for a course.

As you probably noticed, the entire arc of performance improvement involved reducing the number of invocations of plugin code while rendering the Progress page. At first, we added caching at various layers. By the end, we had flipped the data relationship entirely. Rather than the Progress page pulling aggregated score information and content visibility from XBlocks, we incrementally pushed all that information into data models that the grading system defined.

Constrain your plugins.

Object-oriented programming often tempts us with elegant but costly interfaces. XBlock grading had a number of these that had to be walked back in various ways.

For get_score, the fix was to force the plugin to push the score into the system when changes happen, and never require the grading system to re-query the XBlock (trivia note: there’s a remnant of this code still left in the system, as the always_recalculate_grades attribute on XModule, which is now always false).

For max_score, the eventual fix was to ignore the more dynamic possibilities and assume that the maximum possible score for any two users who have not interacted with a problem will always be the same, making it possible to cache those values based purely on the content definition. We could have gone a different path entirely, if backwards compatibility were less of a concern. In edx-platform, there’s a concept of a problem weight, which acts as an optional multiplier on the raw score for a given problem. Instead of doing that, we could have made a weighted_score be a mandatory field, and never had to invoke plugin code to find the max_score. This was not a capability that XBlocks ever really needed.

Determining the set of problems a user has (i.e. what permutation of the sequence they see) remains unconstrained. Instead, we asynchronously record every user’s permutation into a table as kind of a giant hammer that allows us to ignore many of the details that go into determining content visibility, such as A/B tests, cohorts, etc. But it’s worth asking if we could have constrained this interface somehow to allow for more performant calculation.

Use asynchronous tasks as a bridge.

Asynchronous tasks can provide a backwards compatible path for converting a system that queries plugins at the time of a user request to one that controls its own data model. As soon as a state change occurs, you trigger the task to pull data from the plugins and push it into your system’s data model in an optimized form. The major difficulties are that you have to carefully map out when state changes actually occur, and account for failure and recovery of your async tasks. Keep in mind that recovery can also mean fixing data that has been persisted erroneously because of a bug.

Tradeoffs and Historical Perspective

It’s easy to criticize the choices that were made with the benefit of hindsight. I am certainly guilty of this. But it’s important to keep in mind that there were advantages to the early implementation. Defining a powerful interface for grading is a lot easier than defining an equally powerful and performant data model. State synchronization is more operationally complex than simply recalculating it from scratch every time. Taking a few weeks to do things the Right Way (even assuming we knew enough at the time to determine what that was) is a lot less tenable when you are rolling out critical features days before students are using them. Technical debt is a problem only for organizations that survive long enough to run into it.

Yet there was undeniably a lot of long term pain that came from those early decisions. Months of developer time were spent on stabilizing short-term optimizations to grading code over the years, and addressing various bugs that resulted from the added complexity there. The layers and assumptions built upon the existing system meant that even our radical overhaul didn’t simplify the content access patterns so much as make use of asynchronous processing to shift the burden of computation from “when the user looks at the Progress page” to “when the user’s score changes”. It is quite likely that if we had the freedom to develop an entirely new system from scratch, it would have a more performant data model and not be as operationally complex.

Thank you for reading this article. I hope that it was useful to you in designing your next plugin system.

edX Engineering

XBlock Lessons: Plugin Performance and Grading

TLDR

A Brief History of XBlock

XBlocks and Grading

Simple, Powerful, Predictable: Pick Two

Clawing Towards Predictability with Bandaids

Fixing (Creating) the Data Model

Lessons Learned

Control your data.

Constrain your plugins.

Use asynchronous tasks as a bridge.

Tradeoffs and Historical Perspective

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in edX Engineering

Written by David Ormsbee

No responses yet