Skip to content

What42Pizza/Data-Oriented-Manifesto

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

When it comes to structuring code, I'm absolutely a perfectionist. I've spent countless hours contemplating the fundamentals of paradigms, including their problems, causes of said problems, and causes of said causes. I care about it so much that I often spend more time perfecting the features I've just added instead of moving on to the next ones. So as you could probably guess, I'm not happy with the current programming landscape. I'm unhappy with readability, with performance, with correctness, and with maintainability. Over the years, I've gained many insights through my analyses, and hopefully I can convince you too that the status quo can and should be improved.



But first, I need to set some definitions. You can skip this if you want, but please remember that a disagreement might stem from people not actually talking about the same thing. Even if you disagree with my definitions, please just go along with them so that everyone stays on the same page.
- Object: A grouping of data and its associated code
- OOP: Creating code one object at a time (only creating data as objects and only creating statements as object methods)
- Procedural programming: Creating code one struct / function at a time
- Mixed codebase: Any codebase that uses multiple paradigms, like OOP + procedural (which is almost every codebase)
- Control flow: The sequence of actions that is ultimately performed by code



Part 1

The first question I need to answer is this: do we actually need change? I believe that OOP is fundamentally flawed, and hopefully I can prove it to you. And when I say OOP, I don't mean whatever combinations of principles and design patterns are currently popular. I'm addressing the core of OOP, programming with the mindset of objects. And though the modern definition of OOP usually contains other aspects like private data and inheritance, I'm only going to consider OOP to be programming with objects. Even if you disagree with that definition, please just go along with it. If I can show that OOP as I define it is flawed, that also shows that OOP as you define it is flawed.

The fundamental problem with OOP, as far as I can tell, is how it handles control flow. Managing control flow is the most important part of programming, as any amount of debugging will clearly show. Features are defined by control flow, bugs in / across features are defined by control flow, user interaction is defined by control flow, and so on. Some say that data is more important, but data is useless if it's static. Plus, the amount of code that defines data is nothing compared to the number of statements. And if you think about it, there are only two halves to programming: defining data, and defining control flow.

And here comes the problem: OOP is built off the idea that control flow across objects doesn't matter. That assertion might seem harsh, but the entire point of OOP is that you no longer have to deal with an entire system at once, you only have to deal with one object at a time. You only have to create one object at a time, only have to think of one object at a time, and so on. After all, if you do consider behavior across objects, most of the appeal of OOP is gone. If you're in a situation where you're forced to consider how two objects interact, there's no encapsulation there. Either you're considering one object at a time and ignoring cross-object control flow, or you're not using OOP. You may want to respond that you yourself don't use OOP according to my definition, but I'd argue that you still do, just inconsistently. You probably always attempt to think of objects one at a time, and inevitably end up considering multiple objects together whenever needed. But is that a bad way to program?

Control flow across objects matters much more than you think. Each object does manage itself to an extent, but only in the same way that a car manages itself. Things like pumping fuel, generating electricity, and creating torque are managed by the vehicle, but the car's movements, lights, and all other features that make it a car are managed by whoever is using it. And it's exactly the same with objects. All the plumbing is managed by the object, but all the important work is managed by every external object that uses it. OOP only makes sense if an object truly does manage itself, but that's rarely ever the case. I think everyone would agree that modifications to an object are rarely ever constrained to just one object, and this is why.

And I want to make something clear, I'm not just saying that it's bad for programs to be made entirely of objects. I'm saying that objects should essentially never be used for large-scale operations because true encapsulation can effectively never be achieved for anything other than simple data types. Sure OOP proposes other benefits, like private data and methods for more stable apis, but none of its actual benefits are exclusive to OOP. I also want to make it clear that I only consider an object to be a combination of data and methods, and I don't consider structs with only private data to be objects.

Now let’s do a little recap. I'm saying that managing control flow is an important, fundamental, and required part of software development, please take a minute to check if you agree. I'm also saying that OOP fundamentally disregards control flow, and again, please take a minute to check if you agree. And if you do agree with both statements, you have to agree that OOP is fundamentally flawed. But this is still just theory, does it have any effect in the real world?

Look back at the last time you programmed with OOP, I can pretty much guarantee it went like this: you set everything up with an ideal system of whatever design patterns you want, and with some luck, it might even work without any structural changes. But then comes the killer, modifications and new features that you didn't account for. Maybe you try to restructure everything to accommodate, or maybe you try to work the additions into the existing framework. Either way, there are always two stages here: the idealism stage where you build the code, trying to follow all of OOP's rules, and the reality stage where you use a debugger to fix the control flow and make it actually work.

So, what if you agree that OOP does problematically disregard control flow, but you doubt that there are other substantial issues? Well then, let's talk about code reuse. You're taught that you should always reuse code, and you do, because that sounds like obviously good advice. In practice, though, reusing code will sometimes cause much more trouble than it saves. Just think about this: what if an existing function mostly does what you want, but not exactly? Creating a copy of the function is "bad programming practice", so you have to find a way to retrofit the existing function to fit your needs. Maybe add a boolean or enum to the arguments? Add a flag that's set in the object? Wait I know, you should use inheritance! Restricting yourself to only ever modifying the existing logic means you'll innevitably modify indirectly related code in unexpected ways. And if that's done too much, you could end up with a function that only works on Tuesdays. I've seen some applications, like the game BeamNG, have code that is so unreasonably interconnected that saving a car's configuration changes the camera angle and exits fullscreen (yes, that was a real bug).

But don't get me wrong, reusing code is still a good thing to do, you just have to do it with care. But actually, just consider what would happen if you did just create a copy of a function for your own specific use, you can do whatever you want with it without needing to worry about affecting unrelated code. And when you're done, you can even extract whatever was unchanged so that you aren't left with any duplicate code. Code reuse is often horribly misused, but it is still a necessary tool for maintaining code.

Now let me guess what you're thinking, "why are you attacking OOP for coding advice that isn't specific to OOP?". While it's true that code reuse can be suggested and applied for any paradigm, there's something very insidious going on here. Just think about this: would it make sense for two methods in an object to be basically identical? Sure there are many times (espcially with utilities) where the answer is yes, but it's certainly not every time. For many methods, it would seem ridiculous to have more than one of it. Using OOP subtly pushes you toward the negative version of code reuse, where you skip creating a customized function and jump straight to editing the existing code and existing logic. Or what about this, whenever you need to call a method in an object, how often do you consider whether to clone the method for your personal use? Probably never, because that again sounds ridiculous and completely goes against OOP's principles. And that certainly is a problem, because as I described above, there's not actually any downside to cloning a function.

I could go on, and on, and on. How many times have you been in a situation where you suddenly realize that an event in an object needs to happen in the middle of an event in another object? How often does an object's data have to be completely reset and rebuilt because modifying it over time just isn't feasible? I haven't even gone over the "four pillars of OOP", because hopefully it's clear now why they don't work. Encapsulation fails because lines of control flow don't care about the boundaries of objects. Abstraction is the same as encapsulation, change my mind. I'm not going over inheritance. And finally, Polymorphism isn't restricted to OOP.

Bear with me for a second, but there's another piece to this that I need to explain. This might sound strange, but OOP (excluding inheritance) and procedural are mechanically identical. To see why, just think about the semantics of function calls. In one, you effectively have `functionName(arg1, arg2, ...)`, and in the other you have `arg1.functionName(arg2, ...)`. C++ actually transforms the second example into the first, because every single thing that you can do in OOP (except inheritance) can be recreated one-to-one with procedural code. And I don't mean you can accomplish the same tasks in both, all you have to do to change OOP into procedural is swap every method with a function and every object with a struct. The point I want to make here is that the mindset that a paradigm creates can often be more important than the actual mechanics. Even just a naming convention can subtly but substantially impact the way you think of a program. You may not believe me, but just think about this: if you get a list's length by calling a standard library's `length` function with a list supplied, that feels very low-level and distant, but calling a length method feels nice and comfy even though it's same exact operation.

If you use OOP, even if you do your best to ignore its mechanics, you will be affected by the mindset that OOP creates. Even if you know that you can't actually consider just one object at a time, you're still being pushed towards that mindset. You're subtly trying to think that a piece of code can take full responsibility for itself, that its logic can be encapsulated and not related to the rest of the program's logic, that changes to it won't spill out into other parts of code, that you can understand an object and its quirks without considering big-picture logic, and so on. But none of this can effectively be true, because features and their control flows destroy encapsulation.

But even if you completely agree that OOP is bad, you should now be asking this important and very reasonable question: if it really is that bad then why is it so widely used? Even though more and more people are starting to realize that OOP isn't as infallible as once thought, it still seems like programmers are attracted to OOP like it has its own gravitational pull. Once someone learns what it is, it's almost like other options simply don't exist (or aren't realistic). I said earlier that OOP focuses on how humans naturally think, and yeah that sort of explains it, but not entirely. So, here's my explanation for why OOP is used for basically everything: it's comfy. Even if you know that OOP is bad, it's just comfortable to know that your program is just a bunch of objects that talk to each other. OOP promises that you can work with one object at a time, and we believe it. No matter how often you need to coordinate control flow between multiple objects, you always subconsciously think that you'll only have to deal with one at a time. And it doesn't help that people always pit OOP against functional in particular, which for years made me think that the only real options were OOP or Haskell. And besides, there are so many design patterns that surely they'll be able to solve whatever problems pop up, right? Or maybe you can switch to whatever this year's "actual way to use OOP" is, I'm sure they got it right this time!

When it comes to structuring your code, you likely have thought that the structural problems you've run into are only because of the specific ways you've been using OOP. Or, you may even think that these structural problems are just fundamental to the process of programming itself. But no, OOP is almost entirely at fault. The decision to ignore per-feature control flow in favor of per-object control flow is such a fundamental flaw that every part of programming is affected by it. There's this weird situation where everyone seems to know that pure OOP doesn't work by itself, but they also think that it can be fixed as long as you use enough design patterns and other good practices. But I've never seen an OOP fix that addresses this mismatch in thinking about objects and thinking about control flow, which leads to these "fixes" only pushing the problem down the road. Once again, you cannot change the fact that features are added one control flow at a time. If you want a fix that actually allows you to freely add features, here it is: think per-control-flow instead of per-object.



Part 2

So, if you shouldn't use OOP, then what should you do instead? If you want to know how to create programs that can handle the effects of scale, you first need a good grasp on both the essense of programming and the mental effects of paragims. First off, programming is where you have data, and you update the data. That's all it is. And it holds whether you look at what the CPU is doing or what you're doing. The CPU is just updating data according to control flow, and you're just defining the data and defining the control flow. Yes, you may have very fancy ways of defining and updating data, but that's always ultimately your goal. It's all just data, it's all just updates. We've created countless ways of hiding this fact from ourselves, and I think that has been a huge mistake. You cannot run from the fact that updating data is a dangerous task, and these fancy tricks to hide data and control flow often just kick the can down the road. If you focus on the fact that all you have to do is update the data based on the data, you can write and update code much more efficiently.

But there's still the other side to this, the mental effects of paradigms. Your memory ranges from short-term to long-term, and the very tip of this is called working memory. Working memory is extremely short-term, where ideas can stay for possibly just a few seconds at a time. It's also where logic is processed, and that's a bit of a problem. The logic of a program, or even just a subsection of a program, is often too big to reasonable fit in your working memory, meaning you'll have to take some shortcuts in order to process any logic. This is why programming paradigms are used, and they can be thought of as ways to represent relavent parts of a program in a way that easily fits in your working memory.

You may have noticed that I've talked a lot about control flow, but not much about the topic of this work, data. Control flow is important because it defines features, but data is even more important because it defines control flow. Structuring control flow well should be the goal of any paradigm, but in order to do so, you have to structure your data well.

But before I continue, there are some miscellaneous things I need to address. Most notably, the potential dangers of abstractions. Some abstractions work amazingly well, like in the case of smart pointers. But if you try to take a third of your program and abstract it behind a manager, that will be a disaster. Smart pointers work because they are self-contained, a third of your program is anything but. If you try to abstract away something that still has connections to other systems, you will just be lying to yourself when you think that the system is now simplified.

You may think that adding abstractions is done with active, intentional thought, but no, it happens passively. When you see several chairs close to eachother, how long does it take for you to consider it to be a group of chairs? It doesn't even feel right to call them "several chairs close to eachother", because you think that I should already be calling them a group. Abstraction is so integral to thought itself that you don't even realize just how much it happens. Yes, abstraction is an amazing tool in programming, but you have to remember that you shouldn't abstract away subsystems that aren't fully encapsulated. Also, some abstractions take the form of abstracting away the actual goal of programming, which is very high-risk, high-reward. Systems can only safely be abstracted when they're fully encapsulated, but you subconsciously do so whether or not it's okay to. I don't think it's possible to not mentally abstract away systems, but it's important to recognize that it happens so that you avoid bad abstractions that you accidentally set up.

This is one reason why functional programming works so well. It removes the ability for systems to have interconnected internals, which allows every system to be safely abstracted. But despite its benefits, I don't see much of a future for functional programming. Most programmers only care about creating the program, not its quality or even its maintenance. FP is very close to what I'd consider to be the optimal way to program, but its obsession with correctness and perfection makes it impractical for most use cases. And it's not shallow to value the upfront cost, because that's the cost that is multiplied by prototyping.

The last item I need to tackle is this: when it comes to deciding on how to program, how do we know if we're on the right track? We need a good definition of what good code is (or clean code, I use the terms interchangeably), and my definition of "clean code" is "flexible code". I know that there's an endless number of thoughts on the term "clean code", but please just remember that when I say "clean code", I am only referring to "flexible code". More flexible means more clean, more clean means more flexible, basically "clean" just means "flexible". So, when structuring code, we must always make sure that the resulting code is flexible, which will make it good / clean. And even if I'm wrong and flexibility shouldn't be the measure of whether or not code is good, that doesn't matter because flexible code can easily be turned into whatever "clean code" actually is. I think I got this definition from Dave Farley, but I can't remember for sure. Also, if you want a reason why inheritance is bad, there you go.

Going back to the start of this section, programming is just where you have data and you update the data. It's unbelievably easy to forget this, though, and I've found that it's always best for this fact to be part of your programming mindset. And that's what Data Oriented Programming (DOP) is, the mindset of "it's all just data, it's all just updates". Note that it's not a paradigm, but a category of paradigms. Also, don't get this confused with Data-Oriented Design (DOD), which is a strategy for achieving better performance.

Any time you're using the mindset "it's all just data, it's all just updates", you're practicing DOP. But does this actually lead to flexible code? In most cases, a program's data layout is already extremely flexible, because it's already extremely easy to look at and understand. If it's not then you're doing something very wrong. When done properly, difficulty in programming only comes from the statements that update the data. Simplicity is always easier to work with than complexity, with the only exception being abstracted complexity. DOP lets you focus on the fundamentals of what needs to be done, which then allows your code to be greatly simplified, which is a huge step towards flexibility.

ECS is an example of DOP, with many of its own strengths. It operates on a collection of Entities, which are just containers for Components. An entity can have `gui_element` components, `position` components, or whatever other components you create. Then, you define Systems that operate on entities that contain specific components. For example, you could have a system that takes in entities that have a position component and a velocity component and adds the velocity to the position. ECS is mostly used for game development, but there's nothing stopping you from using it for other applications. However, this isn't always an option, since it effectively requires its own runtime.

Procedural is much harder to categorize, though, and I'd say that it (usually) doesn't fall under DOP. When it comes to indirectly updating data, there are two options: managers or utilities. OOP goes with managers, seeing as each object manages its own data. When you need to update an object's data, you "go through the official channels" and let the object "manage its own data". Utilities, on the other hand, are for letting programmers update the data manually. Managers are inherently flawed because they don't actually manage the data. Whoever's using the manager is the actual manager of the data, whether you like it or not. DOP focuses on using utilities, but procedural often focuses on managers, and managers don't fit the mindset of "it's all just data, it's all just updates".

However, it's still possible to write procedural code with a DOP mindset. One way to do so is what I call "blob programming", which is composed of three elements: update blips, the update tree, and the data blob. Update blips are (relatively) small update functions, being coherent items that may need to be placed before / after other update blips. For example, you could "input processing" blips before and after main logic blips. The update tree is a tree of functions which primarily call other functions (being either update blips or other 'update tree' functions). The purpose of this is to help coordinate the control flow of update blips. And lastly, the data blob is just a centralized collection of all persistent data. I've included an example program that uses this paradigm, and just generally shows how I think you should program.

For many reasons, it's very unfortunate that most of my experience is with OOP. One of these reasons is that I don't have enough experience with non-OOP code to say for sure that my suggestions for how to program are good. Although, I do know for sure that programming is where you have data and you update the data, and even the fanciest of tricks to hide that fact cannot hide its effects. Hopefully, in the future I'll be able to make extensions to this work that will better guide others.



But anyway, I just wanted to give my thoughts on the problems with how we program, hopefully all this theory will be useful. I know I still need a lot more experience, but I want to contribute knowledge as soon as possible. I don't think OOP will ever truly be killed off since there are always new programmers to try it on a large scale, but people who are serious about creating good software will hopefully focus more on "it's all just data, it's all just updates".

About

All my thoughts from over the years

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages