Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Placement New #17057

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

WalterBright
Copy link
Member

This implements Placement New, described in https://github.com/WalterBright/documents/blob/master/placementnew.md

@WalterBright WalterBright added Enhancement WIP Work In Progress - not ready for review or pulling labels Nov 10, 2024
@dlang-bot
Copy link
Contributor

Thanks for your pull request, @WalterBright!

Bugzilla references

Your PR doesn't reference any Bugzilla issue.

If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog.

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub run digger -- build "master + dmd#17057"

@WalterBright WalterBright force-pushed the placementNew branch 3 times, most recently from 29a5064 to 4db86a4 Compare November 10, 2024 07:46
@thewilsonator thewilsonator added Needs Changelog A changelog entry needs to be added to /changelog Needs Spec PR A PR updating the language specification needs to be submitted to dlang.org Needs Tests New Language Feature labels Nov 10, 2024
@WalterBright WalterBright force-pushed the placementNew branch 6 times, most recently from 311f2cb to fd23a02 Compare November 12, 2024 08:55
@WalterBright
Copy link
Member Author

I got it to work for some basic cases. Next comes extending it to more complex ones.

@WalterBright WalterBright force-pushed the placementNew branch 3 times, most recently from 6456c9a to c1d3813 Compare November 13, 2024 09:23
@WalterBright
Copy link
Member Author

Placement new for structs seem to be working now!

@WalterBright WalterBright force-pushed the placementNew branch 4 times, most recently from 5f00e37 to 49f4a4c Compare November 14, 2024 08:17
@Connor-GH
Copy link

Connor-GH commented Nov 15, 2024

If one desires to use classes without the GC, such as in BetterC, it's just awkward to use emplace

If i'm reading this right, this will allow classes in betterC, using a kind of "allocator argument" like in $OTHER_LANGUAGES? Once upon a time I made my own internal fork of dmd that forced betterC on all D code but of course that didn't last long because it couldn't compile Phobos.

(I should clarify, my usecase is making a kernel with betterC and so far I have had to use a hacky mixin and alias this in order to use inheritance. Also, will this unlock the door for interfaces?)

@WalterBright WalterBright force-pushed the placementNew branch 2 times, most recently from 16cdbd6 to 448abed Compare November 15, 2024 04:51
@WalterBright
Copy link
Member Author

Placement works for class objects now!

@TurkeyMan
Copy link
Contributor

Did you tweak it to receive pointers as arguments? I've sent several emails recently and they've all gone silent; I'm wondering if they're getting lost. I also tried commenting on the code in GitHub.

@WalterBright
Copy link
Member Author

I haven't received emails about this. I will check my spam filter. I don't know why you cannot comment on github? I thought it was you who had gone silent! Glad that's not the case.

I stewed about the placement expression as pointer for a while, and figured that making it essentially a ref parameter is more the D way of eschewing pointers in favor of ref. A ref also includes bounds on the size, making that check something the language can do.

Of course, placement new will always be unsafe, but at least we can prevent memory overruns with ref parameters.

@TurkeyMan
Copy link
Contributor

TurkeyMan commented Nov 16, 2024

Weird, I've sent several messages in various ways.

Nar, it's really not appropriate as a ref.
I think the best way to think about it from a spec perspective is in terms of lifetime; placement new returns a new T, prior to the placement new call, it is NOT a T, it is just raw memory.
I would even make the case that placement new should receive void* and not T*; that would be uniform with the class case, but it would also make the semantics very clear, prior to the call, you are handling a pointer to memory, after the new call, you receive a T as result; the lifetime of T begins with the call to new, before the call, there is no T, and nothing in the expression should give the impression that T pre-exists the call to new.

I feel it's absolutely inappropriate to hand it a ref to a T to be initialized... that certainly implies that the T pre-exists. And also the asymmetry with the class case really bugs me.

Your error message "new must receive an lvalue" gives the wrong impression; an lvalue in my mind suggests that the value exists, but before the new, there is no value, just vacant memory.

I think I've convinced myself while writing this that even T* which I initially felt attracted to is wrong, and placement new should only receive void* as input... that correctly expresses the operation.

@WalterBright
Copy link
Member Author

Glad your review is getting through this time. Some good thoughts!

  1. void* certainly works from a pragmatic, to-the-metal approach. It does come with a price, though. The compiler is helpless in diagnosing simple mistakes, like the amount of storage pointed to being insufficient. When developing a test case, I promptly made that mistake in not having sufficient size for a class instance. Fortunately, the compiler hit back with "Walter, you ignorant scum. The size is too small!" Properly chastened, I was grateful to get the message rather than baffling memory overflow bugs. It reminds me of when I resisted Andrei's suggestion that array bounds checking be always turned on, unless in @system code and a compiler flag was thrown. History has proven Andrei to be right. P.S. I hear C++ std::vector is going to get array bounds checking on by default! Only 25 years late!!
  2. My main use of emplace in the compiler is there's a union of various Expressions in the ctfe interpreter. The placement new fits in perfectly with that, as the union object will always have sufficient size. The new(u) AddExp just goes perfectly with that.
  3. The lifetime of any pre-existing lvalue is going to be unceremoniously terminated when used by placement new, without passing Go or collecting $200 or running the destructor. This is why it is fundamentally unsafe. But at least refs can be statically checked for correct size!
  4. Maybe we can improve the lifetime support in the future. But we won't be able to do much of anything with void* as an argument.

@WalterBright
Copy link
Member Author

P.S. lifetime analysis of a pointer to an object is more difficult than lifetime analysis of an object, because of the possibility of multiple pointers to the same object.

@TurkeyMan
Copy link
Contributor

There is no lifetime before the call to new. The lifetime begins with new, and before new, there is only a pointer to bytes in memory. The placement new expression might in the future have integrated into it some initiation of lifetime for any future lifetime analysis; it's one of the reasons I think it's important to introduce it and concentrate initialisation in an intrinsic expression.

Your particular case with a union is obviously legit, but I assure you it's an edge case; every other case where you need placement new is when calling malloc() (or some similar allocator) to get memory, and then emplacing runtime values into that allocation.

If you're not comfortable with void*, then consider void[] which you can use for bound checking.

Your comment about lifetime analysis on any object leading IN to the new() expression is not something that makes sense to me... there is no lifetime of anything to speak of prior to the new... that's exactly what new means; it takes raw memory and turns it into live values.

The union case is actually a sort of aberration if anything. Personally I don't much like unions fundamentally from a conceptual point of view; because they express an aggregate of values despite all, or all-but-one of those values are completely invalid.
You might recall me arguing 5-10 years ago when we first ever started talking about @safe that I felt unions are absolutely fundamentally not @safe, and it's completely improper for @safe code to make contact with a union... I was shot down by several people (who are all wrong), and I still believe that's true today, and this kinda demonstrates a facet of my discomfort; the idea you are supplying an lvalue (ref) of what appears to be a value to a new(), except it's not, in the case of a union, it's actually just raw memory in disguise. Passing the value by ref paints the exact wrong picture about what's going on there and further lends to the deception. new() simply does not accept values as arguments; it accepts memory as argument, and returns a brand "new" start-of-life value.

So, assuming new() receives a void of some kind (after sleeping on it some, I'm prepared to die on that hill), then your union initialisation looks like one of these cases:

If it receives void*, then obviously your union code will look like:

new(&myUnion.member) T();

Where the pointer will naturally cast to void*. This expression is not not too annoying from a syntax perspective, but like you say, the type erasure loses the size information, so you can't do a bounds check.

The solution that gives you bounds checking is to define new(void[]) T, and then your union initialisation will look like this:

new((&myUnion.member)[0..1]) T();

So, I'm taking the address of the union member, and from that pointer taking a 1-length slice. That slice will naturally casts to void[] with the proper length... now the placement new will have the buffer size, and can perform the bounds checking you seek.

I grant that looks a tad awkward, but I think it's pretty much only the union case which is subject to this weird contortion, and my feeling is that it's appropriate to place that burden on the union case since they're already an unsavoury and weird case themselves. I would load the union case with the unnatural thunk, rather than requiring all the natural cases to be burdened with an unnatural thunk.

The natural cases look like this:

T* newObject = new(allocateMem(T.sizeof)) T;

Given some allocator void[] allocateMem(size_t bytes), which is how allocators in D are always naturally defined.

Even CRT malloc always seems to be wrapped in some function like void[] d_malloc(size_t bytes) => malloc(bytes)[0..bytes] everywhere I look; that's what's overwhelmingly established in D, so I reckon it's probably most appropriate to make the argument of new match the natural result of an allocator, and also by accepting void[], you get bounds checking.

I think that's the way forward.

@TurkeyMan
Copy link
Contributor

Actually, thinking a little more, I realised one more really common case:

  void opAssign(ref typeof(this) rh)
  {
    this.destroy();
    new(&this) typeof(this)(rh);
  }

It's common to fabricate a copy assignment from a copy constructor using that pattern. And so this is another case where new(void[]) would feel awkward, but here's a wild idea...

It's always sat a little awkwardly with me that you have some T, you may destroy it and then pass that same T reference forward into a reconstruction... there's a period between destruction and reconstruction where you're handing a reference to raw and invalid memory, but it's still typed like a valid T.

From a purist perspective, it would be nice if the language were specified that any time you are holding a T*, then you are holding a valid and live value. That may even be a convenient piece of language spec in the event you were planning to implement lifetime analysis for instance.

Okay, so here's my idea; we have void destroy(T*), which destroys some T... let's just change that to void[] destroy(T*), which destroys T and returns the buffer that used to contain the value back to the application.

This actually feels quite natural to me; if you destroy a value, that terminates the value's lifetime, there is no more value and so there is no more T*, but there IS still the memory where T used to be... it makes sense for the result of destroy() to be the buffer where the value used to exist, and it's useful for the application to receive it, because it can be used for recycling.
Given that new definition for destroy, it makes the case I describe above convenient:

T* object = ...;

void[] mem = destroy(object); // returns the former values buffer
T* newObject = new(mem) T; // recycle the buffer directly

Or simply:

void opAssign(ref typeof(this) rh)
{
  new(this.destroy) typeof(this)(rh);
}

That reads nicely, the types all flow... but what's really nice is that because destroy returns the sized void buffer, you are no longer encouraged to handle the typed pointer post-destruction!
In the future with lifetime analysis; we can make it a compile error to handle the post-destroyed value, and insist that the user handle the untyped raw buffer returned from destroy instead.

Terrible idea?
I really like it.

After a call to destroy where you receive the value's former buffer as result; it is UB or even a compile error to touch the dead-but-still-typed pointer from that event onwards.

In the future; ideally, destroy would be intrinsic too, and not just call the destructor but also implement the termination of the value's lifetime. Maybe you should recycle void[] delete(T*) for an intrinsic destroy expression to complement placement new?

@TurkeyMan
Copy link
Contributor

P.S. lifetime analysis of a pointer to an object is more difficult than lifetime analysis of an object, because of the possibility of multiple pointers to the same object.

Yeah, but tough; placement new is FOR dealing with freely allocated objects. Your union case is an edge case, and it's also a language aberration.

For what it's worth; I find that I rarely use unions in my own code. I usually place an embedded buffer and implement emplace operations and casts in the methods that access what might have otherwise been different members in a union.
I only ever find myself use a union when I know I'm only dealing with POD, and in that case, why would I be concerned with placement new?

@WalterBright
Copy link
Member Author

I only ever find myself use a union when I know I'm only dealing with POD, and in that case, why would I be concerned with placement new?

This code in dmd's constfold.d:

UnionExp Com(Type type, Expression e1)
{
    UnionExp ue = void;
    Loc loc = e1.loc;
    emplaceExp!(IntegerExp)(&ue, loc, ~e1.toInteger(), type);  // (*)
    return ue;
}

(*) would be:

    new(ue) IntegerExp(loc, ~e1.toInteger(), type);

It looks nice, and no ugly casting or use of &. Compile time checking of sizes, instead of runtime. The "joints" between untyped data and typed data are where user error creeps in. Minimizing this is good. This is why I want to try this approach.

I've made some progress in the dmd source with eliminating the use of the & operator. The results are appealing, more readable, and safer.

I know I haven't addressed all your points, more later.

@TurkeyMan
Copy link
Contributor

I forgot to respond to one of your points:

  1. The lifetime of any pre-existing lvalue is going to be unceremoniously terminated when used by placement new, without passing Go or collecting $200 or running the destructor. This is why it is fundamentally unsafe. But at least refs can be statically checked for correct size!

I just want to point out here, that the idea of the lifetime of something being unceremoniously terminated is definitely not something that should be happening in any case ever. Cases of buffer recycling should have been destroyed prior.
This point you make is supporting my case that new should not accept ref T or T* as argument... by accepting void[], then neither ref T, T*, or Class arguments can naturally convert to a void[], so an accidental error is almost impossible.

The proposal I made above where destroy returns you the buffer that you need for a memory recycling operation feels like the natural way to deal with this situation; it naturally avoids this "unceremoniously terminating" live values, since they are not valid function arguments to begin with.

In fact, I would say that the pattern where destroy returns a void[] and new accepts a void[], and consequently there are no blind casts or thunks in sight actually IS safe... with my proposal above, I think placement new (and placement delete) can both be safe operations.

@TurkeyMan
Copy link
Contributor

TurkeyMan commented Nov 16, 2024

I only ever find myself use a union when I know I'm only dealing with POD, and in that case, why would I be concerned with placement new?

This code in dmd's constfold.d:

UnionExp Com(Type type, Expression e1)
{
    UnionExp ue = void;
    Loc loc = e1.loc;
    emplaceExp!(IntegerExp)(&ue, loc, ~e1.toInteger(), type);  // (*)
    return ue;
}

(*) would be:

    new(ue) IntegerExp(loc, ~e1.toInteger(), type);

It looks nice, and no ugly casting or use of &. Compile time checking of sizes, instead of runtime. The "joints" between untyped data and typed data are where user error creeps in. Minimizing this is good. This is why I want to try this approach.

I've made some progress in the dmd source with eliminating the use of the & operator. The results are appealing, more readable, and safer.

I know I haven't addressed all your points, more later.

I'm sorry, but using a union as primary (singular!) case-study is completely out of order.

You can write this:

new((&ue)[0..1]) IntegerExp(loc, ~e1.toInteger(), type);

Yeah, it's not perfect, but the union is actually the problem... don't wreck everything for this one degenerate case.

My wall of text above should be seriously considered... I think it's a really strong design, and solves a whole lot of problems that C++ (and D) has with lifetime flow, it makes the start/end of lifetimes feel natural while also making the typed variables naturally match the lifetimes of the values they represent, and it also solves convenient recycling of memory.
If you don't immediately understand the merit of my suggestion, then let's get on a phone call... I need an opportunity to represent the idea clearly, if it's not clear above.

@WalterBright
Copy link
Member Author

Given:

T* newObject = new(allocateMem(T.sizeof)) T;

how does this look:

T* newObject = new(mallocate!T()) T;

? Where:

ref T mallocate(T)()
{
    return *cast(T*)malloc(T.sizeof);
}

Not bad! Only one dirty cast for all the new's, no &, and the user needn't use a sizeof.

@TurkeyMan
Copy link
Contributor

  1. I think that's the most insane permutation yet
  2. I feel like you've dismissed everything I wrote :/

@TurkeyMan
Copy link
Contributor

This is what a function called mallocate(T)() is expected to do:

void[] mallocate(size_t);

ref T mallocate(T)()
{
    return *new(mallocate(T.sizeof)) T;
}

No casts. It's completely @safe.

No function that returns a T should return an invalid/uninitialised T ever.

@TurkeyMan
Copy link
Contributor

My whole point is, there's an opportunity here to eliminate UB, invalid memory references, and unsafe casts associated with lifetime management from the language specification entirely... and that's a HUGE deal. If you haven't understood the point of what I'm suggesting, then we need to get on a call and talk.

@WalterBright
Copy link
Member Author

We both want the same thing. But realistically, there's no way to not have UB when allocating and initializing memory. That's why at some point it's going to have to not be verifiably safe by the compiler. As for void initialized objects,

S s = void;

we already have that in the language, and it is supported. But only in system code. Casting an array of void into a real object is always going to be system code. I appreciate your efforts to make this as safe as practical. But I also want the union case to be easy - this also applies to option types and sum types. And I want things to be correct without adding runtime checks. So here goes, I verified that it works:

import core.stdc.stdlib;

struct S {    int i = 1, j = 4, k = 9; }

ref void[T.sizeof] mallocate(T)() {
    return *(cast(void[T.sizeof]*) malloc(T.sizeof));
}

void main() {
    S* ps = new(mallocate!S()) S;
    assert(ps.i == 1);
    assert(ps.j == 4);
    assert(ps.k == 9);
}

I changed your proposal of a slice void[] to a static array void[sizeof(T)], so the compiler knows its size at compile time. But does it blend? Let's check the code generated:

000c:   B8 01 00 00 00           mov       EAX,1
0011:   48 89 45 F8              mov       -8[RBP],RAX
0015:   E8 00 00 00 00           call      mallocate
001a:   48 89 C3                 mov       RBX,RAX              ; RBX gets pointer to storage
001d:   48 8B 45 F8              mov       RAX,-8[RBP]
0021:   89 03                    mov       [RBX],EAX            ; .i = 1
0023:   C7 43 04 04 00 00 00     mov       dword ptr 4[RBX],4   ; .j = 4;
002a:   C7 43 08 09 00 00 00     mov       dword ptr 8[RBX],9   ; .k = 9

Efficient, compact, and no monkey business.

@TurkeyMan
Copy link
Contributor

return malloc(T.sizeof)[0 .. T.sizeof] should be sufficient; you don't need that weird thunk.

This said though; you keep using malloc() as your experiment, but nobody in D calls malloc without wrapping it in a function that just returns a void[] instead of void*; all the memory allocators, (including Andrei's std API), all established long ago that we return void[] from allocation functions... this is why I'm trying to make it native compatible with that type.

And sorry, let me be more clear; yes I understand there will be unsafe escape hatches (union for instance, that's unsafe by definition), and some low-level coercions are still available and useful, but what I'm trying to achieve is where the common case (allocate and initialise) have no casts, no weird shit, so that flow can actually be completely safe.

I thought about static array briefly too. It seems workable, but it's got some ergonomic problems. You can't pass fixed arrays easily, because you need a different overload of a function for every array size, which means this will always be wrapped in something that; receives a dynamic array, does a bounds check, and then slices the appropriately bounded static array, then pass to a call like you show above.

The nice thing about a fixed array is that you can elide runtime bounds checking, which is nice, but I think the ergonomic problem carries more weight. Memory allocation is NEVER a high-frequency operation; I can't imagine any situation where I wouldn't tolerate a runtime bounds check in the void[].

Perhaps you'd consider accepting BOTH fixed array and also dynamic array; if a dynamic array was supplied, perform a runtime bounds check, static array, static bounds check?

I want to re-enforce though that my proposal is designed to work very nicely if we also make destroy return the buffer to the user (for recycling or returning to the allocator). The whole point is that in idiomatic and safe code, typed variables will never be handled at any time when they are in an invalid state. I really like that the type is only introduced to the program as a result of new; that is, the moment of its birth, and then you supply it to delete or destroy and it's gone; you get the void buffer back as a result which is the only thing you need to handle to perform a recycle, or just hand that memory back to the allocator. Your final contact with the typed value is supplying it as argument to delete.

@WalterBright
Copy link
Member Author

return malloc(T.sizeof)[0 .. T.sizeof] should be sufficient; you don't need that weird thunk.

It doesn't work because it returns a void[], and its size cannot be checked at compile time. The static array works, I tried it and looked at the assembler.

Memory allocation is NEVER a high-frequency operation

Placement new has a major use case for unions, option types, and sum types, and these can be very high frequency. See CTFE, which I sped up quite a bit by using emplace on a stack allocated union. I've also written a Javascript interpreter - the operand stack is a critical high-frequency piece of code, and the operands get overwritten like any stack. As I've shown, the static array method generates optimal code, so the user doesn't need to hack around it.

we return void[] from allocation functions

Not a problem, it's still castable to a pointer to a static array, since the length is known in advance.

You can't pass fixed arrays easily, because you need a different overload of a function for every array size

I haven't implemented placement new for variable sized arrays yet. The proposed solution uses a template to take care of placement new expressions for types of fixed size. I don't think it is a big deal, it's just a few instructions.

Perhaps you'd consider accepting BOTH fixed array and also dynamic array; if a dynamic array was supplied, perform a runtime bounds check, static array, static bounds check?

I expect a different implementation will be required for new'ing variable length arrays, meaning a void[].

The whole point is that in idiomatic and safe code, typed variables will never be handled at any time when they are in an invalid state.

Placement new is never going to be safe. Prefixing a destroy() means that void-initialized objects will also be destroyed, and that's UB for sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Changelog A changelog entry needs to be added to /changelog Needs Spec PR A PR updating the language specification needs to be submitted to dlang.org Needs Tests New Language Feature WIP Work In Progress - not ready for review or pulling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants