Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The library uses a lot of stack #13

Open
faern opened this issue Jun 1, 2022 · 6 comments
Open

The library uses a lot of stack #13

faern opened this issue Jun 1, 2022 · 6 comments

Comments

@faern
Copy link
Contributor

faern commented Jun 1, 2022

We have noticed that for the larger key sizes that the library supports, we need to run the functions in a thread with increased stack size. Even if we pass in heap allocated buffers for the key out parameters. My guess is that the library stores the public key on the stack somewhere internally. We have not looked at the code. But it would be great if the caller could somehow tell the library to use the heap more, so it can run in threads with default stack sizes.

This is somewhat related to #12, where we also need a way for the caller to control how the library handles memory buffers.

@prokls
Copy link
Collaborator

prokls commented Jun 3, 2022

Ok, this is possible. Public keys of Classic McEliece are huge and various operations are applied. Fundamentally I preferred to use the stack for allocation over the heap. If there is some unnecessary copy operation somewhere, I will gladly fix it. It is difficult to fix it without knowing the location though.

@dkales
Copy link
Collaborator

dkales commented Aug 24, 2022

Probably at some point it would be nice to look at temporary buffers that are used by keygen,encaps,decaps and see if there are some that are huge and can be reduced.

@faern
Copy link
Contributor Author

faern commented Aug 25, 2022

I tried to run https://crates.io/crates/cargo-call-stack on this crate. But it failed. But using something like that to see which functions require a lot of stack could be very helpful in finding out where the lowest hanging fruits are.

@dkales
Copy link
Collaborator

dkales commented Aug 25, 2022

I actually tried the same project yesterday, got a bit further by using the musl target, but there are a lot of unhandled llvm intrinsics atm. When I get the time ill try a patched version that just assumes all llvm intrinsics lower to machine code.

@meisterluk
Copy link
Collaborator

FTR: My account @prokls is inactive since I left university quite some time ago. I am going to use my private account @meisterluk now to contribute.

I did a code review for the current HEAD and inserted the highest numbers for the configuration with the largest keys. I was able to find the following (larger) stack allocations:

file function / line source code bytes
benes.rs apply_benes line 140 let mut bs = [0u64; 64]; 512
let mut cond = [0u64; 64]; 512
benes.rs apply_benes line 257 let mut r_int_v = [[0u64; 64]; 2]; 1024
let mut r_int_h = [[0u64; 64]; 2]; 1024
let mut b_int_v = [0u64; 64]; 512
let mut b_int_h = [0u64; 64]; 512
benes.rs support_gen line 348 let mut l: [[u8; 1024]; 13] 13312
controlbits.rs controlbitsfrompermutation line 293 let mut temp: [i32, 16384] 49152
let mut pi_as_i32: [i32; 4096] 16384
decrypt.rs decrypt line 18 let mut l = [0u16; 8192]; 16384
let mut images = [0u16; 8192]; 16384
encrypt.rs syndrome line 189 let mut row = [0u8; 1024]; 1024
pk_gen.rs pk_gen line 210 let mut buf = [0u64; 8192]; 65536
let mut mat = [[0u8; 1024]; 1664]; 1703936
let mut g = [0u16; 129]; 258
let mut l = [0u16; 8192]; 16384
let mut inv = [0u16; 8192]; 16384
sk_gen.rs genpoly_gen line 9 let mut mat = [[0u16; 128]; 129]; 33024

Additional attention must be spent on:

  • struct PublicKey<'a>(KeyBuffer<'a, 1357824>); and struct SecretKey<'a>(KeyBufferMut<'a, 14120>); must be heap-allocated
  • On my machine, the maximum stack size is 8388608 (ulimit -s returns the kilobyte count on my Arch Linux kernel 6.10)

Then the following conclusions can be made:

  • PublicKey and SecretKey are handled properly. KeyBuffer* instances provide a proper enum variants to represent borrowed or owned keys. And util::alloc_boxed_array::<SIZE>() provide a utility to enforce heap allocation which is usually done.
  • mat in pk_gen.rs has the largest stack allocation size with 1703936 bytes (i.e. 20% of the allowed maximum stack size on my machine). The second placed buf with 65536 bytes in the same stack amounts to 0.8% of the maximum stack size. Evidently, mat is the elephant in the room.
  • crypto_kem_keypair in operations.rs might call pk_gen several times, but several stack frames with mat are never allocated simultaneously.

I changed let mut mat = [[0u8; SYS_N / 8]; PK_NROWS] to let mut mat = Box::new([[0u8; SYS_N / 8]; PK_NROWS]) and looked at runtime result from cargo test. The results are completely inconclusive and I have to run the actual benchmarks.

@faern
Copy link
Contributor Author

faern commented Oct 16, 2024

This is apparently not as bad as it used to be. Maybe this is even a non-issue? I have not analyzed this in a long time. I think we have lowered the amount of stack we assign the special CME thread at some point. But we still run it in a separate thread, so we don't have to rely on default stack sizes. Yours seems to be 8 MiB, but we can't rely on that being the case everywhere. The code has to run fine on mobile etc for us as well.

This is our current workaround for the stack problem. But the comment seems to indicate that we no longer know if this actually is a problem 🙃
https://github.com/mullvad/mullvadvpn-app/blob/8a06592ba05a2e7cc56261532fdfda2688fef204/talpid-tunnel-config-client/src/classic_mceliece.rs#L4-L8

EDIT: We of course run it in a separate thread in order to make it not block the async runtime. This is a way to offload the computation.

meisterluk added a commit that referenced this issue Oct 27, 2024
When using the library in the variant with the largest public key
(namely mceliece8192128), the stack allocation amounts
to 1,703,936 bytes. This is 20% of the maximum admissible stack size
of my system. This high stack consumption is a problem on several
systems. Thus, I propose to allocate this huge array on the heap.

Context:
  issue #13
  #13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants