Skip to content

fix: apply final hashes deterministically with stable placeholders set #5644

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

mattkubej
Copy link
Contributor

@mattkubej mattkubej commented Sep 11, 2024

This PR contains:

  • bugfix
  • feature
  • refactor
  • documentation
  • other

Are tests included?

  • yes (bugfixes and features will not be merged without tests)
  • no

Breaking Changes?

  • yes (breaking changes will not be merged unless absolutely necessary)
  • no

List any relevant issue numbers:

Issue does not exist, but I can create one if that would be helpful.

Description

The generateFinalHashes iterates over the renderedChunksByPlaceholder map in a for..of loop, which iterates via insertion order. The renderedChunksByPlaceholder has key/values set asynchronously within transformChunksAndGenerateContentHashes. Due to this, generateFinalHashes does not apply final hashes in the same order between builds with the same input. So, when hash collisions occur, hashes associated with collisions may apply to chunks in different order across multiple builds. This change uses the stable placeholders to iterate over the chunks and deterministically apply the final hashes.

Why is this important?

We encountered an issue where a chunk with the same name had different file contents between builds, which then failed an SRI check on the produced asset. We expected that if the hash did not change, then the contents did not change.

What we found occurred was a manual chunk depended on two dependencies with the same name and same file contents, which then resulted in a collision. The two dependencies of the manual chunk would have their hashes flip between builds depending on who got processed first within generateFinalHashes. Processing chunks in the same order when applying final hashes will insure a deterministic result and avoid this problem.

How this happens

  1. renderChunks invokes render on all chunks to set the preliminaryFileName for each chunk (i.e. placeholder hash) (ref)
  2. transformChunksAndGenerateContentHashes invoked with chunks and returns renderedChunksByPlaceholder (ref)
    1. Iterates through all chunks and transforms them asynchronously (ref)
    2. Sets the transformedChunk on the renderedChunksByPlaceholder map (ref)
      1. This is the beginning of the problem
  3. generateFinalHashes invoked with renderedChunksByPlaceholder (ref)
    1. Iterates through renderedChunksByPlaceholder map (ref)
      1. Maps are iterated over by insertion order. Async insertion in step 2b means even if the map contains the same key/values, then it may not be iterated over in a consistent manner as it may have the key/values inserted in different orders across builds.
      2. For each chunk, it calculates the hash via contentToHash
        1. contentToHash is calculated by combining the hashes of the current chunk and its dependencies (ref, ref)
        2. getHash is invoked on contentToHash and sliced by placeholder length (ref)
          1. If a collision occurs, then the hash becomes contentToHash and is hashed again on the next loop (ref)

Copy link

vercel bot commented Sep 11, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
rollup ✅ Ready (Inspect) Visit Preview 💬 Add feedback Sep 19, 2024 4:24am

Copy link

codecov bot commented Sep 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.39%. Comparing base (184bc4e) to head (ddf4b13).
Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5644   +/-   ##
=======================================
  Coverage   99.39%   99.39%           
=======================================
  Files         242      242           
  Lines        9348     9349    +1     
  Branches     2470     2470           
=======================================
+ Hits         9291     9292    +1     
  Misses         48       48           
  Partials        9        9           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lukastaegert
Copy link
Member

Makes sense, though I guess it is probably hard to write a meaningful test. One could write a manual test that bundles the same input twice and introduces a renderChunk hook that slows down the rendering of one or the other chunk in both runs and in the end verifies that the hashes for both runs match.
I think that would fit into misc.js where we already have a test that chunks are sorted.

I wonder, though, if we really need to manually sort the placeholders before iterating. There is a placeholders Set in transformChunksAndGenerateContentHashes that is created synchronously from the chunks respecting their order. I wonder if that Set would have a sufficiently stable insertion order so that when returned from transformChunksAndGenerateContentHashes and used in generateFinalHashes would have the same effect as the manual sorting?

@mattkubej
Copy link
Contributor Author

One could write a manual test that bundles the same input twice and introduces a renderChunk hook that slows down the rendering of one or the other chunk in both runs and in the end verifies that the hashes for both runs match.
I think that would fit into misc.js where we already have a test that chunks are sorted.

I can take a look into drafting that up. Just to insure I found what you're referring to, you're suggesting modeling that test off of this one?

I wonder, though, if we really need to manually sort the placeholders before iterating. There is a placeholders Set in transformChunksAndGenerateContentHashes that is created synchronously from the chunks respecting their order. I wonder if that Set would have a sufficiently stable insertion order so that when returned from transformChunksAndGenerateContentHashes and used in generateFinalHashes would have the same effect as the manual sorting?

Just to be clear, you're suggesting the following?

  1. Have transformChunksAndGenerateContentHashes return the placeholders set, which contains all the placeholders all the hashPlaceholders subsequently used as keys for renderedChunksByPlaceholder
  2. Pass the resulting placeholders set as an additional param to generateFinalHashes
  3. Iterate over the placeholder set and retrieve chunks from renderedChunksByPlaceholder by those placeholder keys

I think that could work too. As you said, it seems that the placeholder set is synchronously built and should be deterministic. This would also avoid the additional overhead of the sort by keys, which is likely fairly minimal, but can empathize with the desire to squeeze performance and memory overhead. I think my only concern with using placeholders is that there is an implicit assumption that the the placeholders set and renderedChunksByPlaceholder keys have the same one-to-one correspondence (i.e. same size and values). That does appear to be the case, but if that assumption no longer holds true, then there might be a bug. Sorting the keys of renderedChunksByPlaceholder avoids that potential problem, but does have more cost.

I can refactor this change with your suggested approach and vet it against my use case if you think that is preferable. I'm good either way.

@lukastaegert
Copy link
Member

you're suggesting modeling that test off of this one?

Yes, roughly, except that of course you would build twice. And note that this is an old test, you can use async-await these days to make the test much nicer ;)

I can refactor this change with your suggested approach and vet it against my use case if you think that is preferable. I'm good either way.

That is why I would also like to have this test. As I see it, the renderChunk hook is the major source of asynchronicity we have, so if we have a test specifically for that, then I would feel good that we do not break it in the future.
I must admit that it is micro-optimization to use placeholders instead of sorting, on the other hand, the data structure is already created. And my expectation is that is is as deterministic as sorting by placeholder name. Which is to say, the latter could also be broken. Which brings me back to the test ;)

In any case, thank you so much for looking into this and providing a solution for this issue in the first place! I know that indeterminism is a major source of pain for those affected by it, and you are making it better for everyone. If you need help, let me know, then I can also have a look at writing that test.

@mattkubej
Copy link
Contributor Author

you're suggesting modeling that test off of this one?

Yes, roughly, except that of course you would build twice. And note that this is an old test, you can use async-await these days to make the test much nicer ;)

I can refactor this change with your suggested approach and vet it against my use case if you think that is preferable. I'm good either way.

That is why I would also like to have this test. As I see it, the renderChunk hook is the major source of asynchronicity we have, so if we have a test specifically for that, then I would feel good that we do not break it in the future. I must admit that it is micro-optimization to use placeholders instead of sorting, on the other hand, the data structure is already created. And my expectation is that is is as deterministic as sorting by placeholder name. Which is to say, the latter could also be broken. Which brings me back to the test ;)

In any case, thank you so much for looking into this and providing a solution for this issue in the first place! I know that indeterminism is a major source of pain for those affected by it, and you are making it better for everyone. If you need help, let me know, then I can also have a look at writing that test.

Thanks for your thoughts and guidance! I'll take a swing at the test shortly and also adjust the PR to use the existing placeholder set. I'll report back once that is done or if I could use any assistance with the test. Thanks again!

@mattkubej
Copy link
Contributor Author

mattkubej commented Sep 17, 2024

@lukastaegert I removed the sorting, switched to the stable placeholder set, and introduced a test.

I validated that the test fails on master as well.

Test failing on master
image

I've updated the PR description to reflect the latest changes. Let me know if you'd like me to make any additional changes.

@mattkubej mattkubej changed the title fix: apply final hashes deterministically via sorted processing of chunks fix: apply final hashes deterministically with stable placeholders set Sep 17, 2024
Copy link
Member

@lukastaegert lukastaegert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, great work!

@lukastaegert lukastaegert added this pull request to the merge queue Sep 19, 2024
Merged via the queue into rollup:master with commit 447c191 Sep 19, 2024
38 checks passed
Copy link

This PR has been released as part of rollup@4.22.0. You can test it via npm install rollup.

lukastaegert added a commit that referenced this pull request Sep 20, 2024
It appears that this might cause some build failures
that need further investigation.
lukastaegert added a commit that referenced this pull request Sep 20, 2024
It appears that this might cause some build failures
that need further investigation.
lukastaegert added a commit that referenced this pull request Sep 20, 2024
It appears that this might cause some build failures
that need further investigation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants