Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay calculation of LCS in pairs #1589

Open
rien opened this issue Sep 20, 2024 · 0 comments
Open

Delay calculation of LCS in pairs #1589

rien opened this issue Sep 20, 2024 · 0 comments
Labels
algorithm experiment This is an experiment and might not be released 🧪

Comments

@rien
Copy link
Member

rien commented Sep 20, 2024

Running the Plutokiller dataset (1000+ files) takes 45 seconds:

  • 25 seconds are spent tokenizing
  • 20 seconds are spent calculating all pairs

During pair calculation, we create all $\Theta(n^2)$ pairs and for each pair we perform a longest-common-substring algorithm to search for the longest fragment, this is an $\Theta(n^2)$ implementation in itself.

Disabling the LCS calculation shaves down the time spent calculating the pairs to 50 ms. As we don't use the LCS for the initial visualizations, we could easily skip this step.

However, we will need the LCS if the user would go to the pairs (or similar) page, so there could be a worker that starts calculating all the LCS' once the user opens the browser.

Changes needed

  • Lib: Do not calculate the LCS when constructing a new pair. This could be easily implemented without API changes by making longest a getter that is lazily calculated.
  • CLI: Disable writing the longest fragment to the output CSV. As this is a change of the CLI API, we likely want this to be an option that can be enabled again.
  • Web: change the Pair model and store to reflect this change.
    • We still want to support CSV's with the longest fragment calculated. We already approach lazily calculating fragments in a similar way by only calculating them when they are required.
    • We cannot wait for calculating the LCS until it is needed, because we do need all of them at once and that can take 20 seconds. Once the data is loaded, we can start a worker in the background that calculates the LCS (using the methods already provided by dolos-core.
    • Implement loading states for components that do use the longest fragment, as the user can possibly still go to these pages.
    • Optionally: we could cache the LCS data using localstorage in order to prevent this data from being recalculated every time the page is loaded again.
@rien rien added algorithm experiment This is an experiment and might not be released 🧪 labels Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
algorithm experiment This is an experiment and might not be released 🧪
Projects
None yet
Development

No branches or pull requests

1 participant