-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for Issue 2988: Change individual to grouped parsing #2989
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…t-specific so multiple targets can be compared side-by-side.
…Use trgen to create these from scratch and remaster everything.
kaby76
changed the title
Fix for Issue 2988
Fix for Issue 2988: Change individual to grouped parsing
Jan 5, 2023
…--not yet available.
…ile, or use "-input" for a string.
…plicate functionality, when it's not clear how best to implement this. Removing newlines indescriminately is NOT the way to do a comparison!
…ctly what is in makefile and test.sh, as they should have been to begin with!! The WHOLE POINT of providing a makefile and test.sh is so the details of building and running the generated parser driver code--which changes for every damn target--are completely hidden, and so I don't need to know how to compile and run Go vs Java vs C# vs Dart, etc. Still, all I want is to use a Powershell environment and type "builder.ps1" or "tester.ps1" and just get the thing working. Prior to this, all I could do was "cd to the root directory", type "pwsh _scripts/test.ps1" and it would test everything.
…l if you use --template-source-directory, but it's an embedded resource in trgen.
@teverett @KvanTTT After working on this PR for two weeks, I think I finally have this PR ready for review. The new Bash and Powershell scripts should resemble each other more closely and across all Antlr targets. That should make it easier to understand and maintain. After this, I plan to fix #2883 |
…it was wrong. Fixed.
…conflicting code.
…r tool call for JavaScript.
…sue with both test.ps1 and test.sh. It does not appear on Github Action boxes, but only on my system. Rather than throw the test out, I'm renaming to bypass the issue.
… files are now .gitignored, but we need to override that when checking for errors.
Needing to refork this repo, and squash commits. Will open a new PR. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Statement of Problem
This PR is to address #2988.
Tests in the V4 repo parse one input file per program invocation, which I call individual parsing. But, we can save a good deal of time by parsing multiple input files per program invocation, called grouped parsing. Let's see what this looks like in the output from a build.
Individual parsing
, one test is processed by one program call:Grouped parsing
, all tests are processed by one program call:With grouped parsing, the time for parsing each successive input should be shorter relative to the time the input took with individual parsing. Internally, Antlr keeps a cache of prediction context between each parse for faster parsing. Grouped parsing should help for Java, CSharp, and probably a few other targets. But, it may not help with all targets, especially PHP, which shows no speed-up with warm-up, likely a bug. See antlr/antlr-php-runtime#36.
Requirements, Design, and Implementation
To implement grouped parsing, the newest version of Trash trgen will need to be employed, as well as the templates in this repo updated. This PR changes the CI build tests, and some of the
.errors
and.tree
files (more on this explained below).After a detailed reading and testing of the current _scripts/test.ps1 and templates, I came to the conclusion that there was a significant disconnect in the requirements, design, and implementation between the Bash and Powershell test scripts. Therefore, we need to state some requirements for these scripts and the generated scripts for trgen.
Requirements
.errors
and.tree
files; it cannot be redirected because output for all inputs would then be grouped together. NB: Careful attention is placed on capturing all parser output. Any output to stderr that is not for trees and standard parsing errors is considered an error in the execution of the parser. For example, this can happen especially with dynamically typed and parsed languages like JavaScript, Python3, and PHP in which the base class is not correct, and the interpreter throws an error..errors
and.tree
files.Discussion of requirements
It's hard to remember how to run the compiler and program of the parser program for a particular target. It's even worse trying to remember the syntax and runtimes for the targets. That's the whole point of trgen.
Unfortunately, tester.psm1 is not a stand-alone Powershell script because the extension is ".psm1", not ".ps1".
Further, the script assumed the test input was always in a directory called "examples/". trgen reads the pom.xml to get the location of the input tests.
Since grouped parsing can't use Bash/Powershell capture of the parse tree--because there would be multiple trees captured for multiple input files--the driver program must place the output in a specific output file. But, then, how does the script check diffs of these parse tree files??
In order to implement parse tree file diffs, I use
git diff
to find those files that are changed. It's laborious to write code to do diffs in Bash and Powershell. In fact, the Powershell "diff" strips out newlines of a multiline output file contained in a string. Let's make the code simpler by just using "git". If there's no git, or the grammar has been removed from the rest of the cloned repo, then the tests should just default to returning any errors found, even if they are expexted.The
.errors
files seem to have extra newlines at the end. Extremely annoying. I don't know if they are hand-edited or the drivers that created them include extra printf's for the hell of it. In any case, I remastered a number of these files with this PR in order for the tests to pass.From this point forward, test.sh or test.ps1 should be used to remaster all
.tree
and.errors
files, as pondered in this comment.Results
Although not scientific, I compared one testing of the V4 grammars with individual parsing vs grouped parsing. On an unloaded Windows 11 machine (Ryzen 7 2700, 8-cores, DDR4 16GB, SanDisk SDSSDH3 1TB, Antlr4.11.1, NET SDK 7.0.101), individual parsing of the CSharp target completed in
1h 44m
, grouped parsing completed in1h 11m
. That is significantly faster.Conclusions
This is an important PR. Grouped parsing helps a lot in the parse time for the build. But, overall, the main bottleneck in the build is the compilation of the generated code. It really goes back to #2883. The "incremental" testing isn't really working because, somehow, the information on what specifically to test is ignored. In addition, we shouldn't be testing combined grammars because they don't have target-specific action code.