Fix for Issue 2988: Change individual to grouped parsing #2989

kaby76 · 2023-01-04T00:56:06Z

Statement of Problem

This PR is to address #2988.

Tests in the V4 repo parse one input file per program invocation, which I call individual parsing. But, we can save a good deal of time by parsing multiple input files per program invocation, called grouped parsing. Let's see what this looks like in the output from a build.

For Individual parsing, one test is processed by one program call:

./bin/Debug/net6.0/Test.exe ../examples/AllInOne8.java
Time: 00:00:05.9106526
Parse succeeded.
./bin/Debug/net6.0/Test.exe ../examples/helloworld.java
Time: 00:00:00.5270543
Parse succeeded.
./bin/Debug/net6.0/Test.exe ../examples/IdentifierTest.java
Time: 00:00:00.6478677
Parse succeeded.
...

For Grouped parsing, all tests are processed by one program call:

./bin/Debug/net6.0/Test.exe ../examples/AllInOne8.java ../examples/helloworld.java ../examples/IdentifierTest.java ...
CSharp 0 ../examples/AllInOne8.java success 6.1742676
CSharp 1 ../examples/helloworld.java success 0.0255556
CSharp 2 ../examples/IdentifierTest.java success 0.1810686
...

With grouped parsing, the time for parsing each successive input should be shorter relative to the time the input took with individual parsing. Internally, Antlr keeps a cache of prediction context between each parse for faster parsing. Grouped parsing should help for Java, CSharp, and probably a few other targets. But, it may not help with all targets, especially PHP, which shows no speed-up with warm-up, likely a bug. See antlr/antlr-php-runtime#36.

Requirements, Design, and Implementation

To implement grouped parsing, the newest version of Trash trgen will need to be employed, as well as the templates in this repo updated. This PR changes the CI build tests, and some of the .errors and .tree files (more on this explained below).

After a detailed reading and testing of the current _scripts/test.ps1 and templates, I came to the conclusion that there was a significant disconnect in the requirements, design, and implementation between the Bash and Powershell test scripts. Therefore, we need to state some requirements for these scripts and the generated scripts for trgen.

Requirements

The basic requirement for the Bash or Powershell test scripts is to provide a simple, self-contained method to build and test the code in a Bash or Powershell environment.
The scripts should take minimal parameters so that people can focus on updating changes to a grammar rather than wasting time figuring out how to test the grammar across eight targets. (Note, to be implemented for Build is very slow, tests too much, outputs too much #2883.)
trgen should generate scripts test.sh and test.ps1. The Bash and Powershell scripts should mirror each other step-by-step. The scripts should also mirror each other across targets (e.g., Java and Dart.
There should be scripts for building, testing, and cleaning up, both Bash and Powershell.
The top-level scripts regtest.sh and test.ps1 (replacing tester.ps1) should be incremental (taking into account git status), and runable from any directory (including a grammar, or a collection of grammars, like asm/, antlr, or sql/). (NB: This requirement will be addressed in Build is very slow, tests too much, outputs too much #2883.)
With grouped parsing, the parser driver program must generate .errors and .tree files; it cannot be redirected because output for all inputs would then be grouped together. NB: Careful attention is placed on capturing all parser output. Any output to stderr that is not for trees and standard parsing errors is considered an error in the execution of the parser. For example, this can happen especially with dynamically typed and parsed languages like JavaScript, Python3, and PHP in which the base class is not correct, and the interpreter throws an error.
git will be used to implement diffing of the .errors and .tree files.

Discussion of requirements

It's hard to remember how to run the compiler and program of the parser program for a particular target. It's even worse trying to remember the syntax and runtimes for the targets. That's the whole point of trgen.

Unfortunately, tester.psm1 is not a stand-alone Powershell script because the extension is ".psm1", not ".ps1".

Further, the script assumed the test input was always in a directory called "examples/". trgen reads the pom.xml to get the location of the input tests.

Since grouped parsing can't use Bash/Powershell capture of the parse tree--because there would be multiple trees captured for multiple input files--the driver program must place the output in a specific output file. But, then, how does the script check diffs of these parse tree files??

In order to implement parse tree file diffs, I use git diff to find those files that are changed. It's laborious to write code to do diffs in Bash and Powershell. In fact, the Powershell "diff" strips out newlines of a multiline output file contained in a string. Let's make the code simpler by just using "git". If there's no git, or the grammar has been removed from the rest of the cloned repo, then the tests should just default to returning any errors found, even if they are expexted.

The .errors files seem to have extra newlines at the end. Extremely annoying. I don't know if they are hand-edited or the drivers that created them include extra printf's for the hell of it. In any case, I remastered a number of these files with this PR in order for the tests to pass.

From this point forward, test.sh or test.ps1 should be used to remaster all .tree and .errors files, as pondered in this comment.

Results

Although not scientific, I compared one testing of the V4 grammars with individual parsing vs grouped parsing. On an unloaded Windows 11 machine (Ryzen 7 2700, 8-cores, DDR4 16GB, SanDisk SDSSDH3 1TB, Antlr4.11.1, NET SDK 7.0.101), individual parsing of the CSharp target completed in 1h 44m, grouped parsing completed in 1h 11m. That is significantly faster.

Conclusions

This is an important PR. Grouped parsing helps a lot in the parse time for the build. But, overall, the main bottleneck in the build is the compilation of the generated code. It really goes back to #2883. The "incremental" testing isn't really working because, somehow, the information on what specifically to test is ignored. In addition, we shouldn't be testing combined grammars because they don't have target-specific action code.

…t-specific so multiple targets can be compared side-by-side.

…Use trgen to create these from scratch and remaster everything.

…--not yet available.

…ile, or use "-input" for a string.

…plicate functionality, when it's not clear how best to implement this. Removing newlines indescriminately is NOT the way to do a comparison!

…ror).

…ctly what is in makefile and test.sh, as they should have been to begin with!! The WHOLE POINT of providing a makefile and test.sh is so the details of building and running the generated parser driver code--which changes for every damn target--are completely hidden, and so I don't need to know how to compile and run Go vs Java vs C# vs Dart, etc. Still, all I want is to use a Powershell environment and type "builder.ps1" or "tester.ps1" and just get the thing working. Prior to this, all I could do was "cd to the root directory", type "pwsh _scripts/test.ps1" and it would test everything.

…h Bash test.sh

…to issue-2988

…ters.

…l if you use --template-source-directory, but it's an embedded resource in trgen.

…file.

kaby76 · 2023-01-16T02:22:29Z

@teverett @KvanTTT After working on this PR for two weeks, I think I finally have this PR ready for review. The new Bash and Powershell scripts should resemble each other more closely and across all Antlr targets. That should make it easier to understand and maintain.

After this, I plan to fix #2883

…it was wrong. Fixed.

…conflicting code.

…r tool call for JavaScript.

…sue with both test.ps1 and test.sh. It does not appear on Github Action boxes, but only on my system. Rather than throw the test out, I'm renaming to bypass the issue.

… files are now .gitignored, but we need to override that when checking for errors.

kaby76 · 2023-01-23T11:14:46Z

Needing to refork this repo, and squash commits. Will open a new PR.

kaby76 added 7 commits January 3, 2023 13:38

Advance to latest stable version.

2676bc0

Changes for #2988--changing tests to group parsings in one call.

e37cad7

The "-file" option not used anymore.

90080be

Update for changed name of directory for generated code--now is targe…

f06fbbb

…t-specific so multiple targets can be compared side-by-side.

Fix regression in trgen, use latest alpha2.

c018962

Update templates.

dc0a62c

Remaster all existing .errors files since these were created ad hoc. …

aa331d2

…Use trgen to create these from scratch and remaster everything.

kaby76 changed the title ~~Fix for Issue 2988~~ Fix for Issue 2988: Change individual to grouped parsing Jan 5, 2023

kaby76 added 22 commits January 5, 2023 13:20

Fix template for CSharp.

c2459bf

Call latest trgen.

c2d40cb

Set version of trgen. Make sure to comment out ATN tracing capability…

da71019

…--not yet available.

The "-file" command-line option is no longer recognize. Just name a f…

5fb7d12

…ile, or use "-input" for a string.

Remove .errors compare in Powershell tester--too many testers that du…

e237415

…plicate functionality, when it's not clear how best to implement this. Removing newlines indescriminately is NOT the way to do a comparison!

Fix exit code from parse with *expected* error (it's not really an er…

ebb3da6

…ror).

Update Dart driver and test.

da1459e

Updates to tester.ps1 script.

aef2aa1

Rename to be consistent in name as well as purpose and code steps wit…

fea2ad7

…h Bash test.sh

Merge branch 'antlr:master' into issue-2988

75d13ef

Merge branch 'issue-2988' of https://github.com/kaby76/grammars-v4 in…

0541b2b

…to issue-2988

Fixes for Antlr4cs, CSharp, Dart targets with Powershell and Bash tes…

2f76f55

…ters.

Updates to Antlr4cs, CSharp, Dart, Go templates.

eb23348

"files" should be the result of "find . -type f > files". Not critica…

d858208

…l if you use --template-source-directory, but it's an embedded resource in trgen.

Fixing tabs/spaces.

27b407d

Fix newline that should not be there.

4b0c42e

Updates for Cpp template.

518020b

Update Java template. Fix exit code.

047eed6

Updates to Java template.

c006ee0

Updates for JavaScript template.

2f8f35e

Updates to PHP templates.

ba9f6c2

kaby76 added 7 commits January 14, 2023 09:56

Add in triconv test for invalid utf-8 input files in tests.

f1c6100

Fixes to Dart and Go templates.

0559460

Fixes to trgen because Antlr4 grammar was messed up.

2a18ae4

Updated templates with build.sh, clean.sh, clean.ps1, simplified make…

ca0712b

…file.

Fix build Cpp/Ubuntu.

cc386ee

Add in code to get dependencies.

835158e

Setup for release.

b73c8f1

kaby76 marked this pull request as ready for review January 16, 2023 02:12

Not sure where the code for SyntaxError in javascript came from, but …

f704711

…it was wrong. Fixed.

kaby76 marked this pull request as draft January 18, 2023 00:30

Fix templates.

e23956e

kaby76 marked this pull request as ready for review January 18, 2023 10:53

kaby76 marked this pull request as draft January 18, 2023 12:33

kaby76 added 8 commits January 18, 2023 11:36

Fixes to templates.

dbe625d

Remove conflicting implementations with Test.java.

fb68f29

Fix JS test script. Remove kirikiri-tjs from Python3 testing, remove …

f76f128

…conflicting code.

Fix collision in build.sh and clean.sh with trgen templates. Fix antl…

3abc063

…r tool call for JavaScript.

Add scss grammar to skip javascript. Timeout.

e191295

There seems to be a problem with a file system issue and/or locale is…

077f178

…sue with both test.ps1 and test.sh. It does not appear on Github Action boxes, but only on my system. Rather than throw the test out, I'm renaming to bypass the issue.

Fix regex pattern, from dev branch of antlr4 repo.

c32f9ea

These grammars are too slow with the Go target in combined parsing.

0213da4

kaby76 marked this pull request as ready for review January 21, 2023 01:18

kaby76 marked this pull request as draft January 21, 2023 01:20

kaby76 added 3 commits January 20, 2023 21:11

If a test file is not tracked, there will be new .errors files. These…

9363da7

… files are now .gitignored, but we need to override that when checking for errors.

Corrections and commenting code.

bbf21ca

Update templates.

72efb43

kaby76 marked this pull request as ready for review January 22, 2023 22:29

kaby76 closed this Jan 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for Issue 2988: Change individual to grouped parsing #2989

Fix for Issue 2988: Change individual to grouped parsing #2989

kaby76 commented Jan 4, 2023 •

edited

Loading

kaby76 commented Jan 16, 2023

kaby76 commented Jan 23, 2023

Fix for Issue 2988: Change individual to grouped parsing #2989

Fix for Issue 2988: Change individual to grouped parsing #2989

Conversation

kaby76 commented Jan 4, 2023 • edited Loading

Statement of Problem

Requirements, Design, and Implementation

Requirements

Discussion of requirements

Results

Conclusions

kaby76 commented Jan 16, 2023

kaby76 commented Jan 23, 2023

kaby76 commented Jan 4, 2023 •

edited

Loading