Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update EDTF regex to be more lenient, or use custom format instead #421

Closed
bdarcus opened this issue Jun 2, 2022 · 24 comments
Closed

Update EDTF regex to be more lenient, or use custom format instead #421

bdarcus opened this issue Jun 2, 2022 · 24 comments

Comments

@bdarcus
Copy link
Member

bdarcus commented Jun 2, 2022

While working on #420, I realize the EDTF regex is too strict. In that case, I noticed it doesn't allow date-times, but I'm sure there are others, and my regex skills are too limited to feel comfortable tackling it on that PR.

See also citation-style-language/documentation#145


See #420 (comment)

Regexes cannot realistically test for valid edtf dates across the range of possible dates/date ranges/date combos expressable in edtf, so it'd either be too lenient or too restrictive.

So really the question is what, more lenient, regex will work here?

@bdarcus bdarcus changed the title Update EDTF regex to be more complete Update EDTF regex to be more lenient Jun 2, 2022
@bdarcus bdarcus changed the title Update EDTF regex to be more lenient Update EDTF regex to be more lenient, or use custom format instead Jun 3, 2022
@bdarcus
Copy link
Member Author

bdarcus commented Jun 3, 2022

@retorquere - sorry, I should have moved the thread here earlier.

Personally, I'd add edtf validation as custom format attributes, and provide a reference implementation using ajv + EDTF.js. If I'm reading https://json-schema.org/draft/2020-12/json-schema-validation.html#rfc.section.7.2.3 right, processors that don't know about these custom properties would validate presence but not format of the property.

This makes sense probably. In any case, a regex doesn't seem practical.

Here's what it would like in the schema, except ajv will throw an error if it doesn't understand the format. So maybe best to just defer this.

    "edtf-string": {
      "title": "EDTF date string",
      "description": "CSL input supports EDTF, levels 0 and 1.",
      "type":"string",
      "format": "edtf/level_0+1"
    }

It's seems like this just shouldn't go in the schema, and validation can be recommended independently somehow.

Any suggestions @inukshuk?

bdarcus added a commit that referenced this issue Jun 3, 2022
This may not be the ideal long-term solution, but a regex isn't
well-suited to validating an edtf string.

In the future, we may want to use a dedicated 'format' property here.

See #421
bdarcus added a commit that referenced this issue Jun 3, 2022
This may not be the ideal long-term solution, but a regex isn't
well-suited to validating an edtf string.

In the future, we may want to use a dedicated 'format' property here.

See #421
@retorquere
Copy link

@retorquere - sorry, I should have moved the thread here earlier.

Personally, I'd add edtf validation as custom format attributes, and provide a reference implementation using ajv + EDTF.js. If I'm reading https://json-schema.org/draft/2020-12/json-schema-validation.html#rfc.section.7.2.3 right, processors that don't know about these custom properties would validate presence but not format of the property.

This makes sense probably. In any case, a regex doesn't seem practical.

Here's what it would like in the schema, except ajv will throw an error if it doesn't understand the format. So maybe best to just defer this.

Then it looks like it's out of spec:

6.5. Extending JSON Schema

Additional schema keywords and schema vocabularies MAY be defined by any entity. Save for explicit agreement, schema authors SHALL NOT expect these additional keywords and vocabularies to be supported by implementations that do not explicitly document such support. Implementations SHOULD ignore keywords they do not support.

@retorquere
Copy link

ajv is looks to be out of spec by default, but it can be configured to be compliant:

const Ajv = require("ajv")
const ajv = new Ajv({ strict: false})

const schema = {
  type: "object",
  properties: {
    foo: {type: "integer"},
    bar: {type: "string"},
    baz: {other: true}
  },
  required: ['baz', "foo"],
  additionalProperties: false
}

const data = {foo: 1, bar: "abc", baz: 5}
const valid = ajv.validate(schema, data)
if (!valid) console.log(ajv.errors)

so I'd say it'd be a good thing to ship a schema with this extension, certainly if we provide a reference implementation that does do validation.

@bdarcus
Copy link
Member Author

bdarcus commented Jun 5, 2022

A problem I've noticed in my dabbling with the JSON schema ecosystem is it's not very consistent. So it's not always enough to do things according to spec.

Some questions we'll want to answer:

  1. Beyond ajv, what the other primary validation tools? How do they handle "format"?
  2. How do IDE's handle this?

On 1, at least https://www.jsonschemavalidator.net/ ignores the "format" property.

In any case, I'll leave this issue open and we can figure out WDT as we get closer to a release (including agreement among developers about 1.1 in general).

I agree, though, we ideally want this:

... it'd be a good thing to ship a schema with this extension, certainly if we provide a reference implementation that does do validation.

... which would also mean adding back that "format" property, once we settle what it's value should be, etc..

@retorquere
Copy link

for ajv, this would work. I could look at other json-schema implementations. This implementation could be shipped as a module that adds the validator to ajv by simply requiring the module, it's a common way to ship ajv extensions.

const Ajv = require("ajv")
const ajv = new Ajv({ strict: false})
const { parse } = require('edtf')

const schema = {
  type: "object",
  properties: {
    foo: {type: "integer"},
    bar: {type: "string"},
    baz: {edtf: [0, 1]}
  },
  required: ['baz', "foo"],
  additionalProperties: false
}

function edtfValidator(levels, date) {
  edtfValidator.errors = []

  if (typeof date !== 'string') {
    edtfValidator.errors.push('not a valid EDTF date')
    return false
  }

  try {
    const parsed = parse(date)

    if (!levels.includes(parsed.level)) {
      edtfValidator.errors.push(`expected EDTF level ${levels.map(l => `${l}`).join(',')}, got ${parsed.level}`)
      return false
    }
  }
  catch (err) {
    edtfValidator.errors.push('not a valid EDTF date')
    return false
  }

  return true
}
ajv.addKeyword({
  keyword: 'edtf',
  validate: edtfValidator,
})

const data = {foo: 1, bar: "abc", baz: '2016-XX'}
const valid = ajv.validate(schema, data)
if (!valid) console.log(ajv.errors)

@bdarcus
Copy link
Member Author

bdarcus commented Jun 5, 2022

This implementation could be shipped as a module that adds the validator to ajv by simply requiring the module, it's a common way to ship ajv extensions.

What would that look like?

Care to submit a PR?

@retorquere retorquere mentioned this issue Jun 5, 2022
8 tasks
@retorquere
Copy link

#422

I was wrong on the way to require it in for ajv, but I've included tests which show how to apply the formats for 4 json-schema validators. There are a ton on npmjs, I can add more implementations, but these 4 seemed to be the fastest so they should be popular.

@retorquere
Copy link

alternately this could be spun off as it's own npm module, or offered to @inukshuk for inclusion in EDTF.js

@bdarcus
Copy link
Member Author

bdarcus commented Jun 5, 2022

Many thanks for the PR!

alternately this could be spun off as it's own npm module, or offered to @inukshuk for inclusion in EDTF.js

Yes, I was wondering about this earlier. What do you say @inukshuk?

I definitely think it would be more widely useful beyond CSL.

@bdarcus
Copy link
Member Author

bdarcus commented Jun 6, 2022

BTW, I just tested the "format" property with this supposedly fast and compliant go validator, and it correctly ignored it.

https://github.com/santhosh-tekuri/jsonschema

@inukshuk
Copy link
Member

inukshuk commented Jun 6, 2022

This looks nice! Where you thinking of bundling a validate function with edtf.js or something beyond that?

As for the validation, the levels 0-2 are strictly in order; in the few cases where grammar rules of different levels can both produce a given string, the parser will always report the lower level. That is to say, I don't think it's necessary to define level ranges like 0+1, just specifying the highest level should work as well. Similarly, if you set the parser to a certain level, e.g. edtf.parse('...', { level: 1 }) it will also accept level 0 strings.

Finally, there was the old issue of season intervals. Those are currently not supported by the spec, but edtf.js supports them, specifically because @retorquere and others rightly argued that they are important for bibliographic use. Season intervals are defined at level 3 so we'd have to make up our mind how to handle those during validation.

@retorquere
Copy link

This looks nice! Where you thinking of bundling a validate function with edtf.js or something beyond that?

I think that would be the best option, really. There's currently 4 supported json-schema libraries that could mostly transfer to edtf.js as-is, and adding new ones looks to be trivial.

As for the validation, the levels 0-2 are strictly in order; in the few cases where grammar rules of different levels can both produce a given string, the parser will always report the lower level. That is to say, I don't think it's necessary to define level ranges like 0+1, just specifying the highest level should work as well. Similarly, if you set the parser to a certain level, e.g. edtf.parse('...', { level: 1 }) it will also accept level 0 strings.

I've updated the implementation to support edtf/[0,1,2,3] and edtf/[0,1]+3.

I'd be happy to submit a PR on EDTF.js. The test script (or something like it) on this repo could perhaps remain to show how to use it for this schema.

Finally, there was the old issue of season intervals. Those are currently not supported by the spec, but edtf.js supports them, specifically because @retorquere and others rightly argued that they are important for bibliographic use. Season intervals are defined at level 3 so we'd have to make up our mind how to handle those during validation.

It's for this reason I've included the +3 syntax. Parsing using {level: X} would be convenient, but given the description above, {level: 3} seems different from testing for the case of detected level [0,1,3].

@retorquere
Copy link

Oh, on re-reading your comment: something further, I was thinking to move the ready-to-run schema format functions now in the PR to EDTF.js

@inukshuk
Copy link
Member

inukshuk commented Jun 6, 2022

Sounds good, let's do that!

@retorquere
Copy link

Is there an easy way to say "level 1 with season-range support, but not level 2"? Right now I'm walking the parse results to test for level detected, but I'm very likely re-doing work already done in EDTF.js.

@inukshuk
Copy link
Member

inukshuk commented Jun 6, 2022

In terms of efficiency, that's not a problem. EDTF.js uses an Earley parser which can produce multiple valid derivations for the same string. So what the parse method currently does is pretty much the same thing as your approach: discarding derivations which don't fit the level requirement.

Since 'Level 3' doesn't actually exist I wonder if we shouldn't use a different syntax to say that we want to either include extensions or just season intervals in particular? I.e., the validation itself would stay the same for the time being, just that we use a different format to express 'edtf level 1 + season intervals'.

@bdarcus
Copy link
Member Author

bdarcus commented Jun 6, 2022

Since 'Level 3' doesn't actually exist I wonder if we shouldn't use a different syntax to say that we want to either include extensions or just season intervals in particular? I.e., the validation itself would stay the same for the time being, just that we use a different format to express 'edtf level 1 + season intervals'.

Just "edtf/level-0+level-1+seasons"?

So +-delimited flags, which can either be full levels, or feature components of those levels?

PS - seasons are definitely important for this use case.

@retorquere
Copy link

Technically possible, but it'd mean implementing all format permutations. I reckon 0, 0+season-ranges, 1, 1+season-ranges, 2, 2+season-ranges would suffice for json schema formats, and for the parser, something like { level: 1, seasonRanges: true} should do.

@inukshuk
Copy link
Member

inukshuk commented Jun 6, 2022

The EDTF spec leaves some room when describing compliance. E.g., "Level 1 is supported, and in addition the following features of level 2 are supported (list features)." so I think this is definitely preferable over using the non-existent level 3.

However, I'd definitely go with "edtf/level-1+seasons-intervals" because level 1 includes level 0 already, there's no reason to declare it explicitly.

Perhaps we can use something shorter than "season-intervals"; just "seasons" might be misleading, because I might want to specify something like "edtf/level-0+seasons" (since seasons is a level 1 feature).

@retorquere
Copy link

This way of encoding it in the format string is not the best way though. The implementations I know support type string+a fixed number of explicit formats, or just a new keyword, and then something like edtf: { level:1, seasonInterval: true} rather than a string to parse would be better. The benefit of type string + format is that parsers that don't know about the format will still check stringness and will not error out.

@bdarcus
Copy link
Member Author

bdarcus commented Jun 6, 2022

Does that mean one can use an object for the format property?

@inukshuk
Copy link
Member

inukshuk commented Jun 6, 2022

Oh yes, passing it as an extra option would be much preferable.

@retorquere
Copy link

I don't think string+format allows for format to be an object. We'd be looking at a custom keyword, which has varying support in validator libraries.

@retorquere
Copy link

https://github.com/retorquere/json-schema-edtf

bdarcus added a commit that referenced this issue Jun 12, 2022
This allow use of an edtf.js-based validator on the dates.

https://github.com/retorquere/json-schema-edtf

The json schema spec says tools should ignore the property
if they don't know what to do with its value.

Close #421
bdarcus added a commit that referenced this issue Jun 20, 2022
This allow use of an edtf.js-based validator on the dates.

https://github.com/retorquere/json-schema-edtf

The json schema spec says tools should ignore the property
if they don't know what to do with its value.

Close #421
@bdarcus bdarcus closed this as completed Mar 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants