Unit classification mode: Integrate TextClassificationUnit.setSuffix() with externalD #251

daxenberger · 2015-07-10T07:47:01Z

In unit mode, the id2outcome report contains random IDs for units in documents with more than one unit. It would be much easier to interpret and analyze classification errors, if the IDs were interpretable.
To make this happen, we should introduce a new feature to the TextClassificationUnit type which holds (optionally) stores a user-specified string. This field can be set together with the label for this unit during reading. In ClassificationUnitCasMultiplier, where the units are "converted" into instances/CASes, this fields should be used to set the document ID of the CAS (which is later used by the id2outcome report). If no user-specified IDs are set, the current approach (increasing index for multiple units in a document) can be applied instead.

zesch · 2015-07-10T07:54:59Z

We already have something similar in the form of the "suffix" field in the TextClassificationUnit type.
This does not replace numerical ids, but as the name says is added as a suffix.

daxenberger · 2015-07-10T08:10:09Z

True. However, AFAIK this suffix is only explicitely used in a feature (ID_FEATURE_NAME) and is not applied within id2outcome report. I guess, the best option is to use this (string) value to append it to the document ID in ClassificationUnitCasMultiplier, rather than adding a new one. The IDs do not need to be numeric.

zesch · 2015-07-10T08:12:04Z

They need to be unique. When only using a user-defined name, we cannot rely on that.

daxenberger · 2015-07-10T08:15:29Z

We need to make sure that document IDs are unique as well. I think, there, we also rely on the user to set unique values, don't we? I guess uima(FIT) will complain if the IDs are not unique.

zesch · 2015-07-10T08:18:38Z

Yes, that is right. As document IDs are usually file names, the problem is less critical there as they need to be kind of unique already.
As we have multiple units per document, there is some danger that we will get CASes with the same name here.

However, just pointing out the problem. Whether we want to take action or leave it to the user is another question.

reckart · 2015-07-10T08:22:29Z

We need to make sure that document IDs are unique as well. I think, there, we also rely on the user to set unique values, don't we? I guess uima(FIT) will complain if the IDs are not unique.

I don't see why uima(FIT) should complain about any IDs not being unique. From the perspective of these frameworks, I'd say the IDs you are speaking about are "user-defined" and oblique to the framework.

daxenberger · 2015-07-10T08:23:23Z

Within a single document, it is easy to ensure that user-specified IDs are unique by verifying in the ClassificationUnitCasMultiplier. If user-specified IDs are not unique within the same document, we can append numeric indices and output a warning.

reckart · 2015-07-10T08:25:32Z

Btw. external IDs are a very common request... a related issue is e.g. dkpro/dkpro-core#609

daxenberger · 2015-07-10T08:36:27Z

The feature requested in dkpro/dkpro-core#609 will be bound to sentences or tokens, right? In DKPro TC we need to bind it to units (which can be sentences, but also many other things). If there is a way to combine these requests and to make it a common feature, that would be good; but at the moment I have no idea how this could be done.

reckart · 2015-07-10T08:45:55Z

Well... we don't have a common base-class for all DKPro annotations and I personally would prefer not to introduce one (gut feeling). If we had such a base-class, we could introduce all kinds of common features there and they would be automatically inherited. But we'd risk inheriting such common features where they are not actually needed and there is no way of "removing" features in subclasses.

Unfortunately, UIMA currently doesn't allow adding "interfaces" to JCas classes. One could imagine to add an interface "HasExternalId" to certain types which would then have to define a "externalId" field along with getters and setters. Cf. https://issues.apache.org/jira/browse/UIMA-3354

So currently, I tend towards introducing a set of "helper" classes, e.g.

ExternalIdHelper.getExternalId(annotation); 
ExternalIdHelper.setExternalId(annotation, id);

If the annotation type defines a field "externalId", the methods would work, otherwise they would throw an exception.

Btw. the same "helper" class solution would allow removing customized code from the generated JCas classes, e.g. from DocumentMetaData. That in turn would allow us to remove the JCas classes from the repository and automatically generate all of them during build. Currently, most JCas classes in DKPro Core are generated during build, but some are not due to customizations.

reckart · 2015-07-10T08:52:01Z

Btw. there are some considerations towards making JCas more dynamic in future versions of UIMA:

https://cwiki.apache.org/confluence/display/UIMA/Ideas+for+UIMAJ+v3#IdeasforUIMAJv3-AutomaticJCasGenofmergedtypesystem

logological · 2015-07-10T09:10:43Z

Not only the instance IDs but the label IDs should be consistent from report to report. Is there already an issue for that?

daxenberger · 2015-07-10T09:38:41Z

So currently, I tend towards introducing a set of "helper" classes, e.g.
ExternalIdHelper.getExternalId(annotation);
ExternalIdHelper.setExternalId(annotation, id);
If the annotation type defines a field "externalId", the methods would work, otherwise
they would throw an exception.

Sounds good to me. That would require us to agree upon a common name for that field.

reckart · 2015-07-10T09:44:05Z

It would merely require that you don't object to using "externalId" ;)

reckart · 2015-07-10T09:46:48Z

Mind that "externalId" is the name I have in mind for DKPro Core. It implies that a single annotation type cannot have more than one external id. In case you store labelId and instanceId in a single type, a different DKPro TC solution would be required.

... or we could maintain multiple external IDs, e.g.

ExternalIdHelper.getExternalId(annotation); 
ExternalIdHelper.setExternalId(annotation, idNamespace, id);

But I'm not sure... it seems a bit of an overkill. I believe in the majority of cases, only one external ID would be required.

daxenberger · 2015-07-10T10:01:12Z

Not only the instance IDs but the label IDs should be consistent from report to report.
Is there already an issue for that?

I'm not exactly sure whether I understand that request. units (TextClassificationUnit) and label (TextClassificationOutcome) annotations need to span across the same text. So it should be enough to have unique IDs with the units.
Or do you have another application in mind?

logological · 2015-07-10T10:21:21Z

Currently id2outcome.txt uses numeric IDs for the classification outcomes, but (at least for cross-validation experiments) these IDs are not consistent from file to file. For example, BrownPosDemo does a two-fold cross-validation. One of the id2outcome.txt file uses the following mapping of numeric IDs to the original labels:

0=NPg 2=JJ 1=(null) 3=RB 5=TO 4=PPS 6=RP 7=NP 8=NN 10=VBN 9=VB 11=pct 12=PPO 13=BE 14=MD 15=DTS 16=VBZ 17=AT 18=IN 19=CS 20=VBG 21=VBD 22=BEDZ 23=NNS 24=CC 25=CD 26=AP 27=PPg

The other id2outcome.txt file uses a slightly different mapping:

0=NPg 2=(null) 1=JJ 3=RB 5=PPS 4=TO 6=RP 7=NP 8=NN 10=VB 9=VBN 11=pct 12=PPO 13=BE 14=MD 15=DTS 16=VBZ 17=AT 18=IN 19=CS 20=VBG 21=VBD 22=BEDZ 23=NNS 24=CC 25=CD 26=AP 27=PPg

In order to get the raw classifications, not only do I need to combine the two files, but I first have to manually un-map all the numeric IDs to their original labels.

It would be better if id2outcome.txt didn't use numeric IDs at all, but rather used the original label IDs. If for some reason the mapping to numeric IDs is necessary, it would be helpful if the mapping were consistent across files.

EmilyKJamison · 2015-07-10T11:08:40Z

@tristan: Whoops, we have two different discussions here. User-set unit
ids is a separate issue from maintaining continuity in numeric ids of
classifications in a CV task.

@johannes, CC:Tristan: Yes, TextClassificationUnit.setSuffix(String v) does
show up in id2outcome, along with the auto-generated numeric unit id.
AFAIK, the proposed issue is that it would still be desirable to remove the
auto-generated unit id.

On Fri, Jul 10, 2015 at 12:21 PM, Tristan Miller notifications@github.com
wrote:

Currently id2outcome.txt uses numeric IDs for the classification
outcomes, but (at least for cross-validation experiments) these IDs are not
consistent from file to file. For example, BrownPosDemo does a two-fold
cross-validation. One of the id2outcome.txt file uses the following
mapping of numeric IDs to the original labels:

0=NPg 2=JJ 1=(null) 3=RB 5=TO 4=PPS 6=RP 7=NP 8=NN 10=VBN 9=VB 11=pct 12=PPO 13=BE 14=MD 15=DTS 16=VBZ 17=AT 18=IN 19=CS 20=VBG 21=VBD 22=BEDZ 23=NNS 24=CC 25=CD 26=AP 27=PPg

The other id2outcome.txt file uses a slightly different mapping:

0=NPg 2=(null) 1=JJ 3=RB 5=PPS 4=TO 6=RP 7=NP 8=NN 10=VB 9=VBN 11=pct 12=PPO 13=BE 14=MD 15=DTS 16=VBZ 17=AT 18=IN 19=CS 20=VBG 21=VBD 22=BEDZ 23=NNS 24=CC 25=CD 26=AP 27=PPg

In order to get the raw classifications, not only do I need to combine the
two files, but I first have to manually un-map all the numeric IDs to their
original labels.

It would be better if id2outcome.txt didn't use numeric IDs at all, but
rather used the original label IDs. If for some reason the mapping to
numeric IDs is necessary, it would be helpful if the mapping were
consistent across files.

—
Reply to this email directly or view it on GitHub
#251 (comment).

logological · 2015-07-13T09:07:57Z

OK, I've opened a separate issue for the outcome classification labels (Issue 252).

daxenberger · 2015-07-13T09:24:58Z

Emily is right. These are two issues (the second to be tracked in Issue 252).
TextClassificationUnit.setSuffix(String v) does indeed show up in the id2outcome report. I'll add a line to the BrownCorpusReader to show its usage and to make it effective in the BrownUnitPOSDemo. This should solve the original request.

I'm leaving this issue open to integrate it with DKPro's externalId.

Horsmann · 2016-03-26T21:49:59Z

integrate it with DKPro's externalId.
What is left to do for that?

reckart · 2016-03-27T08:34:21Z

Core didn't implement an "externalId" mechanism yet.

Horsmann · 2018-02-09T22:32:41Z

Any updates on this one?

reckart · 2018-02-09T22:33:18Z

Nope.

Horsmann · 2018-02-09T22:35:01Z

Ok, then I close this for now. Once the demand for this kind of change becomes bigger a new issue can be opened.

reckart · 2018-02-09T22:35:42Z

Why not leave it open and push it to the next release? If it get's closed, its easier to miss past discussions on the topic.

daxenberger added Type-Enhancement labels Jul 10, 2015

daxenberger self-assigned this Jul 10, 2015

daxenberger changed the title ~~Unit classification mode: Augment id2outcome report to contain customized ids~~ Unit classification mode: Integrate TextClassificationUnit.setSuffix() with externalD Jul 13, 2015

reckart modified the milestone: 0.8.0 Aug 8, 2015

reckart removed the Milestone-Release0.8.0 label Aug 8, 2015

reckart added enhancement and removed Type-Enhancement labels Sep 6, 2015

Horsmann modified the milestones: 0.9.0, 0.8.0 Mar 27, 2016

Horsmann modified the milestones: 1.0.0, 0.9.0 Oct 19, 2016

Horsmann closed this as completed Feb 9, 2018

Horsmann reopened this Feb 9, 2018

Horsmann modified the milestones: 1.0.0, 1.1.0 Apr 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unit classification mode: Integrate TextClassificationUnit.setSuffix() with externalD #251

Unit classification mode: Integrate TextClassificationUnit.setSuffix() with externalD #251

daxenberger commented Jul 10, 2015

zesch commented Jul 10, 2015

daxenberger commented Jul 10, 2015

zesch commented Jul 10, 2015

daxenberger commented Jul 10, 2015

zesch commented Jul 10, 2015

reckart commented Jul 10, 2015

daxenberger commented Jul 10, 2015

reckart commented Jul 10, 2015

daxenberger commented Jul 10, 2015

reckart commented Jul 10, 2015

reckart commented Jul 10, 2015

logological commented Jul 10, 2015

daxenberger commented Jul 10, 2015

reckart commented Jul 10, 2015

reckart commented Jul 10, 2015

daxenberger commented Jul 10, 2015

logological commented Jul 10, 2015

EmilyKJamison commented Jul 10, 2015

logological commented Jul 13, 2015

daxenberger commented Jul 13, 2015

Horsmann commented Mar 26, 2016

reckart commented Mar 27, 2016

Horsmann commented Feb 9, 2018

reckart commented Feb 9, 2018

Horsmann commented Feb 9, 2018

reckart commented Feb 9, 2018

Unit classification mode: Integrate TextClassificationUnit.setSuffix() with externalD #251

Unit classification mode: Integrate TextClassificationUnit.setSuffix() with externalD #251

Comments

daxenberger commented Jul 10, 2015

zesch commented Jul 10, 2015

daxenberger commented Jul 10, 2015

zesch commented Jul 10, 2015

daxenberger commented Jul 10, 2015

zesch commented Jul 10, 2015

reckart commented Jul 10, 2015

daxenberger commented Jul 10, 2015

reckart commented Jul 10, 2015

daxenberger commented Jul 10, 2015

reckart commented Jul 10, 2015

reckart commented Jul 10, 2015

logological commented Jul 10, 2015

daxenberger commented Jul 10, 2015

reckart commented Jul 10, 2015

reckart commented Jul 10, 2015

daxenberger commented Jul 10, 2015

logological commented Jul 10, 2015

EmilyKJamison commented Jul 10, 2015

logological commented Jul 13, 2015

daxenberger commented Jul 13, 2015

Horsmann commented Mar 26, 2016

reckart commented Mar 27, 2016

Horsmann commented Feb 9, 2018

reckart commented Feb 9, 2018

Horsmann commented Feb 9, 2018

reckart commented Feb 9, 2018