-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unit classification mode: Integrate TextClassificationUnit.setSuffix() with externalD #251
Comments
We already have something similar in the form of the "suffix" field in the TextClassificationUnit type. |
True. However, AFAIK this suffix is only explicitely used in a feature (ID_FEATURE_NAME) and is not applied within id2outcome report. I guess, the best option is to use this (string) value to append it to the document ID in ClassificationUnitCasMultiplier, rather than adding a new one. The IDs do not need to be numeric. |
They need to be unique. When only using a user-defined name, we cannot rely on that. |
We need to make sure that document IDs are unique as well. I think, there, we also rely on the user to set unique values, don't we? I guess uima(FIT) will complain if the IDs are not unique. |
Yes, that is right. As document IDs are usually file names, the problem is less critical there as they need to be kind of unique already. However, just pointing out the problem. Whether we want to take action or leave it to the user is another question. |
I don't see why uima(FIT) should complain about any IDs not being unique. From the perspective of these frameworks, I'd say the IDs you are speaking about are "user-defined" and oblique to the framework. |
Within a single document, it is easy to ensure that user-specified IDs are unique by verifying in the ClassificationUnitCasMultiplier. If user-specified IDs are not unique within the same document, we can append numeric indices and output a warning. |
Btw. external IDs are a very common request... a related issue is e.g. dkpro/dkpro-core#609 |
The feature requested in dkpro/dkpro-core#609 will be bound to sentences or tokens, right? In DKPro TC we need to bind it to units (which can be sentences, but also many other things). If there is a way to combine these requests and to make it a common feature, that would be good; but at the moment I have no idea how this could be done. |
Well... we don't have a common base-class for all DKPro annotations and I personally would prefer not to introduce one (gut feeling). If we had such a base-class, we could introduce all kinds of common features there and they would be automatically inherited. But we'd risk inheriting such common features where they are not actually needed and there is no way of "removing" features in subclasses. Unfortunately, UIMA currently doesn't allow adding "interfaces" to JCas classes. One could imagine to add an interface "HasExternalId" to certain types which would then have to define a "externalId" field along with getters and setters. Cf. https://issues.apache.org/jira/browse/UIMA-3354 So currently, I tend towards introducing a set of "helper" classes, e.g.
If the annotation type defines a field "externalId", the methods would work, otherwise they would throw an exception. Btw. the same "helper" class solution would allow removing customized code from the generated JCas classes, e.g. from DocumentMetaData. That in turn would allow us to remove the JCas classes from the repository and automatically generate all of them during build. Currently, most JCas classes in DKPro Core are generated during build, but some are not due to customizations. |
Btw. there are some considerations towards making JCas more dynamic in future versions of UIMA: |
Not only the instance IDs but the label IDs should be consistent from report to report. Is there already an issue for that? |
Sounds good to me. That would require us to agree upon a common name for that field. |
It would merely require that you don't object to using "externalId" ;) |
Mind that "externalId" is the name I have in mind for DKPro Core. It implies that a single annotation type cannot have more than one external id. In case you store labelId and instanceId in a single type, a different DKPro TC solution would be required. ... or we could maintain multiple external IDs, e.g.
But I'm not sure... it seems a bit of an overkill. I believe in the majority of cases, only one external ID would be required. |
I'm not exactly sure whether I understand that request. units (TextClassificationUnit) and label (TextClassificationOutcome) annotations need to span across the same text. So it should be enough to have unique IDs with the units. |
Currently
The other
In order to get the raw classifications, not only do I need to combine the two files, but I first have to manually un-map all the numeric IDs to their original labels. It would be better if |
@tristan: Whoops, we have two different discussions here. User-set unit @johannes, CC:Tristan: Yes, TextClassificationUnit.setSuffix(String v) does On Fri, Jul 10, 2015 at 12:21 PM, Tristan Miller notifications@github.com
|
OK, I've opened a separate issue for the outcome classification labels (Issue 252). |
Emily is right. These are two issues (the second to be tracked in Issue 252). I'm leaving this issue open to integrate it with DKPro's externalId. |
|
Core didn't implement an "externalId" mechanism yet. |
Any updates on this one? |
Nope. |
Ok, then I close this for now. Once the demand for this kind of change becomes bigger a new issue can be opened. |
Why not leave it open and push it to the next release? If it get's closed, its easier to miss past discussions on the topic. |
In unit mode, the id2outcome report contains random IDs for units in documents with more than one unit. It would be much easier to interpret and analyze classification errors, if the IDs were interpretable.
To make this happen, we should introduce a new feature to the TextClassificationUnit type which holds (optionally) stores a user-specified string. This field can be set together with the label for this unit during reading. In ClassificationUnitCasMultiplier, where the units are "converted" into instances/CASes, this fields should be used to set the document ID of the CAS (which is later used by the id2outcome report). If no user-specified IDs are set, the current approach (increasing index for multiple units in a document) can be applied instead.
The text was updated successfully, but these errors were encountered: