Skip to content

francetem/deduper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

deduper

deduper is a set of functionalities that helps the user to implement a deduplication process. Based on some implementation details from: https://addi.ehu.es/handle/10810/28984?locale-attribute=en

Blocking, derive and dataset creation

mySourcesCollection
                .collect(Sources.collector())
                .block(this::blockingPredicate)
                .deriving()
                .withFeatureDerivers(getFeatureDerivers())
                .derive()
                .writeToCsv("myDataSet.csv");

Pair resolution & clustering

        ByteArrayOutputStream stream;
        BufferedWriter bufferedWriter;
        Instances instances;

        stream = new ByteArrayOutputStream();
        bufferedWriter = new BufferedWriter(new OutputStreamWriter(stream));

        sources.stream()
                .collect(Sources.collector())
                .onlyIn(test)
                .block(this::blockingPredicate)
                .deriving()
                .withFeatureDerivers(getFeatureDerivers())
                .withBuckets(test)
                .derive()
                .writeToCsv(bufferedWriter);

        bufferedWriter.close();

        instances = WekaUtils.getCsvInstances(new BufferedInputStream(new ByteArrayInputStream(stream.toByteArray())));

        PairResolution resolution = Solver.pairResolve(abstractClassifier, instances, threshold);
        Buckets<String> clusters = resolution.toNormalizedClusters();

Evaluation

   GMD gmd = new GMD();
   GmdCost cost = gmd.cost(clusters, buckets);

Build Status

Releases

No releases published

Packages

No packages published

Languages