Skip to content

ll0ruc/ace2005chinese_preprocess

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ace2005chinese_preprocess

ACE 2005 corpus preprocessing for Event Extraction task

Prerequisites

  1. Prepare ACE 2005 dataset.

    (Download: https://catalog.ldc.upenn.edu/LDC2006T06. Note that ACE 2005 dataset is not free.)

  2. Install the packages.

    pip install beautifulsoup4 nltk tqdm
    
  3. Choose your data_list,not given above. (train/dev/test)

Usage

Run:

sudo python main.py --data=./data/ace_2005/Chinese
  • Then you can get the parsed data in output directory.

Output

Format

I follow the json format described in nlpcl-lab/ace2005-preprocessing [github] repository like the bellow sample. But currently only sentence, event-mentions, entity-mentions, others information such as dependency tree, pos_tags, etc. will be added later. The data division method (data_list.csv) is selected randomly during the experiment, it does not belong to the authoritative division method of ED task.

If you want to know event types and arguments in detail, read this document (ACE 2005 event guidelines).

sample.json

[
   {
    "sentence": "两个星期来,藤森曾亲自带队搜捕 前情报顾问蒙特西诺斯,迄今蒙特西诺斯仍未落网",
    "golden-event-mentions": [
      {
        "arguments": [
          {
            "start": 29,
            "end": 34,
            "entity-type": "PER:Individual",
            "text": "蒙特西诺斯",
            "role": "Person"
          },
          {
            "start": 0,
            "end": 4,
            "entity-type": "TIM:time",
            "text": "两个星期",
            "role": "Time"
          }
        ],
        "trigger": {
          "start": 36,
          "end": 38,
          "text": "落网"
        },
        "event_type": "Justice:Arrest-Jail"
      }
    ],
    "golden-entity-mentions": [
      {
        "start": 16,
        "entity-type": "PER:Individual",
        "text": "前情报顾问",
        "end": 21,
        "phrase-type": "NOM"
      },
      {
        "start": 21,
        "entity-type": "PER:Individual",
        "text": "蒙特西诺斯",
        "end": 26,
        "phrase-type": "NAM"
      },
      {
        "start": 29,
        "entity-type": "PER:Individual",
        "text": "蒙特西诺斯",
        "end": 34,
        "phrase-type": "NAM"
      },
      {
        "start": 6,
        "entity-type": "PER:Individual",
        "text": "藤森",
        "end": 8,
        "phrase-type": "NAM"
      },
      {
        "start": 0,
        "entity-type": "TIM:time",
        "text": "两个星期",
        "end": 4,
        "phrase-type": "TIM"
      },
      {
        "start": 27,
        "entity-type": "TIM:time",
        "text": "迄今",
        "end": 29,
        "phrase-type": "TIM"
      }
    ]
  },
]

Reference

  • nlpcl-lab's ace2005-preprocessing repository, [github]

About

ACE 2005 corpus preprocessing for Event Extraction task

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages