Sketch Grammar Explorer is an API wrapper for Sketch Engine, a corpus management software useful for linguistic research. The goal is to build a flexible scaffold for any kind of programmatic work with Sketch Engine and NoSketch Engine.
Clone SGEX or install it with pip install sgex
(main dependencies are pandas pyyaml aiohttp aiofiles
).
Get a Sketch Engine API key. Be sure to reference SkE's documentation and schema:
wget https://www.sketchengine.eu/apidoc/openapi.yaml -O .openapi.yaml
A quick intro on the API (examples use a local NoSketch Engine server).
Most things are identical for SkE's main server, apart from using credentials and more call types being available. SGEX currently uses the Bonito API, with URLs ending in
/bonito/run.cgi
, not newer paths like/search/corp_info
.
job
: the primary module - makes requests and manipulates datacall
: classes and methods for API call typesquery
: functions to generate/manipulate CQL queriesutil
: utility functions
Calls are made with the job
module, which can also be run as a script. The Job
class has a few options:
from sgex.job import Job
j = Job(
# define API calls
infile: str | list | None = None,
params: str | dict | list | None = None,
# set server info
server: str = "local",
default_servers: dict = default_servers,
# supply credentials
api_key: str | None = None,
username: str | None = None,
# manage caching
cache_dir: str = "data",
clear_cache: bool = False,
# run asynchronous requests
thread: bool = False,
# control request throttling
wait_dict: dict = wait_dict,
# make a dry run
dry_run: bool = False,
# change verbosity
verbose: bool = False,
)
j.run()
Here's how to make a request:
>>> from sgex.job import Job
# instantiate the job with options
>>> j = Job(
... params={"call_type": "View", "corpname": "preloaded/susanne", "q": 'alemma,"bird"'},
... api_key="", # add key
... username="", # add name
... server="ske") # use SkE main server
...
# this example uses a local server (the default)
>>> j = Job(
... params={"call_type": "View", "corpname": "susanne", "q": 'alemma,"bird"'})
...
# run the job
>>> j.run()
# get a summary
>>> dt = j.summary()
>>> for k,v in dt.items():
... print(k, ("<float>" if k == "seconds" else v))
...
seconds <float>
calls 1
errors Counter()
# results are stored in Job.data.<call_type_lowercase>
>>> j.data.view
[View 8cdfca2 {asyn: '0', corpname: susanne, format: json, q: 'alemma,"bird"'}]
# the response gets cached in `data/<hash>.json`: repeating the same request pulls from the cache
# data is accessible via `.text` or `.json()`
>>> j.data.view[0].response.json()["concsize"] # the number of concordances for "bird"
12
Just provide a list of call parameters (list of dict) to make more than one call.
# supplying a list of calls
>>> from sgex.job import Job
>>> j = Job(
... params=[
... {"call_type": "CorpInfo", "corpname": "susanne"},
... {"call_type": "View", "corpname": "susanne", "q": 'alemma,"bird"'},
... {"call_type": "Collx", "corpname": "susanne", "q": 'alemma,"bird"'}])
...
>>> j.run()
>>> j.data
<class 'sgex.call.Data'>
collx (1) [Collx 26d29b1 {corpname: susanne, format: json, q: 'alemma,"bird"'}]
corpinfo (1) [Corp_info 9c08055 {corpname: susanne, format: json}]
view (1) [View 8cdfca2 {asyn: '0', corpname: susanne, format: json, q: 'alemma,"bird"'}]
Or supply a JSON, JSONL or YAML file with calls:
// test/example.jsonl
{"call_type": "Collx", "corpname": "susanne", "q": "alemma,\"apple\""}
{"call_type": "Collx", "corpname": "susanne", "q": "alemma,\"carrot\""}
{"call_type": "Collx", "corpname": "susanne", "q": "alemma,\"lettuce\""}
# supplying a file of calls
>>> from sgex.job import Job
>>> j = Job(infile="test/example.jsonl")
>>> j.run()
>>> j.data
<class 'sgex.call.Data'>
collx (3) [Collx bc5d89b {corpname: susanne, format: json, q: 'alemma,"apple"'}, Collx 19495d0 {corpname: susanne, format: json, q: 'alemma,"carrot"'}, Collx 7edee07 {corpname: susanne, format: json, q: 'alemma,"lettuce"'}]
The View
call retrieves concordances by page, defaulting to page 1. Its fromp
and pagesize
parameters adjust the current page and max number of concordances per page. Using a large pagesize
is often fine to get data in one request, but it may be better to use several. For this, try Job.run_repeat()
, which gets the first page, calculates how many pages remain, and then gets remaining pages (or up to max_pages
, if defined).
This example gets all the hits for work
in the Susanne corpus in sets of 10 concordances per page. There are 93 in total, meaning that 10 requests are made (fromp=1
through fromp=10
).
>>> from sgex.job import Job
# run job
>>> j = Job(params={"call_type": "View", "corpname": "susanne", "q": 'aword,"work"', "fromp": 1, "pagesize": 10})
>>> j.run_repeat(max_pages=0) # optionally set max_pages to stop after n pages
# the 93 concordances were retrieved in 10 calls
>>> j.data.view[0].response.json()["concsize"] == 93
True
>>> len(j.data.view) == 10
True
Response data can be manipulated by accessing the lists of calls stored in Job.data
. A few methods are included so far, such as Freqs.df_from_json()
, which transforms a JSON frequency query to a DataFrame.
# convert frequency JSON to a Pandas DataFrame
>>> from sgex.job import Job
>>> j = Job(
... params={
... "call_type": "Freqs",
... "corpname": "susanne",
... "fcrit": "doc.file 0",
... "q": 'alemma,"bird"'})
>>> j.run()
>>> df = j.data.freqs[0].df_from_json()
>>> df.head(3)
frq rel reltt fpm value attribute arg nicearg corpname total_fpm total_frq fmaxitems
0 7 3625.97107 2892.561983 46.534509 A11 doc.file "bird" bird susanne 79.77 12 None
1 2 1093.37113 872.219799 13.295574 N08 doc.file "bird" bird susanne 79.77 12 None
2 1 525.59748 419.287212 6.647787 G04 doc.file "bird" bird susanne 79.77 12 None
A few more considerations for doing corpus linguistics with SGEX.
Data
is a dataclass used in Job
to store API data and make associated methods easily available. Whenever requests are made from a list of dictionary parameters, the responses are automatically sorted by call_type
. Each call type has a list, which gets appended each time a request is made. These lists of responses can be processed using methods shared by a given call type.
Every call type is a subclass of the call.Call
base class. All calls share some universal methods, including simple parameter verification to reduce API errors. Every subclass (Freqs
, View
, CorpInfo
, etc.) can also have its own methods for data processing tasks. These methods tend to focus on manipulating JSON data, which is the only complete format for the API; manipulating other response formats like CSV is also possible.
At least while SGEX is in beta, existing methods aren't stable for production purposes: using your own custom method, like the following example, is a safer bet.
Adding custom methods to a call type is easy:
>>> from sgex.job import Job
>>> from sgex.call import CorpInfo
# write a new method
>>> def new_method_from_json(self) -> str:
... """Returns a string of corpus structures."""
... self.check_format()
... _json = self.response.json()
... return " ".join([k.replace("count", "") for k in _json.get("sizes").keys()])
>>> CorpInfo.new_method_from_json = new_method_from_json
# run the job
>>> j = Job(
... clear_cache=True,
... params={"call_type": "CorpInfo", "corpname": "susanne", "struct_attr_stats": 1})
>>> j.run()
# use the method
>>> j.data.corpinfo[0].new_method_from_json()
'token word doc par sent'
Feel free to suggest more methods for call types if you think they're useful. Be sure to explain the purpose and required parameters in the docstring (e.g., "Requires a corp_info call with these parameters: {"x": "y"}
).
Wait periods are added between calls with a wait_dict
that defines the required increments for a number of calls. This is the standard dictionary, following SkE's fair use policy:
wait_dict = {"0": 9, "0.5": 99, "4": 899, "45": None}
In other words:
- no wait applied for 9 calls or fewer
- 0.5 seconds added for up to 99 calls
- 4 seconds added for up to 899 calls
- 45 seconds added for any number above that
The aiohttp
package is used to implement async requests.
This is activated with Job(thread=True)
; it's usable with local servers only.
The number of connections for async calling is adjustable by adding a kwarg when running a job. The default of 20 should increase rates while reducing errors, although this depends on how many calls are made, their complexity, and the hardware.
Job.run(connector=aiohttp.TCPConnector(limit_per_host=int))
If a large asynchronous job raises a few exceptions caused by the server struggling to handle requests, it's often simpler to just run the job again. This retries failed calls and loads successful ones from the cache. Trying to adjust the
connector
to eliminate one or two exceptions out of 1,000 calls isn't necessary.If calls are complex and the corpus is large, using sequential calling might be the best option.
Data can be retrieved in JSON, XML, CSV or TXT formats with Job(params={"format": "csv"})
etc. Only JSON is universal: most API call types can only return some of these formats.
A simple filesystem cache is used to store response data. These are named with a hashing function and are accompanied with response metadata. Once a call is cached, identical requests are loaded from the cache. Calls with format="json"
and no exceptions or SkE errors get cached. Data in other formats (CSV, XML) are always cached since error handling isn't implemented.
Response data can include credentials in several locations. SGEX strips credentials before caching URLs and JSON data, although inspecting data before sharing it is still prudent.
simple_query
approximates this query type in SkE: enter a phrase and a CQL rule is returned. The search below uses double hyphens to include tokens w/ or w/o hyphens or spaces; wildcard tokens are also possible.
>>> from sgex.query import simple_query
>>> simple_query("home--made * recipe")
'( [lc="homemade" | lemma_lc="homemade"] | [lc="home" | lemma_lc="home"] [lc="made" | lemma_lc="made"] | [lc="home-made" | lemma_lc="home-made"] | [lc="home" | lemma_lc="home"] [lc="-" | lemma_lc="-"] [lc="made" | lemma_lc="made"] ) [lc=".*" | lemma_lc=".*"][lc="recipe" | lemma_lc="recipe"]'
fuzzy_query
takes a sentence or longer phrase and converts it into a more forgiving CQL rule. This can be helpful to relocate an extracted concordance or find similar results elsewhere. The returned string is formatted to work with word
or word_lowercase
as a default attribute.
>>> from sgex.query import fuzzy_query
>>> fuzzy_query("Before yesterday, it was fine, don't you think?")
'"Before" "yesterday" []{,1} "it" "was" "fine" []{,3} "you" "think"'
>>> fuzzy_query("We saw 1,000.99% more visitors at www.example.com yesterday")
'"We" "saw" []{,6} "more" "visitors" "at" []{,2} "yesterday"'
Numbers, URLs and other challenging tokens are parsed to some extent, but these can prevent
fuzzy_query
from finding concordances.
To cache data, each unique call is identified by hashing an ordered JSON representation of its parameters. Hashes can be derived from input data (the parameters you write) and response data (the parameters as stored in a JSON API response). Accessing hashes can be done as such:
>>> from sgex.job import Job
>>> from sgex.call import CorpInfo
# get shortened hash from input parameters
>>> c = CorpInfo({"corpname": "susanne", "struct_attr_stats": 1})
>>> c.hash()[:7]
'9c28c7a'
# send request
>>> j = Job(
... params={"call_type": "CorpInfo", "corpname": "susanne", "struct_attr_stats": 1})
...
>>> j.run()
# get shortened hash from response
>>> j.data.corpinfo[0].hash()[:7]
'9c28c7a'
Timeouts are disabled for the local
server, which lets expensive queries run as needed. Other servers use the aiohttp
default of 5 minutes. Enforce a custom timeout by adding it to Job
kwargs. (Use this technique to pass other args to the aiohttp
session as well.)
>>> from sgex.job import Job
>>> import aiohttp
# add a very short timeout for testing
>>> timeout = aiohttp.ClientTimeout(sock_read=0.01)
# design a call with a demanding CQL query
>>> j = Job(
... params={
... "call_type": "Collx",
... "corpname": "susanne",
... "q": "alemma,[]{,10}"})
...
# run with additional session args
>>> j.run(timeout=timeout)
# check for timeout exception [(error, call, index), ...]
>>> isinstance(j.errors[0][0], aiohttp.client_exceptions.ServerTimeoutError)
True
Even if a request is timed-out by the client, a server may still try to compute results (and continue taking up resources on a local machine, causing unexpected exceptions).
Data from different call types can be utilized together to construct more complex queries and custom operations. For example, the random sample
feature in Sketch Engine's interface uses simple randomization, yet some analyses might require a stratified sampling technique (taking a separate random sample for each category in a text type). This can be done with the code below.
This includes three API call types:
- a
CorpInfo
call - a
AttrVals
call - a series of
View
calls (64, requested concurrently)
It retrieves 5 random concordances for each doc.file
text type in the susanne
corpus for cases of "the" and a following token.
>>> from sgex.job import Job
# 1. check the corpus's attributes
>>> j = Job(params={"call_type": "CorpInfo", "corpname": "susanne", "struct_attr_stats": 1})
>>> j.run()
>>> j.data.corpinfo[0].structures_from_json()
structure attribute size
0 font type 2
0 head type 2
0 doc file 64
1 doc n 12
2 doc wordcount 1
# 2. get values for one text type
# (make sure avmaxitems is >= the size of the text type)
>>> j0 = Job(params={"call_type": "AttrVals", "corpname": "susanne", "avattr": "doc.file", "avmaxitems": 1000000})
>>> j0.run()
>>> values = j0.data.attrvals[0].response.json()["suggestions"]
# the query ['a<default_attribute>,"<cql_rule>"', "r<sample_size>"]
>>> q = [f'alemma,"the" []', "r5"]
>>> call_template = {
... "call_type": "View",
... "corpname": "susanne",
... "viewmode": "sen",
... "pagesize": 1000, # make greater than "r<int>"" to get everything in one request
... "attrs": "word,tag,lemma",
... "attr_allpos": "all"}
# generate list of calls
>>> calls = []
>>> for value in values:
... within = f' within <doc file="{value}" />'
... calls.append(call_template | {"q": [q[0] + within, q[1]]})
# 3. execute job
>>> j1 = Job(params=calls, thread=True)
>>> j1.run()
# process data as needed
# (print the query for the first sample)
>>> print(j1.data.view[0].response.json()["request"]["q"])
['alemma,"the" [] within <doc file="A01" />', 'r5']
# (print the KWICs for the first sample)
>>> for x in range(5):
... kwic = j1.data.view[0].response.json()["Lines"][x]["Kwic"]
... tokens = []
... for dt in kwic:
... if dt["class"] != "attr":
... tokens.append(dt["str"].strip())
... print(" ".join(tokens))
the jury
the Fulton
the state
the recommendations
the jury
If the repo is cloned and is the current working directory, SGEX can be run as a script as such:
# gets collocation data from the Susanne corpus for the lemma "bird"
python sgex/job.py -p '{"call_type": "Collx", "corpname": "susanne","q": "alemma,\"bird\""}'
Basic commands are available to run as a script for downloading data. Example: one could read a list of API calls from a file (-i "<myfile.json>"
) and send requests to the SkE server (-s "ske"
). More complex tasks still require importing modules in Python.
Run SGEX with --help
for up-to-date options.
python sgex/job.py --help
usage: SGEX [-h] [-k API_KEY] [--cache-dir CACHE_DIR] [--clear-cache] [--data DATA] [--default-servers DEFAULT_SERVERS] [--dry-run] [-i [INFILE ...]] [-p [PARAMS ...]] [-s SERVER] [-x] [-u USERNAME] [-w WAIT_DICT]
arg | example | description |
---|---|---|
-k --api-key | "1234" |
API key, if required by server |
--cache-dir | "data" (default) |
cache directory location |
--clear-cache | (disabled by default) | clear the cache directory (ignored if --dry-run ) |
--data | (reserved) | placeholder for API call data |
--default-servers | '{"server_name": "URL"}' |
settings for default servers |
--dry-run | (disabled by default) | print job settings |
-i --infile | "api_calls.json" |
file(s) to read calls from |
-p --params | '{"call_type": "Collx", "corpname": "susanne","q": "alemma,\"bird\""}' |
JSON/YAML string(s) with a dict of params |
-s --server | "local" (default) |
local , ske or a URL to another server |
-x, --thread | (disabled by default) | run asynchronously, if allowed by server |
-u --username | "J. Doe" |
API username, if required by server |
-v --verbose | (disabled by default) | print details while running |
-w --wait-dict | '{"0": 10, "1": null}' (wait zero seconds for =<10 calls and 1 second for 10<) |
wait period between calls |
Environment variables can be set by exporting them or using an .env
file. When used as env variables, argument names are just converted to uppercase and given a prefix.
# .env
SGEX_API_KEY="<KEY>"
SGEX_CACHE_DIR="<PATH>"
SGEX_CLEAR_CACHE=False
SGEX_DRY_RUN=True
SGEX_INFILE="<FILE>"
SGEX_SERVER="ske"
SGEX_USERNAME="<USER>"
# export variables in .env
set -a && source .env && set +a
# run SGEX
python sgex/job.py # add args here
# unset variables
unset ${!SGEX_*}
SGEX has been developed to meet research needs at the University of Granada Translation and Interpreting Department. See the LexiCon research group for related projects.
The name refers to sketch grammars, which are series of generalized corpus queries in Sketch Engine (see their bibliography).
Questions, suggestions and support are welcome.
If you use SGEX, please cite it. This paper introduces the package in the context of doing collocation analysis:
@inproceedings{isaacsAggregatingVisualizingCollocation2023,
address = {Lisbon, Portugal},
title = {Aggregating and {Visualizing} {Collocation} {Data} for {Humanitarian} {Concepts}},
url = {https://ceur-ws.org/Vol-3427/short11.pdf},
booktitle = {Proceedings of the 2nd {International} {Conference} on {Multilingual} {Digital} {Terminology} {Today} ({MDTT} 2023)},
publisher = {CEUR-WS},
author = {Isaacs, Loryn and León-Araúz, Pilar},
editor = {Di Nunzio, Giorgio Maria and Costa, Rute and Vezzani, Federica},
year = {2023},
}
See Zenodo for citing specific versions of the software.