-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues/37 Add function for returning an iterator instead of sequence #89
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@frankinspace, thanks for picking this up!
def get_iter(self, limit: int = -1, page_size: int = 2000) -> Iterator[Any]: | ||
""" | ||
Returns all results for the query as an iterator (generator) | ||
|
||
:param limit: The maximum number of results to return. Negative value means no limit. | ||
:param page_size: The page size (min 0, max 2000) of results retrieved from CMR. Smaller page size means | ||
fewer items in memory and more cmr queries. Larger page size means more items in memory and fewer cmr queries. | ||
:returns: query results as an iterator (generator) | ||
""" | ||
|
||
url = self._build_url() | ||
|
||
headers = dict(self.headers or {}) | ||
more_results = True | ||
page_size = min(max(0, page_size), 2000) | ||
n_results = 0 | ||
if limit < 0: | ||
limit = self.hits() | ||
|
||
while more_results: | ||
# Only get what we need on the last page. | ||
page_size = min(limit - n_results, page_size) | ||
response = requests.get( | ||
url, headers=headers, params={"page_size": page_size} | ||
) | ||
response.raise_for_status() | ||
|
||
# Explicitly track the number of results we have because the length | ||
# of the results list will only match the number of entries fetched | ||
# when the format is JSON. Otherwise, the length of the results | ||
# list is the number of *pages* fetched, not the number of *items*. | ||
n_results += page_size | ||
|
||
if self._format == "json": | ||
yield from response.json()["feed"]["entry"] | ||
else: | ||
yield response.text | ||
|
||
if cmr_search_after := response.headers.get("cmr-search-after"): | ||
headers["cmr-search-after"] = cmr_search_after | ||
|
||
more_results = n_results < limit and cmr_search_after is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that when you and I and @briannapagan spoke about this a few months ago, we agreed upon the name results
for this method.
Also, since this returns an iterator, there's no need to keep the limit
parameter. Callers can freely limit the results in various ways by limiting how many items they iterate over, which is standard practice when dealing with iterators (i.e., you won't find functions/methods that return iterators that also provide any type of "limit" parameter).
Therefore, I suggest the following rename and slight simplification:
def get_iter(self, limit: int = -1, page_size: int = 2000) -> Iterator[Any]: | |
""" | |
Returns all results for the query as an iterator (generator) | |
:param limit: The maximum number of results to return. Negative value means no limit. | |
:param page_size: The page size (min 0, max 2000) of results retrieved from CMR. Smaller page size means | |
fewer items in memory and more cmr queries. Larger page size means more items in memory and fewer cmr queries. | |
:returns: query results as an iterator (generator) | |
""" | |
url = self._build_url() | |
headers = dict(self.headers or {}) | |
more_results = True | |
page_size = min(max(0, page_size), 2000) | |
n_results = 0 | |
if limit < 0: | |
limit = self.hits() | |
while more_results: | |
# Only get what we need on the last page. | |
page_size = min(limit - n_results, page_size) | |
response = requests.get( | |
url, headers=headers, params={"page_size": page_size} | |
) | |
response.raise_for_status() | |
# Explicitly track the number of results we have because the length | |
# of the results list will only match the number of entries fetched | |
# when the format is JSON. Otherwise, the length of the results | |
# list is the number of *pages* fetched, not the number of *items*. | |
n_results += page_size | |
if self._format == "json": | |
yield from response.json()["feed"]["entry"] | |
else: | |
yield response.text | |
if cmr_search_after := response.headers.get("cmr-search-after"): | |
headers["cmr-search-after"] = cmr_search_after | |
more_results = n_results < limit and cmr_search_after is not None | |
def results(self, page_size: int = 2000) -> Iterator[Any]: | |
""" | |
Return an iterator (generator) of all results matching the query | |
criteria. | |
Because a query may produce a large number of results (perhaps | |
10s or 100s of thousands), such results are fetched using | |
multiple CMR requests, each returning a "page" of results, as | |
returning all results in a single request would be impractical. | |
The size of each page (in terms of the number of results | |
in a page) is controlled by the `page_size` parameter. A smaller | |
page size means fewer items in memory (per page), requiring | |
more CMR queries to fetch all results (if desired). Conversely, | |
a larger page size means more items in memory (per page) | |
and fewer CMR queries. | |
When the query is configured to use the `"json"` format, each | |
element produced by the returned iterator is a element of the | |
"feed.entry" array (see | |
<https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#json>). | |
In this case, the iterator may produce as many elements as there | |
are results matching the query criteria. | |
For all other formats, each element produced by the returned | |
iterator is an unparsed (text) page of results (i.e., the caller | |
is responsible for parsing the page of results into individual | |
elements). In this case, the iterator will produce only as many | |
pages as required (based on `page_size`) to produce all results | |
matching the query criteria. | |
:param page_size: maximum number of results per page (min 1, | |
max 2000 [default]) requested from the CMR | |
:returns: query results as an iterator (generator) | |
""" | |
url = self._build_url() | |
headers = dict(self.headers or {}) | |
params={"page_size": min(max(1, page_size), 2000)} | |
while True: | |
response = requests.get(url, headers=headers, params=params) | |
response.raise_for_status() | |
if self._format == "json": | |
yield from response.json()["feed"]["entry"] | |
else: | |
yield response.text | |
if not (cmr_search_after := response.headers.get("cmr-search-after")): | |
break | |
headers["cmr-search-after"] = cmr_search_after |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Yeah I remember discussing it but couldn't recall what the consensus was, will make the update.
awhile if many requests have to be made. | ||
|
||
:returns: query results as a list | ||
""" | ||
|
||
return self.get(self.hits()) | ||
return list(self.get_iter()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also deprecate get_all
. If a user wants to create a list of all results, they can simply call list(query.results())
themselves, rather than query.get_all()
.
@@ -58,11 +60,12 @@ def __init__(self, route: str, mode: str = CMR_OPS): | |||
self.concept_id_chars: Set[str] = set() | |||
self.headers: MutableMapping[str, str] = {} | |||
|
|||
@deprecated("Use get_iter() instead") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deprecated("Use get_iter() instead") | |
@deprecated("Use the 'results' method instead, but note that it produces an iterator.") |
### Added | ||
- New function `get_iter` for returning results as an iterator instead of sequence ([#37](https://github.com/nasa/python_cmr/issues/37)) | ||
|
||
### Deprecated | ||
- Function `get` has been marked as deprecated in favor of the new `get_iter` function. `get` will likely be removed for the 1.0.0 release. ([#37](https://github.com/nasa/python_cmr/issues/37)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### Added | |
- New function `get_iter` for returning results as an iterator instead of sequence ([#37](https://github.com/nasa/python_cmr/issues/37)) | |
### Deprecated | |
- Function `get` has been marked as deprecated in favor of the new `get_iter` function. `get` will likely be removed for the 1.0.0 release. ([#37](https://github.com/nasa/python_cmr/issues/37)) | |
### Added | |
- Add method `Query.results` for returning results as an iterator instead | |
of sequence ([#37](https://github.com/nasa/python_cmr/issues/37)) | |
### Changed | |
- Deprecate methods `Query.get` and `Query.get_all` in favor of the new | |
`Query.results` method. These deprecated methods will likely be removed | |
for the 1.0.0 release. | |
([#37](https://github.com/nasa/python_cmr/issues/37)) | |
Clicked the wrong button in my IDE and accidentally merged this. Will open a new PR with suggested changes |
Closes #37
Added a new function
get_iter
which returns an iterator that will yield all hits by default. Optional parameters for controlling the limit and page size are also available.Changed the
get_all
function to useget_iter
.Marked
get
as deprecated in favor ofget_iter
.Updated the unit tests to have equivalent tests for the new
get_iter
function. Also updated the usage of vcrpy so that cassettes are named for the test function.