Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues/37 Add function for returning an iterator instead of sequence #89

Merged
merged 3 commits into from
Sep 24, 2024

Conversation

frankinspace
Copy link
Collaborator

Closes #37

Added a new function get_iter which returns an iterator that will yield all hits by default. Optional parameters for controlling the limit and page size are also available.

Changed the get_all function to use get_iter.

Marked get as deprecated in favor of get_iter.

Updated the unit tests to have equivalent tests for the new get_iter function. Also updated the usage of vcrpy so that cassettes are named for the test function.

Copy link
Collaborator

@chuckwondo chuckwondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@frankinspace, thanks for picking this up!

Comment on lines +131 to +172
def get_iter(self, limit: int = -1, page_size: int = 2000) -> Iterator[Any]:
"""
Returns all results for the query as an iterator (generator)

:param limit: The maximum number of results to return. Negative value means no limit.
:param page_size: The page size (min 0, max 2000) of results retrieved from CMR. Smaller page size means
fewer items in memory and more cmr queries. Larger page size means more items in memory and fewer cmr queries.
:returns: query results as an iterator (generator)
"""

url = self._build_url()

headers = dict(self.headers or {})
more_results = True
page_size = min(max(0, page_size), 2000)
n_results = 0
if limit < 0:
limit = self.hits()

while more_results:
# Only get what we need on the last page.
page_size = min(limit - n_results, page_size)
response = requests.get(
url, headers=headers, params={"page_size": page_size}
)
response.raise_for_status()

# Explicitly track the number of results we have because the length
# of the results list will only match the number of entries fetched
# when the format is JSON. Otherwise, the length of the results
# list is the number of *pages* fetched, not the number of *items*.
n_results += page_size

if self._format == "json":
yield from response.json()["feed"]["entry"]
else:
yield response.text

if cmr_search_after := response.headers.get("cmr-search-after"):
headers["cmr-search-after"] = cmr_search_after

more_results = n_results < limit and cmr_search_after is not None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that when you and I and @briannapagan spoke about this a few months ago, we agreed upon the name results for this method.

Also, since this returns an iterator, there's no need to keep the limit parameter. Callers can freely limit the results in various ways by limiting how many items they iterate over, which is standard practice when dealing with iterators (i.e., you won't find functions/methods that return iterators that also provide any type of "limit" parameter).

Therefore, I suggest the following rename and slight simplification:

Suggested change
def get_iter(self, limit: int = -1, page_size: int = 2000) -> Iterator[Any]:
"""
Returns all results for the query as an iterator (generator)
:param limit: The maximum number of results to return. Negative value means no limit.
:param page_size: The page size (min 0, max 2000) of results retrieved from CMR. Smaller page size means
fewer items in memory and more cmr queries. Larger page size means more items in memory and fewer cmr queries.
:returns: query results as an iterator (generator)
"""
url = self._build_url()
headers = dict(self.headers or {})
more_results = True
page_size = min(max(0, page_size), 2000)
n_results = 0
if limit < 0:
limit = self.hits()
while more_results:
# Only get what we need on the last page.
page_size = min(limit - n_results, page_size)
response = requests.get(
url, headers=headers, params={"page_size": page_size}
)
response.raise_for_status()
# Explicitly track the number of results we have because the length
# of the results list will only match the number of entries fetched
# when the format is JSON. Otherwise, the length of the results
# list is the number of *pages* fetched, not the number of *items*.
n_results += page_size
if self._format == "json":
yield from response.json()["feed"]["entry"]
else:
yield response.text
if cmr_search_after := response.headers.get("cmr-search-after"):
headers["cmr-search-after"] = cmr_search_after
more_results = n_results < limit and cmr_search_after is not None
def results(self, page_size: int = 2000) -> Iterator[Any]:
"""
Return an iterator (generator) of all results matching the query
criteria.
Because a query may produce a large number of results (perhaps
10s or 100s of thousands), such results are fetched using
multiple CMR requests, each returning a "page" of results, as
returning all results in a single request would be impractical.
The size of each page (in terms of the number of results
in a page) is controlled by the `page_size` parameter. A smaller
page size means fewer items in memory (per page), requiring
more CMR queries to fetch all results (if desired). Conversely,
a larger page size means more items in memory (per page)
and fewer CMR queries.
When the query is configured to use the `"json"` format, each
element produced by the returned iterator is a element of the
"feed.entry" array (see
<https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#json>).
In this case, the iterator may produce as many elements as there
are results matching the query criteria.
For all other formats, each element produced by the returned
iterator is an unparsed (text) page of results (i.e., the caller
is responsible for parsing the page of results into individual
elements). In this case, the iterator will produce only as many
pages as required (based on `page_size`) to produce all results
matching the query criteria.
:param page_size: maximum number of results per page (min 1,
max 2000 [default]) requested from the CMR
:returns: query results as an iterator (generator)
"""
url = self._build_url()
headers = dict(self.headers or {})
params={"page_size": min(max(1, page_size), 2000)}
while True:
response = requests.get(url, headers=headers, params=params)
response.raise_for_status()
if self._format == "json":
yield from response.json()["feed"]["entry"]
else:
yield response.text
if not (cmr_search_after := response.headers.get("cmr-search-after")):
break
headers["cmr-search-after"] = cmr_search_after

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Yeah I remember discussing it but couldn't recall what the consensus was, will make the update.

awhile if many requests have to be made.

:returns: query results as a list
"""

return self.get(self.hits())
return list(self.get_iter())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also deprecate get_all. If a user wants to create a list of all results, they can simply call list(query.results()) themselves, rather than query.get_all().

@@ -58,11 +60,12 @@ def __init__(self, route: str, mode: str = CMR_OPS):
self.concept_id_chars: Set[str] = set()
self.headers: MutableMapping[str, str] = {}

@deprecated("Use get_iter() instead")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@deprecated("Use get_iter() instead")
@deprecated("Use the 'results' method instead, but note that it produces an iterator.")

Comment on lines +11 to +16
### Added
- New function `get_iter` for returning results as an iterator instead of sequence ([#37](https://github.com/nasa/python_cmr/issues/37))

### Deprecated
- Function `get` has been marked as deprecated in favor of the new `get_iter` function. `get` will likely be removed for the 1.0.0 release. ([#37](https://github.com/nasa/python_cmr/issues/37))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Added
- New function `get_iter` for returning results as an iterator instead of sequence ([#37](https://github.com/nasa/python_cmr/issues/37))
### Deprecated
- Function `get` has been marked as deprecated in favor of the new `get_iter` function. `get` will likely be removed for the 1.0.0 release. ([#37](https://github.com/nasa/python_cmr/issues/37))
### Added
- Add method `Query.results` for returning results as an iterator instead
of sequence ([#37](https://github.com/nasa/python_cmr/issues/37))
### Changed
- Deprecate methods `Query.get` and `Query.get_all` in favor of the new
`Query.results` method. These deprecated methods will likely be removed
for the 1.0.0 release.
([#37](https://github.com/nasa/python_cmr/issues/37))

@frankinspace frankinspace merged commit bda1cff into develop Sep 24, 2024
7 checks passed
@frankinspace frankinspace deleted the issues/37 branch September 24, 2024 20:05
@frankinspace frankinspace restored the issues/37 branch September 24, 2024 20:10
@frankinspace
Copy link
Collaborator Author

Clicked the wrong button in my IDE and accidentally merged this. Will open a new PR with suggested changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Return query results as an iterator
2 participants