Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues/37 Add function for returning an iterator instead of sequence #89

Merged
merged 3 commits into from
Sep 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,12 @@ adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]

### Added
- New function `get_iter` for returning results as an iterator instead of sequence ([#37](https://github.com/nasa/python_cmr/issues/37))

### Deprecated
- Function `get` has been marked as deprecated in favor of the new `get_iter` function. `get` will likely be removed for the 1.0.0 release. ([#37](https://github.com/nasa/python_cmr/issues/37))

Comment on lines +11 to +16
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Added
- New function `get_iter` for returning results as an iterator instead of sequence ([#37](https://github.com/nasa/python_cmr/issues/37))
### Deprecated
- Function `get` has been marked as deprecated in favor of the new `get_iter` function. `get` will likely be removed for the 1.0.0 release. ([#37](https://github.com/nasa/python_cmr/issues/37))
### Added
- Add method `Query.results` for returning results as an iterator instead
of sequence ([#37](https://github.com/nasa/python_cmr/issues/37))
### Changed
- Deprecate methods `Query.get` and `Query.get_all` in favor of the new
`Query.results` method. These deprecated methods will likely be removed
for the 1.0.0 release.
([#37](https://github.com/nasa/python_cmr/issues/37))

## [0.13.0]

### Added
Expand Down
55 changes: 50 additions & 5 deletions cmr/queries.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@
from datetime import date, datetime, timezone
from inspect import getmembers, ismethod
from re import search
from typing import Iterator

from typing_extensions import (
Any,
List,
Expand All @@ -20,7 +22,7 @@
Tuple,
TypeAlias,
Union,
override,
override, deprecated,
)
from urllib.parse import quote

Expand Down Expand Up @@ -58,11 +60,12 @@ def __init__(self, route: str, mode: str = CMR_OPS):
self.concept_id_chars: Set[str] = set()
self.headers: MutableMapping[str, str] = {}

@deprecated("Use get_iter() instead")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@deprecated("Use get_iter() instead")
@deprecated("Use the 'results' method instead, but note that it produces an iterator.")

def get(self, limit: int = 2000) -> Sequence[Any]:
"""
Get all results up to some limit, even if spanning multiple pages.

:limit: The number of results to return
:param limit: The number of results to return
:returns: query results as a list
"""

Expand Down Expand Up @@ -117,14 +120,56 @@ def hits(self) -> int:

def get_all(self) -> Sequence[Any]:
"""
Returns all of the results for the query. This will call hits() first to determine how many
results their are, and then calls get() with that number. This method could take quite
Returns all of the results for the query. This method could take quite
awhile if many requests have to be made.

:returns: query results as a list
"""

return self.get(self.hits())
return list(self.get_iter())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also deprecate get_all. If a user wants to create a list of all results, they can simply call list(query.results()) themselves, rather than query.get_all().


def get_iter(self, limit: int = -1, page_size: int = 2000) -> Iterator[Any]:
"""
Returns all results for the query as an iterator (generator)

:param limit: The maximum number of results to return. Negative value means no limit.
:param page_size: The page size (min 0, max 2000) of results retrieved from CMR. Smaller page size means
fewer items in memory and more cmr queries. Larger page size means more items in memory and fewer cmr queries.
:returns: query results as an iterator (generator)
"""

url = self._build_url()

headers = dict(self.headers or {})
more_results = True
page_size = min(max(0, page_size), 2000)
n_results = 0
if limit < 0:
limit = self.hits()

while more_results:
# Only get what we need on the last page.
page_size = min(limit - n_results, page_size)
response = requests.get(
url, headers=headers, params={"page_size": page_size}
)
response.raise_for_status()

# Explicitly track the number of results we have because the length
# of the results list will only match the number of entries fetched
# when the format is JSON. Otherwise, the length of the results
# list is the number of *pages* fetched, not the number of *items*.
n_results += page_size

if self._format == "json":
yield from response.json()["feed"]["entry"]
else:
yield response.text

if cmr_search_after := response.headers.get("cmr-search-after"):
headers["cmr-search-after"] = cmr_search_after

more_results = n_results < limit and cmr_search_after is not None
Comment on lines +131 to +172
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that when you and I and @briannapagan spoke about this a few months ago, we agreed upon the name results for this method.

Also, since this returns an iterator, there's no need to keep the limit parameter. Callers can freely limit the results in various ways by limiting how many items they iterate over, which is standard practice when dealing with iterators (i.e., you won't find functions/methods that return iterators that also provide any type of "limit" parameter).

Therefore, I suggest the following rename and slight simplification:

Suggested change
def get_iter(self, limit: int = -1, page_size: int = 2000) -> Iterator[Any]:
"""
Returns all results for the query as an iterator (generator)
:param limit: The maximum number of results to return. Negative value means no limit.
:param page_size: The page size (min 0, max 2000) of results retrieved from CMR. Smaller page size means
fewer items in memory and more cmr queries. Larger page size means more items in memory and fewer cmr queries.
:returns: query results as an iterator (generator)
"""
url = self._build_url()
headers = dict(self.headers or {})
more_results = True
page_size = min(max(0, page_size), 2000)
n_results = 0
if limit < 0:
limit = self.hits()
while more_results:
# Only get what we need on the last page.
page_size = min(limit - n_results, page_size)
response = requests.get(
url, headers=headers, params={"page_size": page_size}
)
response.raise_for_status()
# Explicitly track the number of results we have because the length
# of the results list will only match the number of entries fetched
# when the format is JSON. Otherwise, the length of the results
# list is the number of *pages* fetched, not the number of *items*.
n_results += page_size
if self._format == "json":
yield from response.json()["feed"]["entry"]
else:
yield response.text
if cmr_search_after := response.headers.get("cmr-search-after"):
headers["cmr-search-after"] = cmr_search_after
more_results = n_results < limit and cmr_search_after is not None
def results(self, page_size: int = 2000) -> Iterator[Any]:
"""
Return an iterator (generator) of all results matching the query
criteria.
Because a query may produce a large number of results (perhaps
10s or 100s of thousands), such results are fetched using
multiple CMR requests, each returning a "page" of results, as
returning all results in a single request would be impractical.
The size of each page (in terms of the number of results
in a page) is controlled by the `page_size` parameter. A smaller
page size means fewer items in memory (per page), requiring
more CMR queries to fetch all results (if desired). Conversely,
a larger page size means more items in memory (per page)
and fewer CMR queries.
When the query is configured to use the `"json"` format, each
element produced by the returned iterator is a element of the
"feed.entry" array (see
<https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#json>).
In this case, the iterator may produce as many elements as there
are results matching the query criteria.
For all other formats, each element produced by the returned
iterator is an unparsed (text) page of results (i.e., the caller
is responsible for parsing the page of results into individual
elements). In this case, the iterator will produce only as many
pages as required (based on `page_size`) to produce all results
matching the query criteria.
:param page_size: maximum number of results per page (min 1,
max 2000 [default]) requested from the CMR
:returns: query results as an iterator (generator)
"""
url = self._build_url()
headers = dict(self.headers or {})
params={"page_size": min(max(1, page_size), 2000)}
while True:
response = requests.get(url, headers=headers, params=params)
response.raise_for_status()
if self._format == "json":
yield from response.json()["feed"]["entry"]
else:
yield response.text
if not (cmr_search_after := response.headers.get("cmr-search-after")):
break
headers["cmr-search-after"] = cmr_search_after

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Yeah I remember discussing it but couldn't recall what the consensus was, will make the update.


def parameters(self, **kwargs: Any) -> Self:
"""
Expand Down
201 changes: 102 additions & 99 deletions poetry.lock

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,11 @@ interactions:
- gzip, deflate
Connection:
- keep-alive
User-Agent:
- python-requests/2.31.0
method: GET
uri: https://cmr.earthdata.nasa.gov/search/granules.json?short_name=TELLUS_GRAC_L3_JPL_RL06_LND_v04&page_size=0
response:
body:
string: '{"feed":{"updated":"2023-08-14T17:02:36.801Z","id":"https://cmr.earthdata.nasa.gov:443/search/granules.json?short_name=TELLUS_GRAC_L3_JPL_RL06_LND_v04&page_size=0","title":"ECHO
string: '{"feed":{"updated":"2024-09-24T00:24:58.509Z","id":"https://cmr.earthdata.nasa.gov:443/search/granules.json?short_name=TELLUS_GRAC_L3_JPL_RL06_LND_v04&page_size=0","title":"ECHO
granule metadata","entry":[]}}'
headers:
Access-Control-Allow-Origin:
Expand All @@ -25,37 +23,41 @@ interactions:
CMR-Hits:
- '163'
CMR-Request-Id:
- 5855d714-8aff-4d0f-b4cc-e556f02ef96a
- 64ce09ea-2037-48a1-b5e3-1e9706811229
CMR-Took:
- '52'
- '131'
Connection:
- keep-alive
Content-MD5:
- 3c1bb7d108b84325434e60a36dda1159
Content-SHA1:
- 3d871ed3d2791fefc0fb58701b67811f121caa63
Content-Type:
- application/json;charset=utf-8
Date:
- Mon, 14 Aug 2023 17:02:36 GMT
- Tue, 24 Sep 2024 00:24:58 GMT
Server:
- ServerTokens ProductOnly
Strict-Transport-Security:
- max-age=31536000
- max-age=31536000; includeSubDomains; preload
Transfer-Encoding:
- chunked
Vary:
- Accept-Encoding, User-Agent
Via:
- 1.1 cc58556a6e846289f4d3105969536e4c.cloudfront.net (CloudFront)
- 1.1 b837267595110a1135bf4fb036d71e1e.cloudfront.net (CloudFront)
X-Amz-Cf-Id:
- qj9VuAc1JQu-rnMVDg3mGwstR-jGQA4rd7MKVRAEpXeTDbKZT5p5jg==
- nCF7mfer1omvbZi5CTMRTv9-9uPozEm7zBM8NhFZ8nJ_sXVz-tBAgw==
X-Amz-Cf-Pop:
- SFO53-C1
- LAX50-C1
X-Cache:
- Miss from cloudfront
X-Content-Type-Options:
- nosniff
X-Frame-Options:
- SAMEORIGIN
X-Request-Id:
- qj9VuAc1JQu-rnMVDg3mGwstR-jGQA4rd7MKVRAEpXeTDbKZT5p5jg==
- nCF7mfer1omvbZi5CTMRTv9-9uPozEm7zBM8NhFZ8nJ_sXVz-tBAgw==
X-XSS-Protection:
- 1; mode=block
content-length:
Expand All @@ -72,13 +74,11 @@ interactions:
- gzip, deflate
Connection:
- keep-alive
User-Agent:
- python-requests/2.31.0
method: GET
uri: https://cmr.earthdata.nasa.gov/search/granules.json?short_name=TELLUS_GRAC_L3_JPL_RL06_LND_v04&page_size=163
response:
body:
string: '{"feed":{"updated":"2023-08-14T17:02:40.416Z","id":"https://cmr.earthdata.nasa.gov:443/search/granules.json?short_name=TELLUS_GRAC_L3_JPL_RL06_LND_v04&page_size=163","title":"ECHO
string: '{"feed":{"updated":"2024-09-24T00:24:58.790Z","id":"https://cmr.earthdata.nasa.gov:443/search/granules.json?short_name=TELLUS_GRAC_L3_JPL_RL06_LND_v04&page_size=163","title":"ECHO
granule metadata","entry":[{"boxes":["-89.5 0.5 89.5 180","-89.5 -180 89.5
-0.5"],"time_start":"2002-04-04T00:00:00.000Z","updated":"2023-04-17T15:27:21.022Z","dataset_id":"JPL
TELLUS GRACE Level-3 Monthly Land Water-Equivalent-Thickness Surface Mass
Expand Down Expand Up @@ -2045,39 +2045,43 @@ interactions:
CMR-Hits:
- '163'
CMR-Request-Id:
- 60eb29b2-95e1-453c-8efe-6e59cf649eb5
- b82d5198-a729-42bd-b597-a1132b2652c3
CMR-Search-After:
- '["pocloud",1495497600000,2658328520]'
CMR-Took:
- '4959'
- '125'
Connection:
- keep-alive
Content-MD5:
- 2f2981275f193e1579bea1c3e9f1acf5
Content-SHA1:
- 4fa7e296cc7b77f83dcf2bcda4237d29dc885fd1
Content-Type:
- application/json;charset=utf-8
Date:
- Mon, 14 Aug 2023 17:02:42 GMT
- Tue, 24 Sep 2024 00:24:58 GMT
Server:
- ServerTokens ProductOnly
Strict-Transport-Security:
- max-age=31536000
- max-age=31536000; includeSubDomains; preload
Transfer-Encoding:
- chunked
Vary:
- Accept-Encoding, User-Agent
Via:
- 1.1 44933b72098305e9c31fc50b2e6554a0.cloudfront.net (CloudFront)
- 1.1 be66acbcc5d85e825abf1047b034d722.cloudfront.net (CloudFront)
X-Amz-Cf-Id:
- 9TJ3JRMGc6mUxKegR4f2HSLC_1Cfwei5QHZuicg_aLsWEJS3T6XCNg==
- JIeDUJvd8TodeetWYvcK5xBnxBTh8jvsNt8if-ZsMjTUWW4sbZ9P2A==
X-Amz-Cf-Pop:
- SFO53-C1
- LAX50-C1
X-Cache:
- Miss from cloudfront
X-Content-Type-Options:
- nosniff
X-Frame-Options:
- SAMEORIGIN
X-Request-Id:
- 9TJ3JRMGc6mUxKegR4f2HSLC_1Cfwei5QHZuicg_aLsWEJS3T6XCNg==
- JIeDUJvd8TodeetWYvcK5xBnxBTh8jvsNt8if-ZsMjTUWW4sbZ9P2A==
X-XSS-Protection:
- 1; mode=block
content-length:
Expand Down

Large diffs are not rendered by default.

Loading