Issues/37 Add function for returning an iterator instead of sequence #89

frankinspace · 2024-09-24T00:56:15Z

Closes #37

Added a new function get_iter which returns an iterator that will yield all hits by default. Optional parameters for controlling the limit and page size are also available.

Changed the get_all function to use get_iter.

Marked get as deprecated in favor of get_iter.

Updated the unit tests to have equivalent tests for the new get_iter function. Also updated the usage of vcrpy so that cassettes are named for the test function.

…ed unit tests

chuckwondo

@frankinspace, thanks for picking this up!

chuckwondo · 2024-09-24T15:26:03Z

cmr/queries.py

+    def get_iter(self, limit: int = -1, page_size: int = 2000) -> Iterator[Any]:
+        """
+        Returns all results for the query as an iterator (generator)
+
+        :param limit: The maximum number of results to return. Negative value means no limit.
+        :param page_size: The page size (min 0, max 2000) of results retrieved from CMR. Smaller page size means
+        fewer items in memory and more cmr queries. Larger page size means more items in memory and fewer cmr queries.
+        :returns: query results as an iterator (generator)
+        """
+
+        url = self._build_url()
+
+        headers = dict(self.headers or {})
+        more_results = True
+        page_size = min(max(0, page_size), 2000)
+        n_results = 0
+        if limit < 0:
+            limit = self.hits()
+
+        while more_results:
+            # Only get what we need on the last page.
+            page_size = min(limit - n_results, page_size)
+            response = requests.get(
+                url, headers=headers, params={"page_size": page_size}
+            )
+            response.raise_for_status()
+
+            # Explicitly track the number of results we have because the length
+            # of the results list will only match the number of entries fetched
+            # when the format is JSON.  Otherwise, the length of the results
+            # list is the number of *pages* fetched, not the number of *items*.
+            n_results += page_size
+
+            if self._format == "json":
+                yield from response.json()["feed"]["entry"]
+            else:
+                yield response.text
+
+            if cmr_search_after := response.headers.get("cmr-search-after"):
+                headers["cmr-search-after"] = cmr_search_after
+
+            more_results = n_results < limit and cmr_search_after is not None


I believe that when you and I and @briannapagan spoke about this a few months ago, we agreed upon the name results for this method.

Also, since this returns an iterator, there's no need to keep the limit parameter. Callers can freely limit the results in various ways by limiting how many items they iterate over, which is standard practice when dealing with iterators (i.e., you won't find functions/methods that return iterators that also provide any type of "limit" parameter).

Therefore, I suggest the following rename and slight simplification:

Suggested change

def get_iter(self, limit: int = -1, page_size: int = 2000) -> Iterator[Any]:

"""

Returns all results for the query as an iterator (generator)

:param limit: The maximum number of results to return. Negative value means no limit.

:param page_size: The page size (min 0, max 2000) of results retrieved from CMR. Smaller page size means

fewer items in memory and more cmr queries. Larger page size means more items in memory and fewer cmr queries.

:returns: query results as an iterator (generator)

"""

url = self._build_url()

headers = dict(self.headers or {})

more_results = True

page_size = min(max(0, page_size), 2000)

n_results = 0

if limit < 0:

limit = self.hits()

while more_results:

# Only get what we need on the last page.

page_size = min(limit - n_results, page_size)

response = requests.get(

url, headers=headers, params={"page_size": page_size}

)

response.raise_for_status()

# Explicitly track the number of results we have because the length

# of the results list will only match the number of entries fetched

# when the format is JSON. Otherwise, the length of the results

# list is the number of *pages* fetched, not the number of *items*.

n_results += page_size

if self._format == "json":

yield from response.json()["feed"]["entry"]

else:

yield response.text

if cmr_search_after := response.headers.get("cmr-search-after"):

headers["cmr-search-after"] = cmr_search_after

more_results = n_results < limit and cmr_search_after is not None

def results(self, page_size: int = 2000) -> Iterator[Any]:

"""

Return an iterator (generator) of all results matching the query

criteria.

Because a query may produce a large number of results (perhaps

10s or 100s of thousands), such results are fetched using

multiple CMR requests, each returning a "page" of results, as

returning all results in a single request would be impractical.

The size of each page (in terms of the number of results

in a page) is controlled by the `page_size` parameter. A smaller

page size means fewer items in memory (per page), requiring

more CMR queries to fetch all results (if desired). Conversely,

a larger page size means more items in memory (per page)

and fewer CMR queries.

When the query is configured to use the `"json"` format, each

element produced by the returned iterator is a element of the

"feed.entry" array (see

<https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#json>).

In this case, the iterator may produce as many elements as there

are results matching the query criteria.

For all other formats, each element produced by the returned

iterator is an unparsed (text) page of results (i.e., the caller

is responsible for parsing the page of results into individual

elements). In this case, the iterator will produce only as many

pages as required (based on `page_size`) to produce all results

matching the query criteria.

:param page_size: maximum number of results per page (min 1,

max 2000 [default]) requested from the CMR

:returns: query results as an iterator (generator)

"""

url = self._build_url()

headers = dict(self.headers or {})

params={"page_size": min(max(1, page_size), 2000)}

while True:

response = requests.get(url, headers=headers, params=params)

response.raise_for_status()

if self._format == "json":

yield from response.json()["feed"]["entry"]

else:

yield response.text

if not (cmr_search_after := response.headers.get("cmr-search-after")):

break

headers["cmr-search-after"] = cmr_search_after

Thanks. Yeah I remember discussing it but couldn't recall what the consensus was, will make the update.

chuckwondo · 2024-09-24T15:27:47Z

cmr/queries.py

        awhile if many requests have to be made.

        :returns: query results as a list
        """

-        return self.get(self.hits())
+        return list(self.get_iter())


We should also deprecate get_all. If a user wants to create a list of all results, they can simply call list(query.results()) themselves, rather than query.get_all().

chuckwondo · 2024-09-24T15:28:58Z

cmr/queries.py

@@ -58,11 +60,12 @@ def __init__(self, route: str, mode: str = CMR_OPS):
        self.concept_id_chars: Set[str] = set()
        self.headers: MutableMapping[str, str] = {}

+    @deprecated("Use get_iter() instead")


Suggested change

@deprecated("Use get_iter() instead")

@deprecated("Use the 'results' method instead, but note that it produces an iterator.")

chuckwondo · 2024-09-24T15:34:26Z

CHANGELOG.md

+### Added
+- New function `get_iter` for returning results as an iterator instead of sequence ([#37](https://github.com/nasa/python_cmr/issues/37))
+
+### Deprecated
+- Function `get` has been marked as deprecated in favor of the new `get_iter` function. `get` will likely be removed for the 1.0.0 release. ([#37](https://github.com/nasa/python_cmr/issues/37))
+


Suggested change

### Added

- New function `get_iter` for returning results as an iterator instead of sequence ([#37](https://github.com/nasa/python_cmr/issues/37))

### Deprecated

- Function `get` has been marked as deprecated in favor of the new `get_iter` function. `get` will likely be removed for the 1.0.0 release. ([#37](https://github.com/nasa/python_cmr/issues/37))

### Added

- Add method `Query.results` for returning results as an iterator instead

of sequence ([#37](https://github.com/nasa/python_cmr/issues/37))

### Changed

- Deprecate methods `Query.get` and `Query.get_all` in favor of the new

`Query.results` method. These deprecated methods will likely be removed

for the 1.0.0 release.

([#37](https://github.com/nasa/python_cmr/issues/37))

frankinspace · 2024-09-24T20:12:54Z

Clicked the wrong button in my IDE and accidentally merged this. Will open a new PR with suggested changes

frankinspace added 3 commits September 23, 2024 17:48

Added function for returning an iterator instead of a sequence. Updat…

e4590ef

…ed unit tests

Added function for returning an iterator instead of a sequence. Updat…

1dfc115

…ed unit tests

Added function for returning an iterator instead of a sequence. Updat…

8ce8e64

…ed unit tests

frankinspace requested a review from chuckwondo September 24, 2024 00:56

chuckwondo requested changes Sep 24, 2024

View reviewed changes

frankinspace merged commit bda1cff into develop Sep 24, 2024
7 checks passed

frankinspace deleted the issues/37 branch September 24, 2024 20:05

frankinspace restored the issues/37 branch September 24, 2024 20:10

frankinspace mentioned this pull request Sep 24, 2024

Revert "Issues/37 Add function for returning an iterator instead of sequence" #90

Merged

frankinspace mentioned this pull request Sep 24, 2024

Issues/37 Add function for returning an iterator instead of sequence #91

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues/37 Add function for returning an iterator instead of sequence #89

Issues/37 Add function for returning an iterator instead of sequence #89

frankinspace commented Sep 24, 2024

chuckwondo left a comment

chuckwondo Sep 24, 2024

frankinspace Sep 24, 2024

chuckwondo Sep 24, 2024

chuckwondo Sep 24, 2024

chuckwondo Sep 24, 2024

frankinspace commented Sep 24, 2024

	@deprecated("Use get_iter() instead")
	@deprecated("Use the 'results' method instead, but note that it produces an iterator.")

Issues/37 Add function for returning an iterator instead of sequence #89

Issues/37 Add function for returning an iterator instead of sequence #89

Conversation

frankinspace commented Sep 24, 2024

chuckwondo left a comment

Choose a reason for hiding this comment

chuckwondo Sep 24, 2024

Choose a reason for hiding this comment

frankinspace Sep 24, 2024

Choose a reason for hiding this comment

chuckwondo Sep 24, 2024

Choose a reason for hiding this comment

chuckwondo Sep 24, 2024

Choose a reason for hiding this comment

chuckwondo Sep 24, 2024

Choose a reason for hiding this comment

frankinspace commented Sep 24, 2024