11 Nov 10:41

53f5a96

v2.7.0 Latest

Latest

Release Notes

✨ Highlights

🚅 Rework `Pipeline.run()` logic to better handle cycles

Pipeline.run() internal logic has been heavily reworked to be more robust and reliable than before. This new implementation makes it easier to run Pipelines that have cycles in their graph. It also fixes some corner cases in Pipelines that don't have any cycle.

📝 Introduce `LoggingTracer`

With the new LoggingTracer, users can inspect the logs in real-time to see everything that is happening in their Pipelines. This feature aims to improve the user experience during experimentation and prototyping.

import logging
from haystack import tracing
from haystack.tracing.logging_tracer import LoggingTracer

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.DEBUG)
tracing.tracer.is_content_tracing_enabled = True # to enable tracing/logging content (inputs/outputs)
tracing.enable_tracing(LoggingTracer())

⬆️ Upgrade Notes

Removed Pipeline init argument debug_path. We do not support this anymore.
Removed Pipeline init argument max_loops_allowed. Use max_runs_per_component instead.
Removed PipelineMaxLoops exception. Use PipelineMaxComponentRuns instead.

The deprecated default converter class haystack.components.converters.pypdf.DefaultConverter used by PyPDFToDocument has been removed.

Pipeline YAMLs from haystack<2.7.0 that use the default converter must be updated in the following manner:

# Old
components:
    Comp1:
    init_parameters:
        converter:
        type: haystack.components.converters.pypdf.DefaultConverter
    type: haystack.components.converters.pypdf.PyPDFToDocument

# New
components:
    Comp1:
    init_parameters:
        converter: null
    type: haystack.components.converters.pdf.PDFToTextConverter

Pipeline YAMLs from haystack<2.7.0 that use custom converter classes can be upgraded by simply loading them with haystack==2.6.x and saving them to YAML again.

Pipeline.connect() will now raise a PipelineConnectError if sender and receiver are the same Component. We do not support this use case anymore.

🚀 New Features

Added component StringJoiner to join strings from different components to a list of strings.
Improved serialization/deserialization errors to provide extra context about the delinquent components when possible.
Enhanced DOCX converter to support table extraction in addition to paragraph content. The converter supports both CSV and Markdown table formats, providing flexible options for representing tabular data extracted from DOCX documents.
Added a new parameter additional_mimetypes to the FileTypeRouter component. This allows users to specify additional MIME type mappings, ensuring correct file classification across different runtime environments and Python versions.

Introduce a LoggingTracer, that sends all traces to the logs.

It can enabled as follows:

import logging
from haystack import tracing
from haystack.tracing.logging_tracer import LoggingTracer

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.DEBUG)
tracing.tracer.is_content_tracing_enabled = True # to enable tracing/logging content (inputs/outputs)
tracing.enable_tracing(LoggingTracer())

Fundamentally rework the internal logic of Pipeline.run(). The rework makes it more reliable and covers more use cases. We fixed some issues that made Pipelines with cycles unpredictable and with unclear Components execution order.
Each tracing span of a component run is now attached with the pipeline run span object. This allows users to trace the execution of multiple pipeline runs concurrently.

⚡️ Enhancement Notes

Add streaming_callback run parameter to HuggingFaceAPIGenerator and HuggingFaceLocalGenerator to allow users to pass a callback function that will be called after each chunk of the response is generated.
The SentenceWindowRetriever now supports the window_size parameter at run time, overwriting the value set in the constructor.
Add output type validation in ConditionalRouter. Setting validate_output_type to True will enable a check to verify if the actual output of a route returns the declared type. If it doesn't match a ValueError is raised.
Reduced numpy usage to speed up imports.
Improved file type detection in FileTypeRouter, particularly for Microsoft Office file formats like .docx and .pptx. This enhancement ensures more consistent behavior across different environments, including AWS Lambda functions and systems without pre-installed office suites.
The FiletypeRouter now supports passing metadata (meta) in the run method. When metadata is provided, the sources are internally converted to ByteStream objects and the metadata is added. This new parameter simplifies working with preprocessing/indexing pipelines.
SentenceTransformersDocumentEmbedder now supports config_kwargs for additional parameters when loading the model configuration
SentenceTransformersTextEmbedder now supports config_kwargs for additional parameters when loading the model configuration
Previously, numpy was pinned to <2.0 to avoid compatibility issues in several core integrations. This pin has been removed, and haystack can work with both numpy 1.x and 2.x. If necessary, we will pin numpy version in specific core integrations that require it.

⚠️ Deprecation Notes

The DefaultConverter class used by the PyPDFToDocument component has been deprecated. Its functionality will be merged into the component in 2.7.0.

🐛 Bug Fixes

Serialized data of components are now explicitly enforced to be one of the following basic Python datatypes: str, int, float, bool, list, dict, set, tuple or None.
Addressed an issue where certain file types (e.g., .docx, .pptx) were incorrectly classified as 'unclassified' in environments with limited MIME type definitions, such as AWS Lambda functions.
Fixes logs containing JSON data getting lost due to string interpolation.
Use forward references for Hugging Face Hub types in the HuggingFaceAPIGenerator component to prevent import errors.
Fix the serialization of PyPDFToDocument component to prevent the default converter from being serialized unnecessarily.
Revert change to PyPDFConverter that broke the deserialization of pre 2.6.0 YAMLs.

Assets 2

11 Nov 15:15

silvanocerza

v1.26.4

577c39a

v1.26.4

Release Notes

v1.26.4

⚡️ Enhancement Notes

Upgrade the transformers dependency requirement to transformers>=4.46,<5.0
Updated tokenizer.json URL for Anthropic models as the old URL was no longer available.

Assets 2

30 Oct 15:05

github-actions

v2.7.0-rc1

2fd1d78

v2.7.0-rc1 Pre-release

Pre-release

Release Notes

✨ Highlights

🚅 Rework `Pipeline.run()` logic to better handle cycles

📝 Introduce `LoggingTracer`

With the new LoggingTracer, users can inspect in the logs everything that is happening in their Pipelines in real time. This feature aims to improve the user experience during experimentation and prototyping.

⬆️ Upgrade Notes

Removed Pipeline init argument debug_path. We do not support this anymore.
Removed Pipeline init argument max_loops_allowed. Use max_runs_per_component instead.
Removed PipelineMaxLoops exception. Use PipelineMaxComponentRuns instead.

The deprecated default converter class haystack.components.converters.pypdf.DefaultConverter used by PyPDFToDocument has been removed.

Pipeline YAMLs from haystack<2.7.0 that use the default converter must be updated in the following manner:

# Old
components:
    Comp1:
    init_parameters:
        converter:
        type: haystack.components.converters.pypdf.DefaultConverter
    type: haystack.components.converters.pypdf.PyPDFToDocument

# New
components:
    Comp1:
    init_parameters:
        converter: null
    type: haystack.components.converters.pdf.PDFToTextConverter

Pipeline YAMLs from haystack<2.7.0 that use custom converter classes can be upgraded by simply loading them with haystack==2.6.x and saving them to YAML again.

Pipeline.connect() will now raise a PipelineConnectError if sender and receiver are the same Component. We do not support this use case anymore.

🚀 New Features

Added component StringJoiner to join strings from different components to a list of strings.
Improved serialization/deserialization errors to provide extra context about the delinquent components when possible.
Enhanced DOCX converter to support table extraction in addition to paragraph content. The converter supports both CSV and Markdown table formats, providing flexible options for representing tabular data extracted from DOCX documents.
Added a new parameter additional_mimetypes to the FileTypeRouter component.

This allows users to specify additional MIME type mappings, ensuring correct

file classification across different runtime environments and Python versions.

Introduce a LoggingTracer, that sends all traces to the logs.

It can enabled as follows:

import logging
from haystack import tracing
from haystack.tracing.logging_tracer import LoggingTracer

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.DEBUG)
tracing.tracer.is_content_tracing_enabled = True # to enable tracing/logging content (inputs/outputs)
tracing.enable_tracing(LoggingTracer())

Fundamentally rework the internal logic of Pipeline.run(). The rework makes it more reliable and covers more use cases. We fixed some issues that made Pipelines with cycles unpredictable and with unclear Components execution order.
Each tracing span of a component run is now attached with the pipeline run span object. This allows users to trace the execution of multiple pipeline runs concurrently.

⚡️ Enhancement Notes

Add streaming_callback run parameter to HuggingFaceAPIGenerator and HuggingFaceLocalGenerator to allow users to pass a callback function that will be called after each chunk of the response is generated.
The SentenceWindowRetriever now supports the window_size parameter at run time, overwriting the value set in the constructor.
Add output type validation in ConditionalRouter. Setting validate_output_type to True will enable a check to verify if the actual output of a route returns the declared type. If it doesn't match a ValueError is raised.
Reduced numpy usage to speed up imports.
Improved file type detection in FileTypeRouter, particularly for Microsoft Office file formats like .docx and .pptx. This enhancement ensures more consistent behavior across different environments, including AWS Lambda functions and systems without pre-installed office suites.
The FiletypeRouter now supports passing metadata (meta) in the run method. When metadata is provided, the sources are internally converted to ByteStream objects and the metadata is added. This new parameter simplifies working with preprocessing/indexing pipelines.
SentenceTransformersDocumentEmbedder now supports config_kwargs for additional parameters when loading the model configuration
SentenceTransformersTextEmbedder now supports config_kwargs for additional parameters when loading the model configuration
Previously, numpy was pinned to <2.0 to avoid compatibility issues in several core integrations. This pin has been removed, and haystack can work with both numpy 1.x and 2.x. If necessary, we will pin numpy version in specific core integrations that require it.

⚠️ Deprecation Notes

The DefaultConverter class used by the PyPDFToDocument component has been deprecated. Its functionality will be merged into the component in 2.7.0.

🐛 Bug Fixes

Serialized data of components are now explicitly enforced to be one of the following basic Python datatypes: str, int, float, bool, list, dict, set, tuple or None.
Addressed an issue where certain file types (e.g., .docx, .pptx) were incorrectly classified as 'unclassified' in environments with limited MIME type definitions, such as AWS Lambda functions.
Fixes logs containing JSON data getting lost due to string interpolation.
Use forward references for Hugging Face Hub types in the HuggingFaceAPIGenerator component to prevent import errors.
Fix the serialization of PyPDFToDocument component to prevent the default converter from being serialized unnecessarily.
Revert change to PyPDFConverter that broke the deserialization of pre 2.6.0 YAMLs.

Assets 2

10 Oct 08:53

github-actions

v2.6.1

2fa6c77

v2.6.1

Release Notes

v2.6.1

Bug Fixes

Revert change to PyPDFConverter that broke the deserialization of pre 2.6.0 YAMLs.

Assets 2

08 Oct 12:38

github-actions

v2.6.1-rc1

ccc4dcd

v2.6.1-rc1 Pre-release

Pre-release

Release Notes

v2.6.1-rc1

Bug Fixes

Revert change to PyPDFConverter that broke the deserialization of pre 2.6.0 YAMLs.

Assets 2

03 Oct 07:07

github-actions

v2.6.0

89dc8b8

v2.6.0

Release Notes

⬆️ Upgrade Notes

gpt-3.5-turbo was replaced by gpt-4o-mini as the default model for all components relying on OpenAI API
Support for the legacy filter syntax and operators (e.g., "$and", "$or", "$eq", "$lt", etc.), which originated in Haystack v1, has been fully removed. Users must now use only the new filter syntax. See the docs for more details.

🚀 New Features

Added a new component DocumentNDCGEvaluator, which is similar to DocumentMRREvaluator and useful for retrieval evaluation. It calculates the normalized discounted cumulative gain, an evaluation metric useful when there are multiple ground truth relevant documents and the order in which they are retrieved is important.
Add new CSVToDocument component. Loads the file as bytes object. Adds the loaded string as a new document that can be used for further processing by the Document Splitter.
Adds support for zero shot document classification via new TransformersZeroShotDocumentClassifier component. This allows you to classify documents into user-defined classes (binary and multi-label classification) using pre-trained models from Hugging Face.
Added the option to use a custom splitting function in DocumentSplitter. The function must accept a string as input and return a list of strings, representing the split units. To use the feature initialise DocumentSplitter with split_by="function" providing the custom splitting function as splitting_function=custom_function.
Add new JSONConverter Component to convert JSON files to Document. Optionally it can use jq to filter the source JSON files and extract only specific parts.

import json  
from haystack.components.converters import JSONConverter 
from haystack.dataclasses import ByteStream  
data = {
  "laureates": [
    {
      "firstname": "Enrico",
      "surname": "Fermi",
      "motivation": "for his demonstrations of the existence of new radioactive elements produced "
      "by neutron irradiation, and for his related discovery of nuclear reactions brought about by slow neutrons",
    },
    {
      "firstname": "Rita",
      "surname": "Levi-Montalcini",
      "motivation": "for their discoveries of growth factors",
    },
  ],
} 
source = ByteStream.from_string(json.dumps(data)) 
converter = JSONConverter(jq_schema=".laureates[]", content_key="motivation", extra_meta_fields=["firstname", "surname"])  
results = converter.run(sources=[source]) 
documents = results["documents"] print(documents[0].content) 
# 'for his demonstrations of the existence of new radioactive elements produced by 
# neutron irradiation, and for his related discovery of nuclear reactions brought 
# about by slow neutrons' 
print(documents[0].meta)
# {'firstname': 'Enrico', 'surname': 'Fermi'} 
print(documents[1].content)
# 'for their discoveries of growth factors'  print(documents[1].meta) # {'firstname': 'Rita', 'surname': 'Levi-Montalcini'}

Added a new NLTKDocumentSplitter, a component enhancing document preprocessing capabilities with NLTK. This feature allows for fine-grained control over the splitting of documents into smaller parts based on configurable criteria such as word count, sentence boundaries, and page breaks. It supports multiple languages and offers options for handling sentence boundaries and abbreviations, facilitating better handling of various document types for further processing tasks.
Updates SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder so model_max_length passed through tokenizer_kwargs also updates the max_seq_length of the underlying SentenceTransformer model.

⚡️ Enhancement Notes

Adapts how ChatPromptBuilder creates ChatMessages. Messages are deep copied to ensure all meta fields are copied correctly.
Expose default_headers to pass custom headers to Azure API including APIM subscription key.
Add optional azure_kwargs dictionary parameter to pass in parameters undefined in Haystack but supported by AzureOpenAI.
Allow the ability to add the current date inside a template in PromptBuilder using the following syntax:
- {% now 'UTC' %}: Get the current date for the UTC timezone.
- {% now 'America/Chicago' + 'hours=2' %}: Add two hours to the current date in the Chicago timezone.
- {% now 'Europe/Berlin' - 'weeks=2' %}: Subtract two weeks from the current date in the Berlin timezone.
- {% now 'Pacific/Fiji' + 'hours=2', '%H' %}: Display only the number of hours after adding two hours to the Fiji timezone.
- {% now 'Etc/GMT-4', '%I:%M %p' %}: Change the date format to AM/PM for the GMT-4 timezone.
Note that if no date format is provided, the default will be %Y-%m-%d %H:%M:%S. Please refer to list of tz database for a list of timezones.
Adds usage meta field with prompt_tokens and completion_tokens keys to HuggingFaceAPIChatGenerator.
Add new GreedyVariadic input type. This has a similar behaviour to Variadic input type as it can be connected to multiple output sockets, though the Pipeline will run it as soon as it receives an input without waiting for others. This replaces the is_greedy argument in the @component decorator. If you had a Component with a Variadic input type and @component(is_greedy=True) you need to change the type to GreedyVariadic and remove is_greedy=true from @component.
Add new Pipeline init argument max_runs_per_component, this has the same identical behaviour as the existing max_loops_allowed argument but is more descriptive of its actual effects.
Add new PipelineMaxLoops to reflect new max_runs_per_component init argument
We added batching during inference time to the TransformerSimilarityRanker to help prevent OOMs when ranking large amounts of Documents.

⚠️ Deprecation Notes

The DefaultConverter class used by the PyPDFToDocument component has been deprecated. Its functionality will be merged into the component in 2.7.0.
Pipeline init argument debug_path is deprecated and will be removed in version 2.7.0.
@component decorator is_greedy argument is deprecated and will be removed in version 2.7.0. Use GreedyVariadic type instead.
Deprecate connecting a Component to itself when calling Pipeline.connect(), it will raise an error from version 2.7.0 onwards
Pipeline init argument max_loops_allowed is deprecated and will be removed in version 2.7.0. Use max_runs_per_component instead.
PipelineMaxLoops exception is deprecated and will be removed in version 2.7.0. Use PipelineMaxComponentRuns instead.

🐛 Bug Fixes

Fix the serialization of PyPDFToDocument component to prevent the default converter from being serialized unnecessarily.
Add constraints to component.set_input_type and component.set_input_types to prevent undefined behaviour when the run method does not contain a variadic keyword argument.
Prevent set_output_types from being called when the output_types decorator is used.
Update the CHAT_WITH_WEBSITE Pipeline template to reflect the changes in the HTMLToDocument converter component.
Fix a Pipeline visualization issue due to changes in the new release of Mermaid.
Fixing the filters in the SentenceWindowRetriever allowing now support for 3 more DocumentStores: Astra, PGVector, Qdrant
Fix Pipeline not running Components with Variadic input even if it received inputs only from a subset of its senders
The from_dict method of ConditionalRouter now correctly handles the case where the dict passed to it contains the key custom_filters explicitly set to None. Previously this was causing an AttributeError
Make the from_dict method of the PyPDFToDocument more robust to cases when the converter is not provided in the dictionary.

Assets 2

01 Oct 15:29

github-actions

v2.6.0-rc3

cd23720

v2.6.0-rc3 Pre-release

Pre-release

Release Notes

⬆️ Upgrade Notes

gpt-3.5-turbo was replaced by gpt-4o-mini as the default model for all components relying on OpenAI API
Support for the legacy filter syntax and operators (e.g., "$and", "$or", "$eq", "$lt", etc.), which originated in Haystack v1, has been fully removed. Users must now use only the new filter syntax. See the docs for more details.

🚀 New Features

Added a new component DocumentNDCGEvaluator, which is similar to DocumentMRREvaluator and useful for retrieval evaluation. It calculates the normalized discounted cumulative gain, an evaluation metric useful when there are multiple ground truth relevant documents and the order in which they are retrieved is important.
Add new CSVToDocument component. Loads the file as bytes object. Adds the loaded string as a new document that can be used for further processing by the Document Splitter.
Adds support for zero shot document classification via new TransformersZeroShotDocumentClassifier component. This allows you to classify documents into user-defined classes (binary and multi-label classification) using pre-trained models from Hugging Face.
Added the option to use a custom splitting function in DocumentSplitter. The function must accept a string as input and return a list of strings, representing the split units. To use the feature initialise DocumentSplitter with split_by="function" providing the custom splitting function as splitting_function=custom_function.
Add new JSONConverter Component to convert JSON files to Document. Optionally it can use jq to filter the source JSON files and extract only specific parts.

import json  
from haystack.components.converters import JSONConverter 
from haystack.dataclasses import ByteStream  
data = {
  "laureates": [
    {
      "firstname": "Enrico",
      "surname": "Fermi",
      "motivation": "for his demonstrations of the existence of new radioactive elements produced "
      "by neutron irradiation, and for his related discovery of nuclear reactions brought about by slow neutrons",
    },
    {
      "firstname": "Rita",
      "surname": "Levi-Montalcini",
      "motivation": "for their discoveries of growth factors",
    },
  ],
} 
source = ByteStream.from_string(json.dumps(data)) 
converter = JSONConverter(jq_schema=".laureates[]", content_key="motivation", extra_meta_fields=["firstname", "surname"])  
results = converter.run(sources=[source]) 
documents = results["documents"] print(documents[0].content) 
# 'for his demonstrations of the existence of new radioactive elements produced by 
# neutron irradiation, and for his related discovery of nuclear reactions brought 
# about by slow neutrons' 
print(documents[0].meta)
# {'firstname': 'Enrico', 'surname': 'Fermi'} 
print(documents[1].content)
# 'for their discoveries of growth factors'  print(documents[1].meta) # {'firstname': 'Rita', 'surname': 'Levi-Montalcini'}

Added a new NLTKDocumentSplitter, a component enhancing document preprocessing capabilities with NLTK. This feature allows for fine-grained control over the splitting of documents into smaller parts based on configurable criteria such as word count, sentence boundaries, and page breaks. It supports multiple languages and offers options for handling sentence boundaries and abbreviations, facilitating better handling of various document types for further processing tasks.
Updates SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder so model_max_length passed through tokenizer_kwargs also updates the max_seq_length of the underlying SentenceTransformer model.

⚡️ Enhancement Notes

Adapts how ChatPromptBuilder creates ChatMessages. Messages are deep copied to ensure all meta fields are copied correctly.
Expose default_headers to pass custom headers to Azure API including APIM subscription key.
Add optional azure_kwargs dictionary parameter to pass in parameters undefined in Haystack but supported by AzureOpenAI.
Allow the ability to add the current date inside a template in PromptBuilder using the following syntax:
- {% now 'UTC' %}: Get the current date for the UTC timezone.
- {% now 'America/Chicago' + 'hours=2' %}: Add two hours to the current date in the Chicago timezone.
- {% now 'Europe/Berlin' - 'weeks=2' %}: Subtract two weeks from the current date in the Berlin timezone.
- {% now 'Pacific/Fiji' + 'hours=2', '%H' %}: Display only the number of hours after adding two hours to the Fiji timezone.
- {% now 'Etc/GMT-4', '%I:%M %p' %}: Change the date format to AM/PM for the GMT-4 timezone.
Note that if no date format is provided, the default will be %Y-%m-%d %H:%M:%S. Please refer to list of tz database for a list of timezones.
Adds usage meta field with prompt_tokens and completion_tokens keys to HuggingFaceAPIChatGenerator.
Add new GreedyVariadic input type. This has a similar behaviour to Variadic input type as it can be connected to multiple output sockets, though the Pipeline will run it as soon as it receives an input without waiting for others. This replaces the is_greedy argument in the @component decorator. If you had a Component with a Variadic input type and @component(is_greedy=True) you need to change the type to GreedyVariadic and remove is_greedy=true from @component.
Add new Pipeline init argument max_runs_per_component, this has the same identical behaviour as the existing max_loops_allowed argument but is more descriptive of its actual effects.
Add new PipelineMaxLoops to reflect new max_runs_per_component init argument
We added batching during inference time to the TransformerSimilarityRanker to help prevent OOMs when ranking large amounts of Documents.

⚠️ Deprecation Notes

The DefaultConverter class used by the PyPDFToDocument component has been deprecated. Its functionality will be merged into the component in 2.7.0.
Pipeline init argument debug_path is deprecated and will be removed in version 2.7.0.
@component decorator is_greedy argument is deprecated and will be removed in version 2.7.0. Use GreedyVariadic type instead.
Deprecate connecting a Component to itself when calling Pipeline.connect(), it will raise an error from version 2.7.0 onwards
Pipeline init argument max_loops_allowed is deprecated and will be removed in version 2.7.0. Use max_runs_per_component instead.
PipelineMaxLoops exception is deprecated and will be removed in version 2.7.0. Use PipelineMaxComponentRuns instead.

🐛 Bug Fixes

Fix the serialization of PyPDFToDocument component to prevent the default converter from being serialized unnecessarily.
Add constraints to component.set_input_type and component.set_input_types to prevent undefined behaviour when the run method does not contain a variadic keyword argument.
Prevent set_output_types from being called when the output_types decorator is used.
Update the CHAT_WITH_WEBSITE Pipeline template to reflect the changes in the HTMLToDocument converter component.
Fix a Pipeline visualization issue due to changes in the new release of Mermaid.
Fixing the filters in the SentenceWindowRetriever allowing now support for 3 more DocumentStores: Astra, PGVector, Qdrant
Fix Pipeline not running Components with Variadic input even if it received inputs only from a subset of its senders
The from_dict method of ConditionalRouter now correctly handles the case where the dict passed to it contains the key custom_filters explicitly set to None. Previously this was causing an AttributeError
Make the from_dict method of the PyPDFToDocument more robust to cases when the converter is not provided in the dictionary.

Assets 2

01 Oct 05:45

github-actions

v2.6.0-rc2

57f43be

v2.6.0-rc2 Pre-release

Pre-release

Release Notes

⬆️ Upgrade Notes

gpt-3.5-turbo was replaced by gpt-4o-mini as the default model for all components relying on OpenAI API
The legacy filter syntax support has been completely removed. Users need to use the new filter syntax. See the docs for more details.

🚀 New Features

Add new CSVToDocument component. Loads the file as bytes object. Adds the loaded string as a new document that can be used for further processing by the Document Splitter.
Adds support for zero shot document classification via new TransformersZeroShotDocumentClassifier component. This allows you to classify documents into user-defined classes (binary and multi-label classification) using pre-trained models from Hugging Face.
Added the option to use a custom splitting function in DocumentSplitter. The function must accept a string as input and return a list of strings, representing the split units. To use the feature initialise DocumentSplitter with split_by="function" providing the custom splitting function as splitting_function=custom_function.
Add new JSONConverter Component to convert JSON files to Document. Optionally it can use jq to filter the source JSON files and extract only specific parts.

import json  
from haystack.components.converters import JSONConverter 
from haystack.dataclasses import ByteStream  
data = {
  "laureates": [
    {
      "firstname": "Enrico",
      "surname": "Fermi",
      "motivation": "for his demonstrations of the existence of new radioactive elements produced "
      "by neutron irradiation, and for his related discovery of nuclear reactions brought about by slow neutrons",
    },
    {
      "firstname": "Rita",
      "surname": "Levi-Montalcini",
      "motivation": "for their discoveries of growth factors",
    },
  ],
} 
source = ByteStream.from_string(json.dumps(data)) 
converter = JSONConverter(jq_schema=".laureates[]", content_key="motivation", extra_meta_fields=["firstname", "surname"])  
results = converter.run(sources=[source]) 
documents = results["documents"] print(documents[0].content) 
# 'for his demonstrations of the existence of new radioactive elements produced by 
# neutron irradiation, and for his related discovery of nuclear reactions brought 
# about by slow neutrons' 
print(documents[0].meta)
# {'firstname': 'Enrico', 'surname': 'Fermi'} 
print(documents[1].content)
# 'for their discoveries of growth factors'  print(documents[1].meta) # {'firstname': 'Rita', 'surname': 'Levi-Montalcini'}

Added a new NLTKDocumentSplitter, a component enhancing document preprocessing capabilities with NLTK. This feature allows for fine-grained control over the splitting of documents into smaller parts based on configurable criteria such as word count, sentence boundaries, and page breaks. It supports multiple languages and offers options for handling sentence boundaries and abbreviations, facilitating better handling of various document types for further processing tasks.
Updates SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder so model_max_length passed through tokenizer_kwargs also updates the max_seq_length of the underlying SentenceTransformer model.

⚡️ Enhancement Notes

Adapts how ChatPromptBuilder creates ChatMessages. Messages are deep copied to ensure all meta fields are copied correctly.
Expose default_headers to pass custom headers to Azure API including APIM subscription key.
Add optional azure_kwargs dictionary parameter to pass in parameters undefined in Haystack but supported by AzureOpenAI.
Allow the ability to add the current date inside a template in PromptBuilder using the following syntax:
- {% now 'UTC' %}: Get the current date for the UTC timezone.
- {% now 'America/Chicago' + 'hours=2' %}: Add two hours to the current date in the Chicago timezone.
- {% now 'Europe/Berlin' - 'weeks=2' %}: Subtract two weeks from the current date in the Berlin timezone.
- {% now 'Pacific/Fiji' + 'hours=2', '%H' %}: Display only the number of hours after adding two hours to the Fiji timezone.
- {% now 'Etc/GMT-4', '%I:%M %p' %}: Change the date format to AM/PM for the GMT-4 timezone.
Note that if no date format is provided, the default will be %Y-%m-%d %H:%M:%S. Please refer to list of tz database for a list of timezones.
Adds usage meta field with prompt_tokens and completion_tokens keys to HuggingFaceAPIChatGenerator.
Add new GreedyVariadic input type. This has a similar behaviour to Variadic input type as it can be connected to multiple output sockets, though the Pipeline will run it as soon as it receives an input without waiting for others. This replaces the is_greedy argument in the @component decorator. If you had a Component with a Variadic input type and @component(is_greedy=True) you need to change the type to GreedyVariadic and remove is_greedy=true from @component.
Add new Pipeline init argument max_runs_per_component, this has the same identical behaviour as the existing max_loops_allowed argument but is more descriptive of its actual effects.
Add new PipelineMaxLoops to reflect new max_runs_per_component init argument
We added batching during inference time to the TransformerSimilarityRanker to help prevent OOMs when ranking large amounts of Documents.

⚠️ Deprecation Notes

Pipeline init argument debug_path is deprecated and will be removed in version 2.7.0.
@component decorator is_greedy argument is deprecated and will be removed in version 2.7.0. Use GreedyVariadic type instead.
Deprecate connecting a Component to itself when calling Pipeline.connect(), it will raise an error from version 2.7.0 onwards
Pipeline init argument max_loops_allowed is deprecated and will be removed in version 2.7.0. Use max_runs_per_component instead.
PipelineMaxLoops exception is deprecated and will be removed in version 2.7.0. Use PipelineMaxComponentRuns instead.

🐛 Bug Fixes

Add constraints to component.set_input_type and component.set_input_types to prevent undefined behaviour when the run method does not contain a variadic keyword argument.
Prevent set_output_types from being called when the output_types decorator is used.
Update the CHAT_WITH_WEBSITE Pipeline template to reflect the changes in the HTMLToDocument converter component.
Fix a Pipeline visualization issue due to changes in the new release of Mermaid.
Fixing the filters in the SentenceWindowRetriever allowing now support for 3 more DocumentStores: Astra, PGVector, Qdrant
Fix Pipeline not running Components with Variadic input even if it received inputs only from a subset of its senders
The from_dict method of ConditionalRouter now correctly handles the case where the dict passed to it contains the key custom_filters explicitly set to None. Previously this was causing an AttributeError
Make the from_dict method of the PyPDFToDocument more robust to cases when the converter is not provided in the dictionary.

Assets 2

10 Sep 14:08

github-actions

v2.5.1

d6ea32b

v2.5.1

Release Notes

⚡️ Enhancement Notes

Add default_headers init argument to AzureOpenAIGenerator and AzureOpenAIChatGenerator

🐛 Bug Fixes

Fix the Pipeline visualization issue due to changes in the new release of Mermaid
Fix Pipeline not running Components with Variadic input even if it received inputs only from a subset of its senders
The from_dict method of ConditionalRouter now correctly handles the case where the dict passed to it contains the key custom_filters explicitly set to None. Previously this was causing an AttributeError

Assets 2

10 Sep 13:25

github-actions

v2.5.1-rc2

8d0f31f

v2.5.1-rc2 Pre-release

Pre-release

Release Notes

⚡️ Enhancement Notes

Add default_headers init argument to AzureOpenAIGenerator and AzureOpenAIChatGenerator

🐛 Bug Fixes

Fix the Pipeline visualization issue due to changes in the new release of Mermaid
Fix Pipeline not running Components with Variadic input even if it received inputs only from a subset of its senders
The from_dict method of ConditionalRouter now correctly handles the case where the dict passed to it contains the key custom_filters explicitly set to None. Previously this was causing an AttributeError

Assets 2

Releases: deepset-ai/haystack

v2.7.0

Release Notes

✨ Highlights

🚅 Rework Pipeline.run() logic to better handle cycles

📝 Introduce LoggingTracer

⬆️ Upgrade Notes

🚀 New Features

⚡️ Enhancement Notes

⚠️ Deprecation Notes

🐛 Bug Fixes

v1.26.4

Release Notes

v1.26.4

⚡️ Enhancement Notes

v2.7.0-rc1

Release Notes

✨ Highlights

🚅 Rework Pipeline.run() logic to better handle cycles

📝 Introduce LoggingTracer

⬆️ Upgrade Notes

🚀 New Features

⚡️ Enhancement Notes

⚠️ Deprecation Notes

🐛 Bug Fixes

v2.6.1

Release Notes

v2.6.1

Bug Fixes

v2.6.1-rc1

Release Notes

v2.6.1-rc1

Bug Fixes

v2.6.0

Release Notes

⬆️ Upgrade Notes

🚀 New Features

⚡️ Enhancement Notes

⚠️ Deprecation Notes

🐛 Bug Fixes

v2.6.0-rc3

Release Notes

⬆️ Upgrade Notes

🚀 New Features

⚡️ Enhancement Notes

⚠️ Deprecation Notes

🐛 Bug Fixes

v2.6.0-rc2

Release Notes

⬆️ Upgrade Notes

🚀 New Features

⚡️ Enhancement Notes

⚠️ Deprecation Notes

🐛 Bug Fixes

v2.5.1

Release Notes

⚡️ Enhancement Notes

🐛 Bug Fixes

v2.5.1-rc2

Release Notes

⚡️ Enhancement Notes

🐛 Bug Fixes

🚅 Rework `Pipeline.run()` logic to better handle cycles

📝 Introduce `LoggingTracer`

🚅 Rework `Pipeline.run()` logic to better handle cycles

📝 Introduce `LoggingTracer`