In short: Is there any way to index the stemmed and non-stemmed version of Fulltext fields in Drupal Search API Solr without resorting to hardcoding field data into copyfield in the schema file?

(and if not, what's the safest, most Drupal / Search API friendly approach to doing this? e.g. using Drupal field machine names in the Solr schema file, maybe?)

Background: A common practice when working with Apache Solr is to use a stemmer like SnowballPorterFilterFactory (stemmers make searches match by the grammatical 'stem' of a word, so for example, searches on "walking" match content with "walked"), then, to copy fulltext fields (e.g. using copyfield), stemming one of the two duplicates and not stemming the other.

This popular approach has two advantages (at the cost of more processing):

  • Exact matches are indexed more highly than close matches - a search on "walking" matches content with "walking" twice (stemmed and unstemmed) and content with "walked" once (stemmed only)
  • You're guaranteed to not have any awkward cases where a search on a stemmable term fails to match content that contains that original exact term*.

As I understand it, when Solr is used outside of Drupal, this is usually done by hardcoding <copyfield> declarations into the Solr schema file for each field that is to be stemmed.

(for the sake of this question, imagine a Search API Solr search indexing nodes, processing as fulltext the fields Title, Body, Teaser, and one custom field named Notes)

The problem: With Drupal and the Search API module's Search API Solr, the fields are configured by Drupal dynamically - rather than being hardcoded in the Solr schema.txt file. It's not clear how to make the two approaches play nicely together.

Is there a better, more Drupal-friendly way to index stemmed and unstemmed content than hard-coding the field names into something based on copyfield in the schema.txt file?

If hard-coding is the answer, what names should be used, what care needs to be taken to avoid namespace problems?

Another popular approach with Solr outside of Drupal is to create a composite field of all your fulltext fields, combined, and treat it as a string rather than fulltext - so it bypasses filters and simply mops up and boosts any words that are searched for exactly as they appear in the text. Search API has two features under Workflow that look promising for this, Aggregated fields and Complete entity view - but (as far as I can tell) both can only index as Fulltext and would therefore be stemmed just like the regular field.

I can't see anything on this specific to Search API Solr. The closest I can see is this issue for D8 core search, but that's rather different. I can't find any information about anything like this for Search API.

*(for example, I find that with stemming, searches on 'unravelling' don't match content containing 'unravelling', but searches on 'unravel' do. It's being stemmed down to 'unravel', but for some reason it's not recognising 'unravelling' as a valid extension of the stem 'unravel'. 'Unpublished' is another example (unpublish matches, unpublished doesn't). I've seen various reports of similar language-specific stemming issues. Keeping an unstemmed copy seems to be the standard approach, but I can't see any clean way to do this in Drupal)

Also posted as a support issue on the Search API Solr queue (yes, cross posting like this is okay, actually encouraged to make life easier for maintainers)


Good question, and also a topic that should probably be brought up as a feature request for both Apache Solr Integration and Search API Solr at some point. Searching both stemmed and unstemmed content can often lead to significantly better results.

For Search API Solr, there's no way to do this without modifying the schema.xml bundled with the module. But you're probably already aware of that since you're using stemming and the bundled schema does not have stemming enabled by default.

The key to doing this without lots copyField definitions is to use a combination of dynamicField and copyField. You'll also need two different fieldType definitions, one for unstemmed and one for stemmed text.

Try these steps:

  1. Duplicate the entire definition for <fieldType name="text"> as <fieldType name="stemmed_text"> and uncomment the SnowballPorterFilterFactory filter.
  2. Add a dynamicField definition using this new fieldType. The data does not need to be stored, only indexed. Example: <dynamicField name="stemmed_*" type="stemmed_text" termVectors="true" stored="false" />
  3. Add a wildcard copyField definition that copies all text fields into corresponding stemmed text fields. Example: <copyField source="t_*" dest="stemmed_*" />
  4. Reindex.
  5. Add "qf" params to the Solr query for all stemmed fields you want to search. You can do this via hook_search_api_solr_query_alter(). If you want all text fields, you can look for field names prefixed with "t_".

Bonus: Sometimes it helps to boost matches against the unstemmed field in order to make more exact results appear first. You can add boost factors to the field names in the qf param.

Bonus #2: The pf param ("phrase field") is another good way to boost exact matches. This lets you give priority to results where terms are in the same order as what the user entered.

Apache Solr Integration module:

For those using apachesolr.module, matching against both stemmed/unstemmed can be done without changing schema.xml. The bundled schema includes lots of dynamicFields and is quite flexible. By default, searchable fields are stemmed. You can simply copy content fields to unstemmed dynamicFields in hook_apachesolr_index_documents_alter() and then add "qf" params for those fields via hook_apachesolr_query_alter().

