Search engine providers usually restrict insights into the ranking process of search results. This article gives insights search results are ordered using the search engine on umwelt.info.
About the ranking of search results
What is the purpose of ranking search results?
The ranking of search results is important to obtain the most relevant information as quickly as possible. A submitted search query is compared with all entries in the search index - the database of referenced information and data in umwelt.info. This comparison identifies the entries that are most relevant to the respective search query. Based on this, the order is determined and suitable search results are placed at the top of the results list.
Which factors influence the ranking?
The ranking on umwelt.info is based on five factors: BM25 ranking, recency, metadata quality, popularity and status. These five factors are used to determine a score for each entry. The higher the overall score, the higher up an entry is positioned in the results list. The individual factors are explained in the following.
The BM25 ranking algorithm is often used by default in search engines. Among other things, it considers how often a search term occurs within an entry compared to the entire text corpus, that is the complete search catalogue of umwelt.info. Furthermore, the length of the respective search result is relevant. More information on the algorithm can be found here. We use the algorithm based on the tantivy programme library. The algorithm is applied for title, description, region, keywords, origin, type, and measured values throughout all entries of the search index. The occurrence of a search term in an entry is single weighted except for the title (double weighted), the region (triple weighted) and keywords (triple weighted). Futhermore, the number of search items of the query which is matched within each entry is taken into account by multiplying the score by this number. For instance, if a search entry contains all three words of the exemplary query "groundwater nitrate saxony", its score gets multiplied by three.
Recency determines how up-to-date the respective data set is. This means that users should preferably find the latest entry. The considered factors for topicality are the current time difference to the publication date and whether the publication date exceeds six years.
Metadata quality is assessed according to the FAIR principles. Another article provides the information on the meaning and calculation of metadata quality. This ranking factor prioritises search results with original data or information that is easier to find and easier to reuse. The average value of all metadata quality criteria considered is used in the ranking.
Popularity rating is based on the click rate of a search entry. The more often a search result has been accessed, the higher the score for the entry. Current results are rated more highly than results from further back (you can find the corresponding exponential function in our code here).
The status of a search entry is divided in four categories: obsolete, active, in development and in planning. Usually, data providers assign these categories by themselves. The default is active, i.e. the entry is current; in this case the score remains unchanged. Obsolete entries are those that are out of date.
How are the factors weighted?
All factors considered are weighted and result in a total score for each entry. This then determines the order in which the search entries are displayed. The BM25 ranking algorithm is included with a weighting of 85%. A high match of the entry with the search query has the strongest influence on the ranking, followed by the topicality of the entry (around 9%), whereas the factors metadata quality (2.5%) and popularity (around 3%) only have a minor effect on the positioning (you can find the implementation in our code here).
For example, if two entries are close to each other in the BM25 score, a current entry with higher metadata quality and/or popularity would tend to appear higher up. On the other hand, it should be avoided that non-current records with high quality and popularity are prioritised too strongly. This could lead to entries with a relatively low match with the search query still being positioned at the top of the results list.
The status is irrelevant in most cases and is set to active by default, which does not affect the ranking. It is mainly used to devalue obsolete entries. These are multiplied by 0.6 in order to place them lower correspondingly. This means that users are more likely to see active and therefore more relevant entries higher up in the results list. Entries that are under development or in planning are also devalued slightly by multiplying them by 0.95.
How can users and data providers help to improve ranking?
Data providers may be interested in maximising the ranking of their entries. Therefore, it is advisable to pay particular attention to the metadata quality of your entries. For example, by providing the data via machine-readable interfaces and information on the licence to ensure easy reusability. You can find more information in the article on metadata quality. Please contact the team of umwelt.info, if you have any questions.
Users should find the search entries that are as relevant as possible to their queries on environmental and nature conservation issues. That is why the ranking is constantly improved and further under development. In the future, further factors will be taken into account, such as geographical relevance of the search results. The search index will be constantly expanded by integrating new data and information offers. By that, more areas of environmental protection and nature conservation will be covered well in umwelt.info. Users should not hesitate to contact umwelt.info if they receive less relevant or unsuitable results in their search or if they have any other comments on the results list for their search query. User feedback is welcome.