# Input files

As mentioned in the [commands reference](commands.md), there exist two methods 
of filling the receipt cataloging database with shops, receipts and product. 
One way is to interactively create [new](commands.md#new-receipts-and-products) 
receipts and products, which also writes YAML files for later changes and 
synchronization. The second method is to create such YAML files from scratch 
and import them by [reading them in](commands.md#read-files).

The receipt, product and shop files follow a specific structure which allows 
them to be read in programmatically while still being fairly readable and 
succinct. The files are also meant to be more portable compared to a PostgreSQL 
database, for example, and the I/O tools of the module remain compatible with 
format changes as they are introduced, whereas [database 
migrations](commands.md#alembic-migration) are somewhat more tedious.

For validation and other automation purposes, [JSON schemas](schemas.rst) exist 
to describe the YAML structure, but this is less useful for human consumption. 
Therefore, we also provide a number of example files for the receipt as well as 
the product and shop metadata files.

:::{tip}
Remember that the files are meant to be stored relative to the data path and 
within the subdirectories discovered with the pattern configured in the 
[settings](configuration.md), so any mentions of filenames here are relative to 
those paths.
:::

## Receipts

The receipt YAML file contains a single receipt's metadata, such as date of the 
receipt and the shop it is from, as well as the products and discounts. The 
following `receipt.yml` file shows an example receipt with some dummy data:

```{literalinclude} ../../samples/receipt.yml
```

Follow year-month-day (YMD) conventions for writing the date. The shop should 
be a simple identifier that is repeated for all receipts of that shop. The 
products are entries of a list, with each product a list of three or four 
elements. In order, the product has the following fields:

- The quantity of the product, either as a plain, exact number or as 
  a (fractional) number with a unit (such as kilograms), if the product is 
  weighed at purchase, for example.
- The label of the product, as it appears on the receipt. The convention is to 
  use lowercase labels, but this is up to you to choose, although it is a good 
  idea to write the label the same way when it appears on different receipts.
- The price of the product, multiplied by the quantity. This should not just be 
  the price of a singular product or the "unit price", because this field is 
  also used as-is for total price calculation of a receipt. However, discounts 
  (if those are listed separately) should not be deducted from this price, to 
  make price tracking across time and base price matching in product metadata 
  feasible.
- An indicator that the product received a separate discount. If the receipt 
  has a particular character (or short set of characters) for this, then that 
  is a good option to use, otherwise this should include something if the 
  product is involved in such a discount. The indicator itself may be used for 
  shop-specific matching but does not affect the receipt item otherwise.

For discounts, which are stored in a `bonus` entry, we also have an array with 
lists with at least two entries:

- The label of the discount as it appears on the receipt.
- The decrease in price that the discount provided across all the products 
  involved in the sale.
- Any further elements refer to labels defined in the `products` entry. These 
  should be listed in order of appearance and not refer to any products that 
  are not involved in any discount. If multiple products have the same label, 
  the order should in most case make it clear which product is involved; if 
  somehow an earlier product is not involved in this discount but a later one, 
  then mismatches may occur, so consider adjusting labels then. If additional 
  granularity is needed on how the discount is broken down over products, then 
  consider splitting up a discount.

In the example above, there are six products with one having a quantity with 
a unit. Three products have the same label, but only two of them had a discount 
(which the bonus section refers to by their label in order). The final product 
also has a discount with a different indicator. The bonus section contains two 
discounts that refer to no product or a nonexistent product, respectively, but 
the discounts themselves still apply.

## Product inventories

Aside from primary receipt information, it is possible to augment product data 
with more metadata that help describe, group, shape and identify the items. 
Similarly to receipt files, product metadata is specified in YAML files.

By default, the convention is to have one YAML file for products from the same 
shop. However, by adjusting the products format in the data settings, you can 
also group together products with the same brand, category and/or type, on top 
of the shop. These YAML files are also called *inventories*, and an example of 
a file, `products-id.yml`, is shown below.

```{literalinclude} ../../samples/products-id.yml
```

The YAML file has a shop identifier, possibly any other shared fields (the 
brand, category or type already mentioned), and the products that have metadata 
defined for each of them in a list. In order to match the metadata with receipt 
product items, we use *matchers*:

- For labels, we match a product's label with any of the names in the list.
- For prices, we either match a product's normalized price (i.e., after 
  dividing the value by the quantity, but without discounts) with one of the 
  prices in the list. Otherwise, the matchers for prices is an object with 
  keys, which could define a range using minimum and maximum prices (again 
  using the normalized price to compare with both interval ends required), 
  prices per year key (with a similar comparison) or prices for a particular 
  unit (using the normalized format for units as keys, such as 'kilogram' and 
  not 'kg').
- For discounts, we match the labels of any of the discounts that a product is 
  involved in (so not the discount indicator provided directly with the item) 
  with any of the names in the list.

Any of the matchers may be used to limit which item are considered to be the 
described product. Combinations of matches further restrict the match, which is 
important as matches with multiple top-level products lead to the items not 
being matched at all to any metadata. We later mention how to group an 
assortment of similar products together in a range with a generic product.

With the receipt and inventory examples above, the "weigh" product matches the 
fifth item on the receipt, which the "due" product matches the last item 
because the price of 0.89 is considered in the open interval. The third product 
(disregarding the range of products below it for now) matches the second and 
fourth item because they are involved in the "disco" discount and their 
normalized prices are 2.50 and 2.00, respectively.

For metadata, it is possible to define a product's brand, description, category 
and type, as text. This is free-form, although it is a good idea to write 
brands, categories and types in the same way, as a form of taxonomy across 
products. Additionally, we recognize the following numerical properties of 
products:

- The portions in a product, as a number. Use common sense whether this is 
  relevant to provide for you; normally this is the number of discrete items in 
  the package, but for other products this is a recommended serving size.
- The weight of a product, using a quantity with a unit.
- The volume of a product, using a quantity with a unit.
- The percentage of alcohol contained in a product.

There are also two identifiers recognized for the products, which are the `sku` 
(stock-keeping unit) and the `gtin` (global trade item number). The former may 
be obtained from an external registry of the shop for lookup purposes and is 
free-form text, while the latter is a number of at most 14 digits, which may
correspond to the barcode on a product, usually written with zeroes on the 
start of the number to pad it to the maximum length. These two identification 
codes should also be unique within the inventory (and the GTIN even across 
shops), meaning that the `rechu new` command detects duplicates and gives the 
option to merge them.

When a product comes in different forms which share the same properties or are 
otherwise hard to distinguish (like when they show up with the same labels, 
prices and/or discounts on the receipts), it is possible to define a range of 
product metadata items in an array. These inherit most fields from the generic 
product defined at the top level, except for sku/gtin identification codes. 
Product range metadata can override fields from the generic product and even 
leave out matchers, such as in the Small variant of the chocolate type product.

The matching algorithm prefers generic products over product range metadata 
when they have the same matchers, but with adjustments to their matchers, the 
metadata with the most narrow matchers prevails. This means that overriding 
a matcher to remove it completely decreases the specificity of that product 
(actually making it less likely to be used if it has a duplicate match), while 
overriding it to decrease the number of matching values increases its 
specificity. This is how the second and fourth product on the example receipt 
will end up matching the second and first range product, respectively.

## Shops

When shop identifiers are used in receipts and products, the database ensures 
that there is a metadata model to refer to. Even if the shop is not mentioned 
in a shop inventory file, there will be an empty model for relations to work 
properly. Adding metadata to shops can be helpful to further distinguish the 
store chain and to enable shop-specific configuration.

Only one YAML file can be used to load shop metadata from. The path to the file 
can be adjusted in the shops data setting. This is an example `shops.yml` file:

```{literalinclude} ../../samples/shops.yml
```

In contrast to other files, there is no overarching object mapping in the shops 
inventory, instead each shop is represented as an object in a primary array. 
Each shop must have a "key" with the identifier used by receipts and products. 
In addition, the shop defines optional text properties like name, website, 
Wikidata item and products URL format which refers to fields from the shop and 
a relevant product. Finally, the shop can have a number of discount indicator 
patterns, which are regular expressions that match portions of a receipt item's 
short sequence of characters that indicate that it was discounted.