Skip to content

2025-12-09 minmax

Arrivato pressoché alla fine dell’ingestione di Crossref, il database dei dati è diventato inutilizzabile e ha smesso di ricevere sia query di lettura che di scrittura. Nei log ho trovato un’infinità di log, gigabyte e gigabyte di log, che segnalavano un’inconsistenza nella paginazione. L’unica spiegazione che riesco a dare è che eseguire bulk load troppo frequenti spacchi il database.

Si ricomincia d’accapo, senza bulk load.

Pasted image 20251205232137.png

arcangelo7
arcangelo7 Dec 5, 2025 · opencitations/oc_meta

refactor: migrate from SPARQLWrapper to sparqlite

+1233 -1646 b0b71bd

https://pytest-benchmark.readthedocs.io/en/latest/

arcangelo7
arcangelo7 Dec 3, 2025 · opencitations/oc_ocdm

feat: add benchmark infrastructure

Add pytest-benchmark configuration and benchmarks for testing provenance generation, reading e writing operations.

+8252 -12 e357300
class MockEntity:
def __init__(self, current_graph, preexisting_graph):
self.g = current_graph
self.preexisting_graph = preexisting_graph
self.to_be_deleted = False
class TestGraphDiff:
@pytest.mark.benchmark(group="graph_diff")
@pytest.mark.parametrize("entity_count", [50, 100, 150])
def test_compute_graph_changes_modified(
self, benchmark, redis_counter_handler, entity_count
):
def setup():
graph_set, brs = create_populated_graph_set(
redis_counter_handler, entity_count
)
entities = []
for entity in graph_set.res_to_entity.values():
preexisting = Graph()
for triple in entity.g:
preexisting.add(triple)
mock = MockEntity(entity.g, preexisting)
entities.append(mock)
for i, br in enumerate(brs):
br.has_title(f"Modified Title {i}")
return (entities,), {}
def compute_changes(entities):
results = []
for entity in entities:
result = _compute_graph_changes(entity, "graph")
results.append(result)
return results
result = benchmark.pedantic(
compute_changes, setup=setup, rounds=BENCHMARK_ROUNDS
)
assert len(result) >= entity_count * 14
entities_with_changes = sum(
1 for to_insert, to_delete, n_added, n_removed in result
if n_added > 0 or n_removed > 0
)
assert entities_with_changes == entity_count, (
f"Expected {entity_count} entities with changes, "
f"got {entities_with_changes}"
)

Esegue il test e genera automaticamente statistiche.

Diff con grafi isomorfi: Pasted image 20251204143443.png

Diff tra set: Pasted image 20251204143532.png

arcangelo7
arcangelo7 Dec 4, 2025 · opencitations/oc_ocdm

fix(query_utils): replace isomorphic graph comparison with set operations [release]

Use simple set difference instead of to_isomorphic() and graph_diff() for detecting changes between graphs. This is valid because OCDM never uses blank nodes (all resources have URIs).

+429 -1168 5650079
arcangelo7
arcangelo7 Dec 4, 2025 · opencitations/oc_ocdm

fix(query_utils): replace Graph objects with sets in _compute_graph_changes

Eliminate unnecessary Graph object creation when computing triple differences. The previous implementation created Graph objects and added triples in a loop, which was slow. Since callers only iterate over the results, sets work equally well and provide ~10x speedup in benchmarks.

+390 -58 9dbd09e

Pasted image 20251204191611.png

arcangelo7
arcangelo7 Dec 4, 2025 · opencitations/oc_ocdm

fix(storer): remove unnecessary Dataset reconstruction in _store_in_file

The hack was originally needed with ConjunctiveGraph (rdflib < 7) to ensure provenance triples were written correctly. After migration to Dataset with quads() in rdflib 7.4.0, this reconstruction became a redundant copy operation.

Also adds storer benchmark for measuring file I/O performance.

+384 -12 5091653

Prima: Pasted image 20251204172306.png

Dopo: Pasted image 20251204172350.png

arcangelo7
arcangelo7 Dec 4, 2025 · opencitations/oc_ocdm

fix(support): optimize find_paths with single regex match

Add parse_uri() function that extracts all URI components in one regex match instead of calling get_short_name(), get_prefix(), get_count() separately. Remove unused get_prov_subject_* functions.

+457 -106 e244b5b

Prima: Pasted image 20251204173711.png

Dopo: Pasted image 20251204173735.png

arcangelo7
arcangelo7 Dec 4, 2025 · opencitations/oc_ocdm

perf(support): cache parse_uri results and refactor URI helper functions

Add LRU cache to parse_uri() and refactor get_short_name(), get_count(), get_prefix(), get_base_iri(), get_resource_number() to use cached parse_uri() instead of _get_match() which compiled regex on every call.

Benchmark improvement: ~1.6x speedup on find_paths operations.

+350 -33 efbe372

Pasted image 20251204194107.png

arcangelo7
arcangelo7 Dec 5, 2025 · opencitations/oc_ocdm

fix(sparql): migrate from SPARQLWrapper to sparqlite

[release]

+199 -255 a2b8957
arcangelo7
arcangelo7 Dec 6, 2025 · opencitations/oc_ocdm

fix(sparql): add query chunking to prevent oversized SPARQL requests

Queries with more than 500 triples are now split into multiple chunks. The batching mechanism combines these chunks up to batch_size per request.

[release]

+139 -114 7b28aea

Questa logica prima era in oc_meta, ma richiedeva il parsing della query. L’ho spostata in oc_ocdm. Un chunk è o un’entità o un gruppo di triple pattern fino a 500 (perché Virtuoso accetta di default massimo 1000 per query).

Ho spostato il database di provenance su un altro disco per ridurre la competizione sull’I/O.

arcangelo7
arcangelo7 Dec 2, 2025 · skg-if/shacl-extractor

fix: generate per-module SHACL shapes with separate namespaces

  • Generate shapes in module-specific namespaces (e.g., skg_sh_agent:, skg_sh_grant:)
  • Add sh:targetClass only for root classes of each module
  • Use sh:node for nested entity validation instead of sh:class
  • Drop support for single-file ontologies (version < 1.0.1)
+811 -491 1284312
arcangelo7
arcangelo7 Dec 4, 2025 · opencitations/sparqlite

test: add benchmark suite for SPARQL client comparison

Compares sparqlite, SPARQLWrapper, and rdflib against a Virtuoso endpoint. Includes analytics for generating performance reports.

+2039 -2 c882ebb
arcangelo7
arcangelo7 Dec 4, 2025 · opencitations/sparqlite

docs: add benchmarks page comparing sparqlite vs rdflib and SPARQLWrapper

Add documentation describing benchmark methodology and results. Sparqlite shows significant performance advantages: 3-4x on CONSTRUCT, 2-2.5x on ASK, ~2x on UPDATE, and 1.4-1.8x on SELECT queries.

+3367 -337 95576bf

https://opencitations.github.io/sparqlite/architecture/benchmarks/

Each buffer caches one 8K page of data and occupies approximately 8700 bytes of memory (https://docs.openlinksw.com/virtuoso/ch-server/)

Io stavo calcolando il numero di buffer facendo (memory × 0.85 × 0.66) / 8000, dev’essere (memory × 0.85 × 0.66) / 8700.

AsyncQueueMaxThreads: control the size of a pool of extra threads that can be used for query parallelization. This should be set to either 1.5 * the number of cores or 1.5 * the number of core threads; see which works better (https://github.com/dbcls/bh11/wiki/Virtuoso-configuration-tips)

“For a number of reasons, the rdf:type predicate (often expressed as a, thanks to SPARQL/Turtle semantic sugar) can be a performance killer. Removing those predicates from your graph pattern is likely to boost performance substantially. If needed, there are other ways to limit the solution set (such as by testing for attributes only possessed by entities your desired rdf:type) which do not have such negative performance impacts.” (https://stackoverflow.com/questions/39748060/complex-sparql-query-virtuoso-performance-hints)

arcangelo7
arcangelo7 Dec 5, 2025 · opencitations/virtuoso_utilities

feat(launch_virtuoso): add query parallelization and memory optimization

Add --parallel-threads option to configure CPU-based threading parameters. Automatically calculate AsyncQueueMaxThreads, ThreadsPerQuery, MaxClientConnections, and ServerThreads based on available CPU cores. Add MaxQueryMem calculation to prevent OOM errors during query execution. Enable adaptive vector sizing for better performance on large queries. Correct buffer size calculation from 8000 to 8700 bytes per buffer.

[release]

+307 -8 10c0803
arcangelo7
arcangelo7 Dec 5, 2025 · opencitations/virtuoso_utilities

feat(native_entrypoint): add docker-compose compatible entrypoint

Add virtuoso-native-launch command that configures Virtuoso from environment variables and delegates to the original Docker entrypoint. This enables using optimized Virtuoso configuration in docker-compose scenarios.

Add documentation and unit tests

+1637 -138 6b89c69
services:
virtuoso:
build:
context: .
dockerfile: Dockerfile.virtuoso
environment:
- VIRTUOSO_ENABLE_WRITE_PERMISSIONS=true
volumes:
- ./virtuoso-data:/opt/virtuoso-opensource/database
ports:
- "8890:8890"
- "1111:1111"
FROM openlink/virtuoso-opensource-7:latest
RUN pip install virtuoso-utilities
ENTRYPOINT ["virtuoso-native-launch"]
arcangelo7
arcangelo7 Dec 7, 2025 · opencitations/virtuoso_utilities

feat(benchmarks): add parallel SPARQL query benchmarks

Add pytest-benchmark tests measuring query performance at different parallelism levels (1, 25%, 50%, 75%, 100% of CPU cores) for SPO, DOI lookup, VVI, and mixed workloads.

Includes documentation page.

+5922 -6 51de91a

https://opencitations.github.io/virtuoso_utilities/benchmarks/

Pasted image 20251207175854.png

È assurdo che le stesse query fatte sullo stesso database, in benchmark, sul mio PC, funzioni tutto perfettamente, in produzione, la stessa versione di Virtuoso, dia Monitor: Locks are held for a long time.

Ho provato ad eseguire lo stesso benchmark anche sul database di produzione e anche lì non dà problemi. Segno che il problema sul parallelismo potrebbe non essere di Virtuoso ma del mio codice. Solo che non capisco dove perché le query che sto facendo nel benchmark sono identiche per forma e per dimensionalità a quelle che faccio per Meta, e anche sullo stesso database

arcangelo7
arcangelo7 Dec 6, 2025 · opencitations/python-package-template

feat: initial Python package template

Template for Python packages with:

  • Interactive setup.py script for project configuration
  • GitHub Actions workflows (tests, release, docs deployment)
  • UV-based dependency management with pyproject.toml
  • Semantic-release configuration for automated versioning
  • Optional Starlight documentation (generated dynamically via npm)
  • ISC license

Run python setup.py after creating a repository from this template.

+738 -0 bb0c921
arcangelo7
arcangelo7 Dec 6, 2025 · opencitations/repository_setup_guides

docs: add Python package template guide

Add getting started guide for the template repository with:

  • Quick start section (5 steps to create a new package)
  • Under the hood section explaining workflows and configuration
  • Links to related guides for deeper understanding

Update homepage and sidebar to feature the template prominently.

+186 -2 eb57629

https://opencitations.github.io/repository_setup_guides/

arcangelo7
arcangelo7 Dec 6, 2025 · dharc-org/aldrovandi-provenance

feat: add SharePoint folder structure extractor with cookie-based authentication

+26074 -3 103d4af

Per ogni oggetto digitalizzato (263 totali in 6 Sale), generare meta.ttl e prov.nq nelle 4 sottocartelle (raw, rawp, dcho, dchoo) con i metadati specifici per ogni fase.

SottocartellaStep KGContenutoSoftware
raw/00Foto grezze acquisizioneFotocamera
rawp/01Ricostruzione 3D3DF Zephyr
dcho/02-04Modello 3D completoBlender, GIMP, Instant Meshes
dchoo/05-06Modello ottimizzato e pubblicatoCHAD-AP, Aton, Nextcloud
In realtà, nel kg, per ogni oggetto (es. la Carta Nautica, NR=1) ci sono due tipi di informazioni:
  1. Il processo di DIGITALIZZAZIONE (step 00-06)

Queste triple descrivono come e’ stato creato il modello 3D:

act/1/00/1 = “Federica Bonifazi ha fotografato l’oggetto con Nikon D7200” act/1/01/1 = “Federica Bonifazi ha processato le foto con 3DF Zephyr” act/1/02/1 = “Alice Bordignon ha modellato con Blender” …

  1. La storia dell’OGGETTO ORIGINALE (ob00-ob08)

Queste triple descrivono l’oggetto fisico originale (la carta nautica del 1482):

act/1/ob00/1 = “Creazione dell’opera originale” (F28_Expression_Creation) act/1/ob08/1 = “Il museo conserva questo oggetto” (curating) itm/1/ob00/1 = L’item fisico originale exp/1/ob00/1 = L’espressione intellettuale wrk/1/ob00/1 = L’opera

BASE = "https://w3id.org/changes/4/aldrovandi"
raw/ -> act/{nr}/00/1, mdl/{nr}/00/1, tsp/{nr}/00/1, lic/{nr}/00/1, apl-lic/{nr}/00/1
rawp/ -> act/{nr}/01/1, mdl/{nr}/01/1, tsp/{nr}/01/1, lic/{nr}/01/1, apl-lic/{nr}/01/1
dcho/ -> act/{nr}/02-04/1, mdl/{nr}/02-04/1, tsp/{nr}/02-04/1, lic/{nr}/02-04/1, ...
dchoo/ -> act/{nr}/05-06/1, mdl/{nr}/05-06/1, tsp/{nr}/05-06/1, lic/{nr}/05-06/1, ...
raw/ -> solo metadati step 00
rawp/ -> metadati step 00 + 01
dcho/ -> metadati step 00 + 01 + 02 + 03 + 04
dchoo/ -> metadati step 00 + 01 + 02 + 03 + 04 + 05 + 06
data/output/
├── Sala1/
│ └── S1-01-CNR_CartaNautica/
│ ├── raw/
│ │ ├── meta.ttl # step 00
│ │ └── prov.nq
│ ├── rawp/
│ │ ├── meta.ttl # step 00 + 01
│ │ └── prov.nq
│ ├── dcho/
│ │ ├── meta.ttl # step 00-04
│ │ └── prov.nq
│ └── dchoo/
│ ├── meta.ttl # step 00-06
│ └── prov.nq
└── ...
  • Primary source e responsible agent li ho già chiesti vero?
  • act/1/00/1: il primo 1 è il numero dell’oggetto? Il secondo cos’è?
  • Io devo mettere sia metadati di digitalizzazione che sull’oggetto originale in meta.rdf, giusto? Mappare gli step 00-06 a raw-dchoo, è “facile”, ma come faccio per i ob00-ob08?
  • Primary source e responsible agent
  • act/1/00/1
    • Se domani qualcuno fa un’attività di acquisizione per lo stesso oggetto ho l’attività di acquisizione 2. Nel nostro caso c’è solo la 1. è un sistema di estensione.