Disclaimer: I’m in no way suggesting to directly manipulate data in the Alfresco database schema in general. It sometimes just helped me getting the job done (faster). I always feel better understanding persistent storage structures. That especially applies when data is
in production crucial, just like it can be the case with Alfresco. Thats even more the case when you’re running the community version as it is unsupported by Alfresco and its partners.
When I started working with Alfresco community sources in since 2007, I created a „quick and dirty“ diagram (based on v2.1) to get an overview how data is stored. Tables and relationships are pretty self-explanatory, so I only looked at it every once in a while when I needed help getting „a bigger picture“.
Quite recently I was given the challenge to upgrade an old 2.1 community system to 3.4.b. Data quality had suffered over the years for various reasons (i.e. transactional misbehaviour of the database). Various data migration patches failed as a matter of the data they were applied against. Given the fact that you can google almost „everything“ regarding Alfresco, I was quite suprised failing to find a (better) diagram than the one I created back in 2007. I finally started all over again creating a new one. It proved to be very valuable helping me resolving all data related problems during the migration and afterwards.
There are other situations when I go for raw SQL. Sometimes its just the fastest way to get what you want. Say, for example you want the node distribution across stores:
select count(*), concat(s.protocol,'://',s.identifier)
from alf_node n,alf_store s
where s.id=n.store_id group by n.store_id;
The diagram sources (MySQL Workbench) and a rendered PDF (the promised infographics ;) can be downloaded below.
PS: In the diagram, I tried to cluster tables semantically.
PPS: Just in case you dare to changes nodes: When in doubt, better rebuild the lucene index or you can run into unexpected behaviour.