csvToTk/ 0000755 0001750 0001750 00000000000 10575247356 012007 5 ustar monika monika csvToTk/toydb/ 0000755 0001750 0001750 00000000000 10574311523 013113 5 ustar monika monika csvToTk/toydb/Accessions 0000644 0001750 0001750 00000001376 10571075622 015143 0 ustar monika monika "ID","accessionDate","accessionDispositionNote","accessionNumber","accessRestrictions","acquistionType","agreementReceived","agreementSent","cataloged","catalogedNote","conditionNote","description","extentNumber","extentType","inventory","processingPlan","restrictionsApply","rightsTransferred","title" "2","2006-12-14","Entered By: OJ","2006.1214.2","",,"","","TRUE","Holdings In Public","Manuscript","Draft","0.5","Linear Feet","see CNN","Prelim Inv Completed","FALSE","FALSE","How I Would Have Done It" "3","2007-02-20","Entered By: JohnStewart","2007.0220","May contain offensive truth",,"","","TRUE","Holdings In Plain View","PodCast","Final Version","1.5","Linear Feet","as seen on TV","Prelim Inv Completed","TRUE","FALSE","The Real Story About Fake News" csvToTk/toydb/Names 0000644 0001750 0001750 00000001521 10571075622 014104 0 ustar monika monika "ID","Name_personalPrimaryName","Name_personalRestOfName","Name_corporatePrimaryName","Name_contactAddress1","Name_contactCity","Name_contactRegion","Name_contactMailCode","Name_contactCountry","Name_contactPhone","Name_contactEmail" "-2","personalPrimaryName","personalRestOfName","","contactAddress1","contactCity","contactRegion","contactMailCode","contactCountry","contactPhone","contactEmail" "-1","","","corporatePrimaryName.","contactAddress1","contactCity","contactRegion","contactMailCode","contactCountry","contactPhone","contactEmail" "2","Glue","Glitter ","","Elmer Lane","Stuckton","","0001","USA","444-222-0101","glitter@glue.com" "1","","","Institute For Sticky Bits.","","","","","","","sticky@bits.edu" "0","Glue","Glitter","Institute For Sticky Bits","Elmer Lane","Stuckton","","0001","USA","444-222-0101","glitter@glue.com" csvToTk/toydb/Links 0000644 0001750 0001750 00000002142 10574310012 014105 0 ustar monika monika "accessionNumber","tk_nameFunction","ID","Name_personalPrimaryName","Name_personalRestOfName","Name_corporatePrimaryName","Name_contactAddress1","Name_contactCity","Name_contactRegion","Name_contactMailCode","Name_contactCountry","Name_contactPhone","Name_contactEmail" "2006.1214.2","Source","-2","personalPrimaryName","personalRestOfName","","contactAddress1","contactCity","contactRegion","contactMailCode","contactCountry","contactPhone","contactEmail" "2007.0220","Subject","-1","","","corporatePrimaryName.","contactAddress1","contactCity","contactRegion","contactMailCode","contactCountry","contactPhone","contactEmail" "2006.1214.2","Creator","2","Glue","Glitter ","","Elmer Lane","Stuckton","","0001","USA","444-222-0101","glitter@glue.com" "2006.1214.3","Creator","2","Glue","Glitter ","","Elmer Lane","Stuckton","","0001","USA","444-222-0101","glitter@glue.com" "2006.1214.2","Source","1","","","Intitute For Broken Bits","","","","","","","sticky@bits.edu" "2006.1214.2","Source","0","Glue","Glitter","Institute For Sticky Bits","Elmer Lane","Stuckton","","0001","USA","444-222-0101","glitter@glue.com" csvToTk/README.html 0000644 0001750 0001750 00000023132 10575244762 013631 0 ustar monika monika
csvToTk is a perl script that converts CSV files to ingest files suitable for import by the Archivist Toolkit.
run csvToTk -h to see the options available
As a courtesy it trims leading and trailing spaces from accession and name properties.
csvToTk the script it requires the DBI module, if it is not installed on your machine get it from cpan toydb a toy database example with 2 accessions and 4 names toydb/Accessions the Accessions table in csv formats toydb/Names the Name table in csv formats toydb/Links the Links table in csv formats connecting accessions with names
At Georgia Tech we do not use all possible accession/name properties available in the toolkit. To add or remove property fields that are or are not used at your institution you need to edit the script slightly; that is you need to adjust the field name lists it uses, see FIELD_NAMES
> csvToTk -d toydb -A Accesions.txt -N Names.txt +++ Accessions: File 'Accesions.txt' 2 OK 0 FAILED Name record on line 6 in toydb/Names is kaput, has bad name type: (Name_contactAddress1) => (Elmer Lane) (Name_contactPhone) => (444-222-0101) (Name_contactCountry) => (USA) (Name_contactMailCode) => (0001) (Name_corporatePrimaryName) => (Institute For Sticky Bits) (Name_contactCity) => (Stuckton) (Name_nameType) => (person and corporate) (Name_personalPrimaryName) => (Glue) (Name_contactEmail) => (glitter@glue.com) (Name_personalRestOfName) => (Glitter) +++ Names: File 'Names.txt' 4 OK 1 FAILEDThis produces the Accessions.txt and Names.txt files. csvToTk reports an error for record on line 6 in toydb/ Names because it contains person and corporate information. This record does not become part of the ingest file.
2 accessions are created as a result of ingesting Accessions.txt. 4 names and 4 more accessions are created as a result of ingesting Names.txt. The last 4 accessions have the numbers 9999.000.2/3/4/5. The 2, 3, 4, and 5 correspond to the lines of the dbToys/Names table in which the name for this accession was defined. Creating fake accessions for names is an ugly work around necessary since the toolkit requires that names are ingested only in connection with accessions.
> csvToTk -d toydb -t toolkit_test -L Links.sql Can't find Accession (2006.1214.3) in toolkit database (Name_contactAddress1) => (Elmer Lane) (tk_nameFunction) => (Creator) (Name_contactCountry) => (USA) (Name_contactPhone) => (444-222-0101) (Name_contactMailCode) => (0001) (Name_contactCity) => (Stuckton) (Name_personalPrimaryName) => (Glue) (tk_nameId) => (39) (Name_contactEmail) => (glitter@glue.com) (accessionNumber) => (2006.1214.3) (Name_personalRestOfName) => (Glitter ) Link Record on line 5 in toydb/Links is kaput Can't find Name in toydb/Names; (tk_nameFunction) => (Source) (Name_corporatePrimaryName) => (Intitute For Broken Bits) (Name_contactEmail) => (sticky@bits.edu) (accessionNumber) => (2006.1214.2) Link Record on line 6 in toydb/Links is kaput Can't find Name in toolkit_test database (Name_contactAddress1) => (Elmer Lane) (tk_nameFunction) => (Source) (Name_contactCountry) => (USA) (Name_contactPhone) => (444-222-0101) (Name_corporatePrimaryName) => (Institute For Sticky Bits) (Name_contactMailCode) => (0001) (Name_contactCity) => (Stuckton) (Name_personalPrimaryName) => (Glue) (Name_contactEmail) => (glitter@glue.com) (accessionNumber) => (2006.1214.2) (Name_personalRestOfName) => (Glitter) Link Record on line 7 in toydb/Links is kaput +++ Links: File 'Links.sql' 3 OK 3 FAILED
The script reads links information from toydb/Links. Each record/line in the csv table connects an accession with a name. Each line/record in toydb/Links is expected to
The script prints error messages if any of the conditions are violated. In the example the first error complains that the link defined on line 5 in toydb/Link refers to an accession with an unknown number. The second error complains about the name not corresponding to anything in toydb/Names, 'Intitute For Broken Bits' is misspelled. The third error is a follow up error. Since the record from line 7 in toydb/Names was not ingested because it contains person and corporate information, it can not be found in the toolkit's database.
The script writes an INSERT sql statement to Links.sql that will connect the toolkit's accession with the given name if (and only if) the a link seems to be 'good'. It uses the tk_nameFunction value from the Links table to determine the function of this name. Allowed values are "Source", "Creator", iand "Subject".
Accession numbers in Link tables are split into 4 parts using '.' as deliminator. For eample 2006.1214.2 is interpreted as
accessionNumber1 '2006'
accessionNumber2 '1214'
accessionNumber3 '2'
accessionNumber4 ''
"accessionDate", "accessionNumber", "title", "extentNumber", "extentType", "acquistionType", "agreementReceived", "agreementSent", "inventory", "processingPlan", "accessRestrictions", "restrictionsApply", "rightsTransferred", "cataloged", "catalogedNote", "accessionDispositionNote", "conditionNote", "description"and the following name fields;
"Name_corporatePrimaryName", "Name_corporateSubordinate1", "Name_corporateSubordinate2", "Name_contactAddress1", "Name_contactCity", "Name_contactRegion", "Name_contactMailCode", "Name_contactCountry", "Name_contactPhone", "Name_contactEmail", "Name_personalPrimaryName", "Name_personalRestOfName", "Name_contactAddress1", "Name_contactCity", "Name_contactRegion", "Name_contactMailCode", "Name_contactCountry", "Name_contactPhone", "Name_contactEmail"To adjust the script edit the lists
my @accession_fields my @names_fieldscsvToTk/DataImport.xhtml 0000644 0001750 0001750 00000022713 10575247226 015132 0 ustar monika monika
Georgia Tech Archives is moving its accessions database from a home grown sql/php based archival database tool to the Archivist's Toolkit which was recently released. The easiest route I thought was to export the current data to two spreadsheets in csv format, clean it up, and import the results. It couldn't be that hard: just convert csv to Toolkit format.
That was as easy as pushing the right buttons on phpmyadmn. We ended up with two tables; one described the donors, the other the accessions linked to donors by via donor ids.
From the outset we planned to clean up data. Initially we thought we had to purge double donor records but it turned that step wasn't actually necessary, so I skip the details here. The cleanup incolved making accession numbers more regular, correcting typos that came to our attention when we saw all accessions/names in one big table, ...
Downloading and installing the Toolkit client on the Windows machines was a breeze. The same was unfortunately not true for my Ubuntu machine. First the installer refused to even do its magic dying with mysteries messages about not finding system libraries. The installer including the jre had the same problem. Luckily Jason Fowler posted his solution on the Toolkit's mailing list, see his blog entry.
The only thing I can add here: “vi -b” provides nice binary edit as well, or do
cat [FileName] | sed "s/export LD_ASSUME_KERNEL/#xport LD_ASSUME_KERNEL/
Redefine column names in the csv files to match the names used in the Toolkit.
Write a Perl script to generate a Toolkit import file using the data from the csv files.
Let the Toolkit read it.
It did not turn out to be that easy.
The Toolkit offers to import accessions and names data from a flat csv style file. It insists on tab delimination, which I found rather unreadable when I needed to check whether my script did generate the correct information.
In addition and more importantly, since this is a flat csv file in which each property can only head up one column it is not possible to link an accession to more than one donor/name. Accessions and names in the Toolkit data model have a many-to-many relationship, a name can be linked to several donors and an accession can have multiple names linked to it. Thus the import format is far less expressive than the actual data model supported. It also doesn't match well with the data institutions will want to import.
In our not untypical case we have many donors that represent a person and a corporation, which need to be represented as two distinct name records in the Toolkit. Both records need to be linked to the accession that they donated. Given the Toolkit's import format this is not possible. We had to devise a work around.
Here follows, without mentioning the false starts, how we did get our accessions data into the Toolkit:
I wrote a script that read an accessions csv file and generated a Toolkit import file selecting only those values from the csv file that are of interest to the Toolkit. This was straight forward. We could have imported a csv file straight into the SQL database, but I was vary of importing 'bad' data, like dates in the future, accession numbers that don't live up to Toolkit standards, ... I wanted to leave the consistency checking to the Toolkit.
Our archivist manually cleaned the donor information up. We ended up with a table that contained a name record for each person and each corporate donor. Where we previously had one record with a dual purpose we now had a name record for the corporation as well as the person. The archivist had to decide whether phone number, contact address information, and so forth applied to the person, the corporation, or both. There was no way around employing some human intelligence.
I added an option to my script to convert the donor name data into a Toolkit import file. The donors could not be imported by themselves, so I generated fake accessions with numbers '9999.000.1', '9999.000.2', ... . One accession for each name. After importing this file and deleting the 'fake' accessions all information about accessions and donors was in the Toolkit's database except for the way they are linked to each other.
With some more spreadsheet magic our archivist generated a Links table, which contained accession ids and name information. He basically had to replace donor ids in the accessions table with the full donor/name information.
I added another option to the script that read the Links table from its csv file. For each link it
checked whether the name information in the csv links table corresponded to a name in the csv names table (just to make sure the input data in the csv files was actually consistent)
looked for an accession with the given id in the Toolkit database
searched for the name from the Links table in the Toolkit database
if everything went well it generated an SQL statement that would link the Toolkit's accession with the given name
Last we imported the generated sql file into the Toolkits database.
In a first round I generated the donor and corporate name records by simply using name information like contact address, fax, email for both name records. I got to the point where the Toolkit informed me that all name records were imported successfully. In the next step of importing links my script complained about not finding many of the name records that were imported before. It turned out that the Toolkit had quietly assumed that corporate record:
Georgia Tech , 404-898-0819
and
Georgia Tech , 786-234-1937
were the same. This is not an unreasonable assumption, but I would have liked to see a warning about the fact that the second record was not imported as a new corporate name. I was certainly glad my script had done some consistency checking.
Obviously we had to decide how to match up properties from our legacy database with the ones in the Toolkit, which turned out to be easy because the Toolkit offers a complete list of choices.
Our legacy database's definition of a donor does not match perfectly with the Toolkit's definition of a name. So there was some unavoidable work and data massaging necessary here.
The biggest complication we encountered was the fact that accessions are imported through only one table. Therefore accessions and donors can be imported only in pairs. The Toolkit is smart enough to generate only one name where relevant name properties match across several accession import lines. Thus it supports a many accessions to one donor type relationship but does not help in our case where we needed to import accessions that were linked to multiple donors. Having to work around this resctriction was the biggest obstacle to importing our data.
Now that I have this script in place we will also be able to add creator links to aceessions once we have that data cleaned up and arrannged in a suitable way.
The Toolkit provides an 'initialize-toolkit-database' script. I got tired of calling that program up (yes there are some pretty pictures), so I exported the data base right after setting it up. When I needed a fresh empty database I simply imported that file back in.
Attached
The script and its documentatio can be browsed and downloaded from my website: Browse, Download tar file.
csvToTk/csvToTk.pl 0000755 0001750 0001750 00000033265 10574304321 013736 0 ustar monika monika #!/usr/bin/perl # Globals # use DBI; use vars qw/ %opt /; use strict vars; my $dbDir = "db"; my $sep = "\t"; my $tkName = "toolkit"; my $tkUser = "root"; my $tkPasswd = ""; my $accessionsTable = "Accessions"; my $namesTable = "Names"; my $linksTable = "Links"; my $accessionsFile; my $namesFile; my $linksFile; my $doTrace = 0; my $doQueryTrace = 0; my @accession_fields = (); my @names_fields = (); my @link_fields = (); sub usage() { print STDERR << "EOF"; usage: $0 [-vVDAd] -A filename : print ingets file for accessions to given file -N filename : print ingets file for names to given file -L filename : print sql import file for accession,name links to given file -a name : use