Great minds on the XML structure, David. That's how I'm doing the cleanup-before-the-cleanup in Notepad ++. I'm assigning arbitrary tags, and OpenRefine will parse the nested tags into columns.
I've used some RegEx on it so far, but brute-force copying and pasting also moves fairly quickly with the laptop and some light entertainment on the TV.
Maristella
-----Original Message-----
From: Association for Recorded Sound Discussion List <[log in to unmask]> On Behalf Of David Day
Sent: Thursday, May 23, 2019 12:41 PM
To: [log in to unmask]
Subject: [EXT] Re: [ARSCLIST] ARSCLIST Digest - 21 May 2019 to 22 May 2019 (#2019-119)
I would also be interested to see if you can succeed in coverting the PDF to a database. I have taken a first step, which others may also have already done, by converting the PDF to a plain text file (Unicode UTF-8, Legacy Mac CR). You can access the text file here:
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbyu.box.com%2Fs%2Fs2otejraibhbfr5c8kohwasx2inxhm6u&data=02%7C01%7CMaristella.Feustle%40UNT.EDU%7Ce6098f6d16164aa746b108d6dfa5f3da%7C70de199207c6480fa318a1afcba03983%7C0%7C1%7C636942301195758328&sdata=VhXgqNaZB139GrgT0lRaUxjRGrvrPmjooxZptVorcOc%3D&reserved=0
I have used Open Refine for a few other projects and I am not sure if it is the best solution for the following reasons.
1. As noted, Open Refine works with CSV files or files that are basically structured like a spreadsheet.
2. Converting the text file to CSV is possible, but it would require a lot of manual editing to get to a point where it is possible. At least that is my understanding, I would be eager to learn if there are ways around that. The tools that Maristella mentioned might make the process easier.
3. The data is not a straight forward spreadsheet structure, for any given column or category of information there can be multiple entries, like repeating fields in the MARC format. There are ways around that like using the “fill down” feature in Open Refine, but that results in a lot of duplication.
4. An XML structure might be a better way to adapt this data for an online database. This approach would also involve a lot of manual editing as well, but it is a better way to deal with repeating data in the same category. I think it could be approached by creating a set of XML tags that would be appropriate for managing the data and then using something like Notepad++ to place those tags were appropriate in the text.
All of this conversion could be approached more easily if there is access to the electronic file or data that was used to create the PDF. Does anyone know if that file or data is available?
Keep me posted on your progress. I would be happy to help if there is anything I could contribute.
David Day
On May 22, 2019, at 10:00 PM, ARSCLIST automatic digest system <[log in to unmask]<mailto:[log in to unmask]>> wrote:
From: ARSCLIST automatic digest system <[log in to unmask]<mailto:[log in to unmask]>>
Subject: ARSCLIST Digest - 21 May 2019 to 22 May 2019 (#2019-119)
Date: May 22, 2019 at 10:00:00 PM MDT
To: <[log in to unmask]<mailto:[log in to unmask]>>
Reply-To: Association for Recorded Sound Discussion List <[log in to unmask]<mailto:[log in to unmask]>>
There are 3 messages totaling 256 lines in this issue.
Topics of the day:
1. [EXT] [ARSCLIST] anyone database-ized Brian Rust's: Jazz Records
1917–1934? (3)
From: "Feustle, Maristella" <[log in to unmask]<mailto:[log in to unmask]>>
Subject: Re: [EXT] [ARSCLIST] anyone database-ized Brian Rust's: Jazz Records 1917–1934?
Date: May 22, 2019 at 6:09:28 AM MDT
Are you familiar with OpenRefine? It would take some work, but you could copy/paste this into a text file and wangle it thusly. The Transpose function will convert row headings into columns.
The text file itself will require some cleanup before the cleanup in OpenRefine. A reader like Notepad++ that allows mass edits with Regular Expressions (RegEx) will help.
Of course, I intend to mess with this on my own. If I get it to something I like, I'll share it.
Maristella
________________________________
From: Association for Recorded Sound Discussion List <[log in to unmask]<mailto:[log in to unmask]>> on behalf of Brewster Kahle <[log in to unmask]<mailto:[log in to unmask]>>
Sent: Tuesday, May 21, 2019 7:45:26 PM
To: [log in to unmask]<mailto:[log in to unmask]>
Subject: [EXT] [ARSCLIST] anyone database-ized Brian Rust's: Jazz Records 1917–1934?
It is a huge help that the pdf is available (thank you Mainspring!). I
would like to use it for automatically finding dates for records in the Great 78 Project and point back to the right page in the book.
To do this I need it in a format like a CSV (but can convert it from any other database-like format)
Label, catno, matrix, performer, title, date, page-number
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2F78records.files.wordpress.com%2F2016%2F06%2Frust_jr_free-edition.pdf&data=02%7C01%7CMaristella.Feustle%40UNT.EDU%7Ce6098f6d16164aa746b108d6dfa5f3da%7C70de199207c6480fa318a1afcba03983%7C0%7C0%7C636942301195768318&sdata=txDi%2FLiQeJQ9IRUCovhhbJotkowkqljBDXN8teIZJYE%3D&reserved=0
I did this for Almost Complete ...
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Farchive.org%2Fdetails%2Falmostcomplete7800barr%2Fpage%2F115&data=02%7C01%7CMaristella.Feustle%40UNT.EDU%7Ce6098f6d16164aa746b108d6dfa5f3da%7C70de199207c6480fa318a1afcba03983%7C0%7C0%7C636942301195768318&sdata=NteQzwD3y8IkvGlzjykw8j95%2BrSLBCowIDhkR3P6XBU%3D&reserved=0
and American 45. and 78...
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Farchive.org%2Fstream%2Famerican45fortyf00dani%23page%2F38%2Fmode%2F1up&data=02%7C01%7CMaristella.Feustle%40UNT.EDU%7Ce6098f6d16164aa746b108d6dfa5f3da%7C70de199207c6480fa318a1afcba03983%7C0%7C0%7C636942301195768318&sdata=LumtTH7jECYQTAZL6LM8J%2FLXV4gjnac7M29hMKzYSCo%3D&reserved=0
by having these hand key'ed and each month I use this for the new records.
Does anyone have any databased version of this? It must have started
out in such a thing.
Thank you!
-brewster
From: Brewster Kahle <[log in to unmask]>
Subject: Re: [EXT] [ARSCLIST] anyone database-ized Brian Rust's: Jazz Records 1917–1934?
Date: May 22, 2019 at 9:51:59 AM MDT
thank you.
I never thought of using OpenRefine (I think that derived from the powerful tool gougle bought with metaweb), or other online tools. I wonder how much time it would take to massage this pdf.
if you get somewhere on this, please let me know.
-brewster
On 5/22/19 5:09 AM, Feustle, Maristella wrote:
Are you familiar with OpenRefine? It would take some work, but you could copy/paste this into a text file and wangle it thusly. The Transpose function will convert row headings into columns.
The text file itself will require some cleanup before the cleanup in OpenRefine. A reader like Notepad++ that allows mass edits with Regular Expressions (RegEx) will help.
Of course, I intend to mess with this on my own. If I get it to something I like, I'll share it.
Maristella
________________________________
From: Association for Recorded Sound Discussion List <[log in to unmask]> on behalf of Brewster Kahle <[log in to unmask]>
Sent: Tuesday, May 21, 2019 7:45:26 PM
To: [log in to unmask]
Subject: [EXT] [ARSCLIST] anyone database-ized Brian Rust's: Jazz Records 1917–1934?
It is a huge help that the pdf is available (thank you Mainspring!). I
would like to use it for automatically finding dates for records in the Great 78 Project and point back to the right page in the book.
To do this I need it in a format like a CSV (but can convert it from any other database-like format)
Label, catno, matrix, performer, title, date, page-number
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2F78records.files.wordpress.com%2F2016%2F06%2Frust_jr_free-edition.pdf&data=02%7C01%7CMaristella.Feustle%40UNT.EDU%7Ce6098f6d16164aa746b108d6dfa5f3da%7C70de199207c6480fa318a1afcba03983%7C0%7C0%7C636942301195768318&sdata=txDi%2FLiQeJQ9IRUCovhhbJotkowkqljBDXN8teIZJYE%3D&reserved=0
I did this for Almost Complete ...
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Farchive.org%2Fdetails%2Falmostcomplete7800barr%2Fpage%2F115&data=02%7C01%7CMaristella.Feustle%40UNT.EDU%7Ce6098f6d16164aa746b108d6dfa5f3da%7C70de199207c6480fa318a1afcba03983%7C0%7C0%7C636942301195768318&sdata=NteQzwD3y8IkvGlzjykw8j95%2BrSLBCowIDhkR3P6XBU%3D&reserved=0
and American 45. and 78...
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Farchive.org%2Fstream%2Famerican45fortyf00dani%23page%2F38%2Fmode%2F1up&data=02%7C01%7CMaristella.Feustle%40UNT.EDU%7Ce6098f6d16164aa746b108d6dfa5f3da%7C70de199207c6480fa318a1afcba03983%7C0%7C0%7C636942301195778312&sdata=omkeqU8YJwgsTNkcRVANdz0uv4ek1CvBbsmUjgxA86E%3D&reserved=0
by having these hand key'ed and each month I use this for the new records.
Does anyone have any databased version of this? It must have started
out in such a thing.
Thank you!
-brewster
From: David Diehl <[log in to unmask]>
Subject: Re: [EXT] [ARSCLIST] anyone database-ized Brian Rust's: Jazz Records 1917–1934?
Date: May 22, 2019 at 12:05:54 PM MDT
If you're really in the mood for fun, Rust's British Dance Bands book has been converted to PDF-but the only text that has been OCR'ed are recording locations and dates. https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fopenlibrary.org%2Fbooks%2FOL22062072M%2FBritish_dance_bands_on_record_1911_to_1945&data=02%7C01%7CMaristella.Feustle%40UNT.EDU%7Ce6098f6d16164aa746b108d6dfa5f3da%7C70de199207c6480fa318a1afcba03983%7C0%7C1%7C636942301195778312&sdata=c7P9q2kfQn6dmga9gjS4ZXHi3XIMSiJQ%2BtsMcIVfhME%3D&reserved=0
-----Original Message-----
From: Brewster Kahle <[log in to unmask]>
To: ARSCLIST <[log in to unmask]>
Sent: Wed, May 22, 2019 10:54 am
Subject: Re: [ARSCLIST] [EXT] [ARSCLIST] anyone database-ized Brian Rust's: Jazz Records 1917–1934?
thank you.
I never thought of using OpenRefine (I think that derived from the powerful tool gougle bought with metaweb), or other online tools. I wonder how much time it would take to massage this pdf.
if you get somewhere on this, please let me know.
-brewster
On 5/22/19 5:09 AM, Feustle, Maristella wrote:
Are you familiar with OpenRefine? It would take some work, but you could copy/paste this into a text file and wangle it thusly. The Transpose function will convert row headings into columns.
The text file itself will require some cleanup before the cleanup in OpenRefine. A reader like Notepad++ that allows mass edits with Regular Expressions (RegEx) will help.
Of course, I intend to mess with this on my own. If I get it to something I like, I'll share it.
Maristella
________________________________
From: Association for Recorded Sound Discussion List <[log in to unmask]> on behalf of Brewster Kahle <[log in to unmask]>
Sent: Tuesday, May 21, 2019 7:45:26 PM
To: [log in to unmask]
Subject: [EXT] [ARSCLIST] anyone database-ized Brian Rust's: Jazz Records 1917–1934?
It is a huge help that the pdf is available (thank you Mainspring!). I would like to use it for automatically finding dates for records in the Great 78 Project and point back to the right page in the book.
To do this I need it in a format like a CSV (but can convert it from any other database-like format)
Label, catno, matrix, performer, title, date, page-number
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2F78records.files.wordpress.com%2F2016%2F06%2Frust_jr_free-edition.pdf&data=02%7C01%7CMaristella.Feustle%40UNT.EDU%7Ce6098f6d16164aa746b108d6dfa5f3da%7C70de199207c6480fa318a1afcba03983%7C0%7C0%7C636942301195778312&sdata=hH8ZyP1ZMk%2F9WfDCYpwyakx7vNA973pBnb5eEz97seU%3D&reserved=0
I did this for Almost Complete ...
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Farchive.org%2Fdetails%2Falmostcomplete7800barr%2Fpage%2F115&data=02%7C01%7CMaristella.Feustle%40UNT.EDU%7Ce6098f6d16164aa746b108d6dfa5f3da%7C70de199207c6480fa318a1afcba03983%7C0%7C0%7C636942301195778312&sdata=5V8cbUDFcuMPAYyAXWr7eJn0Z%2B6unPMp%2F2raXCNXjxM%3D&reserved=0
and American 45. and 78...
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Farchive.org%2Fstream%2Famerican45fortyf00dani%23page%2F38%2Fmode%2F1up&data=02%7C01%7CMaristella.Feustle%40UNT.EDU%7Ce6098f6d16164aa746b108d6dfa5f3da%7C70de199207c6480fa318a1afcba03983%7C0%7C0%7C636942301195778312&sdata=omkeqU8YJwgsTNkcRVANdz0uv4ek1CvBbsmUjgxA86E%3D&reserved=0
by having these hand key'ed and each month I use this for the new records.
Does anyone have any databased version of this? It must have started out in such a thing.
Thank you!
-brewster
|