\

Research team digitizes more than 100 years of Canadian infectious disease data

123 points - last Friday at 6:24 AM

Source
  • arjie

    today at 8:41 AM

    This is very cool. I wonder if in their process they stored raw scans as well or if the transcriptions were from the source material directly. The former would be fantastic if possible since perhaps present or future OCR technology could be used to cross-reference for both improving the dataset for human error but also improving OCR as a labeled dataset.

    It also seems like a huge effort to try to come up with a data model here for the normalized dataset. They mentioned it in the article as an aside but it seems like a pretty tough task.

    And my last thought is perhaps there is a sadness in our loss of population level insight into health with the advent of modern privacy concerns. A big source for genomics data is the UK Biobank which ties all sorts of information to a genome. Iโ€™m sure that someone could come up with dangers that this presents but itโ€™s been a gift to so many researchers over time, and to so many people who suffer from genetic disease. I hope that in time people will be willing to volunteer sufficient information to be able to do population-level science, even knowing the dangers inherent.

    If youโ€™re in the US, All of US accepts participation and I found that doing so was quite easy. They will give you all the scary warnings, which are good to consider but I hope many will find it worthwhile even knowing the risks.

    • akudha

      today at 2:00 AM

      What useful tools can be made from such a dataset?

      The other day I came across this pricing dataset https://oria-data.trillianthealth.com/ (this is just for pricing though)

      There must be some gem datasets like these - I wish I had the time (and expertise) to explore

      • rumplecat

        today at 7:22 AM

        Interesting that they manually transcribed the data to Excel. It would also be interesting to know how they mapped from the excel files to the final dataset. I wonder if LLMs could do the switch from scans to structured data more efficiently, and how much of a hit to accuracy would be involved.

        • toomuchtodo

          yesterday at 11:27 PM

          https://journals.plos.org/globalpublichealth/article?id=10.1...

          https://canmod.net/digitization/

            • water-your-self

              today at 1:56 AM

              Second link is the database

          • tim-tday

            today at 3:24 AM

            Do you want computer viruses? Because thatโ€™s how you get computer viruses.

            • pointbob

              today at 3:05 AM

              [flagged]

              • temptemptemp111

                yesterday at 11:43 PM

                [dead]