\

OCR for construction documents does not work, we fixed it

79 points - today at 4:05 PM


So we've built an API and trained models that detects fixtures, extracts schedules, and analyzes construction documents. Check us out!

More examples: - https://www.getanchorgrid.com/developer/docs/endpoints/drawi...

Main website: - https://www.getanchorgrid.com/developer

Why we did it: https://www.getanchorgrid.com/developer/docs/changelog/const...

Source
  • Terr_

    today at 5:34 PM

    > OCR for construction documents does not work

    I'm reminded of the Xerox JBIG2 bug back in ~2013, where certain scan settings could silently replace numbers inside documents, and bad construction-plans were one of the cases that led to it being discovered. [0]

    It wasn't overt OCR per se, end-user users weren't intending to convert pixels to characters or vice-versa.

    [0] https://www.youtube.com/watch?v=c0O6UXrOZJo&t=6m03s

      • hackcasual

        today at 8:30 PM

        JBIG2 does glyph binning, as you say not exactly OCR, but similar. So chunks of the image that look sufficiently similar get replaced with a reference to a single instance.

        • TehCorwiz

          today at 5:48 PM

          If I recall it was an artifact of the compression algo.

          Full context and details: https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...

      • sreekanth850

        today at 6:53 PM

        We’re taking a different path, building a parsing engine that converts CAD (DWG/DXF) into fully structured JSON with preserved semantics (no ML in the critical path).We also have a separate GIS parser that extracts vector data (features, layers, geometries) independently, Like to know how you handle consistency and reproducibility across runs using models and how you make it affordable, especially at scale. because as far as i know CAD and GIS need precision and accuracy.

          • wcisco17

            today at 7:37 PM

            interesting yeah parsing DWG/DXF natively makes sense when the source file is clean and well-structured. The precision argument is valid in controlled environments.

            The challenge we kept running into is that construction drawings in the wild aren’t always that clean. Unresolved xrefs, exploded dynamic blocks, version incompatibilities, SHX font substitutions — by the time a PDF hits a GC’s desk it’s often the only reliable artifact left. The CAD source may not even be available.

            That’s why we see vision becomes the more pragmatic path — not because it’s more precise than structured CAD parsing, but because PDFs are the actual lingua franca of construction. Every firm, every trade, every discipline hands off PDFs. So we made a bet on meeting the document where it actually lives.

            On consistency and reproducibility — that’s a real challenge with vision models. Our approach is to keep detection scope narrow and validate confidence scores on every output rather than trying to generalize broadly. Happy to go deeper on that if useful.

            • oneneptune

              today at 7:38 PM

              Is this a service / product you plan to offer outwardly? I'd be interested in learning more. Use case: estimation.

          • petee

            today at 7:45 PM

            I ran the example doors given and it missed 9 swinging doors, some that were in double swing pairs, and a few that were just out on their own not clustered. Not bad overall though

            • i18nagentai

              today at 7:06 PM

              OCR accuracy on technical documents is one of those problems that looks 95% solved until you hit the edge cases. Construction docs are especially tricky because of mixed handwriting, stamps, revision clouds, and poor scan quality. Curious how you handle multi-language documents — a lot of international construction projects have specs in two or three languages on the same page.

              • frogguy

                today at 6:14 PM

                Looks cool! Where are you getting the data to finetune the cv models for element extraction? I'm worried there isn't a robust enough dataset to be able to build a detection model that will generalize to all of the slightly different standards each discipline (and each firm for that matter) use.

                  • wcisco17

                    today at 6:32 PM

                    good q — we don't train on customer drawings. Our detection models are trained on a curated dataset of architectural drawings we've sourced and labeled ourselves, focused on the most common fixture and element types across CSI divisions.

                    The generalization problem you're pointing at is real and it's the hardest part of this. Our approach is to keep the detection scope tight — rather than trying to generalize across every firm's conventions, we train on a small but high-quality set of fixtures and optimize for precision within that scope.

                    The result is high confidence outputs on the elements we support, rather than mediocre coverage across everything.

                    We're expanding the detection surface incrementally as we validate accuracy division by division!

                      • dylan604

                        today at 7:09 PM

                        How in the world is an answer to a question from the account posting TFA replying directly to said question getting killed?

                • testUser1228

                  today at 5:37 PM

                  What do you foresee being the end use case for this (or most valuable use case)?

                    • wcisco17

                      today at 5:44 PM

                      Anyone building in or for construction tech — whether that's a startup building estimating or project management software, a construction company with an internal tech team solving this themselves, or a builder looking to automate their workflow. The common thread is drawings. Every one of those groups lives and dies by their ability to extract actionable data from a PDF that was never designed to be machine-readable. We're building the layer that makes that possible so they don't have to start from scratch.

                        • wang_li

                          today at 5:55 PM

                          Why does the workflow lie at the level of a real or virtual piece of paper and not in the metadata from the applications used to create that piece of paper? Seems like a CAD tool would allow you to identify each element of the drawing, assigning metadata as required.

                            • jsidney

                              today at 6:04 PM

                              Only a small set of construction stakeholders participate in the CAD ecosystem (e.g., architects, large GCs) while a broader set of stakeholders (subcontractors, trades, smaller GCs/CMs) do not receive BIM files and work with PDFs. CAD/BIM is a wonderful aspiration but for many the reality is PDFs.

                                • instig007

                                  today at 6:55 PM

                                  Re. "CAD/BIM", technically speaking CAD doesn't imply BIM, and the industry's promotion of BIM is akin to AI promotion among software engineering teams - the benefits aren't clear upon detailed review of the advertised capabilities. The CAD part, on the other hand, is generally recognized as the essential tooling for the profession and I'm surprised to hear that it just is a "wonderful aspiration".

                              • cyanydeez

                                today at 6:12 PM

                                Oh you sweet summer child. These draws are anywhere from 0 to 120 years old and might just be something pulled out of a floppy disk from 1970 to scanned in coffee ridden pieces of paper sitting in a desk folded a hundred times.

                                The world in which metadata is a common thing attached to any file doesn't exist, and probably never will, no matter how much you try to improve CAD work flow.

                    • Iulioh

                      today at 5:05 PM

                      When will this be available for 30000x8000px electrical diagrams?

                      I have to make a BOM and oh boy I hate my job

                        • oritron

                          today at 5:11 PM

                          What software made the bitmap? Seems like a step earlier in the pipeline could help generate a BOM more easily.

                            • Iulioh

                              today at 5:53 PM

                              I'm not really sure and I don't have access to it, I just recive flat PDFs or TIFFs

                              A lot of them are "archival" so I'm pretty OOL

                                • dylan604

                                  today at 7:15 PM

                                  You might even be SOL

                                  It is telling that so many of the comments here assume the person with a thing that is not the most practical would be easily able to request thing in a different format. The assumption that the person with the inconvenient thing would never have thought to ask if more convenient thing was available and just willfully toiling with the inconvenient thing is kind of insulting.

                                    • oneneptune

                                      today at 7:40 PM

                                      Also, in the construction industry you get an updated drawing file a day before the bidding closes... good luck getting the GC to send more detailed files (that they themselves got elsewhere) in that time. You're better off sending it to your estimation department in India and letting them work through the night to put together the new estimations.

                          • alexeischiopu

                            today at 5:30 PM

                            I’m building a similar platform, with electrical being furthest ahead - SLD, panels, lights, power, comms.

                            Also do doors, windows, and mechanical equipment.

                            dm, and I can include you in the next preview.

                              • testUser1228

                                today at 6:55 PM

                                I'm not sure how to dm on here, but I'm very interested

                                  • axus

                                    today at 7:37 PM

                                    You can paste "who is alexeischiopu" to a search engine, and since there isn't an athlete with the same name, a good candidate appears.

                                • Iulioh

                                  today at 5:55 PM

                                  I work in the automotive field, I don't know if this complicates the things further but I appreciate any help!

                              • jsidney

                                today at 5:11 PM

                                What do you hate the most?

                                  • stronglikedan

                                    today at 6:51 PM

                                    silly questions

                            • hspraggins77

                              today at 5:47 PM

                              Great points raised!

                              • alexeischiopu

                                today at 5:30 PM

                                Good idea :)

                                  • wcisco17

                                    today at 5:35 PM

                                    Thanks!!

                                • vessenes

                                  today at 5:31 PM

                                  cool. What's pricing like?

                                    • wcisco17

                                      today at 5:35 PM

                                      Thanks! https://www.getanchorgrid.com/developer/pricing

                                      Let me know if you find it useful or have any questions, happy to help.

                                        • vessenes

                                          today at 5:42 PM

                                          Thanks -- btw the Pricing link on the site pulls up a form, not that page.

                                  • achillesheels

                                    today at 4:57 PM

                                    Love it! Starbucks Vente Machiato sip

                                    Love to give it to an arc client, not sure who the right person to implement this would be? Hmm…

                                  • today at 4:05 PM

                                    • fithisux

                                      today at 4:48 PM

                                      Of course it is not working. PDF and images are supposed to be tamper resistant. OCR tries to reverse engineer them.

                                        • kube-system

                                          today at 5:01 PM

                                          Since when is tamper resistance a part of PDF or any common image format?

                                            • pwagland

                                              today at 5:17 PM

                                              PDF files can be signed, that is tamper resistance. Tamper resistance doesn't have to make any difference to the readability of the document.

                                                • kube-system

                                                  today at 5:23 PM

                                                  So can any type of file -- that doesn't have any relevance to the supposed design of every file type in existence. Now, later versions of PDF do have explicit support for signatures, but what does this have to do with preventing OCR? OCR reads a file, it doesn't change the original file.

                                                    • fithisux

                                                      today at 6:01 PM

                                                      True but you can make modified copies if you reverse engineer it with OCR.

                                                      • ranger_danger

                                                        today at 5:40 PM

                                                        Some OCR solutions do change the original file, like OCRmyPDF. They take layers that were just images before and replace it with text layers so that you can search the document.

                                                          • kube-system

                                                            today at 5:48 PM

                                                            That isn't OCR, but an application of the resulting output of OCR. Again, a signature on a PDF or any type of file doesn't prevent you from reading it. (It also doesn't technically prevent you from changing it, it just enables the detection of changes to a particular file.)

                                                            There's nothing about PDFs or image formats that prevent anyone from doing OCR. The reason construction documents are difficult to OCR is because OCR models are not well trained for them, and they're very technical documents where small details are significant. It doesn't have anything to do with the file format

                                                    • ranger_danger

                                                      today at 5:38 PM

                                                      Can't one just remove the signature and re-sign it with anything else after tampering? Who verifies PDFs that hard?

                                                        • kube-system

                                                          today at 5:55 PM

                                                          If you're performing OCR, you're almost by definition, disregarding the source file. The whole point of OCR is to be transformative.

                                                  • fithisux

                                                    today at 5:59 PM

                                                    You can't change a PDF, it is by design to be not easy to OCRed

                                                      • kube-system

                                                        today at 6:50 PM

                                                        PDFs are merely an collection of objects, that can be plainly read by reading the file -- some of those are straight up plain text that doesn't even need to be OCR'd, it can be simply extracted. It is also possible to embed image objects in PDFs, (this is common for scanned files) which might be what you are thinking of. But this is not a design feature of PDF, but rather the output format of a scanner: an image. Editing PDFs is a simple matter of simply editing a file, which you can do plainly as you would any other.

                                            • ware-intel

                                              today at 6:04 PM

                                              Your smart features looks like a game changer? Nice job!