From Paper to Pixels: Azure AI in Historical Document Digitization and Translation

Heather_MacKinnon · Jul 16, 2024

For 64 years, a stack of letters lay unread in my grandfather’s trunk. These letters, written by relatives of my great-grandfather who immigrated from Poland in 1906, represent the last remnants of Polish language in my family. Over a decade ago, I promised my mother I would have them translated, but life got in the way. Recently, inspired by the success I've seen using OCR, digitization, and translation with various customer documents, I decided to tackle this personal project myself using the Microsoft AI services I am familiar with.

Iterating through Process and Technology

I gathered the letters, took images of each page with my phone, and uploaded them to Azure Storage. Sending the image files directly to Azure OpenAI's GPT-4o model resulted in a confusing mix of English and Polish, so I converted them to .pdf files and took advantage of Azure’s Document Intelligence Service, specifically the Read model, to identify the language and extract the text. The Document Analysis feature can recognize different styles, including handwritten text, and identify the language, reassuring me that the text would be extracted accurately. I used the Document Intelligence code samples repo to get started, and verified that the text was handwritten and correctly identified as Polish (P1). The best workflow involved pre-processing the documents with Azure Document Intelligence’s OCR capabilities, then passing the extracted Polish text to the GPT-4 model for translation into English. I used this git repo as a quick start, changed the model endpoint, and processed the letters in chunks of two pages per call to the model. This combination provided a reliable method to digitize and translate the documents effectively. Success! I had a general idea of the contents of the letters, so once my initial attempt proved successful, I felt motivated to continue. Within about an hour, I had processed all the letters and was ready to validate the results.

Figure 1. Excerpt from an original letter in Polish

The English translation of the above letter excerpt:

“Dear Joseph, I ask you to please ask your brothers Kazimierz and Jan to write to me. I deeply regret the loss of Piotr who died on the front in France. Tell me how your father is doing and if he is healthy. My dearest nephews, please come to Poland to visit us and improve our lives. If you cannot come, please send my family a parcel of clothes. We need both winter and summer clothing, as well as footwear—shoes and clogs. From your abundance, please collect some money and send it to us, poor farmers. I wish you happiness and success in the future. I send my warmest greetings to Joseph, Kazimierz, Jan, your father, and all our relatives.”

Although the translations seemed accurate, I needed confirmation from a native speaker before sharing with family. Luckily, a colleague who speaks Polish fluently offered to review a sample page. He confirmed that, while there were some nuances and minor errors in the translation, the overall message and sentiment were accurate. For example, "cannot replace me, an old man, in hard work" was interpreted by GPT-4o as "too young to replace me in heavy labor." While this analysis method was sufficient for my small project, more sophisticated metrics would be best used at scale. While this small personal project used only a subset of Azure services and was not intended for public use, this pattern could easily be integrated into a search service, such as Azure AI Search for discovery, or broadened to a “Chat with Your Data” pattern, for scalability.

Extending to cursive handwriting

After completing this project, I became curious about applying the same process to more complex documents, such as those written in cursive. Since cursive has not been taught as part of the common core at most U.S. schools since 2010, there is a real risk that future generations may lose the ability to read and preserve these historical documents. Institutions like the Smithsonian and the National Archives rely on volunteers to transcribe and preserve these records, but the pace is insufficient to capture written history before these skills disappear. The time has never been better to introduce AI for preservation of handwritten history.

To test this idea, I found a few family recipes written in cursive. This time, I uploaded image files of the recipes directly to GPT-4o. The results were impressive and did not require OCR as an initial step. Here’s a sample recipe from my grandmother:

Figure 2. Recipe handwritten in cursive

And here are the results from GPT-4o:

Sandies

Ingredients:

1 cup butter
1/3 cup granulated sugar
2 tablespoons water
2 teaspoons vanilla
2 cups flour
1 cup chopped nuts

Instructions:

Cream butter and sugar.
Add water and vanilla.
Then add flour and nuts.
Chill 4 hours.
Roll in cookie sheet, shape into fingers.
Bake at 325°F for 20 minutes.
Cool and roll in powdered sugar.

Amazingly, the model was even able to identify "cookie sheet," which was written in very small superscript text.

Prompt nuances matter

While I didn’t have to do much on the prompt engineering side, I realized how small changes in prompts can affect the quality of the result. In the case of the Polish letters, I initially prompted the GPT model to “translate this text from Polish to English”. I iterated a bit and found “translate these family letters from Polish to English” made the result a bit more readable and self-corrected on some misspellings. In the case of the recipes, I specifically prompted the GPT model to “read this family recipe” rather than just read what was in the image. This resulted in not only a very accurate result, but the model output separated the ingredients and the instructions without being explicitly written in the original recipe.

The journey from paper to pixels has never been more accessible or efficient, thanks to Azure Document Intelligence and Azure OpenAI. These powerful tools have proven their capability in digitizing and translating handwritten historical documents, preserving invaluable cultural and personal histories. My experience with translating my great-grandfather's letters and digitizing family recipes demonstrates the transformative potential of these technologies. By leveraging Azure's AI tools, we can ensure that the stories and knowledge contained in historical documents are not lost to time but are instead accessible to future generations.

Continue reading...

From Paper to Pixels: Azure AI in Historical Document Digitization and Translation

Heather_MacKinnon