Skip to content

Commit 369c05e

Browse files
committed
add docx metadata extractor tutorial
1 parent e086cab commit 369c05e

File tree

5 files changed

+44
-0
lines changed

5 files changed

+44
-0
lines changed

‎README.md‎

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@ This is a repository of all the tutorials of [The Python Code](https://www.thepy
6363
-[How to Build a Username Search Tool in Python](https://thepythoncode.com/code/social-media-username-finder-in-python). ([code](ethical-hacking/username-finder))
6464
-[How to Find Past Wi-Fi Connections on Windows in Python](https://thepythoncode.com/article/find-past-wifi-connections-on-windows-in-python). ([code](ethical-hacking/find-past-wifi-connections-on-windows))
6565
-[How to Remove Metadata from PDFs in Python](https://thepythoncode.com/article/how-to-remove-metadata-from-pdfs-in-python). ([code](ethical-hacking/pdf-metadata-remover))
66+
-[How to Extract Metadata from Docx Files in Python](https://thepythoncode.com/article/docx-metadata-extractor-in-python). ([code](ethical-hacking/docx-metadata-extractor))
6667

6768
-### [Machine Learning](https://www.thepythoncode.com/topic/machine-learning)
6869
-### [Natural Language Processing](https://www.thepythoncode.com/topic/nlp)
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
# [How to Extract Metadata from Docx Files in Python](https://thepythoncode.com/article/docx-metadata-extractor-in-python)
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
importdocx# Import the docx library for working with Word documents.
2+
frompprintimportpprint# Import the pprint function for pretty printing.
3+
4+
defextract_metadata(docx_file):
5+
doc=docx.Document(docx_file) # Create a Document object from the Word document file.
6+
core_properties=doc.core_properties# Get the core properties of the document.
7+
8+
metadata={} # Initialize an empty dictionary to store metadata
9+
10+
# Extract core properties
11+
forpropindir(core_properties): # Iterate over all properties of the core_properties object.
12+
ifprop.startswith('__'): # Skip properties starting with double underscores (e.g., __elenent). Not needed
13+
continue
14+
value=getattr(core_properties, prop) # Get the value of the property.
15+
ifcallable(value): # Skip callable properties (methods).
16+
continue
17+
ifprop=='created'orprop=='modified'orprop=='last_printed': # Check for datetime properties.
18+
ifvalue:
19+
value=value.strftime('%Y-%m-%d %H:%M:%S') # Convert datetime to string format.
20+
else:
21+
value=None
22+
metadata[prop] =value# Store the property and its value in the metadata dictionary.
23+
24+
# Extract custom properties (if available).
25+
try:
26+
custom_properties=core_properties.custom_properties# Get the custom properties (if available).
27+
ifcustom_properties: # Check if custom properties exist.
28+
metadata['custom_properties'] ={} # Initialize a dictionary to store custom properties.
29+
forpropincustom_properties: # Iterate over custom properties.
30+
metadata['custom_properties'][prop.name] =prop.value# Store the custom property name and value.
31+
exceptAttributeError:
32+
# Custom properties not available in this version.
33+
pass# Skip custom properties extraction if the attribute is not available.
34+
35+
returnmetadata# Return the metadata dictionary.
36+
37+
38+
39+
docx_path='test.docx'# Path to the Word document file.
40+
metadata=extract_metadata(docx_path) # Call the extract_metadata function.
41+
pprint(metadata) # Pretty print the metadata dictionary.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
python-docx
12.5 KB
Binary file not shown.

0 commit comments

Comments
(0)