I noticed that many programs and services (as Dropbox camera uploads) sometimes modify files metadata (as the EXIF tags inside JPEGs), causing the hash of the file to change even if the real-image is unchanged.
This causes problem when comparing files to delete duplicates and checking files integrity.
To solve this problem I thought to an algorithm that allows me to calculate the hash analizing ONLY the REAL IMAGE, without tags.
That brought me to another problem... the orientation of the image. Suppose you take a photo with the wrong orientation (it really happens quiet often) and you rotate it to make it straight... it causes the image file to be altered in three possible ways:
- you rotate it with a "stupid" program (like Windows Photo Viewer) that simply rotate the image and saves it back recompressing it... AAARRRGHHHH !!! You loose the original file and probably you loose quality due to the recompression... This is absolutely to be avoided.
- you rotate it with a "smart" program that changes metadata tag "orientation", thus altering the file and consequently its hash.
- you rotate it with a "smart" program that performs lossless rotation. There are not many of these programs, still they exist. The problem is that, even in this case the file gets altered, thus its hash.
I therefore thought that I should calculate the hash on the real image only, stripping the metadata off the file and considering the fact that the image may be rotated losslessly. That means that the bit-stream composing the real image may be re-arranged, causing the respective hash to be not only one hash but effectively four hashes, one for each of the 0, 90, 180 and 270 degrees of lossless rotation.
In the end I came out with a solution I consider good, probably not perfect (even if at the moment I can't see any logic-related flaw) but reasonable: I calculate the hash of all the four possible stream-order combinations, I sort them alphabetically (MD5 is hex printed), I join them in a single string of text and eventually I calculate the hash of this string.
I know, I know.. MD5 is not anymore considered collision-proof, but its binary is commonly available on almost every system, while SHA<something> isn't. Besides I'm not working with cryptography, I just need to get a unique file name and I enforce the hash collision improbability adding to the filename the timestamp of the moment the file has been created.
It would be quite an event if you get two photos taken in the very same second and with the very same hash calculated via the algorithm described above.
The most important commands used by this scripts are jpegtran and exiftool.
On cygwin you can find jpegtran inside the jpeg package, while on debian you find it inside libjpeg-progs.
Exiftool is a bit more difficult to use on cygwin as this is not present in the standard repository and not even on cygports. You have to download the .tar.gz file from this page:
unpack it and copy the file exiftool and the directory lib together inside a directory on your PATH.
Do not use the win32 binary (.exe) as it doesn't support stdin and stdout processin. Unfortunately it only accepts filenames both for input and output files (no, the dash does not work either).
On debian it's easier as you can find it inside libimage-exiftool-perl package.
As written in the usage, the 4-rotations-algorithm is applied only to JPEG files because, with exiftool, this is the only file type for which the metadata removal is absolutely certain (it's written here in the "writer limitations" section).
For every other file type the hash is calculated on the whole file, including metadata... it's not nice but I don't really know how to obtain the same result as above on other filetypes... if you find one do not hesitate to contact me.
The command display comes from ImageMagick package which is commonly available both on cygwin and debian.
One last note: the timestamp for the filename is obtained searching in a specific order between metadata... if nothing is found inside them, then the modification file timestamp of the file is taken instead.
H4sIALPS5lsCA7VYbXPTSBL+bP2KRvGCQyLLDoQ7AqaW2gSOvQ2hIHt1VzEVFGlsa5E0Wmkc4kv8
3+/pGUmWLCeEqj1XYksz0z39+nTPbD1wL8LEvfDymWVZ/zr6+OndyfuRHYsg9DKReLEgh4b9Yf8Z
fi8WFIf+TESi78skl5H8eRp7YYS32LasLfpl5iVTEcnpAV5KKl9mmfCVCMh2Tm1KpKJvMgu/hsmU
woRyiT3yUM09FYJpQbgPwlheMtGxDMLJgg49JWy8sWTUGPMUqZkgkQQkJ/oxCnPFvGUWiIyUpEh4
lwIMQRX6eiMKQJuThz+KPCzPRC4z1be2OtXOb8JImK1KIt7QPQ1j7Ar2HuUi9TKMkR1ObLqIpP+V
VBZOpyIDA9jLg/px7CWBE4UJ9PwWKn/WL5R8CiW9SxkGWFsYG7KyAjk/T3h7HjVCeUFAMD52CKEl
ZMiVF6c0InvQ+Nh9yzr+z/vXx0ej7vVga+uxu8T7h9en/+D3n9zHS8tKhMCm5+JK5CNbXIUTJWVE
f6RiqjIvIUiUKDYwtlDaUtCK2D40T8I/oTZEmStC2IhnTykO9vN5TOIqhaIUXyIWJjLDu2Arda9r
uy3pBQXS6kCFB/RthmjCPCaW9IrcQFy6yTyKaO/VwyEWwhKJ1ekIfybJPvr48eTjAQIwzzlyxrah
G9ulhemBzYuvQkVDqzMJrUAmwrLmuTcVvW26tgwjB26wx0n3ugj25TgZJ3Z98vdPr98eHUCu46Ml
nZET0A05h/SZnzN+9vHMzmHftKjHCgQHFIiL+bQ9dVhOEUucc1RmEkaGQZljvoldBpoiPHRI9AIx
8eaROiCZRAsKwjyNvAUl4puOlu02h1NwmOcFeSsLXA4mzoRaHrR5+OBhwq+KSxcooBApxGiAlBOJ
v1gnTN6fnMKYCBFIS79+OHprFKUwJ9+L/HmklZ9rp6oZRtNM+iKYZ+JggzGGSBm9PxbmyLQ0FTrr
xaXIgE5CedDIo97Rv9+92aVf3/B3v99v22SPAQZCNcWAiJx+sAHsqwEqRvS0iJ8YYle7js3o5bn0
Q/PMmAa/Bi0qTnfNjzeNZJ7DDDkcWAbA8wFCY5oJGAcQMlO7LQ6ED5SOYxYtkYnD2Zoj12LvAjYR
AXhrEMpbpAyomVDzLGGMSWUItw2RzSqMjPGQLCV8TuQ8a+i3UZCazjn1BruQf5eGf8fX3t8G28hy
6MFg7+uigHhfxDJr2/KZ9mjiRbe7Q8uz8gnDzWaJonTmXQiF4I5gV47jthuSsTpthBnv6aVpFOoN
QQf71OL0Qvge5w4EaQW38gGWF4KGg8FPd4UjUH9GJcz2W1zeTWgh52yFAAXjm6dFWPFc8ZlkMiYJ
SeAeroMtTjo11CIV+S6lKHqQO5DaCzOBGquBXHK6Ks9X4EG9eEGCa3iLFawSIqsDYbIy9yGJ6m+w
5rFkGyYTiT1M0sIZM6XSA9f10C9I/u8jLKd5KlU/VO7eYLjvDp64tS7DMfwdJZ1igIuRo13Qn6m4
kI/BfW9/30IJQ/GArlOhZKpysoPDzEd3cfLhFJhuigwVH5/N0C1mQq4owTa+OhqHR/ZC5Fw4Oi9e
4Ptw04x+1aI0VtPaJ9tuDVVTWqeC+rZFWgBfC6CBdk20Uz3DYlTF/255Ht8ujy6KdwlSPorc800h
zWcIX+r2emzI94fOcHvbss6ou0WO+JMGKIkPHxZ8LR22ZdQWtbdsMhyEDjlI0BE57EeDfs8HZlnV
gji+TBeMcWhAzRLGR4fRj6m4vzB1ICG7+7NdtBWo0P/F+zVPLe26TNxxnKHncCaN+c1NhuY8LteN
TdMKFNrQZPCeWGgcDJYj9GPsE7O3ntPObE0VpurkQrFN0jBFVQ8jy3gYKTz6UrVhDuDkQiKKV6Lv
vVq1SzeEqpGSTW+4EqAvsDHE/ZnuRKDx3uodLbQe+WJ23qnvrKXVwd6SdpXyZeODxrO0T1KKXI3x
y5JTtrC7XRs2rBlhGw7g+UsvG3V7L+llD+1JTdmbotPc5hQoauRogOeidJ6x2Hp0+Xn0xbjSEINj
jd70co3orOZgkA43QE+5vIxsu/56bmpYYIZ9BILS29/LYBq5OgauNEGL7dngM+gfgH6eFEPM5ebm
tvXPf5QAZfkHKVDDN1GYNOt0eO1ZtzB6zeYbHNKyf3FcWA/JIXugCpdKojVWbOVG3NTc1AiD7nWL
urm6UrVOhfDOcnUeealGVRF9Tx420Iat/p8ilt4wAubiPjQrH5oiUjf1hgCuiW4CWUOPDvyl6X3z
8xX/seK8b+kyVpsCaz0yilOjA0B9pjdsiKWr7q2ydX4k7Jpp/4rsgui8VKym0XkJPXobYy7zbZIf
ZVA/7Aw1JHUScaVKXMJc8bjzfLC9DXnP9Bm4WgELoh9+8qyomXVaBpUaqNWJfjDJ1kGuLLN34l+F
rc2tGXT1WVrn/Tlq97nprVcirQIcsEH19+frAxqLGiMF1tzwJQODAUMmy8n3DXqQQQJjX6z7Y25T
UCoCtDZUnPuNToUiTtJa5rrdR+Pkkbu8G7iKPNSsimV2l0PI3rhcNw7Km+ZFuVt1R96q7HFo3aFt
r7Q989Hm010AX1Eh5ngS9FWrqEsXd01YrZumXzLBK4trtepii05w9uTjWHXX1lx4mnn+141jzXu6
tx8+1W/Lqnf6xNLY97jWK0pNCQeVJrUiUJav9bm1C7EGaKws8mWzAc3AstVDORySCCVg4aPcPTug
z647fVTGqTPH00x4AQfRsKT1aegMTWrdK27N1kDYuk6mgWD40Vn4F1mk7E6bB4qGbJXRvm+zO69K
/1JD3jP/b7ckG/I7TEym9HrkAwm3VvQ7fCvpvxwNn/Lvzg5peMcJu5Zm9f2uTBSz0zZ65arxKQSo
8VrzmsXHKDUqnLa19bi/5A14jC9D1e4u3+1W8bHpTNI875iyvpKqr9GPf8BtSbUiWO/iTddzB109
3eLLGtntNEWAr/qtDYemiumaMKPRXVo0s7+mtqY+oJN/6hNd0Ue159+8fvfb0aFdCajX/Zj1yrvo
/wHEsjLLbRkAAA==
to be piped to | base64 -d | gunzip > mediarename
1 commento:
sisi.... proprio lo script che cercavo
GRAZIE AMICOMICO!
Posta un commento