Week 7

>>DAVID MALAN: All right, welcome back. This is CS50. This is the start of week seven. So it’s been a while, so I thought we’d
take a whirlwind tour of where we left off and where we’re now going.>>So this thing here might have
caused some angst at first. But hopefully, you’re beginning to
acclimate to what this denotes here– star representing a pointer, which is
just what, in more layman’s terms? So it’s an address.>>So it’s the address of
something in memory. And we started to peel back the layers
a couple of weeks ago, things like GetString and other such functions
all this time have been returning addresses of things in memory, like the
address of the first character in some sequence.>>So we also introduced valgrind, which
you’ll start to use for this problem set, particularly for the next
problem set as well. And valgrind does what for us? It checks for memory leaks, and it
also checks for abuse of memory.>>It can, with some probability, detect if
your code is going to touch memory that it simply shouldn’t. So not necessarily a leak, but if you
go beyond the boundaries of some array, and you actually run valgrind
and induce that behavior while valgrind is running in your program is
running inside of it, you’ll get messages like this– “invalid write of
size 4,” which, recall a couple of weeks ago meant that I had accidentally
like on one int too far beyond the boundaries of an array. And so size 4 means here the size
of that particular int.>>So take reassurance in the fact that
valgrind’s output, the format of it, is just atrocious. It’s really hard to see through the mess
for the interesting information. So what we’ve done here is just excerpt
some of the couple of more interesting lines. But realize that 80% of valgrind’s
output is going to be a bit of a distraction.>>Just look for patterns like these–
invalid right, invalid read, 40 bytes and some number of blocks are definitely
lost, keywords like that. And what you’ll hopefully see is some
kind of trace of what function the mistake is actually in. In this case here, in what line of
my code was the error apparently? >>26 in a file called memory.c, which was
the example we were playing with at the time. So it’s probably not in malloc. It was probably in my code instead. So we’ll see this again
and again before long.>>So scanf, this came up in a
couple of forms thus far. We saw sscanf briefly. It was something a number of
you dived into in your preparations for the quiz. And scanf is actually what the CS50
library’s been using underneath the hood for quite some time in order
to get input from the user.>>For instance, if I move over to the CS50
appliance here, let me open up an example today that’s called scanf-0.c
And it’s super simple. It’s just a few lines of code. But it demonstrates really how getInt
has been working all of this time.>>In this program here, in line 16
, notice that I declare an int. So no pointers, nothing magical
there, just an int. Then in line 17, I prompt the
user for a number, please. Then in late 18, I use scanf here. And I specified, kind of like printf,
that I’m expecting quote unquote percent i.>>So percent i, of course,
denotes an int. But notice what the second
argument to scanf is. How would you describe the second
argument after the comma? What is that?>>It’s the address of x. So this is useful because by providing
scanf with the address of x, what does that empower that function to do? Not just go there, but also do what?>>Make a change to it. Because you can go there, it’s sort of
like a map to a location in memory. And so long as you provide scanf, or
any function with such a map, that function can go there, and not only
look at the value, but it can also change that value, which is useful if
the purpose in life of scanf is to scan input from the user, specifically
from the keyboard. And the f denotes formatted, just like
printf, the f denotes a formatted string that you want to print.>>So in short, this line 18 simply says,
try to read an int from the user’s keyboard and store it inside of x, at
whatever address x happens to live at. And then lastly, line 19 just says,
thanks for the int, in this case.>>So let me go ahead and make this. So make scanf 0. Let me go ahead and zoom in. I’ll go and run this with
dots slash scanf 0. Number, please? 50. Thanks for the 50. So it’s quite simple.>>Now what is it not doing? It’s not doing a whole bunch
of error checking. For instance, if I don’t cooperate,
and I don’t type in a number, but instead I write something like “hello,”
that’s just kind of strange. And so one of the things the CS50
library has been doing for us for some time is that reprompting
and reprompting.>>The retry phrase recall was in cs50.c,
and that’s the reason that getInt in the CS50 library is actually a whole
bunch of lines long, because we’re checking for stupid stuff like this. Did the user not give
us, in fact, an int? Did he or she give us something
like an alphabetical letter? If so, we want to detect
that and yell at them.>>But things get more interesting
in this next example. If I go to scanf-1.c, what is the one
thing that is fundamentally changed in this next example? I’m using char*, of course,
instead of int.>>So this is interesting, because char*,
recall, is really just the same thing as string. So it feels like maybe this is a super
simple implementation of GetString. But I’ve peeled back the layer
of the CS50 library, so I’m calling this char* now. So let’s see where, if anywhere,
we go wrong.>>Line 17– I again say, please give me something,
in this case, a string. And then in the next line, I call scanf,
again, giving it a format code, but this time percent s. And then this time, I’m
giving it buffer.>>Now notice, I’m not using
the ampersand. But why is that probably OK here? Because what is buffer already? It’s already a pointer. It’s already an address.>>And let’s this word “confuse,” let me
just call it s, for instance, for simplicity. But I’ve called it buffer because in
general, in programming, if you have a chunk of memory, which a string really
just is, you might call it a buffer. It’s a place to store information.>>Similar to things like YouTube, when
they’re buffering, so to speak, that just means it’s downloading bits from
the internet and storing them in a local array, a local chunk of memory so
that you can watch it later without it skipping or hanging on
you while playing back.>>So there’s a problem here though,
because I’m telling scanf, expect a string from the user. Here’s the address of
a chunk of memory. Put that string there. Why is that bound give
us trouble, though?>>What’s that? Am I allowed to access
that part of memory? You know, I don’t know. Because has buffer been initialized
to anything? Not really. And so it’s what we’ve been calling
a garbage value, which isn’t a formal word. It just means we have no idea what bits
are inside of the four bytes that I have allocated as buffer.>>I have not called malloc. I’ve definitely not called GetString. So who knows what is actually
inside of buffer? And yet telling scanf blindly, go there
and put whatever the user typed.>>So what is likely to cause
in our code if we run it? Probably a segfault. Maybe not, but probably a segfault. And I say maybe not because sometimes
you do, sometimes you don’t get a segfault. Sometimes you just get lucky, but
it’s nonetheless going to be a bug in our program.>>So let me go ahead and compile this. I’m going to do it the old school way. So clang dash 0, scanf-1,
scanf-1.c, Enter. Oops, too old school. Let’s see. Where did I go? Oh, char* buffer. Oh, thank you– Save, OK– very old school. All right, it’s been a while.>>So I’ve just saved the file after
making that temporary change a moment ago. And now I have compiled it
manually with Clang. And now I’m going to go ahead
and run scanf-1, Enter. String please. I’ll type in “hello.”>>And now, here’s where, frankly, printf
can is a little annoying. It’s not actually going to
segfault in this case. Printf is a little special because
it’s so super commonly used that essentially printf is doing
us a favor and realizing, that’s not a valid pointer. Let me take it upon myself to just print
out in parentheses null, even though it’s not necessarily what
we ourselves expected.>>So we can’t really easily induce a
segfault with this, but clearly this is not the behavior I wanted. So what’s the simple solution? Well, in scanf-2, let me propose that
instead of actually just allocating a char*, let me be a little smarter about
this, and let me allocate buffer as a sequence of 16 chars.>>So I can do this in a couple of ways. I could absolutely use malloc. But I can go back to week two when
I just needed a whole bunch of characters. That’s just an array. So let me instead redefine buffer
to be an array of 16 characters.>>And now, when I pass buffer in– and this is something we didn’t
talk about in week two– but you can treat an array as
though it’s an address. Technically, as we’ve seen, they’re
a little bit different. But scanf won’t mind if you pass it
the name of an array, because what Clang will do for us is essentially
treat the name of that array as the address of the chunk of 16 bytes.>>So this is better. This means now that I can hopefully
do the following. Let me zoom out for a moment and
do make scanf-2, compiled OK. Now let me do got slash scanf-2. String please. “Hello.” And it
seemed to work this time.>>But can someone propose a scenario
in which it might not still work? Yeah? Something longer than 16 characters. And actually, we can be
a little more precise. Something longer then 15 characters,
because really we need to keep in mind that we need that backslash zero
implicitly at the end of the string, which is an aside scanf will typically
take care of for us.>>So let me do something like– sometimes we can just
leave it like that. OK, so we’ve now induced
our segmentation fault. Why? Because I typed to more than 15
characters, and so we’ve actually touched memory that I actually
should not have.>>So what’s really the solution here? Well, what if we need a longer string? Well, we maybe make it 32 bytes. Well, what if that’s not long enough? How about 64 bytes? What if that’s not long enough? How about 128 or 200 bytes? What really is the solution here in the
general case, if we don’t know in advance what the user’s going to type? >>It’s just kind of a big pain in the ass,
to be honest, which is why the CS50 library has a few dozen lines of
code that collectively implement GetString string in a way that we don’t
have to know in advance what the user is going to type. In particular, if you look back at
cs50.c from two weeks ago, you’ll see that GetString actually does
not use scanf in this way. Rather, it reads one character
at a time.>>Because the one nice thing about
reading one character is we can guarantee ourselves to always
have at least one char. I can just declare a char, and then take
these truly baby steps to just read one character in at a
time from the keyboard. And then, what you’ll see GetString
does is every time it runs out of, say, 16 bytes of memory, it uses
malloc, or a cousin thereof, to allocate more memory, copying the old
memory into the new, and then crawling along, getting one character at a time,
and when it runs out of that chunk of memory, throws it away, grabs
a bigger chunk of memory, copies old into new, and repeats. And it’s truly a pain to actually
implement something as simple as getting input from a user.>>So you can use scanf. You can use other similar functions. And a lot of textbooks and online
examples do, but they’re all vulnerable to problems like this. And ultimately, getting a segfault
is kind of annoying. It’s not good for the user.>>But in the worst case, what does
it fundamentally put your code at risk of? Some kind of attack, potentially. We talked about one such attack–
overflowing the stack. But in general, if you’re allowed to
overflow a buffer, like we did a couple of weeks ago, with just writing
more than “hello” on the stack, you can indeed take over, potentially, a
computer, or at least get at data that doesn’t belong to you.>>So in short, this is why we have
those training wheels. But now, we begin to take them off,
as our programs no longer need, necessarily, input from the user. But in the case of problem set six,
your input will come from a huge dictionary file with 150 some
odd thousand words.>>So you won’t have to worry about
the user’s arbitrary input. We will give you some assumptions
about that file. Any questions on pointers or scanf
or user input in general?>>All right, so a quick look then at one
trailing topic from two weeks ago. And that was this notion of a struct. Not that– this notion of a
struct, which was what? What did struct do for us? >>Define– sorry? Define a variable type. So sort of. We’re actually combining two topics. So with typedef, recall that we can
declare a type of our own, like a synonym, like string for char*. But using typedef and struct, we can
create truly our own data structures.>>For instance, if I go back into gedit
here for just a moment, and I go ahead and do something like, let me save
this as, let’s say, structs.c temporarily, I’m just going
to go ahead and include standardio.h, int main void. And then in here, suppose that I want
to write a program that stores multiple students from multiple
houses, for instance. So it’s like a registrarial
database of some sort.>>So if I need the name one student, I
might do something like char* name, and I’ll do something like– actually, let’s use the CS50 library
for just a moment to make this a little simpler, so we can borrow
those dozens of lines of code. And let’s just keep it simple. We’ll keep it string,
and now GetString.>>So I claim now that I’ve stored the name
of some student, and the house of some student, simply using variables
like we did and in week one. But suppose I now want to support
multiple students. All right, so my instincts are to do
string name2, gets GetString, string house2 gets GetString. And then our third student,
let’s do name3 GetString.>>All right, so this is hopefully striking
you as kind of stupid, because this process is really never
going to end, and it’s just going to make my code look worse
and worse and worse. But we solved this too in week two. What was our relatively clean solution
when we had multiple variables of the same data type that are all related, but
we didn’t want this atrocious mess of similarly named variables? What did we do instead?>>So I think I heard a few places. We had an array. If you want multiple instances of
something, why don’t we clean this all up and just say, give me
array called names?>>And for now, let’s hard code 3. And then give me another array
called houses, and let me for now hard code 3. And I’ve massively cleaned up the
mess that I just created. Now, I’ve still hard coded 3, but even
the 3 could dynamically come from the user, or argv, or the like. So this is already cleaner.>>But what’s annoying about this is that
now, even though a name is somehow fundamentally linked to
a student’s house– it’s a student that I really
want to represent– I now have two arrays that are parallel
in the sense that they’re the same size, and names bracket 0
presumably maps to houses bracket 0, and names bracket 1 maps
to houses bracket 1. In other words, that student lives in
that house, and that other student lives in that other house. But surely this could be
done even more cleanly.>>Well, it can, in fact. And let me go ahead and open
up structs.h, and you’ll see this idea here. Notice that I’ve used typedef, as you
alluded to a moment ago to declare our own data type. But I’m also using this other keyword
called struct which gives me a new data structure.>>And this data structure I claim is going
to have two things inside of it– a string called name, and
a string called house. And the name I’m going to give to
this data structure is going to be called student. I could call it anything I want,
but this semantically make sense to me in my mind.>>So now, if I open up a better version
of the program I started writing there, let me scroll to the top. And there’s some more lines of code
here, but let me focus for the moment on one. I’ve declared a constant called students
and hard coded 3 for now. But now, notice how clean
my code begins to get.>>In line 22, I declare
array of students. And notice that student is apparently
now a data type. Because at the top of this file, notice
I’ve included that header file that I pulled up just a moment ago. And that header file quite simply had
this definition of a student.>>So now, I’ve created my own custom data
type that the authors of C years ago didn’t think of in advance. But no problem. I can make it myself. So this is an array called students,
each of whose members is a student structure. And I want three of those
in the array.>>And now, what does the rest
of this program do? I needed something a little arbitrary. So from online 24 onward,
I iterate from 0 to 3. I then ask the user for
the student’s name. And then I use GetString as before. Then I ask for the student’s house,
and I use GetString as before.>>But notice– slightly new
piece of syntax– I can still index to the i-th student,
but how do I get at the specific data field inside of the struct? Well, what’s apparently the
new piece of syntax? It’s just the dot operator.>>We’ve not really seen this before. You’ve seen it in pset five if you’ve
dived in already with bitmap files. But the dot just means inside of this
struct or multiple fields, give dot name, or give me dot house. That means go inside of the struct
and get those particular fields.>>What does the rest of this program do? It’s not all that sexy. Notice that I iterate from 0 to 3 again,
and I simply create an English phrase like so and so is in such and
such a house, passing in dot name from the i-th student and their
house as well.>>And then lastly, now we’ll start to get
anal about this, now that we’re familiar with what malloc and
other functions have been doing all this time. Why do I have to free both name
and house, even though I did not call malloc?>>GetString did. And that was the dirty little secret for
several weeks, but GetString has been leaking memory all over the
place all semester thus far. And valgrand will finally
reveal this to us.>>But it’s not a big deal, because I know
that I can simply free the name and the house, though technically, to
be super, super safe, I should be doing some error checking here. What are your instincts telling you? What should I be checking for
before I free what is a string, aka which a char*?>>I should really be checking if students
bracket i dot name does not equal null. Then it’ll be OK to go ahead and free
that pointer, and same or the other one as well. If students bracket i dot house is not
equal to null, this now will protect against the corner case in which
GetString returns something like null. And we saw a moment ago, printf will
protect us up here by just saying null, which is going to look weird. But at least it won’t segfault,
as we have seen.>>Well, let me do one other thing here.
structs-0 is kind of a stupid program because I enter all this data, and then
it’s lost once the program ends. But let me go ahead and do this. Let me make the terminal
window a bit bigger. Let me make structs-1, which
is a new version of this.>>I’ll zoom in a little bit. And now let me run dot
slash structs-1. Student’s name– David Mather, let’s do Rob Kirkland,
let’s do Lauren Leverett. What’s interesting now is notice– and I only know this because
I wrote the program– there’s a file now on my current
directory called students.csv. Some of you might have seen
these in the real world.>>What’s a CSV file? Comma-separated values. It’s sort of like a poor man’s
version of an Excel file. It’s a table of rows and columns that
you can open in a program like Excel, or Numbers on a Mac.>>And if I open this file here on gedit,
notice– and the numbers aren’t there. That’s just gedit telling
me line numbers. Notice on the first line of this
file is David and Mather. The next line is Rob comma Kirkland. And the third line is Lauren
comma Leverett.>>So what have I created? I’ve now written a C program that
effectively can generate spreadsheets that can be opened in a
program like Excel. Not all that compelling a data set, but
if you have much larger chunks of data that you actually want to
manipulate and make graphs of and the like, this is perhaps one
way to create that data. Moreover, CSVs are actually super common
just for storing simple data– Yahoo Finance, for instance, if you get
stock quotes via their so-called API, the free service that lets you
get current up-to-the-date stock quotes for companies, they
give the data back in the super simple CSV format.>>So how did we do that? Well notice, most of this program’s
almost the same. But notice down here, rather than print
the students out, on line 35 onward, I claim that I’m saving the
students to disk, so saving a file.>>So notice I’m declaring a FILE*– now, this is kind of an anomaly in C.
For whatever reason, FILE is all caps, which is not like most other data types
in C. But this is a built-in data type, FILE*. And I’m declaring a pointer to a file,
is how you can think of that.>>fopen means open file. What file do you want to open? I want to open a file that I will
arbitrarily call students.csv. I could call that anything I want.>>And then take a guess. What does the second argument
to fopen probably mean? Right, w for write, could
be r for read. There’s a for append if you
want to add rows and not overwrite the whole thing.>>But I just want to create this file
once, so I’ll use quote unquote w. And I know that only from having read
the documentation, or the man page. If file is not null– in other words,
if nothing went wrong there– let me iterate over the
students from 0 to 3.>>And now notice there’s something
ever so slightly different about line 41 here. It’s not printf. It’s fprintf for file printf. So it’s going to write to file. Which file? The one whose pointer you specify
as the first argument.>>Then we specify a format string. Then we specify what string we want to
plug in for the first percent s, and then another variable or
the second percent s. Then we close the file with fclose. Than I free the memory as before, though
I should go back in and add some checks for null.>>And that’s it. fopen, fprintf, fclose gives me the
ability to create text files. Now, you’ll see in problem set five,
which involves images, you’ll be using binary files instead. But fundamentally, the idea is the same,
even though the functions you’ll see are a little bit different.>>So whirlwind tour, but you will get
all too familiar with file I/O– input and output– with pset five. And any questions about the
initial basics here? Yeah? >>What if you try to free a null value? I believe, unless free has gotten a
little more user-friendly, you can potentially segfault. Passing it null is bad because I don’t
believe free bothers to check for you, because it would potentially be a waste
of time for it to do itself for everyone in the world. Good question, though.>>All right, so this kind of gets
us to an interesting topic. The theme of problem set
five is forensics. At least that’s a portion
of the problem set. Forensics generally refers to the
recovery of information that may or may not have been deleted
deliberately. And so I thought I’d give you a quick
taste of what is really going on all this time underneath the
hood of your computer.>>For instance, if you have inside of your
laptop or your desktop computer a hard drive, it’s either a mechanical
device that actually spins– there’s circular things called platters
that look quite like what I just had up on the screen here, though
this is increasingly old school. This is a three-and-a-half-inch
hard drive. And three and a half inches refers of
with of the thing when you install it in a computer.>>Many of you guys in your laptops now
have solid-state drives, or SSDs, which have no moving parts. They’re more like RAM and less like
these mechanical devices. But the ideas are still the same,
certainly as they relate to problem set five.>>And if you think about now a hard drive
represents being a circle, which I’ll draw like this here. When you create a file on your computer,
whether it’s an SSD, or in this case, an older school hard drive,
that file comprises multiple bits. Let’s say that it’s this 0 and 1,
a whole bunch of 0s and 1s. So this is my whole hard drive. This is apparently a pretty big file. And it is using up the 0s and 1s at that
portion of the physical platter.>>Well, what is that physical portion? Well, it turns out that on a hard drive,
at least of this type, there’s these tiny little magnetic particles. And they essentially have north and
south poles to them, so that if you turn one of those magnetic particles
this way, you might say that it’s representing a 1. And if it’s upside down south to
north, you might say that it’s representing a 0.>>So in the real physical world, that’s
how you could represent something in binary state of the 0 and a 1. So that’s all a file is. There’s a whole bunch of magnetic
particles that are their this way or this way, creating patterns
of 0s and 1s.>>But it turns out when you save a file,
some information is saved separately. So this is a little table,
a directory, so to speak. And I’ll call this column name, and
I’ll call this column location.>>And I’m going to say, suppose
this is my resume. My resume.doc is stored at
location, let’s say 123. I always go for that number. But suffice it to say that just like
in RAM, you can take a hard drive that’s a gigabyte or 200 gigabytes
or a terabyte, and you can number all of the bytes. You can number all chunks of 8 bits.>>So we’ll say that this
is location 123. So this directory inside of my operating
system remembers that my resume is at location 123. But it gets interesting when
you delete a file.>>So for instance– and thankfully, most of the world has
caught onto this– what happens when you drag a file to your Mac OS Trash
or your Windows Recycle Bin? What’s the purpose of doing that? It’s obviously to get rid of the file,
but what does the act of dragging and dropping into your Trash or your
Recycle Bin do on a computer?>>Absolutely nothing, really. It’s just like a folder. It’s a special folder, to be sure. But does it actually delete the file?>>Well, no, because some of you probably
have been like, oh damn, you didn’t mean to do that. So you double click the
Trash or Recycle Bin. You’ve poked around and you’ve recovered
the file just by dragging it out of there. So clearly, it’s not necessarily
deleting it.>>OK, you’re smarter than that. You know that just dragging it into the
Trash or Recycle Bin doesn’t mean you’re emptying the trash. So you go up to the menu, and you say
Empty Trash or Empty Recycle Bin. Then what happens? >>Yeah, so it is deleted more so. But all that happens is this. The computer forgets where
resume.doc was.>>But what has not changed apparently
in the picture? The bits, the 0s and 1s that I claim are
on site of some physical aspect of the hardware. They’re still there. It’s just the computer has
forgotten what they are.>>So it’s essentially freed the file’s
bits so that they can be reused. But not until you create more files,
and more files, and more files will probabilistically, those 0s and 1s,
those magnetic particles, get reused, upside or right side up, for
other files, 0s and 1s.>>So you have this window of time. And it’s not of predictable
length, really. It depends on the size of your hard
drive and how many files you have and how quickly you make new ones. But there’s this window of time during
which that file is still perfectly recoverable.>>So if you ever use programs like McAfee
or Norton to try to recover data, all they’re doing is trying to
recover this so-called directory to figure out where your file was. And sometimes Norton and will say,
file is 93% recoverable. Well, what does that mean? That just means that some other file
coincidentally ended up using, say, those bits out of your original file.>>So what is actually involved
in recovering data? Well, if you don’t have something like
Norton pre-installed on your computer, the best you can sometimes do is look
at the entire hard drive looking for patterns of bits. And one of the themes of problem set
five is that you will search the equivalent of a hard drive, a forensic
image of a compact flash card from a digital camera, searching for the 0s
and 1s that typically, with high probability, represent the
start of a JPEG image.>>And you guys can recover those images by
assuming, if I see this pattern of bits on the forensic image, with
high probability, that marks the start of a JPEG. And if I see the same pattern again,
that probably marks the start of another JPEG, and another
JPEG, and another JPEG. And this is typically how
data recovery will work. What’s nice about JPEGs is even though
the file format itself is somewhat complex, the beginning of every such
file is actually fairly identifiable and simple, as you will see,
if you’ve not already.>>So let’s take a closer look underneath
the hood as to exactly what’s been going on, and what these 0s and 1s
are, to give you a bit more of a context for this particular challenge.>>[VIDEO PLAYBACK]>>-Where your PC stores most
of its permanent data. To do that, the data travels from RAM
along with software signals that tell the hard drive how to store that data. The hard drive circuits translate
those signals into voltage fluctuations. These, in turn, control the hard drive’s
moving parts, some of the few moving parts left in the
modern computer.>>Some of the signals control a motor
which spins metal-coated platters. Your data is actually stored
on these platters. Other signals move the read/write
heads to read or write data on the platters. This machinery so precise that a human
hair couldn’t even pass between the heads and spinning platters. Yet, it all works at terrific speeds.>>[END VIDEO PLAYBACK]>>DAVID MALAN: Zoom in a little
deeper now at what’s actually on those platters.>>[VIDEO PLAYBACK]>>-Let’s look at what we just
saw in slow motion. When a brief pulse of electricity is
sent to the read/write head, if flips on a tiny electromagnetic for
a fraction of a second. The magnet creates a field, which
changes the polarity of a tiny, tiny portion of the metal particles which
coat each platter surface.>>A pattern series of these tiny,
charged-up areas on the disk represents a single bit of
data in the binary number system used by computers. Now, if the current is sent one way
through the read/write head, the area is polarized in one direction. If the current is sent in the
opposite direction, the polarization is reversed.>>How you get data off the hard disk? Just reverse the process. So it’s the particles on the disk
that get the current in the read/write head moving. Put together millions of these
magnetized segments, and you’ve got a file.>>Now, the pieces of a single file may
be scattered all over a drive’s platters, kind of like the mess
of papers on your desk. So a special extra file keeps track
of where everything is. Don’t you wish you had
something like that?>>[END VIDEO PLAYBACK]>>DAVID MALAN: OK, probably not. So how many of you guys
grew up with these? OK, so it’s fewer and fewer
hands every year. But I’m glad you’re at least familiar
with them, because this and our own book demo, sadly, are dying a very
slow death here of familiarity.>>But this is what I, at least, back in
high school, used use for backups. And it was amazing, because you
could store 1.4 megabytes on this particular disk. And this was the high density version,
as indicated by the HD, which has meaning before today’s HD videos.>>Standard density was 800 kilobytes. And before that, there were
400-kilobyte disks. And before that, there were 5 and 1/4
inch disks, which were truly floppy, and a little wider and taller
than these things here. But you can actually see the so-called
floppy aspect of these disks.>>And functionally, they’re actually
pretty similar to hard drives of at least this type. Again, SSDs in newer computers
work a little differently. But if you move that little metal tab,
you can actually see a little cookie, or platter.>>It’s not metal like this one. This one’s actually some cheaper
plastic material. And you can kind of wiggle it. And you’ve trully just wiped off some
number of bits or magnetic particles from this disk.>>So thankfully, there’s nothing on it. If that thing’s in the way– and cover
your eyes and those of your neighbor– you can just kind of pull this
whole sheath off like that. But there’s a little spring, so be
aware of that with your eyes. So now you have truly a floppy disk.>>And what’s remarkable about this
is that in as much as this is a small-scale representation of a larger
hard drive, these things are super, super simple. If you pinch the bottom of it, now that
that metal thing’s off, and peel them open, all there is is two pieces of
felt and the so-called floppy disk with a piece of metal on the inside.>>And there goes half of
my disk’s contents. There goes another half of them. But that’s all that was spinning inside
of your computer in yesteryear. >>And again, to put this into perspective,
how big is most of your hard drives these days? 500 gigabytes, a terabyte, maybe in
a desktop computer, 2 terabytes, 3 terabytes, 4 terabytes, right? This is one megabyte, give or take,
which can’t even fit a typical MP3 anymore these days, or some
similar music file.>>So a little souvenir for you today, and
also to help contextualize what we’ll be taking for granted
now in problem set five. So those are yours to keep. So let me transition to where will be
spending the next pset as well. So we’ve now set this page for– oh,
a couple of announcements quickly.>>This Friday, if you would like join CS50
for lunch, go to the usual place, cs50.net/rsvp. And final project– so per the syllabus, we’ve posted the
final project specification already. Realize that that doesn’t mean
it’s due particularly soon. It’s posted, really, just to get
you guys thinking about it. And indeed, a super significant
percentage of you will be tackling final projects on material that we
haven’t even gotten to in the class, but will as early as next week.>>Notice, though, that the spec calls for
a few different components of the final project. The first, in a few weeks, is a
pre-proposal, a pretty casual email to your TF to tell him or what you’re
thinking about for your project, with no commitment. Proposal will be your particular
commitment, saying, here, this is what I’d like to do for my project. What do you think? Too big? Too small? Is it manageable? And you see the spec for more details.>>Couple of weeks after that is the status
report, which is a similarly casual email to your TF to say just how
far behind you are in your final project’s implementation, followed by
the CS50 Hackathon to which everyone is invited, which will be an event from
8:00 PM on one evening till 7:00 AM the next morning. Pizza, as I may have mentioned in week
zero, wil be served at 9:00 PM, Chinese food at 1:00 AM. And if you’re still awake at 5:00 AM,
we’ll take you to IHOP for breakfast.>>So the Hackathon is one of the more
memorable experiences in the class. Then the implementation is due, and
then the climactic CS50 Fair. More details on all of these
in the weeks to come.>>But let’s go back to something
old school– again, an array. So an array was nice, because it solves
problems like we saw just a moment ago with student structures
getting a little out of control if we want to have student one, student two,
student three, student dot dot dot, some arbitrary number of students.>>So arrays, a few weeks ago, swooped in
and solved all of our problems of not knowing in advance how many things
of some type we might want. And we’ve seen that structs can help us
further organize our code and keep conceptually similar variables, like a
name and a house, together, so that we can treat them as one entity, inside
of which there are smaller pieces.>>But arrays have some disadvantages. What are some of the disadvantages
we’ve encountered with arrays thus far? What’s that? Fixed size– so even though you might
be able to allocate memory for an array, once you know how many students
you have, how many characters you have from the user, once you’ve allocated
the array, you’ve kind of painted yourself into a corner.>>Because you cannot insert new elements
into the middle of an array. You can’t insert more elements
at the end of an array. Really, you have to resort to creating a
whole new array, as we’ve discussed, copying the old into the new. And again, that is the headache that
GetString deals with for you.>>But again, you can’t even insert
something into the middle of the array if the rate isn’t entirely filled. For instance, if this array here of size
six only has five things in it, well, you could just tack
something onto the end. But what if you want to insert something
into the middle of the array, even though it might have
five out of six things in it?>>Well, what did we do when we had all
of our human volunteers onstage in weeks past? If we wanted to put someone here, either
these people how to move this way, or these people how to move this
way, and that became expensive. The shifting of people inside of an
array ended up adding up and costing us time, hence lot of our n squared
running times like insertion sort, for instance, in the worst case. So arrays are great, but you have to
know in advance how big you want them.>>So OK, here’s a solution. If I don’t know in advance how many
students I might have, and I know once I decide, though, I’m stuck with that
many students, why don’t I just always allocate twice as much space
as I might think I need? Is that not a reasonable solution?>>Realistically, I don’t think that we’re
going to need more than 50 slots in an array for a medium-size class,
so let’s just round up. I’ll make 100 slots in my array, just
so that we can definitely get the number of students I expect to
be in some medium-size class. So why not just round up and allocate
more memory, typically, for an array than you think you might even need? What’s this simple pushback
to that idea?>>You’re just wasting memory. Literally every program you write then
is maybe using twice as much memory as you actually need. And that just doesn’t feel like a
particularly elegant solution. Moreover, it just decreases the
probability of a problem. If you happen to have a popular course
one semester and you have 101 students, your program is still
fundamentally facing the same issue.>>So thankfully, there’s a solution to
this ad all our problems in the form of data structures that are
more complex than the ones we’ve seen thus far. This, I claim, is a linked list. This is a list of numbers– 9, 17, 22, 26, and 34– that have been linked together by way
of what I’ve drawn as arrows.>>In other words, if I wanted to represent
an array, I could do something like this. And I’ll put this on the overhead
in just a moment. I could do– hello, all right. Stand by. New computer here, clear– all right.>>So if I have these numbers in array– 9, 17, 22, 26, 24– not necessarily to scale. All right, so here is my array– oh my god. All right, so here is my array. Oh my god.>>[LAUGHTER]>>DAVID MALAN: Pretend. It’s too much effort to go back
and fix that, so there– 26. So we have this array of
9, 17, 22, 26, and 34. For those of you can see the
embarrassing mistake I just made, there it is.>>So I claim that this is a
very efficient solution. I’ve allocated as many ints as
I need– one, two, three, four, five, or six– and I’ve then stored the numbers
inside of this array. But suppose, then, I want to insert
a value like the number 8? Well, where does it go? Suppose I want to insert
a number like 20. Well, where does it go? Somewhere there in the middle,
or the number 35 has to go somewhere at the end. But I’m all out of space.>>And so this is a fundamental challenge
of arrays that does are the solution. I claimed a moment ago, GetString
solves this problem. If you want to insert a sixth number
into this array, what is at least one solution you can fall back on for sure,
just like we do with GetString? What’s that?>>Well, make it bigger is
easier said than done. We can’t necessarily make the array
bigger, but what can we do? Make a new array that’s bigger, of size
6, or maybe size 10, if we want to get ahead of things, and then copy
the old array into the new, and then free the old array.>>But what’s the running time
now of that process? It’s big O of n, because the copying
is going to cost you some units of time, so not so ideal if we have to
allocate a new array, which is going to consume twice as much
memory temporarily. Copy old into new– I mean, it’s just a headache, which
is, again, why we wrote GetString for you.>>So what might we do instead? Well, what if our data structure
actually has gaps in it? Suppose that I relax my goal of having
contiguous chunks of memory, where 9 is right next to 17, which is
right next to 22, and so on.>>And suppose that 9 can be over here in
RAM, and 17 can be over here in RAM, and 22 can be over here in RAM. In other words, I don’t need them
even back to back anymore. I just have to somehow thread a needle
through each of these numbers, or each of these nodes, as we’ll call the
rectangles as I’ve drawn them, to remember how to get to the last
such node from the first.>>So what is the programming construct
we’ve seen quite recently with which I can implement that thread, or
drawn here, with which I can implement those arrows? So pointers, right? If I allocate not just an
int, but a node– and by node, I just mean container. And visually, I mean a rectangle. So a node apparently needs
to contain two values– the int itself, and then, as implied by
the bottom half of the rectangle, enough space for an int.>>So just thinking ahead here,
how big is this node, this container in question? How many bytes for the int? Presumably 4, if it’s
the same as usual. And then how many bytes
for the pointer? 4. So this container, or this node, is
going to be an 8-byte structure. Oh, and that’s a happy coincidence that
we just introduced this notion of a struct, or a C structure.>>So I claim that I want to take a step
toward this more sophisticated implementation of a list of numbers, a
linked list of numbers, I need to do a little more thinking up front and
declare not just an int, but a struct that I’ll call, conventionally
here, node. We could call it anything we want, but
node is going to be thematic in a lot of the things we start looking at now.>>Inside of that node is an int n. And then this syntax, a little
weird at first glance– struct node* next. Well pictorially, what is that? That is the bottom half of
the rectangle that we saw just a moment ago.>>But why am I saying struct node*
as opposed to just node*? Because if that pointer is pointing
at another node, it’s just the address of a node. That’s consistent with what we’ve
discussed about pointers thus far. But why, if I claim this structure is
called node, do I have to say struct node inside here?>>Exactly. It’s sort of a stupid reality of C.
The typedef, so to speak, hasn’t happened yet. C is super literal. It reads your code top to
bottom, left to right. And until it hits that semicolon on the
bottom line, guess what doesn’t exist as a data type? Node, quote unquote node.>>But because of the more verbose
declaration I did on the first line– typedef struct node– because that came first, before the
curly braces, that’s sort of like pre-educating Clang that, you
know what, give me a struct called struct node. Frankly, I don’t like calling things
struct node, struct node all throughout my code. But I’ll only use it once, just inside,
so that I can effectively create a sort of circular reference, not
a pointer to myself per se, but a pointer to another of
an identical type.>>So it turns out that on a data structure
like this, there’s a few operations that might be
of interest to us. We might want to insert
into a list like this. We might want to delete
from a list like this. We might want to search the list for a
value, or more generally, traverse. And traverse is just a fancy way of
saying start at the left and move all the way to the right.>>And notice, even with this slightly more
sophisticated data structure, let me propose that we can borrow some of
the ideas of the past two weeks and implement a function called
search like this. It’s going to return true or
false, indicating, yes or no, n is in the list. Its second argument is a pointer
to the list itself, so a pointer to a node.>>All I’m going to then do is declare
a temporary variable. We’ll call it ptr by convention,
for pointer. And I assign it equal to the
beginning of the list.>>And now notice the while loop. So long as pointer is not equal
to null, I’m going to check. Is pointer arrow n equal to
the n that was passed in? And wait a minute– new
piece of syntax. What is arrow all of a sudden? Yeah? >>Exactly. So whereas a few minutes ago, we used
the dot notation to access something inside of a the struct, if the variable
you have is not the struct itself, but a pointer to a struct,
thankfully, a piece of syntax that finally makes intuitive sense. The arrow means to follow the pointer,
like our arrows typically mean pictorially, and go at
data field inside. So arrow is the same thing as dot, but
you use it when you have a pointer.>>So just to recap then, if the n field
inside of the struct called pointer equals equals n, return true. Otherwise, this line here– pointer
equals pointer next. So what this is doing, notice, is if I
am currently pointing at the struct containing 9, and 9 is not the number
I’m looking for– suppose I’m looking for n equals 50– I’m going to update my temporary pointer
to not point at this node anymore, but pointer arrow next, which
is going to put me up here.>>Now, I realized is a whirlwind
introduction. On Wednesday, we’ll actually do this
with some humans and with some more code at a slower pace. But realize, we’re now making our data
structures more complex so that our algorithms can get more efficient, which
is going to be requisite for pset six, when we load in, again, those
150,000 words, but need to do so efficiently, and ideally, create a
program that runs for our users not in linear, not in n squared, but in
constant time, in the ideal.>>We’ll see you on Wednesday.>>SPEAKER: At the next CS50, David
forgets his base case.>>DAVID MALAN: And that’s how you send
text messages with C. What the–>>[VARIOUS TEXT MESSAGE

Stephen Childs

Leave a Reply

Your email address will not be published. Required fields are marked *